Johann
f310ddc470
partial fdct neon: add 16x16_1
...
For the 8x8_1, the highbd output fit nicely in the existing function. 12
bit input will overflow this implementation of 16x16_1.
BUG=webm:1424
Change-Id: I2945fe5478b18f996f1a5de80110fa30f3f4e7ec
2017-06-28 15:37:44 -07:00
Johann
4959dd3eb3
partial fdct neon: add 4x4_1
...
BUG=webm:1424
Change-Id: Ib0f3cfd6116fc1f5a99acb8bfd76e25b90177ffc
2017-06-28 15:37:44 -07:00
Johann
cf75ab6ccd
partial fdct neon: move 8x8_1 and enable hbd tests
...
The function was originally written with HBD in mind. Enable it and
configure the tests.
BUG=webm:1424
Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1
2017-06-28 15:37:43 -07:00
Johann Koenig
81e25512c3
Merge changes Ib454762d,I966650df,Ie126553e,I068f06c6,Icb72a94e
...
* changes:
sad neon: rewrite 64x64 and add 64x32
sad neon: rewrite 32x32, add 32x16 and 32x64
sad neon: rewrite 16x8, 16x16, add 16x32
sad neon: rewrite 8x8 and 8x16
sad neon: rewrite 4x4 and add 4x8
2017-06-28 22:37:00 +00:00
Johann Koenig
35f8515c3f
Merge "partial fdct test"
2017-06-28 22:34:53 +00:00
Johann
5ac88162b9
partial fdct test
...
Test the _1 variant of the fdct, which simply sums the block and applies
a modifying shift based on the block size.
BUG=webm:1424
Change-Id: Ic80d6008abba0c596b575fa0484d5b5855321468
2017-06-28 20:32:20 +00:00
Johann
ad011aaab8
sad neon: rewrite 64x64 and add 64x32
...
BUG=webm:1425
Change-Id: Ib454762d1c61b05a98324fe81ad58c9e09784717
2017-06-28 12:21:34 -07:00
Johann
77a648885c
sad neon: rewrite 32x32, add 32x16 and 32x64
...
BUG=webm:1425
Change-Id: I966650df7e3face93e1e771634d1cc5458a35f85
2017-06-28 12:20:27 -07:00
Johann
469643757f
sad neon: rewrite 16x8, 16x16, add 16x32
...
BUG=webm:1425
Change-Id: Ie126553e5fffcdfaf3d82a85b368ac10ce9ab082
2017-06-28 12:16:00 -07:00
Johann
e40e78be24
sad neon: rewrite 8x8 and 8x16
...
BUG=webm:1425
Change-Id: I068f06c67b841f09ea07c04ada0c2f1706102138
2017-06-28 12:15:57 -07:00
Johann
46d8660ce3
sad neon: rewrite 4x4 and add 4x8
...
The previous implementation loaded 8 values (discarding half)
BUG=webm:1425
Change-Id: Icb72a94e2557a4ee2db7091266ab58fd92f72158
2017-06-28 11:14:59 -07:00
Linfeng Zhang
8253a27904
Add vpx_highbd_idct4x4_16_add_sse4_1()
...
BUG=webm:1412
Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a
2017-06-23 14:30:12 -07:00
Johann Koenig
794a5ad713
Merge "fdct32x32 neon implementation"
2017-06-23 01:58:00 +00:00
Johann
e67660cf37
fdct32x32 neon implementation
...
Almost 3x faster in constrained loop testing. Over 10x faster in HBD
builds.
BUG=webm:1424
Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9
2017-06-22 06:40:17 -07:00
Linfeng Zhang
2b43a1ee18
Clean 32x32 full idct sse2 and ssse3 code
...
vpx_idct32x32_1024_add_ssse3() is actually a sse2 function and faster
than vpx_idct32x32_1024_add_sse2(). Replace the slow one. All are
code relocations, no new code.
Change-Id: I5dac0e98cc411a4ce05660406921118986638d19
2017-06-21 13:46:49 -07:00
Linfeng Zhang
98967645a1
Remove vpx_idct8x8_64_add_ssse3()
...
It's almost identical with vpx_idct8x8_64_add_sse2(), except little
difference in instructions order.
Change-Id: Ie60dabc35eaa6ebae7c755e6cff00a710aad284f
2017-06-15 14:09:33 -07:00
Johann Koenig
903375a48a
Merge "fdct16x16 neon optimization"
2017-06-08 15:19:36 +00:00
Johann
eae7cf2368
fdct16x16 neon optimization
...
Roughly 2x speedup. Since the only change for HBD is to store(), the
improvement appears to hold there as well.
BUG=webm:1424
Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19
2017-06-07 14:59:55 -07:00
James Zern
ff42e04f9c
Merge "ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64}"
2017-06-06 23:52:39 +00:00
James Zern
4753c23983
Merge "ppc: Add vpx_sad64/32/16x64/32/16_avg_vsx"
2017-06-06 02:19:41 +00:00
Johann Koenig
755b3daf90
Merge "comp_avg_pred neon: used by sub pixel avg variance"
2017-05-31 18:17:28 +00:00
Johann
f695b30ac2
comp_avg_pred neon: used by sub pixel avg variance
...
BUG=webm:1423
Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804
2017-05-30 22:47:34 +00:00
Alexandra Hájková
8bf6eaf433
ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64}
...
Change-Id: I547d0099e15591655eae954e3ce65fdf3b003123
2017-05-24 13:27:09 +00:00
Johann
f6fcd3410d
sub pel avg variance neon: 4x block sizes
...
BUG=webm:1423
Change-Id: Iaab2b9a183fdb54aae5f717aba95d90dc36a9e3b
2017-05-22 14:40:05 -07:00
Johann
188d58eaa9
sub pel variance neon: 4x block sizes
...
Add optimizations for blocks of width 4
BUG=webm:1423
Change-Id: Idfb458d36db3014d48fbfbe7f5462aa6eb249938
2017-05-22 14:40:01 -07:00
Johann
9b0d306a2f
sub pel avg variance neon: add neon optimizations
...
These are missing an optimized version of vpx_comp_avg_pred
BUG=webm:1423
Change-Id: I31fa6ef842e98f7ff3ea079ffed51ae33178e2ed
2017-05-22 13:58:43 -07:00
Linfeng Zhang
c167345ffb
Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2
...
BUG=webm:1412
Change-Id: Ia338a6057d36f9ed7eaa9cbd4dfbf0c3cbdc6468
2017-05-22 11:24:21 -07:00
Johann Koenig
e7cac13016
Merge changes Ib8dd96f7,Ie9854b77
...
* changes:
neon variance: process 4x blocks
use memcpy for unaligned neon stores
2017-05-22 17:48:33 +00:00
Johann
7b742da63e
neon variance: process 4x blocks
...
Continue processing sets of 16 values. Plenty of improvement for 4x8
(doubles the speed) but only about 30% for 4x4.
BUG=webm:1422
Change-Id: Ib8dd96f75d474f0348800271d11e58356b620905
2017-05-17 17:35:01 -07:00
Johann
105503b839
neon fdct: 4x4 implementation
...
Approximately twice as fast as C implementation.
BUG=webm:1424
Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f
2017-05-17 07:38:18 -07:00
Alexandra Hájková
bcbc3929ae
ppc: Add vpx_sad64/32/16x64/32/16_avg_vsx
...
Change-Id: Ic9639b1331d8c5cbc207c2a036891ff0137fc56f
2017-05-13 13:13:15 +00:00
James Zern
ac8f58f6ab
Merge changes I1b54a7a5,I3028bdad,I59788cd9
...
* changes:
ppc: Add get_mb_ss_vsx
ppc: Add get4x4sse_cs_vsx
ppc: Add comp_avg_pred_vsx
2017-05-12 15:24:59 +00:00
Luca Barbato
143b21e362
ppc: Add get_mb_ss_vsx
...
Change-Id: I1b54a7a5bb642e4b836d786ea1ae506eed025e3f
2017-05-12 17:23:00 +02:00
Luca Barbato
6d225eb5f9
ppc: Add get4x4sse_cs_vsx
...
Change-Id: I3028bdadf653665d18e781d28e9625f62804b3d8
2017-05-12 17:23:00 +02:00
Luca Barbato
a7f8bd451b
ppc: Add comp_avg_pred_vsx
...
Change-Id: I59788cd98231e707239c2ad95ae54f67cfe24e10
2017-05-12 17:22:55 +02:00
Alexandra Hájková
f48532e271
ppc: Add vpx_sad64x32/64_vsx
...
Change-Id: I84e3705fa52f75cb91b2bab4abf5cc77585ee3e2
2017-05-12 16:10:16 +02:00
Alexandra Hájková
0b15bf1e54
ppc Add vpx_sad32x16/32/64_vsx
...
Change-Id: I3c4f9d595275669580413a71b3c3c810e7ddcacd
2017-05-12 16:10:11 +02:00
James Zern
a12ea1d5e9
Merge "ppc: Add vpx_sad16x8/16/32_vsx"
2017-05-12 13:33:51 +00:00
Alexandra Hájková
cc7f0c0f3e
ppc: Add vpx_sad16x8/16/32_vsx
...
Change-Id: I60619d28fffd9809f93b1af510a50e1aa02519a9
2017-05-10 19:57:30 +00:00
Linfeng Zhang
764b3b8090
Update specializations of idct functions
...
Introduced append situation in Commit 0178d97 which could be
confusing. Clean a little bit and add some comments.
Change-Id: I69ad336f805aca7ce9d45515b8cd237423fadbb2
2017-05-10 12:51:18 -07:00
Johann Koenig
d713ec3c46
Merge changes I92eb4312,Ibb2afe4e
...
* changes:
subpel variance neon: add mixed sizes
sub pixel variance neon: use generic variance
2017-05-10 18:19:52 +00:00
Johann Koenig
1814463864
Merge changes Id602909a,Ib0e85608
...
* changes:
neon variance: process two rows of 8 at a time
neon variance: add small missing sizes
2017-05-08 17:34:20 +00:00
Linfeng Zhang
2c3a2ad6f1
Merge changes I0cfe4117,I3581d80d,Ida62c941
...
* changes:
Split dsp/x86/inv_txfm_sse2.c
Update highbd idct functions arguments to use uint16_t dst
Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct
2017-05-08 16:15:57 +00:00
Johann
2346a6da4a
subpel variance neon: add mixed sizes
...
Add support for everything except block sizes of 4.
Performance is better but numbers will improve again when the variance
optimizations land.
BUG=webm:1423
Change-Id: I92eb4312b20be423fa2fe6fdb18167a604ff4d80
2017-05-04 15:30:01 -07:00
Johann
cb9133c72f
neon variance: add small missing sizes
...
Some of the mixed sizes were missing. They can be implemented trivially
using the existing helper function.
When comparing the previous 16x8 and 8x16 implementations, the helper
function is about 10% faster than the 16x8 version. The 8x16 is very
close, but the existing version appears to be faster.
BUG=webm:1422
Change-Id: Ib0e856083c1893e1bd399373c5fbcd6271a7f004
2017-05-04 08:59:42 -07:00
Linfeng Zhang
d5de63d2be
Update highbd idct functions arguments to use uint16_t dst
...
BUG=webm:1388
Change-Id: I3581d80d0389b99166e70987d38aba2db6c469d5
2017-05-03 13:59:16 -07:00
Yi Luo
a3452996a1
High bit depth inter prediction horizontal/vertical filters AVX2
...
User level speed improvement on i7-6700, cpu-used=1,
x86_64 Linux, bitrate, 1080p, 8Mbps, 4K, 16Mbps:
- Decoder:
1080p: ~4%
4K: ~5%
- Encoder:
1080p: ~1%
4K: ~3%
Change-Id: I51b48f9c5de0d62487d5a11aa579c97bd03dd640
2017-05-03 12:18:01 -07:00
Linfeng Zhang
a10a5cb356
Merge changes I8bb660de,Ica51d780,I6037525d
...
* changes:
Clean specializes of idct functions
Clean add_protos of highbd idct functions
Clean add_protos of idct functions
2017-05-03 19:17:55 +00:00
Luca Barbato
e2ad89092d
ppc: Add convolve8_vsx and convolve8_avg_vsx
...
Change-Id: Ia5293d948003a7fff5a7cbad6e83d8a72717c857
2017-05-02 20:27:47 -07:00
Luca Barbato
e6ca81ee67
ppc: Add convolve8_avg_vert_vsx
...
Only the generic one again, speedups for 8x8 and larger blocks to
come later.
Change-Id: I90d481d3a602d1e277ead8f3934eca126b86b72d
2017-05-02 20:27:42 -07:00