Linfeng Zhang
39de45d3cc
Update sad4d x86 functions
...
Speed change is marginal.
Change-Id: I4d548e9763ce43bd546f19132202f7a8509a32bf
2018-03-28 12:49:12 -07:00
Kyle Siefring
dccb8b45bb
Merge "Fold adds in 16->32-bit converts in SSE2/AVX2 fDCT"
2018-02-21 23:12:07 +00:00
Kyle Siefring
811b2e412e
Fold adds in 16->32-bit converts in SSE2/AVX2 fDCT
...
Changes in the function size in bytes (in lieu of performance metrics)
Before After Diff
vpx_fdct32x32_avx2 29564 -> 28334 -1230
vpx_fdct32x32_sse2 38053 -> 36309 -1744
Change-Id: Ie0b3e6ed7c3f2e9ea45f9d6a1ce1e27d068cee6b
2018-02-10 14:25:24 -05:00
Scott LaVarnway
15b261d854
Merge "BUG FIX: sse2 subpel variance is not PIC compliant"
2018-01-24 22:54:42 +00:00
Scott LaVarnway
cb9f4dc105
BUG FIX: sse2 subpel variance is not PIC compliant
...
BUG=webm:1464
Change-Id: Ibc15bac54aaf509365bed5892a26a29972ad3540
2018-01-24 05:58:54 -08:00
Scott LaVarnway
b9e44842fc
Merge "vp9_quantize_fp_avx2()"
2018-01-24 13:58:08 +00:00
Linfeng Zhang
231012fdab
Add vp9_highbd_iht16x16_256_add_sse4_1()
...
BUG=webm:1413
Change-Id: I8d7eeae1bd219eb848c1a86071046a477f7a91af
2018-01-23 11:24:42 -08:00
Linfeng Zhang
8f50e06012
Add "vpx_" prefix to 2 idct x86 functions
...
Change-Id: I4f3052d8748e16b06e9155f8daf22f867dfaa7a3
2018-01-23 09:17:38 -08:00
Linfeng Zhang
6fea41abee
Merge "Add vp9_highbd_iht8x8_64_add_sse4_1()"
2018-01-23 17:04:20 +00:00
Linfeng Zhang
9874ec07bd
Add vp9_highbd_iht8x8_64_add_sse4_1()
...
BUG=webm:1413
Change-Id: Id9038226902b2d793fc6c17ac81bb104c1a18988
2018-01-18 15:49:44 -08:00
Scott LaVarnway
c7449b482c
vp9_quantize_fp_avx2()
...
Started from vp9_quantize_fp_sse2 and tweaked to use avx2.
Change-Id: Ic2da50cc9d73896c7ef2f3cd3db5b1c5d7795b8b
2018-01-18 13:33:30 -08:00
Johann
97acbbb701
clang-format v5.0.0 vpx_dsp/
...
Remove comments above #define statements because they get
indented unnecessarily.
https://bugs.llvm.org/show_bug.cgi?id=35930
Add blank lines to prevent comments from being treated as
blocks.
Change-Id: I04dce21b2a10e13b8dc07411a0019c098f6dd705
2018-01-18 12:37:50 -08:00
Linfeng Zhang
e20ca4fead
Add vp9_highbd_iht4x4_16_add_sse4_1()
...
BUG=webm:1413
Change-Id: I14930d0af24370a44ab359de5bba5512eef4e29f
2018-01-08 10:14:20 -08:00
Linfeng Zhang
867b593caa
Update iadst4_sse2()
...
Change-Id: I21ff81df0d6898170a3b80b3b5220f9f3ac7f4e8
2017-12-28 16:47:57 -08:00
Johann
bdbecea1ba
explicitly label .text sections
...
nasm should infer .text but does not for windows:
https://bugzilla.nasm.us/show_bug.cgi?id=3392451
Change-Id: Ib195465e5f33405f5ff61c4cf88aa2a72640cacb
2017-12-01 14:33:04 -08:00
Kyle Siefring
3ae909b0f9
Merge "Remove unnecessary includes of emmintrin_compat.h"
2017-11-29 19:14:45 +00:00
Kyle Siefring
a60da3a2eb
Remove unnecessary includes of emmintrin_compat.h
...
Change-Id: Ie60381a0c6ee01f828cd364a43f01517f4cb03e9
2017-11-29 11:48:24 -05:00
Johann
bd990cad72
quantize x86: dedup some parts
...
Change-Id: I9f95f47bc7ecbb7980f21cbc3a91f699624141af
2017-11-27 13:09:21 -08:00
Kyle Siefring
dd4cc5b596
Merge "Optimize AVX2 get16x16var and get32x16var functions"
2017-11-20 22:37:57 +00:00
Kyle Siefring
07a0bf038f
Optimize AVX2 get16x16var and get32x16var functions
...
Change-Id: If8b91aaa883c01107f0ea3468139fa24cfb301d2
2017-11-17 13:55:49 -05:00
Johann
3e3a568616
fwd txfm ssse3: use GLOBAL() for loading constants
...
Fixes a build issue when relocation is not allowed:
relocation R_X86_64_32 against '.rodata' can not be used when making a shared object
Change-Id: Ica3e90c926847bc384e818d7854f0030f4d69aa0
2017-11-15 13:01:44 -08:00
Scott LaVarnway
8e6022844f
vpx: [x86] add vpx_satd_avx2()
...
SSE2 instrinsic vs AVX2 intrinsic speed gains:
blocksize 16: ~1.33
blocksize 64: ~1.51
blocksize 256: ~3.03
blocksize 1024: ~3.71
Change-Id: I79b28cba82d21f9dd765e79881aa16d24fd0cb58
2017-11-10 12:24:12 -08:00
Scott LaVarnway
2387024f41
runtime error fix: bitdepth_conversion_avx2.h
...
Change-Id: I7364a157de39eb7137b599808474b8d46d19d376
2017-11-09 12:26:43 -08:00
Kyle Siefring
b383a17fa4
Support building AVX-512 and implement sadx4 for AVX-512
...
The added AVX-512 support requires the subset of AVX-512 added in Skylake-X.
Change-Id: I39666b00d10bf96d06c709823663eb09b89265b7
2017-11-03 13:37:23 -04:00
Scott LaVarnway
3bf02ad74a
vpx: hadamard: use ptrdiff_t instead of int for stride
...
Eliminates the following instruction for the x86 (64 bit)
intrinsic code:
movslq %esi,%rax
Change-Id: I8f5ebd40726f998708a668b0f52ea7a0576befae
2017-10-26 11:41:48 -07:00
Kyle Siefring
037e596f04
Merge "Optimize convolve8 SSSE3 and AVX2 intrinsics"
2017-10-24 19:22:36 +00:00
Kyle Siefring
ae35425ae6
Optimize convolve8 SSSE3 and AVX2 intrinsics
...
Changed the intrinsics to perform summation similiar to the way the assembly does.
The new code diverges from the assembly by preferring unsaturated additions.
Results for haswell
SSSE3
Horiz/Vert Size Speedup
Horiz x4 ~32%
Horiz x8 ~6%
Vert x8 ~4%
AVX2
Horiz/Vert Size Speedup
Horiz x16 ~16%
Vert x16 ~14%
BUG=webm:1471
Change-Id: I7ad98ea688c904b1ba324adf8eb977873c8b8668
2017-10-24 10:39:48 -04:00
Scott LaVarnway
512bf4e029
vpx: [x86] vpx_hadamard_16x16_avx2() highbitdepth fix
...
Use an intermediate buffer before storing to coeffs when
highbitdepth is enabled.
Change-Id: I101981a1995f1108ad107c55c37d6e09eadb404b
2017-10-23 08:49:32 -07:00
Scott LaVarnway
4906cea027
vpx: [x86] vpx_hadamard_16x16_avx2() improvements
...
~10% performance gain. Fixed the cosmetics noted in the
previous commit.
Change-Id: Iddf475f34d0d0a3e356b2143682aeabac459ed13
2017-10-20 08:55:06 -07:00
Scott LaVarnway
b58259ab55
Merge "vpx: [x86] add vpx_hadamard_16x16_avx2()"
2017-10-19 23:32:10 +00:00
Scott LaVarnway
55c126a5d7
vpx: [x86] add vpx_hadamard_16x16_avx2()
...
This version is ~1.91x faster than the sse2 version. When
highbitdepth is enabled, it is ~1.74x.
Change-Id: I2b0e92ede9f55c6259ca07bf1f8c8a5d0d0955bd
2017-10-18 18:00:00 -07:00
Kyle Siefring
b3a36f7946
Merge "Refactor x86/vpx_subpixel_8t_intrin_avx2.c"
2017-10-18 16:19:52 +00:00
Linfeng Zhang
9336e01621
Merge changes I17fff122,Ic149e3cb
...
* changes:
Add 4 to 3 scaling SSSE3 optimization
Test extreme inputs in frame scale functions
2017-10-17 16:03:29 +00:00
Kyle Siefring
55805e2786
Refactor x86/vpx_subpixel_8t_intrin_avx2.c
...
Change-Id: I6539111dfb35a43028e9755785b2e9ea31854305
2017-10-17 11:57:40 -04:00
Linfeng Zhang
580d32240f
Add 4 to 3 scaling SSSE3 optimization
...
Note this change will trigger the different C version on SSSE3 and
generate different scaled output.
Its speed is 2x compared with the version calling vpx_scaled_2d_ssse3().
Change-Id: I17fff122cd0a5ac8aa451d84daa606582da8e194
2017-10-16 15:42:42 -07:00
Kyle Siefring
caa116c9be
Merge changes I38783d97,If5160c0c
...
* changes:
Extend 16 wide AVX2 convolve8 code to support averaging.
Add AVX2 version of vpx_convolve8_avg.
2017-10-12 16:12:38 +00:00
Linfeng Zhang
16166bfdaa
Add 4 to 1 scaling x86 optimization
...
Change-Id: I51c190f0a88685867df36912522e67bdae58a673
2017-10-10 16:24:06 -07:00
Kyle Siefring
1b2f92ee8e
Extend 16 wide AVX2 convolve8 code to support averaging.
...
Also adds vpx_convolve8_avg_horiz_avx2.
Change-Id: I38783d972ac26bec77610e9e15a0a058ed498cbf
2017-10-09 19:10:03 -04:00
Kyle Siefring
9ca06bcdd2
Add AVX2 version of vpx_convolve8_avg.
...
vpx_convolve8_avg works by first running a normal horizontal filter then a
vertical filter averages at the end.
The added vpx_convolve8_avg_avx2 calls pre-existing AVX2 code for the
horizontal step.
vpx_convolve8_avg_vert_avx2 is also added, but only uses ssse3 code.
Change-Id: If5160c0c8e778e10de61ee9bf42ee4be5975c983
2017-10-07 23:37:48 -04:00
Linfeng Zhang
127864deb3
Generalize 2:1 vp9_scale_and_extend_frame_ssse3()
...
Change-Id: I882da3a04884d5fabd4cd591c28682cbb2d76aa5
2017-10-04 12:35:39 -07:00
Linfeng Zhang
9a71811d98
Merge changes Id6a8c549,Ib1e0650b,Ic369dd86
...
* changes:
Refactor x86/vpx_subpixel_8t_intrin_ssse3.c
Add vpx_dsp/x86/mem_sse2.h
Add transpose_8bit_{4x4,8x8}() x86 optimization
2017-10-04 16:15:14 +00:00
James Zern
66b6b87471
Merge "vpx: fix nasm build errors"
2017-10-03 21:47:49 +00:00
Scott LaVarnway
bc4bc9b622
vpx: fix nasm build errors
...
BUG=webm:1462,766721
Change-Id: Icfa536a8e38623636b96c396e3c94889bfde7a98
2017-10-03 20:02:21 +00:00
Linfeng Zhang
6543213e87
Refactor x86/vpx_subpixel_8t_intrin_ssse3.c
...
Change-Id: Id6a8c549709a3c516ed5d7b719b05117c5ef8bac
2017-10-03 13:02:05 -07:00
Linfeng Zhang
0f756a307d
Add vpx_dsp/x86/mem_sse2.h
...
Add some load and store sse2 inline functions.
Change-Id: Ib1e0650b5a3d8e2b3736ab7c7642d6e384354222
2017-10-03 12:59:05 -07:00
Linfeng Zhang
67c38c92e7
Add transpose_8bit_{4x4,8x8}() x86 optimization
...
Change-Id: Ic369dd86b3b81686f68fbc13ad34ab8ea8846878
2017-10-03 10:00:30 -07:00
Scott LaVarnway
3bbd62ed27
vpxdsp: [x86] add highbd_d135_predictor functions
...
C vs SSE2 speed gains:
_4x4 : ~1.81x
C vs SSSE3 speed gains:
_8x8 : ~1.96x
_16x16 : ~1.88x
_32x32 : ~2.02x
BUG=webm:1411
Change-Id: Iefaf8b39afbbfe34c1ad1d21e3a003b20f1f61e0
2017-09-29 08:56:38 -07:00
Scott LaVarnway
4cae64c32c
vpxdsp: [x86] add highbd_d117_predictor functions
...
C vs SSE2 speed gains:
_4x4 : ~2.04x
C vs SSSE3 speed gains:
_8x8 : ~2.82x
_16x16 : ~5.93x
_32x32 : ~2.79x
BUG=webm:1411
Change-Id: I31d949695991c067dac89d91e0bed3e666c94993
2017-09-28 14:45:28 -07:00
Scott LaVarnway
80992a746c
Merge "vpxdsp: [x86] add highbd_d153_predictor functions"
2017-09-27 20:40:21 +00:00
Linfeng Zhang
dbbbd44304
fix signed integer overflow of idct
...
Exposed by fuzz test in high bitdepth.
The bug is introduced in commit 64653fa.
BUG=webm:1466
Change-Id: Idd77d5c6a60efb9241471611ce1aba0646cb6ff5
2017-09-27 11:17:54 -07:00