Commit Graph

173 Commits

Author SHA1 Message Date
gxw
25d9adb74b vp9: [loongson] optimize vpx_convolve8 with mmi.
1. vpx_convolve8_vert_mmi
2. vpx_convolve8_horiz_mmi
3. vpx_convolve8_mmi
4. vpx_convolve8_avg_mmi
5. vpx_convolve8_avg_vert_mmi

Change-Id: I41a6b3b4f327d6b67d282e0163cfa0aee8648abe
2018-03-28 18:11:16 +00:00
Linfeng Zhang
3636330490 Add vp9_highbd_iht4x4_16_add_neon()
BUG=webm:1403

Change-Id: Id9833e985fb70958cf4bde38f8e6303ed83c12f9
2018-02-05 13:42:16 -08:00
Johann
bd990cad72 quantize x86: dedup some parts
Change-Id: I9f95f47bc7ecbb7980f21cbc3a91f699624141af
2017-11-27 13:09:21 -08:00
Kyle Siefring
b383a17fa4 Support building AVX-512 and implement sadx4 for AVX-512
The added AVX-512 support requires the subset of AVX-512 added in Skylake-X.

Change-Id: I39666b00d10bf96d06c709823663eb09b89265b7
2017-11-03 13:37:23 -04:00
Scott LaVarnway
b58259ab55 Merge "vpx: [x86] add vpx_hadamard_16x16_avx2()" 2017-10-19 23:32:10 +00:00
Scott LaVarnway
55c126a5d7 vpx: [x86] add vpx_hadamard_16x16_avx2()
This version is ~1.91x faster than the sse2 version.  When
highbitdepth is enabled, it is ~1.74x.

Change-Id: I2b0e92ede9f55c6259ca07bf1f8c8a5d0d0955bd
2017-10-18 18:00:00 -07:00
Kyle Siefring
55805e2786 Refactor x86/vpx_subpixel_8t_intrin_avx2.c
Change-Id: I6539111dfb35a43028e9755785b2e9ea31854305
2017-10-17 11:57:40 -04:00
Linfeng Zhang
6543213e87 Refactor x86/vpx_subpixel_8t_intrin_ssse3.c
Change-Id: Id6a8c549709a3c516ed5d7b719b05117c5ef8bac
2017-10-03 13:02:05 -07:00
Linfeng Zhang
0f756a307d Add vpx_dsp/x86/mem_sse2.h
Add some load and store sse2 inline functions.

Change-Id: Ib1e0650b5a3d8e2b3736ab7c7642d6e384354222
2017-10-03 12:59:05 -07:00
Linfeng Zhang
9d0d13e939 Add vpx_scaled_2d_neon()
BUG=webm:1419

Change-Id: I39c8033734562efc0ac0e28e7f06fa05130f9b96
2017-09-26 09:22:39 -07:00
Johann
eb4238ac70 Revert "Revert "quantize avx: copy 32x32 implementation""
This reverts commit 8c42237bb2.

Because ssse3 code is used for the reference, the qcoeff and dqcoeff
reference buffers must be aligned.

Original change's description:
> quantize avx: copy 32x32 implementation
>
> Ensure avx and ssse3 stay in sync by testing them against each other.
>
> Change-Id: I699f3b48785c83260825402d7826231f475f697c

Change-Id: Ieeef11b9406964194028b0d81d84bcb63296ae06
2017-09-12 14:25:38 -07:00
Scott LaVarnway
d6c9bbc2b6 vpxdsp: [x86] add highbd_d207_predictor functions
C vs SSE2 speed gains:
_4x4 : ~2.31x

C vs SSSE3 speed gains:
_8x8 : ~4.73x
_16x16 : ~10.88x
_32x32 : ~4.80x

BUG=webm:1411

Change-Id: I0bac29db261079181ddabc6814bd62c463109caf
2017-09-11 07:36:24 -07:00
Shiyou Yin
2c7b7424c5 Merge "vpxdsp: [loongson] optimize sad functions with mmi" 2017-09-08 00:55:14 +00:00
Linfeng Zhang
3ec20445b2 Refactor convolve8 NEON functions
Change-Id: I4ac576875c91fee7cb150d298fae4a2c156d374c
2017-09-06 15:55:17 -07:00
Shiyou Yin
f4150163a2 vpxdsp: [loongson] optimize sad functions with mmi
1. vpx_sadWxH_c
2. vpx_sadWxH_avg_c
3. vpx_sadWxHx3_c
4. vpx_sadWxHx8_c
5. vpx_sadWxHx4d_c

Change-Id: Ie13161e3d73a052ea6ea7bac9cfadf55598fea7a
2017-09-02 15:11:32 +00:00
Scott LaVarnway
30d9a1916c vpxdsp: [x86] add highbd_h_predictor functions
C vs SSE2 speed gains:
_4x4 : ~8.12x
_8x8 : ~9.71x
_16x16 : ~8.21x
_32x32 : ~5.0x

BUG=webm:1422

Change-Id: I5e8a1ed4db7b8dc539b3e2a728b0b34d8b4b1993
2017-08-28 17:31:18 -07:00
Marco Paniconi
8c42237bb2 Revert "quantize avx: copy 32x32 implementation"
This reverts commit f60d1dcd3d.

Reason for revert: <INSERT REASONING HERE>
Failures in AVX/VP9QuantizeTest in nightly tests.
Original change's description:
> quantize avx: copy 32x32 implementation
> 
> Ensure avx and ssse3 stay in sync by testing them against each other.
> 
> Change-Id: I699f3b48785c83260825402d7826231f475f697c

TBR=slavarnway@google.com,johannkoenig@google.com,builds@webmproject.org

Change-Id: Ibd38636212269328317dd0721be9d25452113d1c
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
2017-08-25 16:56:08 +00:00
Johann
f60d1dcd3d quantize avx: copy 32x32 implementation
Ensure avx and ssse3 stay in sync by testing them against each other.

Change-Id: I699f3b48785c83260825402d7826231f475f697c
2017-08-24 10:42:34 -07:00
Johann
1787e7dbe0 quantize ssse3: copy implementation to intrinsics
Still does not pass tests. Does match the previous assembly, although
saving the sign before multiplying is dubious.

Change-Id: Ia163f18c755aba542d6e93f7bf7343184660df5a
2017-08-24 07:47:51 -07:00
Shiyou Yin
d080c92524 Merge "vpx_dsp:loongson optimize vpx_mseWxH_c(case 16x16,16X8,8X16,8X8) with mmi." 2017-08-24 00:55:11 +00:00
Johann
7c27872164 quantize avx: copy implementation to intrinsics
Adds an early exit based on ptest. Slightly slower than ssse3 in the
full case because of the extra check, but potentially faster if lots of
rows can be skipped.

Very close in speed to the assembly.

Can run in 32 bit, unlike the assembly. Allows reworking the function
prototype to use structs.

Change-Id: If80e2b9ba059370a4cad3c973196e82a97b4330e
2017-08-23 09:19:16 -07:00
Shiyou Yin
59e065b6ed vpx_dsp:loongson optimize vpx_mseWxH_c(case 16x16,16X8,8X16,8X8) with mmi.
Change-Id: I2c782d18d9004414ba61b77238e0caf3e022d8f2
2017-08-23 15:14:15 +08:00
Shiyou Yin
bff5aa9827 Merge "vpx_dsp:loongson optimize vpx_subtract_block_c (case 4x4,8x8,16x16) with mmi." 2017-08-22 00:37:23 +00:00
Shiyou Yin
7d82e57f5b vpx_dsp:loongson optimize vpx_subtract_block_c (case 4x4,8x8,16x16) with mmi.
Change-Id: Ia120ad1064d0b6106d9685cf075bdab373eef19e
2017-08-18 09:06:49 +08:00
Linfeng Zhang
d72e20b123 Add vpx_highbd_idct32x32_{34, 135, 1024}_add_{sse2, sse4_1}
BUG=webm:1412

Change-Id: I08b562b60fa85fbc2fec1c15c323a3444b44618f
2017-08-14 17:05:22 -07:00
Johann Koenig
0b393ae505 Merge "quantize: copy ssse3 optimizations to intrinsics" 2017-08-10 15:42:20 +00:00
Johann
d52cb59729 quantize: copy ssse3 optimizations to intrinsics
Fairly minor differences from sse2. pabsw and psignw are the big gains.
Also re-uses some values in eob calculation to avoid an extra pcmp.

Fixes test failures in HBD and OS X builds.

Allows using it in 32bit builds, where it is about 40% faster than sse2.

Substantially faster than the assembly for skip_block. 10-20% faster the
rest of the time.

Change-Id: If783bb3567e561e47667e10133b9c84414a334e2
2017-08-08 12:22:14 -07:00
Linfeng Zhang
7f20c3ac44 Add vpx_highbd_idct16x16_{10, 38, 256}_add_sse4_1
BUG=webm:1412

Change-Id: I8877c986b4042f7b8e33f5674c86700675a0e4ca
2017-08-04 15:31:17 -07:00
Scott LaVarnway
c42517568d vpx_dsp: merge avx2 variance files
BUG=webm:1404

Change-Id: Ieb8f85c3811b05df78722cb41eeb1166966ceec4
2017-08-04 07:49:30 -07:00
Johann
2d6b5df657 neon: vpx_quantize_b
With skip block or coeff < zbin it is about twice as fast as C.

If most coeff values are > zbin it is about 10-15x as fast as C.

BUG=webm:1426

Change-Id: I5d3c007b014a372d5ef0882b39bb48983b4131c7
2017-07-31 10:38:46 -07:00
Johann
87610ac45e neon: consolidate horizontal adds
Change-Id: Iaf9e88ff636ccf8f0ef310869c6827f3f205cca8
2017-07-10 15:29:13 -07:00
Alexandra Hájková
d8c277030c ppc: Add vpx_idct4x4_16_add_vsx
Change-Id: Id2673eece32027fb245919c7a5c81994a4a19fd8
2017-07-01 12:32:18 -07:00
Linfeng Zhang
1e3a93e72e Merge changes I5d038b4f,I9d00d1dd,I0722841d,I1f640db7
* changes:
  Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1
  sse2: Add transpose_32bit_4x4x2() and update transpose_32bit_4x4()
  Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h
  Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h
2017-06-30 20:49:19 +00:00
Linfeng Zhang
c338f3635e Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1
BUG=webm:1412

Change-Id: I5d038b4fa842ce2f6b9bd5c8c44c70647bda9591
2017-06-29 17:19:34 -07:00
Linfeng Zhang
0fa59a4baf Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h
Also clean highbd_inv_txfm_sse2.h

BUG=webm:1412

Change-Id: I0722841d824ce602874019bd9779b10d49d10c0b
2017-06-29 17:17:43 -07:00
Linfeng Zhang
9ac78ae35f Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h
BUG=webm:1412

Change-Id: I1f640db71ad4c644b7521305a781f2218eb1ba9d
2017-06-29 17:13:28 -07:00
Johann
cf75ab6ccd partial fdct neon: move 8x8_1 and enable hbd tests
The function was originally written with HBD in mind. Enable it and
configure the tests.

BUG=webm:1424

Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1
2017-06-28 15:37:43 -07:00
Linfeng Zhang
8253a27904 Add vpx_highbd_idct4x4_16_add_sse4_1()
BUG=webm:1412

Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a
2017-06-23 14:30:12 -07:00
Johann
e67660cf37 fdct32x32 neon implementation
Almost 3x faster in constrained loop testing. Over 10x faster in HBD
builds.

BUG=webm:1424

Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9
2017-06-22 06:40:17 -07:00
Jerome Jiang
943f9ee25c Merge "Merge skin detection code in vp8/9." 2017-06-08 16:36:00 +00:00
Johann Koenig
903375a48a Merge "fdct16x16 neon optimization" 2017-06-08 15:19:36 +00:00
Jerome Jiang
658e854252 Merge skin detection code in vp8/9.
BUG=webm:1438

Change-Id: Ie3dc034c7dbb498a0b088a767b1936ddeed4df14
2017-06-07 21:20:34 -07:00
Johann
eae7cf2368 fdct16x16 neon optimization
Roughly 2x speedup. Since the only change for HBD is to store(), the
improvement appears to hold there as well.

BUG=webm:1424

Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19
2017-06-07 14:59:55 -07:00
Johann
f695b30ac2 comp_avg_pred neon: used by sub pixel avg variance
BUG=webm:1423

Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804
2017-05-30 22:47:34 +00:00
Johann
105503b839 neon fdct: 4x4 implementation
Approximately twice as fast as C implementation.

BUG=webm:1424

Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f
2017-05-17 07:38:18 -07:00
Johann
1088b4f87c move neon load/stores to a new file
Move the tran_low_t helper functions to a new file. Additional
load/store functions will be added here.

Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3
2017-05-15 08:29:43 -07:00
James Zern
ac8f58f6ab Merge changes I1b54a7a5,I3028bdad,I59788cd9
* changes:
  ppc: Add get_mb_ss_vsx
  ppc: Add get4x4sse_cs_vsx
  ppc: Add comp_avg_pred_vsx
2017-05-12 15:24:59 +00:00
Luca Barbato
a7f8bd451b ppc: Add comp_avg_pred_vsx
Change-Id: I59788cd98231e707239c2ad95ae54f67cfe24e10
2017-05-12 17:22:55 +02:00
Alexandra Hájková
cc7f0c0f3e ppc: Add vpx_sad16x8/16/32_vsx
Change-Id: I60619d28fffd9809f93b1af510a50e1aa02519a9
2017-05-10 19:57:30 +00:00
Linfeng Zhang
2231669a83 Split dsp/x86/inv_txfm_sse2.c
Spin out highbd idct functions.

BUG=webm:1412

Change-Id: I0cfe4117c00039b6778c59c022eee79ad089a2af
2017-05-03 15:43:02 -07:00