Johann Koenig
b63e88e506
Merge "Use 'packssdw' for loading tran_low_t values"
2017-02-16 02:41:00 +00:00
Linfeng Zhang
106c342659
cosmetics,dsp/inv_txfm.c: reorder functions
...
Change-Id: Ie0f7689ebe230c68eadb22a32b14838c1a7543a6
2017-02-15 11:40:35 -08:00
Linfeng Zhang
81914ce68a
Add vpx_highbd_idct16x16_38_add_neon()
...
BUG=webm:1301
Change-Id: Ic6cd8c1e63e1b7a997cbed221e20fff4c599e0fe
2017-02-15 09:12:02 -08:00
Linfeng Zhang
e07e74fb0f
Add vpx_highbd_idct16x16_38_add_c()
...
When eob is less than or equal to 38 for high-bitdepth 16x16 idct,
call this function.
BUG=webm:1301
Change-Id: I09167f89d29c401f9c36710b0fd2d02644052060
2017-02-14 17:25:52 -08:00
Johann
327a02d77e
Use 'packssdw' for loading tran_low_t values
...
This matches bitdepth_conversion_sse2.asm and produces substantially
better assembly. The old way had lots of 'movzwl' and 'shl' and storing
back to memory before loading into an xmm register.
Change-Id: Ib33e35354dfd691a4f8b1e39f4dbcbb14cd5302b
2017-02-14 22:39:49 +00:00
Linfeng Zhang
429e652809
Replace 14 with DCT_CONST_BITS in idct NEON functions' shifts
...
Change-Id: I2a39a3bb87516b04d273bc1c0f4a634e3fb6f0f6
2017-02-14 13:08:41 -08:00
clang-format
4b402746ca
apply clang-format
...
Change-Id: I75e4a9e0b37bd4586f26c8d6c1fa27f3f6ff1bce
2017-02-14 12:45:52 -08:00
Yi Luo
c1a90dc160
Merge "Replace idct32x32_34_add_ssse3 assembly with intrinsics"
2017-02-14 20:13:27 +00:00
Yi Luo
bd86de1ac8
Replace idct32x32_34_add_ssse3 assembly with intrinsics
...
- No user-level speed performance change.
- Pass unit tests.
Change-Id: Idfc598e00f354265e41f6b3219f4734216c115c6
2017-02-14 10:38:36 -08:00
Linfeng Zhang
de9ae32b93
Merge "Add vpx_highbd_idct16x16_256_add_neon()"
2017-02-14 01:15:34 +00:00
Linfeng Zhang
5ad4159ebb
Add vpx_highbd_idct16x16_256_add_neon()
...
BUG=webm:1301
Change-Id: I6bb755552a39bdd26eef3f449601f6a9766c65ec
2017-02-13 15:50:33 -08:00
Johann
5ecde212a8
fdct8x8 highbd neon: use tran_low_t for output
...
Change-Id: I100c4a1955d80bec4d28e82796b3e7f57e84d0ba
2017-02-13 22:16:14 +00:00
Linfeng Zhang
016933ad48
Add vpx_highbd_idct{16x16,32x32}_1_add_neon()
...
and update vpx_highbd_idct8x8_1_add_neon()
BUG=webm:1301
Change-Id: I18d1a0cbe98ba822d5194c1b4e13a4c29c5c75f4
2017-02-13 10:25:22 -08:00
James Zern
91f87e7513
Merge "Add vpx_idct16x16_38_add_neon()"
2017-02-11 03:42:36 +00:00
Linfeng Zhang
bc1c18e18c
Add vpx_idct16x16_38_add_neon()
...
The RunQuantCheck() test on it exposes 16-bit overflow in stage 7 of
pass 2. Change to use saturating add/sub for both
vpx_idct16x16_38_add_neon() and vpx_idct16x16_256_add_neon() for high
bitdepth.
Change-Id: Ibf4c107a887553a52852cc582e28d38a5a5a2712
2017-02-08 12:15:22 -08:00
Yi Luo
ac04d11abc
Replace idct8x8_12_add_ssse3 assembly code with intrinsics
...
- Performance achieves the same as assembly.
- Unit tests pass.
Change-Id: I6eacfbbd826b3946c724d78fbef7948af6406ccd
2017-02-08 10:07:45 -08:00
Linfeng Zhang
cf76ee2cb7
Add vpx_idct16x16_38_add_c()
...
When eob is less than or equal to 38 for 16x16 idct, call this function.
Change-Id: Ief6f3fb16a49ace3c92cebf4e220bf5bf52a6087
2017-02-07 09:40:51 -08:00
Linfeng Zhang
66695533a8
Merge "Update 16x16 8-bit idct NEON intrinsics"
2017-02-07 16:52:40 +00:00
Johann
641fda79bb
highbd x86: consolidate tran_low_t conversions
...
Create new helper files specifically for converting tran_low_t types.
Change-Id: I7c4c458ef910f3b3d10a3cfbf9df4de7682fd905
2017-02-06 10:43:26 -08:00
Jingning Han
bb40844e32
Merge "Add SSSE3 intrinsic 8x8 inverse 2D-DCT"
2017-02-02 22:18:32 +00:00
Kaustubh Raste
5b10674b5c
Merge "Add mips msa sum_squares_2d_i16 function"
2017-02-02 08:09:21 +00:00
Johann Koenig
726556dde9
Merge "Remove neon assembly for idct 16x16 and 8x8"
2017-02-02 03:25:31 +00:00
Johann Koenig
ce6318f254
Merge changes I43521ad3,I013659f6
...
* changes:
satd highbd neon: use tran_low_t for coeff
satd highbd sse2: use tran_low_t for coeff
2017-02-02 03:03:58 +00:00
Linfeng Zhang
e4985cf619
Update 16x16 8-bit idct NEON intrinsics
...
Remove redundant memory accesses.
Change-Id: I8049074bdba5f49eab7e735b2b377423a69cd4c8
2017-02-01 17:04:33 -08:00
Jingning Han
8f95389742
Add SSSE3 intrinsic 8x8 inverse 2D-DCT
...
The intrinsic version reduces the average cycles from 183 to 175.
Change-Id: I7c1bcdb0a830266e93d8347aed38120fb3be0e03
2017-02-01 14:47:53 -08:00
Johann Koenig
dc90501ba3
Merge changes I374dfc08,I7e15192e,Ica414007
...
* changes:
hadamard highbd ssse3: use tran_low_t for coeff
hadamard highbd neon: use tran_low_t for coeff
hadamard highbd sse2: use tran_low_t for coeff
2017-02-01 21:56:36 +00:00
Johann Koenig
f60171bb4f
Merge "deblock: annotate postproc parameters"
2017-02-01 19:57:29 +00:00
Johann
f8d744d91a
satd highbd neon: use tran_low_t for coeff
...
BUG=webm:1365
Change-Id: I43521ad32b6c96737a8ef2b8c327f901fd7eaf84
2017-02-01 11:55:47 -08:00
Johann
2ba383474d
satd highbd sse2: use tran_low_t for coeff
...
BUG=webm:1365
Change-Id: I013659f6b9fbf9cc52ab840eae520fe0b5f883fb
2017-02-01 11:55:16 -08:00
Johann
0f751ecee3
hadamard highbd ssse3: use tran_low_t for coeff
...
BUG=webm:1365
Change-Id: I374dfc08732932382043905f128e928b08cb4f57
2017-02-01 11:51:15 -08:00
Johann
1eb8a718bf
hadamard highbd neon: use tran_low_t for coeff
...
BUG=webm:1365
Change-Id: I7e15192ead3a3631755b386f102c979f06e26279
2017-02-01 11:50:46 -08:00
Johann
2dac808dd1
hadamard highbd sse2: use tran_low_t for coeff
...
BUG=webm:1365
Change-Id: Ica414007d8412ceebfffa9e58e8416226a3fe934
2017-02-01 11:46:57 -08:00
Johann Koenig
3bda634576
Merge "quantize ssse3: remove unused pxor"
2017-02-01 19:41:41 +00:00
Jingning Han
969957f9f2
Fix real-time compression regression in hbd mode
...
This commit resolves the compression performance regression in
real-time encoding setting when high bit-depth mode is enabled.
The current solution temporarily disables the SIMD implementations
of vpx_satd, hadamard8x8, and hadamard16x16 in high bit-depth mode.
The commit makes the coding results bit-wise identical between
regular coding pipeline and high bit-depth at profile 0.
BUG=webm:1365
Change-Id: Icfb900821733749685370460a1a5a7e07f76f4bf
2017-01-31 23:17:09 -08:00
Johann
32f68cc58c
deblock: annotate postproc parameters
...
Clears a clang static analyzer warning where 'cols' is assumed to be
less than 0, preventing the for loop from executing.
The assembly already requires that the size be 8 or 16 (U/V or Y plane)
and cols is a multiple of 8.
Change-Id: Ica4612690ead1638c94cfe56b306e87f8ce644f9
2017-01-31 15:58:57 -08:00
Kaustubh Raste
750e753134
Add mips msa sum_squares_2d_i16 function
...
average improvement ~4x-5x
Change-Id: I8d91b71d0677009be52b412e4f52b40b98573a53
2017-01-31 12:22:43 +00:00
Kaustubh Raste
df7e1fecc1
Add mips msa vpx_minmax_8x8 function
...
average improvement ~4x-5x
Change-Id: I83aee9977534fddb8a9b80d31af646c0b6b1a8c3
2017-01-31 10:00:43 +05:30
Johann
dcfff3ccc8
quantize ssse3: remove unused pxor
...
Change-Id: Ifa22d77fd530827de0b32ae71810dc2213ab2937
2017-01-30 17:02:57 -08:00
Kaustubh Raste
4ce20fb3f4
Add mips msa vpx_vector_var function
...
average improvement ~4x-5x
Change-Id: I2f63ef83d816052ca8dc42421e7e9d42f7a7af6b
2017-01-28 08:53:20 +00:00
Kaustubh Raste
407fad2356
Add mips msa vpx Integer projection row/col functions
...
average improvement ~4x-5x
Change-Id: I17c41383250282b39f5ecae0197ef1df7de20801
2017-01-27 11:11:42 +05:30
Kaustubh Raste
182ea677a0
Add mips msa vpx satd function
...
average improvement ~4x-5x
Change-Id: If8683d636fe2606d4ca1038e28185bca53bbe244
2017-01-24 10:44:22 +05:30
Johann
13234d3c43
Remove neon assembly for idct 16x16 and 8x8
...
Tested using test/partial_idct_test.cc:DISABLED_Speed
Both gcc 4.9 and clang 3.8 from the r13 Android NDK offer improvements
using the intrinsics:
<function> <clang asm> <gcc asm> <clang intrin> <gcc intrin>
idct16x16_256 1720ms 1703ms 1546ms 1554ms
idct16x16_10 1320ms 1247ms 518ms 488ms
idct16x16_1 107ms 108ms 64ms 68ms
idct8x8_64 924ms 931ms 866ms 989ms
idct8x8_12 826ms 824ms 519ms 514ms
idct8x8_1 172ms 166ms 110ms 125ms
idct8x8_64 isn't quite perfect (slight regression with gcc intrinsics)
but as a counter example idct16x16_10 goes from ~1300ms to ~500ms
On a sample clip, clang improved from 48.5 to 49fps and gcc stayed roughly
stable.
BUG=webm:1303
Change-Id: I9d4fd2b41b46ea6174a887b40a82c8e6e4769ed4
2017-01-19 12:27:31 -08:00
Kaustubh Raste
e0c0e65378
Add mips msa vpx hadamard functions
...
average improvement ~4x-5x
Change-Id: I167132d894c04fa85dda8dde7906ff9c61b3a65d
2017-01-19 14:44:03 +05:30
Jingning Han
b6fe63a505
Merge "Rework 8x8 transpose SSSE3 for avg computation"
2017-01-13 18:25:17 +00:00
Jingning Han
553e9e291f
Merge "Rework 8x8 transpose SSSE3 for inverse 2D-DCT"
2017-01-13 18:25:09 +00:00
Jingning Han
39fff1bea0
Rework 8x8 transpose SSSE3 for avg computation
...
Use same transpose process as inv_txfm_sse2 does.
Change-Id: I2db05f0b254628a11f621c4c09abb89501ba6d3c
2017-01-12 15:16:07 -08:00
Jingning Han
f65170ea84
Rework 8x8 transpose SSSE3 for inverse 2D-DCT
...
Use same transpose process as inv_txfm_sse2 does.
Change-Id: Ic4827825bd174cba57a0a80e19bf458a648e7d94
2017-01-12 15:13:18 -08:00
Johann Koenig
9f27d1f843
Merge "arm idct16x16: remove extra config guards"
2017-01-11 20:22:27 +00:00
Johann
68d0f46ec0
arm idct16x16: remove extra config guards
...
This file is guarded by HAVE_NEON_ASM in the .mk file now.
Change-Id: I513a621c234aa90ad52e426c8ed494d8a7d4b74a
2017-01-11 10:17:14 -08:00
Jingning Han
9a780fa7db
Rework forward 8x8 2D-DCT ssse3 implementation
...
This commit reworks the SSSE3 implementation of the forward 8x8
2D-DCT. It uses a cyclic rotation approach to the temporary xmm
registers. It reduces the average cycles from 158 to 154. The SSE2
version uses 169 cycles.
Change-Id: I1b79b9642aae0ed3fb3cefb5b70246e6de5d5caa
2017-01-10 12:50:55 -08:00