Commit Graph

571 Commits

Author SHA1 Message Date
Johann Koenig
b63e88e506 Merge "Use 'packssdw' for loading tran_low_t values" 2017-02-16 02:41:00 +00:00
Linfeng Zhang
106c342659 cosmetics,dsp/inv_txfm.c: reorder functions
Change-Id: Ie0f7689ebe230c68eadb22a32b14838c1a7543a6
2017-02-15 11:40:35 -08:00
Linfeng Zhang
81914ce68a Add vpx_highbd_idct16x16_38_add_neon()
BUG=webm:1301

Change-Id: Ic6cd8c1e63e1b7a997cbed221e20fff4c599e0fe
2017-02-15 09:12:02 -08:00
Linfeng Zhang
e07e74fb0f Add vpx_highbd_idct16x16_38_add_c()
When eob is less than or equal to 38 for high-bitdepth 16x16 idct,
call this function.

BUG=webm:1301

Change-Id: I09167f89d29c401f9c36710b0fd2d02644052060
2017-02-14 17:25:52 -08:00
Johann
327a02d77e Use 'packssdw' for loading tran_low_t values
This matches bitdepth_conversion_sse2.asm and produces substantially
better assembly. The old way had lots of 'movzwl' and 'shl' and storing
back to memory before loading into an xmm register.

Change-Id: Ib33e35354dfd691a4f8b1e39f4dbcbb14cd5302b
2017-02-14 22:39:49 +00:00
Linfeng Zhang
429e652809 Replace 14 with DCT_CONST_BITS in idct NEON functions' shifts
Change-Id: I2a39a3bb87516b04d273bc1c0f4a634e3fb6f0f6
2017-02-14 13:08:41 -08:00
clang-format
4b402746ca apply clang-format
Change-Id: I75e4a9e0b37bd4586f26c8d6c1fa27f3f6ff1bce
2017-02-14 12:45:52 -08:00
Yi Luo
c1a90dc160 Merge "Replace idct32x32_34_add_ssse3 assembly with intrinsics" 2017-02-14 20:13:27 +00:00
Yi Luo
bd86de1ac8 Replace idct32x32_34_add_ssse3 assembly with intrinsics
- No user-level speed performance change.
- Pass unit tests.

Change-Id: Idfc598e00f354265e41f6b3219f4734216c115c6
2017-02-14 10:38:36 -08:00
Linfeng Zhang
de9ae32b93 Merge "Add vpx_highbd_idct16x16_256_add_neon()" 2017-02-14 01:15:34 +00:00
Linfeng Zhang
5ad4159ebb Add vpx_highbd_idct16x16_256_add_neon()
BUG=webm:1301

Change-Id: I6bb755552a39bdd26eef3f449601f6a9766c65ec
2017-02-13 15:50:33 -08:00
Johann
5ecde212a8 fdct8x8 highbd neon: use tran_low_t for output
Change-Id: I100c4a1955d80bec4d28e82796b3e7f57e84d0ba
2017-02-13 22:16:14 +00:00
Linfeng Zhang
016933ad48 Add vpx_highbd_idct{16x16,32x32}_1_add_neon()
and update vpx_highbd_idct8x8_1_add_neon()

BUG=webm:1301

Change-Id: I18d1a0cbe98ba822d5194c1b4e13a4c29c5c75f4
2017-02-13 10:25:22 -08:00
James Zern
91f87e7513 Merge "Add vpx_idct16x16_38_add_neon()" 2017-02-11 03:42:36 +00:00
Linfeng Zhang
bc1c18e18c Add vpx_idct16x16_38_add_neon()
The RunQuantCheck() test on it exposes 16-bit overflow in stage 7 of
pass 2. Change to use saturating add/sub for both
vpx_idct16x16_38_add_neon() and vpx_idct16x16_256_add_neon() for high
bitdepth.

Change-Id: Ibf4c107a887553a52852cc582e28d38a5a5a2712
2017-02-08 12:15:22 -08:00
Yi Luo
ac04d11abc Replace idct8x8_12_add_ssse3 assembly code with intrinsics
- Performance achieves the same as assembly.
- Unit tests pass.

Change-Id: I6eacfbbd826b3946c724d78fbef7948af6406ccd
2017-02-08 10:07:45 -08:00
Linfeng Zhang
cf76ee2cb7 Add vpx_idct16x16_38_add_c()
When eob is less than or equal to 38 for 16x16 idct, call this function.

Change-Id: Ief6f3fb16a49ace3c92cebf4e220bf5bf52a6087
2017-02-07 09:40:51 -08:00
Linfeng Zhang
66695533a8 Merge "Update 16x16 8-bit idct NEON intrinsics" 2017-02-07 16:52:40 +00:00
Johann
641fda79bb highbd x86: consolidate tran_low_t conversions
Create new helper files specifically for converting tran_low_t types.

Change-Id: I7c4c458ef910f3b3d10a3cfbf9df4de7682fd905
2017-02-06 10:43:26 -08:00
Jingning Han
bb40844e32 Merge "Add SSSE3 intrinsic 8x8 inverse 2D-DCT" 2017-02-02 22:18:32 +00:00
Kaustubh Raste
5b10674b5c Merge "Add mips msa sum_squares_2d_i16 function" 2017-02-02 08:09:21 +00:00
Johann Koenig
726556dde9 Merge "Remove neon assembly for idct 16x16 and 8x8" 2017-02-02 03:25:31 +00:00
Johann Koenig
ce6318f254 Merge changes I43521ad3,I013659f6
* changes:
  satd highbd neon: use tran_low_t for coeff
  satd highbd sse2: use tran_low_t for coeff
2017-02-02 03:03:58 +00:00
Linfeng Zhang
e4985cf619 Update 16x16 8-bit idct NEON intrinsics
Remove redundant memory accesses.

Change-Id: I8049074bdba5f49eab7e735b2b377423a69cd4c8
2017-02-01 17:04:33 -08:00
Jingning Han
8f95389742 Add SSSE3 intrinsic 8x8 inverse 2D-DCT
The intrinsic version reduces the average cycles from 183 to 175.

Change-Id: I7c1bcdb0a830266e93d8347aed38120fb3be0e03
2017-02-01 14:47:53 -08:00
Johann Koenig
dc90501ba3 Merge changes I374dfc08,I7e15192e,Ica414007
* changes:
  hadamard highbd ssse3: use tran_low_t for coeff
  hadamard highbd neon: use tran_low_t for coeff
  hadamard highbd sse2: use tran_low_t for coeff
2017-02-01 21:56:36 +00:00
Johann Koenig
f60171bb4f Merge "deblock: annotate postproc parameters" 2017-02-01 19:57:29 +00:00
Johann
f8d744d91a satd highbd neon: use tran_low_t for coeff
BUG=webm:1365

Change-Id: I43521ad32b6c96737a8ef2b8c327f901fd7eaf84
2017-02-01 11:55:47 -08:00
Johann
2ba383474d satd highbd sse2: use tran_low_t for coeff
BUG=webm:1365

Change-Id: I013659f6b9fbf9cc52ab840eae520fe0b5f883fb
2017-02-01 11:55:16 -08:00
Johann
0f751ecee3 hadamard highbd ssse3: use tran_low_t for coeff
BUG=webm:1365

Change-Id: I374dfc08732932382043905f128e928b08cb4f57
2017-02-01 11:51:15 -08:00
Johann
1eb8a718bf hadamard highbd neon: use tran_low_t for coeff
BUG=webm:1365

Change-Id: I7e15192ead3a3631755b386f102c979f06e26279
2017-02-01 11:50:46 -08:00
Johann
2dac808dd1 hadamard highbd sse2: use tran_low_t for coeff
BUG=webm:1365

Change-Id: Ica414007d8412ceebfffa9e58e8416226a3fe934
2017-02-01 11:46:57 -08:00
Johann Koenig
3bda634576 Merge "quantize ssse3: remove unused pxor" 2017-02-01 19:41:41 +00:00
Jingning Han
969957f9f2 Fix real-time compression regression in hbd mode
This commit resolves the compression performance regression in
real-time encoding setting when high bit-depth mode is enabled.

The current solution temporarily disables the SIMD implementations
of vpx_satd, hadamard8x8, and hadamard16x16 in high bit-depth mode.

The commit makes the coding results bit-wise identical between
regular coding pipeline and high bit-depth at profile 0.

BUG=webm:1365

Change-Id: Icfb900821733749685370460a1a5a7e07f76f4bf
2017-01-31 23:17:09 -08:00
Johann
32f68cc58c deblock: annotate postproc parameters
Clears a clang static analyzer warning where 'cols' is assumed to be
less than 0, preventing the for loop from executing.

The assembly already requires that the size be 8 or 16 (U/V or Y plane)
and cols is a multiple of 8.

Change-Id: Ica4612690ead1638c94cfe56b306e87f8ce644f9
2017-01-31 15:58:57 -08:00
Kaustubh Raste
750e753134 Add mips msa sum_squares_2d_i16 function
average improvement ~4x-5x

Change-Id: I8d91b71d0677009be52b412e4f52b40b98573a53
2017-01-31 12:22:43 +00:00
Kaustubh Raste
df7e1fecc1 Add mips msa vpx_minmax_8x8 function
average improvement ~4x-5x

Change-Id: I83aee9977534fddb8a9b80d31af646c0b6b1a8c3
2017-01-31 10:00:43 +05:30
Johann
dcfff3ccc8 quantize ssse3: remove unused pxor
Change-Id: Ifa22d77fd530827de0b32ae71810dc2213ab2937
2017-01-30 17:02:57 -08:00
Kaustubh Raste
4ce20fb3f4 Add mips msa vpx_vector_var function
average improvement ~4x-5x

Change-Id: I2f63ef83d816052ca8dc42421e7e9d42f7a7af6b
2017-01-28 08:53:20 +00:00
Kaustubh Raste
407fad2356 Add mips msa vpx Integer projection row/col functions
average improvement ~4x-5x

Change-Id: I17c41383250282b39f5ecae0197ef1df7de20801
2017-01-27 11:11:42 +05:30
Kaustubh Raste
182ea677a0 Add mips msa vpx satd function
average improvement ~4x-5x

Change-Id: If8683d636fe2606d4ca1038e28185bca53bbe244
2017-01-24 10:44:22 +05:30
Johann
13234d3c43 Remove neon assembly for idct 16x16 and 8x8
Tested using test/partial_idct_test.cc:DISABLED_Speed

Both gcc 4.9 and clang 3.8 from the r13 Android NDK offer improvements
using the intrinsics:
<function>    <clang asm> <gcc asm> <clang intrin> <gcc intrin>
idct16x16_256  1720ms      1703ms    1546ms         1554ms
idct16x16_10   1320ms      1247ms     518ms          488ms
idct16x16_1     107ms       108ms      64ms           68ms
idct8x8_64      924ms       931ms     866ms          989ms
idct8x8_12      826ms       824ms     519ms          514ms
idct8x8_1       172ms       166ms     110ms          125ms

idct8x8_64 isn't quite perfect (slight regression with gcc intrinsics)
but as a counter example idct16x16_10 goes from ~1300ms to ~500ms

On a sample clip, clang improved from 48.5 to 49fps and gcc stayed roughly
stable.

BUG=webm:1303

Change-Id: I9d4fd2b41b46ea6174a887b40a82c8e6e4769ed4
2017-01-19 12:27:31 -08:00
Kaustubh Raste
e0c0e65378 Add mips msa vpx hadamard functions
average improvement ~4x-5x

Change-Id: I167132d894c04fa85dda8dde7906ff9c61b3a65d
2017-01-19 14:44:03 +05:30
Jingning Han
b6fe63a505 Merge "Rework 8x8 transpose SSSE3 for avg computation" 2017-01-13 18:25:17 +00:00
Jingning Han
553e9e291f Merge "Rework 8x8 transpose SSSE3 for inverse 2D-DCT" 2017-01-13 18:25:09 +00:00
Jingning Han
39fff1bea0 Rework 8x8 transpose SSSE3 for avg computation
Use same transpose process as inv_txfm_sse2 does.

Change-Id: I2db05f0b254628a11f621c4c09abb89501ba6d3c
2017-01-12 15:16:07 -08:00
Jingning Han
f65170ea84 Rework 8x8 transpose SSSE3 for inverse 2D-DCT
Use same transpose process as inv_txfm_sse2 does.

Change-Id: Ic4827825bd174cba57a0a80e19bf458a648e7d94
2017-01-12 15:13:18 -08:00
Johann Koenig
9f27d1f843 Merge "arm idct16x16: remove extra config guards" 2017-01-11 20:22:27 +00:00
Johann
68d0f46ec0 arm idct16x16: remove extra config guards
This file is guarded by HAVE_NEON_ASM in the .mk file now.

Change-Id: I513a621c234aa90ad52e426c8ed494d8a7d4b74a
2017-01-11 10:17:14 -08:00
Jingning Han
9a780fa7db Rework forward 8x8 2D-DCT ssse3 implementation
This commit reworks the SSSE3 implementation of the forward 8x8
2D-DCT. It uses a cyclic rotation approach to the temporary xmm
registers. It reduces the average cycles from 158 to 154. The SSE2
version uses 169 cycles.

Change-Id: I1b79b9642aae0ed3fb3cefb5b70246e6de5d5caa
2017-01-10 12:50:55 -08:00