Commit Graph

124 Commits

Author SHA1 Message Date
Linfeng Zhang
2231669a83 Split dsp/x86/inv_txfm_sse2.c
Spin out highbd idct functions.

BUG=webm:1412

Change-Id: I0cfe4117c00039b6778c59c022eee79ad089a2af
2017-05-03 15:43:02 -07:00
Luca Barbato
63860ba7b8 ppc: Add convolve_copy
Change-Id: Ie26d6dbe090e711d84bac01ba7da270db983f405
2017-04-29 15:47:25 +02:00
Luca Barbato
7ad1faa6f8 ppc: vertical intrapred 16x16 and 32x32
Change-Id: Ic80f3c050cfbe7697e81a311b4edaaa597b85cab
2017-04-19 01:48:09 +02:00
James Zern
4ba20da8b1 Merge "Add AVX2 optimization to copy/avg functions" 2017-04-15 00:26:08 +00:00
Yi Luo
aa5a941992 Add AVX2 optimization to copy/avg functions
Change-Id: Ibcef70e4fead74e2c2909330a7044a29381a8074
2017-04-14 16:50:10 -07:00
Johann
28a8622143 vpx_comp_avg_pred: sse2 optimization
Provides over 15x speedup for width > 8.

Due to smaller loads and shifting for width == 8 it gets about 8x
speedup.

For width == 4 it's only about 4x speedup because there is a lot of
shuffling and shifting to get the data properly situated.

BUG=webm:1390

Change-Id: Ice0b3dbbf007be3d9509786a61e7f35e94bdffa8
2017-04-13 08:44:52 -07:00
Linfeng Zhang
27530d484e Add vpx_highbd_idct32x32_1024_add_neon()
BUG=webm:1301

Change-Id: Ib90af0c1712e56b301d0e981dbe9a641e15e36ca
2017-03-17 00:27:46 -07:00
Linfeng Zhang
50b13f75b8 Add vpx_highbd_idct32x32_34_add_neon()
BUG=webm:1301

Change-Id: I74dd16c6c64e7bb71aa991cedccddf0663ef5e06
2017-03-17 00:27:46 -07:00
James Zern
2882778310 Merge "Add vpx_highbd_idct32x32_135_add_neon()" 2017-03-17 07:26:52 +00:00
Linfeng Zhang
65e9fb65e8 Add vpx_highbd_idct32x32_135_add_neon()
BUG=webm:1301

Change-Id: I58c2d65d385080711c3666d6d8f9d241dac7b21a
2017-03-16 22:37:55 -07:00
Rafael de Lucena Valle
405b94c661 Add Hadamard for Power8
Change-Id: I3b4b043c1402b4100653ace4869847e030861b18
Signed-off-by: Rafael de Lucena Valle <rafaeldelucena@gmail.com>
2017-03-15 23:46:18 -03:00
Yi Luo
f62dcc9c33 Replace idct32x32_1024_add_ssse3 assembly with intrinsics
- Encoding/decoding test, BQTerrace_1920x1080_60.y4m, on
  i7-6700, no obvious user-level speed performance downgrade.
- Passed unit tests.

Change-Id: I20688e0dd3731021ec8fb4404734336f1a426bfc
2017-02-16 16:10:40 -08:00
Johann
2104454607 block error avx2: use tran_low_t
Change-Id: Ic5f3a1f569d6f82afeaf4fcd7235374bb460db3c
2017-02-16 12:39:02 -08:00
Linfeng Zhang
016933ad48 Add vpx_highbd_idct{16x16,32x32}_1_add_neon()
and update vpx_highbd_idct8x8_1_add_neon()

BUG=webm:1301

Change-Id: I18d1a0cbe98ba822d5194c1b4e13a4c29c5c75f4
2017-02-13 10:25:22 -08:00
Johann
641fda79bb highbd x86: consolidate tran_low_t conversions
Create new helper files specifically for converting tran_low_t types.

Change-Id: I7c4c458ef910f3b3d10a3cfbf9df4de7682fd905
2017-02-06 10:43:26 -08:00
Jingning Han
bb40844e32 Merge "Add SSSE3 intrinsic 8x8 inverse 2D-DCT" 2017-02-02 22:18:32 +00:00
Kaustubh Raste
5b10674b5c Merge "Add mips msa sum_squares_2d_i16 function" 2017-02-02 08:09:21 +00:00
Jingning Han
8f95389742 Add SSSE3 intrinsic 8x8 inverse 2D-DCT
The intrinsic version reduces the average cycles from 183 to 175.

Change-Id: I7c1bcdb0a830266e93d8347aed38120fb3be0e03
2017-02-01 14:47:53 -08:00
Kaustubh Raste
750e753134 Add mips msa sum_squares_2d_i16 function
average improvement ~4x-5x

Change-Id: I8d91b71d0677009be52b412e4f52b40b98573a53
2017-01-31 12:22:43 +00:00
Johann
13234d3c43 Remove neon assembly for idct 16x16 and 8x8
Tested using test/partial_idct_test.cc:DISABLED_Speed

Both gcc 4.9 and clang 3.8 from the r13 Android NDK offer improvements
using the intrinsics:
<function>    <clang asm> <gcc asm> <clang intrin> <gcc intrin>
idct16x16_256  1720ms      1703ms    1546ms         1554ms
idct16x16_10   1320ms      1247ms     518ms          488ms
idct16x16_1     107ms       108ms      64ms           68ms
idct8x8_64      924ms       931ms     866ms          989ms
idct8x8_12      826ms       824ms     519ms          514ms
idct8x8_1       172ms       166ms     110ms          125ms

idct8x8_64 isn't quite perfect (slight regression with gcc intrinsics)
but as a counter example idct16x16_10 goes from ~1300ms to ~500ms

On a sample clip, clang improved from 48.5 to 49fps and gcc stayed roughly
stable.

BUG=webm:1303

Change-Id: I9d4fd2b41b46ea6174a887b40a82c8e6e4769ed4
2017-01-19 12:27:31 -08:00
Linfeng Zhang
6abdd31555 Refine 8-bit 16x16 idct NEON intrinsics
Speed test shows 25% gain on vpx_idct16x16_256_add_neon(),
and vpx_idct16x16_10_add_neon() got trippled.

Change-Id: If8518d9b6a3efab74031297b8d40cd83c4a49541
2017-01-06 17:52:07 -08:00
Linfeng Zhang
9b187954df Add high bitdepth 8x8 idct NEON intrinsics
BUG=webm:1301

Change-Id: I56e3bc3aab9214e2debac93796389a7194991084
2016-12-27 16:28:53 -08:00
Johann
41b0888a84 postproc: neon down and across macroblock filter
Implement vpx_post_proc_down_and_across_mb_row in NEON.
Runs about 6-7x faster than C.

BUG=webm:1320

Change-Id: Ic5c7d3552a88cfcf999ec5bf2bd46fee460642c2
2016-12-14 15:11:28 -08:00
James Zern
86e340c76e enable vpx_idct32x32_1024_add_neon in hbd builds
BUG=webm:1294

Change-Id: Ibdda54e6d1303b0f73bc7bc71417e4041d7618de
2016-12-12 19:28:35 -08:00
James Zern
228c9940ea Merge changes Ibad079f2,I7858a0a1
* changes:
  enable vpx_idct16x16_10_add_neon in hbd builds
  idct16x16,NEON: rm output_stride from pass1 fns
2016-12-07 01:40:28 +00:00
James Zern
8befcd0089 enable vpx_idct16x16_10_add_neon in hbd builds
BUG=webm:1294

Change-Id: Ibad079f25e673d4f5181961896a8a8333a51e825
2016-12-06 16:09:19 -08:00
Linfeng Zhang
a8eee97b43 Check in vpx_lpf_vertical_4_dual_neon() assembly
This replaces its C version.

Change-Id: Ie39e9324305fdc0fff610ced608a037e44a85a1a
2016-12-02 15:54:30 -08:00
Linfeng Zhang
17a8cf5cc3 Add high bitdepth 4x4 idct NEON intrinsics
Change-Id: I4afc130effa05b8be2e9f982967216b1beb2ce4b
2016-11-30 13:07:13 -08:00
James Zern
21a1abd8e3 enable vpx_idct32x32_135_add_neon in hbd builds
BUG=webm:1294

Change-Id: Ide6d3994fe01c4320c9d143e6d059b49568048e4
2016-11-23 19:59:43 -08:00
James Zern
738c8f23c6 enable vpx_idct32x32_34_add_neon in hbd builds
replace load_and_transpose_s16_8x8() in idct32_6_neon() with a separate
load_tran_low_to_s16() and transpose_s16_8x8(). the combined function is
used in idct32_8_neon() where the input is the correctly sized output
from the earlier stage.

BUG=webm:1294

Change-Id: I4257c4b3a421b2cf5d13651f966eee0680ef98a9
2016-11-08 17:03:36 -08:00
Johann
50b40f114c Optimize idct32x32_135_add for NEON
BUG=webm:1295

Change-Id: I7f80ef4d29813fcb401fc6075babf19e3c195462
2016-11-08 22:06:07 +00:00
Linfeng Zhang
64a5a8fd6f Merge "Add high bitdepth intra prediction NEON optimization (mode dc)" 2016-11-08 16:53:42 +00:00
Johann
cf35ffc025 Extract high bit depth helper functions
These can be used in the vp9 fdct as well.

Change-Id: I4f3875e0cba1b8cad209c3a0581e121deba7675e
2016-11-04 18:13:51 +00:00
Linfeng Zhang
3b74066b10 Add high bitdepth intra prediction NEON optimization (mode dc)
BUG=webm:1316

Change-Id: I984d6004ea2445e86f213fb6fa4d794a9955af8f
2016-11-01 17:07:36 -07:00
James Zern
3ae25974fd idct,NEON: add a tran_low_t->s16 load adapter
enable idct4x4* and idct8x8* which are compatible for 8-bit decodes in
high-bitdepth mode. the adapter narrows 32-bit input to 16, whether the
expansion can be avoided at all in this case remains a TODO. roughly
matches sse2.

BUG=webm:1294

Change-Id: I3ea94e5a2070dfd509b5de0c555aab4e1f4da036
2016-10-31 11:21:16 -07:00
Johann
9720b58aac Optimize idct32x32_34_add for NEON
Approximately 3 times faster than the 1024 version which was used
previously.

BUG=webm:1295

Change-Id: Id15fb3d096029ec38ef01c53e5f6eb08254347c9
2016-10-25 15:43:58 -07:00
James Zern
9dbb3ad396 remove idct32x32*_add_neon.asm
the intrinsics are neutral to ~20% faster on cros/android
devices when using gcc-4.9/clang-3.8.1 and gcc-4.9/clang-3.8.x from the
r13 ndk. neutral results typically came with gcc-4.9 while larger
positive gains were achieved with clang 3.8.x.

BUG=webm:1303

Change-Id: I4d31f9c017944681b881493525d4573a7a5b1e16
2016-10-20 19:47:14 -07:00
Linfeng Zhang
9c8981c666 add vpx high bitdepth convolve8 NEON intrinsics optimization
BUG=webm:1299

Change-Id: I236bfa0441e357b6ff05add8269a2cfb543924d1
2016-10-17 15:23:54 -07:00
Linfeng Zhang
f910d14a1a add vpx_highbd_convolve_{copy,avg}_neon()
BUG=webm:1299

Change-Id: Ib87ac466ada63251eb06ae2abd1e13e61e0d1538
2016-10-13 15:21:14 -07:00
Linfeng Zhang
7aa27bd62f [vpx highbd lpf NEON 1/6] horizontal 4
BUG=webm:1300

Change-Id: Idf441806e6bf397ff5ecd8776146b3f781f50c40
2016-10-06 14:03:04 -07:00
James Zern
a6be7ba1aa enable idct*_1_add_neon in high-bitdepth builds
these are compatible as they only load one element of the input so the
larger size of tran_low_t makes no difference in little endian builds.
note the asm is incompatible with big-endian, but there are other points of
failure there so currently it's considered unsupported.

BUG=webm:1294

Change-Id: Icd2665a0699bccae92d1bea43a95b0a83fb17028
2016-10-05 11:14:25 -07:00
Linfeng Zhang
da14d23e44 Merge "Refactor vpx lpf NEON files (step 2/2)" 2016-10-01 00:07:51 +00:00
Linfeng Zhang
edbca72a53 Merge "Refactor vpx lpf NEON files (step 1/2)" 2016-10-01 00:07:31 +00:00
James Zern
b6277a47c7 Merge changes from topic '8bit-hbd-idct'
* changes:
  *idct*_neon.c: add missing rtcd include
  idct,msa/neon: exclude idct files from hbd build
  *rtcd_defs.pl: remove empty specialize calls
2016-09-30 19:36:08 +00:00
James Zern
b51c4df93a idct,msa/neon: exclude idct files from hbd build
these functions are incompatible currently and unreferenced in rtcd,
exclude them from the build.

BUG=webm:1294

Change-Id: I7790c195a91e1b142f56c04d2a5e305d9133b896
2016-09-30 11:32:47 -07:00
Linfeng Zhang
ca2fe7a8c7 Refactor vpx lpf NEON files (step 2/2)
Change-Id: I0744407cd3361ff752bd7f6e654b70ab6b41a58f
2016-09-30 09:56:28 -07:00
Linfeng Zhang
4779f5308d Refactor vpx lpf NEON files (step 1/2)
Change-Id: I4016d096d46ca691f3b17199b259b7231e983cfb
2016-09-30 09:48:54 -07:00
Linfeng Zhang
b3cb065ee4 Refine vpx convolve8 NEON intrinsics optimization
BUG=webm:1290

Change-Id: I5d7fce62270f9d76ef9ce98b3d188ad11fb21873
2016-09-29 12:48:59 -07:00
Linfeng Zhang
761e5ec2f6 Refactor lpf (size 4 and 8) NEON intrinsics optimization
Also check in 8x8 8-bit transpose NEON intrinsics optimization
transpose_u8_8x8()

Change-Id: I32d321cf97ea21eab158ac4896990fc9a51681c4
2016-09-19 16:41:37 -07:00
Johann
d393885af1 Remove halfpix specialization
This function only exists as a shortcut to subpixel variance with
predefined offsets. xoffset = 4 for horizontal, yoffset = 4 for vertical
and both for "hv"

Removing this allows the existing optimizations for the variance
functions to be called. Instead of having only sse2 optimizations, this
gives sse2, ssse3, msa and neon.

BUG=webm:1273

Change-Id: Ieb407b423b91b87d33c4263c6a1ad5e673b0efd6
2016-08-23 17:05:39 -07:00