263 Commits

Author SHA1 Message Date
Johann
4959dd3eb3 partial fdct neon: add 4x4_1
BUG=webm:1424

Change-Id: Ib0f3cfd6116fc1f5a99acb8bfd76e25b90177ffc
2017-06-28 15:37:44 -07:00
Johann
cf75ab6ccd partial fdct neon: move 8x8_1 and enable hbd tests
The function was originally written with HBD in mind. Enable it and
configure the tests.

BUG=webm:1424

Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1
2017-06-28 15:37:43 -07:00
Johann
ad011aaab8 sad neon: rewrite 64x64 and add 64x32
BUG=webm:1425

Change-Id: Ib454762d1c61b05a98324fe81ad58c9e09784717
2017-06-28 12:21:34 -07:00
Johann
77a648885c sad neon: rewrite 32x32, add 32x16 and 32x64
BUG=webm:1425

Change-Id: I966650df7e3face93e1e771634d1cc5458a35f85
2017-06-28 12:20:27 -07:00
Johann
469643757f sad neon: rewrite 16x8, 16x16, add 16x32
BUG=webm:1425

Change-Id: Ie126553e5fffcdfaf3d82a85b368ac10ce9ab082
2017-06-28 12:16:00 -07:00
Johann
e40e78be24 sad neon: rewrite 8x8 and 8x16
BUG=webm:1425

Change-Id: I068f06c67b841f09ea07c04ada0c2f1706102138
2017-06-28 12:15:57 -07:00
Johann
46d8660ce3 sad neon: rewrite 4x4 and add 4x8
The previous implementation loaded 8 values (discarding half)

BUG=webm:1425

Change-Id: Icb72a94e2557a4ee2db7091266ab58fd92f72158
2017-06-28 11:14:59 -07:00
Johann
e67660cf37 fdct32x32 neon implementation
Almost 3x faster in constrained loop testing. Over 10x faster in HBD
builds.

BUG=webm:1424

Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9
2017-06-22 06:40:17 -07:00
Johann Koenig
903375a48a Merge "fdct16x16 neon optimization" 2017-06-08 15:19:36 +00:00
Johann
eae7cf2368 fdct16x16 neon optimization
Roughly 2x speedup. Since the only change for HBD is to store(), the
improvement appears to hold there as well.

BUG=webm:1424

Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19
2017-06-07 14:59:55 -07:00
Johann Koenig
755b3daf90 Merge "comp_avg_pred neon: used by sub pixel avg variance" 2017-05-31 18:17:28 +00:00
Johann
f695b30ac2 comp_avg_pred neon: used by sub pixel avg variance
BUG=webm:1423

Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804
2017-05-30 22:47:34 +00:00
Johann
42ce25821d remove DECLARE_ALIGNED from neon code
Unlike x86 neon only requires type alignment when loading into vectors.

Change-Id: I7bbbe4d51f78776e499ce137578d8c0effdbc02f
2017-05-26 10:41:57 -07:00
Johann
f3c97ed32e subpel variance neon: reduce stack usage
Unlike x86, arm does not impose additional alignment restrictions on
vector loads. For incoming values to the first pass, it uses vld1_u32()
which typically does impose a 4 byte alignment. However, as the first
pass operates on user-supplied values we must prepare for unaligned
values anyway (and have, see mem_neon.h).

But for the local temporary values there is no stride and the load will
use vld1_u8 which does not require 4 byte alignment.

There are 3 temporary structures. In the C, one is uint16_t. The arm
saturates between passes but still passes tests. If this becomes an
issue new functions will be needed.

Change-Id: I3c9d4701bfeb14b77c783d0164608e621bfecfb1
2017-05-24 13:28:13 -07:00
Johann
d204c4bf01 Use vdup instead of vmov
Change-Id: Idb6248c1429b55176bb3e9f4e8365ea0ed2be62a
2017-05-24 11:38:15 -07:00
Johann
f6fcd3410d sub pel avg variance neon: 4x block sizes
BUG=webm:1423

Change-Id: Iaab2b9a183fdb54aae5f717aba95d90dc36a9e3b
2017-05-22 14:40:05 -07:00
Johann
188d58eaa9 sub pel variance neon: 4x block sizes
Add optimizations for blocks of width 4

BUG=webm:1423

Change-Id: Idfb458d36db3014d48fbfbe7f5462aa6eb249938
2017-05-22 14:40:01 -07:00
Johann
9b0d306a2f sub pel avg variance neon: add neon optimizations
These are missing an optimized version of vpx_comp_avg_pred

BUG=webm:1423

Change-Id: I31fa6ef842e98f7ff3ea079ffed51ae33178e2ed
2017-05-22 13:58:43 -07:00
Johann
e0d294c3af sub pel variance neon: normalize variable names
match vpx_dsp/variance.c variable names

Change-Id: I228c6f296c183af147b079b7c8bcdf97bd09cf3a
2017-05-22 13:58:43 -07:00
Johann
67ac68e399 variance neon: assert overflow conditions
Change-Id: I12faca82d062eb33dc48dfeb39739b25112316cd
2017-05-22 11:25:06 -07:00
Johann
d217c87139 neon variance: special case 4x
The sub pixel variance uses a temp buffer which guarantees width ==
stride. Take advantage of this with the 4x and avoid the very costly
lane loads.

Change-Id: Ia0c97eb8c29dc8dfa6e51a29dff9b75b3c6726f1
2017-05-22 10:51:31 -07:00
Johann Koenig
e7cac13016 Merge changes Ib8dd96f7,Ie9854b77
* changes:
  neon variance: process 4x blocks
  use memcpy for unaligned neon stores
2017-05-22 17:48:33 +00:00
Johann Koenig
b5055002d7 Merge "neon 4 byte helper functions" 2017-05-19 17:11:30 +00:00
Johann
7b742da63e neon variance: process 4x blocks
Continue processing sets of 16 values. Plenty of improvement for 4x8
(doubles the speed) but only about 30% for 4x4.

BUG=webm:1422

Change-Id: Ib8dd96f75d474f0348800271d11e58356b620905
2017-05-17 17:35:01 -07:00
Johann
2057d3ef75 use memcpy for unaligned neon stores
Advise the compiler that the store is eventually going to a uint8_t
buffer. This helps avoid getting alignment hints which would cause the
memory access to fail.

Originally added as a workaround for clang:
https://bugs.llvm.org//show_bug.cgi?id=24421

Change-Id: Ie9854b777cfb2f4baaee66764f0e51dcb094d51e
2017-05-17 12:11:31 -07:00
Johann
105503b839 neon fdct: 4x4 implementation
Approximately twice as fast as C implementation.

BUG=webm:1424

Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f
2017-05-17 07:38:18 -07:00
Johann
7498fe2e54 neon 4 byte helper functions
When data is guaranteed to be aligned, use helper functions which
assert that requirement.

Change-Id: Ic4b188593aea0799d5bd8eda64f9858a1592a2a3
2017-05-15 13:42:31 -07:00
Johann
1088b4f87c move neon load/stores to a new file
Move the tran_low_t helper functions to a new file. Additional
load/store functions will be added here.

Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3
2017-05-15 08:29:43 -07:00
Johann Koenig
d713ec3c46 Merge changes I92eb4312,Ibb2afe4e
* changes:
  subpel variance neon: add mixed sizes
  sub pixel variance neon: use generic variance
2017-05-10 18:19:52 +00:00
Johann
f7d1486f48 neon variance: process 16 values at a time
Read in a Q register. Works on blocks of 16 and larger.

Improvement of about 20% for 64x64. The smaller blocks are faster, but
don't have quite the same level of improvement. 16x32 is only about 5%

BUG=webm:1422

Change-Id: Ie11a877c7b839e66690a48117a46657b2ac82d4b
2017-05-08 18:48:55 +00:00
Johann Koenig
1814463864 Merge changes Id602909a,Ib0e85608
* changes:
  neon variance: process two rows of 8 at a time
  neon variance: add small missing sizes
2017-05-08 17:34:20 +00:00
Linfeng Zhang
2c3a2ad6f1 Merge changes I0cfe4117,I3581d80d,Ida62c941
* changes:
  Split dsp/x86/inv_txfm_sse2.c
  Update highbd idct functions arguments to use uint16_t dst
  Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct
2017-05-08 16:15:57 +00:00
Johann
2346a6da4a subpel variance neon: add mixed sizes
Add support for everything except block sizes of 4.

Performance is better but numbers will improve again when the variance
optimizations land.

BUG=webm:1423

Change-Id: I92eb4312b20be423fa2fe6fdb18167a604ff4d80
2017-05-04 15:30:01 -07:00
Johann
19e1ec8359 sub pixel variance neon: use generic variance
When a neon version is available it will be called. This allows
decoupling the variance implementations and has no real downside. For
most configurations, the call will be #define'd to the neon
implementation.

Change-Id: Ibb2afe4e156c5610e89488504d366b3e6d1ba712
2017-05-04 15:30:01 -07:00
Johann
462e29703c fdct 8x8 neon: minor comment cleanup
Simplify HBD/non distinction in test.

Document why transpose_neon.h is not used

Change-Id: I17659414206ddbb8c2f1ef0d9f4a17f1745d5a52
2017-05-04 15:14:23 -07:00
Johann
d6a7489dd5 neon variance: process two rows of 8 at a time
When the width is equal to 8, process two rows at a time. This doubles
the speed of 8x4 and improves 8x8 by about 20%.

8x16 was using this technique already, but still improved a little bit
with the rewrite.

Also use this for vpx_get8x8var_neon

BUG=webm:1422

Change-Id: Id602909afcec683665536d11298b7387ac0a1207
2017-05-04 08:59:46 -07:00
Johann
cb9133c72f neon variance: add small missing sizes
Some of the mixed sizes were missing. They can be implemented trivially
using the existing helper function.

When comparing the previous 16x8 and 8x16 implementations, the helper
function is about 10% faster than the 16x8 version. The 8x16 is very
close, but the existing version appears to be faster.

BUG=webm:1422

Change-Id: Ib0e856083c1893e1bd399373c5fbcd6271a7f004
2017-05-04 08:59:42 -07:00
Linfeng Zhang
d5de63d2be Update highbd idct functions arguments to use uint16_t dst
BUG=webm:1388

Change-Id: I3581d80d0389b99166e70987d38aba2db6c469d5
2017-05-03 13:59:16 -07:00
Linfeng Zhang
081b39f2b7 Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct
BUG=webm:1388

Change-Id: Ida62c941f2b836d6c9e27b427a7d5008ab6dc112
2017-05-03 13:58:31 -07:00
Linfeng Zhang
51dc998f3a Update highbd convolve functions arguments to use uint16_t src/dst
BUG=webm:1388

Change-Id: I6912de2639895d817ce850da8ea9f6c8fe21da42
2017-04-25 14:22:19 -07:00
Linfeng Zhang
bf8a49abbd Clean CONVERT_TO_BYTEPTR/SHORTPTR in convolve
Replace by CAST_TO_BYTEPTR/SHORTPTR.
The rule is: if a short ptr is casted to a byte ptr, any offset
operation on the byte ptr must be doubled. We do this by casting to
short ptr first, adding offset, then casting back to byte ptr.

BUG=webm:1388

Change-Id: I9e18a73ba45ddae58fc9dae470c0ff34951fe248
2017-04-19 12:13:49 -07:00
Linfeng Zhang
6fc2e57c2c Update 32x32 high bitdepth idct NEON optimization
Preparation of CONVERT_TO_BYTEPTR/SHORTPTR clean up.

BUG=webm:1388

Change-Id: I928d30a5698023bb90888d783cf81c51ec183760
2017-04-05 15:28:11 -07:00
James Zern
f91c3bb3ab idct_neon: prefix non-static functions w/'vpx_'
Change-Id: I94fcdeae18468e6ef0cb7119b8142d982a048031
2017-03-22 11:49:23 -07:00
Linfeng Zhang
27530d484e Add vpx_highbd_idct32x32_1024_add_neon()
BUG=webm:1301

Change-Id: Ib90af0c1712e56b301d0e981dbe9a641e15e36ca
2017-03-17 00:27:46 -07:00
Linfeng Zhang
50b13f75b8 Add vpx_highbd_idct32x32_34_add_neon()
BUG=webm:1301

Change-Id: I74dd16c6c64e7bb71aa991cedccddf0663ef5e06
2017-03-17 00:27:46 -07:00
Linfeng Zhang
65e9fb65e8 Add vpx_highbd_idct32x32_135_add_neon()
BUG=webm:1301

Change-Id: I58c2d65d385080711c3666d6d8f9d241dac7b21a
2017-03-16 22:37:55 -07:00
Linfeng Zhang
e54231d613 Clean vpx_idct32x32_1024_add_neon()
Change-Id: I05921e16d6a3e4e7e5b00a90624735050a186636
2017-03-15 11:24:31 -07:00
Linfeng Zhang
c756eb01c8 Fix overflow issue in 32x32 idct NEON intrinsics
Similar issue as Change bc1c18e.

The PartialIDctTest.ResultsMatch test on vpx_idct32x32_135_add_neon()
in high bit-depth mode exposes 16-bit overflow in final stage of pass
2, when changing the test number from 1,000 to 1,000,000.

Change to use saturating add/sub for vpx_idct32x32_34_add_neon(),
vpx_idct32x32_135_add_neon and vpx_idct32x32_1024_add_neon() in high
bit-depth mode.

Change-Id: Iaec0e9aeab41a3fdb4e170d7e9b3ad1fda922f6f
2017-03-14 16:59:14 -07:00
Linfeng Zhang
77311e0dff Update vpx_idct32x32_1024_add_neon()
Most are cosmetics changes.
Speed has no change with clang 3.8, and about 5% faster with gcc 4.8.4

Tried the strategy used in 8x8 and 16x16 (which operations' orders are
similar to the C code), though speed gets better with gcc, it's worse
with clang.

Tried to remove store_in_output(), but speed gets worse.

Change-Id: I93c8d284e90836f98962bb23d63a454cd40f776e
2017-03-08 12:39:04 -08:00
Linfeng Zhang
c4e5c54d69 cosmetics,dsp/arm/: vpx_idct32x32_{34,135}_add_neon()
No speed changes and disassembly is almost identical.

Change-Id: Id07996237d2607ca6004da5906b7d288b8307e1f
2017-03-08 08:58:32 -08:00