Johann
4378503665
Merge "Remove redundant arm neon instructions."
2014-02-14 20:02:51 -08:00
Yaowu Xu
ecf392a155
Merge "minor spelling cleanup in comments"
2014-02-14 14:29:35 -08:00
Frank Galligan
b41acbf9bb
Fix neon wide loopfilter for filter8 only branch
...
The current code removed the check to only perform the filter8.
Change-Id: Ie54e19a77745042a5660eab986d9ef1c42e82410
2014-02-12 18:36:17 -08:00
Andrew Russell
549c31f8ae
minor spelling cleanup in comments
...
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-12 16:32:51 -08:00
James Yu
619f29cdb0
Remove redundant arm neon instructions.
...
Change-Id: I1fabad59747eb5f68c64275a36c3a1d94daf32a3
Signed-off-by: James Yu <james.yu@linaro.org>
2014-02-11 21:19:12 -08:00
Martin Storsjo
03bc491721
arm: Consistently use braces around doubleword arguments to vld
...
This isn't strictly necessary, but makes the file more consistent
with the other arm assembly source files.
Change-Id: I245c9677d89e0ab3f31991e473764858af35b180
2014-02-05 13:24:25 +02:00
Martin Storsjo
c2bb1aa544
arm: Use {} around quadword arguments to vld
...
This fixes building for iOS.
Change-Id: Ice082648c02a3faf93891f7ddc122875e2bdc9cb
2014-02-05 13:24:17 +02:00
Dmitry Kovalev
c49b08c9a1
Removing "_short" suffix from arm transform file names.
...
Change-Id: Iefe118f61a335e88821a21a9f50fb919212c1507
2014-01-31 17:19:02 -08:00
hkuang
770454f3a8
Add vp9_tm_predictor_32x32 neon implementation
...
which is 7.8 times faster than C.
Change-Id: I858ef4ec09202a07d445da8db702783d6d9d7321
2014-01-27 16:01:07 -08:00
hkuang
05d2081d38
Fix the vp9_tm_predictor_8x8_neon.
...
Change-Id: I832cf83871044bfee7b7e57dbd31bae05cbd53e9
2014-01-27 10:17:20 -08:00
Frank Galligan
183361dadb
Merge "Optimize vp9_tm_predictor_8x8_neon function"
2014-01-24 16:21:56 -08:00
Frank Galligan
56a8a0b54b
Optimize vp9_tm_predictor_8x8_neon function
...
Change-Id: Ia12aae491202098ff66366145aa0c3da38dc97e5
2014-01-24 11:07:14 -08:00
hkuang
3633ffcbf7
Add vp9_tm_predictor_16x16 neon implementation
...
which is 3.5 times faster than C.
Change-Id: I24439ba7a2971829c11620f34848facf2c916678
2014-01-24 10:22:58 -08:00
hkuang
97826df96b
Add tm_predictor_8x8 neon implementation.
...
Change-Id: I76c2720546b737cb63018a8ab6a3ff62a291786d
2014-01-22 13:43:20 -08:00
hkuang
2a2d8c140f
Merge "Add vp9_tm_predictor_4x4 neon implementation"
2014-01-16 10:18:12 -08:00
hkuang
f2ef389256
Add vp9_tm_predictor_4x4 neon implementation
...
Change-Id: I10c423bde7ea5a3bac9f14f35c73b6bc31c8f3e3
2014-01-15 11:51:36 -08:00
hkuang
5be0ed30dc
Merge "Add initial intra frame neon optimization. 1~2% gain."
2014-01-08 14:41:43 -08:00
hkuang
691111aacf
Add initial intra frame neon optimization. 1~2% gain.
...
More intra optimizations will be added.
Change-Id: I33ae8d93f6002bf7b64cc2669602d9e6bfa5a6e8
2014-01-08 11:58:42 -08:00
Jim Bankoski
b720ba165f
rename loop filter functions
...
This renames all the loop filter functions so that they no
longer refer to mb
Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b
2013-12-17 17:34:34 -08:00
Frank Galligan
b4874e2c82
Fix 16 wide neon horz loopfilter.
...
Multiply by 3 was on 8bit vectors when it should have been on
16bit vectors.
Change-Id: I248c1429b3134dfd171dfab0ebb109fd2437e1fc
2013-11-26 10:02:40 -08:00
Yunqing Wang
ed36720b66
Do vertical loopfiltering in parallel
...
This patch followed "Add filter_selectively_vert_row2 to enable
parallel loopfiltering" commit, and added x86 SSE2 optimization
to do 16-pixel filtering in parallel. For other optimizations
(neon and dspr2), current 16-pixel functions were done by calling
8-pixel functions twice, and real 16-pixel functions could be added
later.
Decoder speedup:
tulip clip: 2% speed gain;
old_town_cross: 1.2% speed gain;
bus: 2% speed gain.
Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
2013-11-22 10:04:51 -08:00
Frank Galligan
97d1258375
Revert "Add 16 wide neon horz loopfilter."
...
The change caused mismatches with some test vectors on neon.
Original CL: https://gerrit.chromium.org/gerrit/#/c/67863/
Change-Id: I913891636d53783e93cb1865ca78ded1821dc4b0
2013-11-21 14:01:33 -08:00
Frank Galligan
98de15137e
Add 16 wide neon horz loopfilter.
...
Add support to do 16 pixel horizontal filtering in Neon.
Nexus devices saw about 0.5% decode speed increase.
Change-Id: I2993f6c2d49f31fa74976879eeaa289fd3f4e15d
2013-11-21 09:39:36 -08:00
Yunqing Wang
64f728caef
Do horizontal loopfiltering in parallel
...
This patch followed "Rewrite filter_selectively_horiz for parallel
loopfiltering" commit, and added x86 SSE2 optimization to do
16-pixel filtering in parallel. Also, corrected the declaration
of aligned arrays. For 8-pixel-in-parallel case, improved the
calculation of the masks and filters. Updated the threshold loading
since the thresholds were already duplicated. Updated neon C functions
to call neon loopfilters twice.
Using tulip clip, tests showed it gave a ~1.5% decoder speed gain.
Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-15 16:18:43 -08:00
Johann
e72d49a97a
Use lowercase 'b' to branch
...
iOS doesn't recognize B:
bad instruction `B idct32_pass_loop'
Change-Id: I3cf6aede4639f1d9efa97f7962fa287ba6feaaef
2013-11-12 10:41:06 -08:00
hkuang
c689a126ed
Fix a bug in the assembly code.
...
Change-Id: Ic416e3f8a11e82ee298e6f709b2119a9ddf1e2f8
2013-11-11 12:49:12 -08:00
hkuang
6b16f63332
Add back vp9_short_idct32x32_1_add_neon which is deleted in
...
cleanup I63df79a13cf62aa2c9360a7a26933c100f9ebda3.
Change-Id: I034848cf05031618818f7df2e7f9c35102686948
2013-11-05 14:57:32 -08:00
Dmitry Kovalev
65f118d72f
Making input pointer of any inverse transform constant.
...
Also renaming dest_stride to stride in some places.
Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11 18:27:12 -07:00
Dmitry Kovalev
7ef573914d
Consistent names for inverse hybrid transforms (1 of 2).
...
Renames:
vp9_short_iht4x4_add -> vp9_iht4x4_16_add
vp9_short_iht8x8_add -> vp9_iht8x8_64_add
vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add
Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
2013-10-11 13:31:32 -07:00
Dmitry Kovalev
1e766b50e2
Giving consistent names to IDCT 32x32 functions.
...
Renames:
vp9_short_idct32x32_add -> vp9_idct32x32_1024_add
vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add
vp9_idct_add_32x32 -> vp9_idct32x32_add
Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
2013-10-10 11:27:39 -07:00
Dmitry Kovalev
b096c5a336
Giving consistent names to IDCT 16x16 functions.
...
Renames:
vp9_short_idct16x16_add -> vp9_idct16x16_256_add
vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add
vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add
vp9_idct_add_16x16 -> vp9_idct16x16_add
Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
2013-10-07 14:31:10 -07:00
Dmitry Kovalev
c6ad70d5f1
Giving consistent names to IDCT 8x8 functions.
...
Renames:
vp9_short_idct8x8_add -> vp9_idct8x8_64_add
vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add
vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add
vp9_idct_add_8x8 -> vp9_idct8x8_add
Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
2013-10-06 00:24:09 -07:00
Dmitry Kovalev
3a0602578e
Giving consistent names to IDCT/IWHT functions.
...
The idea is to have the following names for each transform size:
vp9_idct4x4_add
vp9_idct4x4_1_add
vp9_idct4x4_10_add
vp9_idct4x4_16_add
vp9_idct8x8_add
vp9_idct8x8_1_add
vp9_idct8x8_10_add
vp9_idct8x8_64_add
etc for 16x16, 32x32
The actual list of renames in this patch:
vp9_idct_add_lossless -> vp9_iwht4x4_add
vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add
vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
vp9_idct_add -> vp9_idct4x4_add
vp9_short_idct4x4_add -> vp9_idct4x4_16_add
vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
2013-10-04 14:17:06 -07:00
Dmitry Kovalev
3fab2125ff
Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add.
...
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1.
Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
2013-09-27 15:26:27 -07:00
Christian Duvivier
b1b4ba1bdd
Properly save neon registers.
...
Replace current code which corrupts the stack by
duplicate of vp8 code to save and restore neon
registers.
Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
2013-09-27 14:25:33 -07:00
Dmitry Kovalev
db60c02c9e
Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."
2013-09-27 13:08:52 -07:00
Dmitry Kovalev
15a36a0a0d
Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.
...
Making function name consistent with vp9_short_idct16x16 and
vp9_short_idct16x16_1.
Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-26 14:01:25 -07:00
Christian Duvivier
5b1dc1515f
Fix a bunch of TODO from vp9_short_idct32x32_add_neon.
...
- full ASM version, no more C gateway file.
- integrate combine-add with last step of 2nd pass.
- remove a few push/pop pairs.
- some instruction reordering to hide latency.
Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
2013-09-25 21:15:19 -07:00
Johann
a6a00fc6a3
Use lowercase instruction in assembly
...
The iOS compiler does not recognize BLE:
bad instruction `BLE idct32_transpose_pair_loop'
Change-Id: I7426694c66bc31caf939a2d5000968da1222c15b
2013-09-20 16:11:05 -07:00
hkuang
23e1a29fc7
Speed up iht8x8 by rearranging instructions.
...
Speed improves from 282% to 302% faster based on assembly-perf.
Change-Id: I08c5c1a542d43361611198f750b725e4303d19e2
2013-09-16 14:23:26 -07:00
hkuang
86fb12b600
Merge "Add neon optimize iht8x8 which is 282% faster than C."
2013-09-12 15:42:44 -07:00
hkuang
182366c736
Add neon optimize iht8x8 which is 282% faster than C.
...
Change-Id: I963dd4a6e8671957403ccbb9a16ea7de703e3530
2013-09-12 11:49:05 -07:00
Christian Duvivier
6a501462f8
First draft of vp9_short_idct32x32_add_neon.
...
Lots of TODO which will be taken care in upcoming changes. As is,
about 6x faster than C version.
Change-Id: Ie2557b72fd2d8edca376dbf400a4d173aa5e63e0
2013-09-11 15:19:38 -07:00
hkuang
fc5ec206a7
Speed up idct16x16 by rearrange instructions.
...
Speed improve from 376% to 400% faster base on assembly-perf.
Change-Id: If0b2eccc39d5793dc101ce9feb7fcadf88396ea2
2013-09-09 18:00:13 -07:00
hkuang
01c4e04424
Speed up idct8x8 by rearrange instructions.
...
Speed improve from 264% ~ 270% to 280% ~ 300% base on assembly-perf.
Change-Id: I3e2cc818ec14b432204ff43732f39b6438db685d
2013-09-04 15:57:22 -07:00
hkuang
3b8614a8f6
Add neon optimize vp9_short_iht4x4_add.
...
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
2013-09-04 12:37:58 -07:00
hkuang
3a679e56b2
Add neon optimize vp9_short_idct16x16_1_add.
...
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
2013-08-27 14:00:27 -07:00
hkuang
36e9b82080
Add neon optimize vp9_short_idct8x8_1_add.
...
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
2013-08-26 16:28:57 -07:00
hkuang
69384f4fad
Add neon optimize vp9_short_idct4x4_1_add.
...
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
2013-08-26 15:55:16 -07:00
hkuang
b85367a608
Merge "Optimise idct4x4: rearrange the instructions a bit to improve instruction scheduling."
2013-08-23 10:08:43 -07:00