Christian Duvivier
b1b4ba1bdd
Properly save neon registers.
...
Replace current code which corrupts the stack by
duplicate of vp8 code to save and restore neon
registers.
Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
2013-09-27 14:25:33 -07:00
Dmitry Kovalev
db60c02c9e
Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."
2013-09-27 13:08:52 -07:00
Dmitry Kovalev
15a36a0a0d
Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.
...
Making function name consistent with vp9_short_idct16x16 and
vp9_short_idct16x16_1.
Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-26 14:01:25 -07:00
Christian Duvivier
5b1dc1515f
Fix a bunch of TODO from vp9_short_idct32x32_add_neon.
...
- full ASM version, no more C gateway file.
- integrate combine-add with last step of 2nd pass.
- remove a few push/pop pairs.
- some instruction reordering to hide latency.
Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
2013-09-25 21:15:19 -07:00
Johann
a6a00fc6a3
Use lowercase instruction in assembly
...
The iOS compiler does not recognize BLE:
bad instruction `BLE idct32_transpose_pair_loop'
Change-Id: I7426694c66bc31caf939a2d5000968da1222c15b
2013-09-20 16:11:05 -07:00
hkuang
23e1a29fc7
Speed up iht8x8 by rearranging instructions.
...
Speed improves from 282% to 302% faster based on assembly-perf.
Change-Id: I08c5c1a542d43361611198f750b725e4303d19e2
2013-09-16 14:23:26 -07:00
hkuang
86fb12b600
Merge "Add neon optimize iht8x8 which is 282% faster than C."
2013-09-12 15:42:44 -07:00
hkuang
182366c736
Add neon optimize iht8x8 which is 282% faster than C.
...
Change-Id: I963dd4a6e8671957403ccbb9a16ea7de703e3530
2013-09-12 11:49:05 -07:00
Christian Duvivier
6a501462f8
First draft of vp9_short_idct32x32_add_neon.
...
Lots of TODO which will be taken care in upcoming changes. As is,
about 6x faster than C version.
Change-Id: Ie2557b72fd2d8edca376dbf400a4d173aa5e63e0
2013-09-11 15:19:38 -07:00
hkuang
fc5ec206a7
Speed up idct16x16 by rearrange instructions.
...
Speed improve from 376% to 400% faster base on assembly-perf.
Change-Id: If0b2eccc39d5793dc101ce9feb7fcadf88396ea2
2013-09-09 18:00:13 -07:00
hkuang
01c4e04424
Speed up idct8x8 by rearrange instructions.
...
Speed improve from 264% ~ 270% to 280% ~ 300% base on assembly-perf.
Change-Id: I3e2cc818ec14b432204ff43732f39b6438db685d
2013-09-04 15:57:22 -07:00
hkuang
3b8614a8f6
Add neon optimize vp9_short_iht4x4_add.
...
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
2013-09-04 12:37:58 -07:00
hkuang
3a679e56b2
Add neon optimize vp9_short_idct16x16_1_add.
...
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
2013-08-27 14:00:27 -07:00
hkuang
36e9b82080
Add neon optimize vp9_short_idct8x8_1_add.
...
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
2013-08-26 16:28:57 -07:00
hkuang
69384f4fad
Add neon optimize vp9_short_idct4x4_1_add.
...
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
2013-08-26 15:55:16 -07:00
hkuang
b85367a608
Merge "Optimise idct4x4: rearrange the instructions a bit to improve instruction scheduling."
2013-08-23 10:08:43 -07:00
hkuang
4082bf9d7c
Add neon optimize vp9_short_idct10_16x16_add.
...
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.
Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
2013-08-22 15:53:22 -07:00
hkuang
610642c130
Optimise idct4x4: rearrange the instructions a bit
...
to improve instruction scheduling.
Change-Id: I5ea881a6e419f9e8ed4b3b619406403b4de24134
2013-08-22 11:02:22 -07:00
hkuang
37cda6dc4c
Add neon optimize vp9_short_idct10_8x8_add.
...
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.
Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
2013-08-20 11:51:07 -07:00
Johann
d514b778c4
Merge "Reduce the instructions of idct8x8. Also add the saving and restoring of D registers."
2013-08-16 11:30:21 -07:00
Johann
65aa89af1a
Merge "Reduce instructions of idct4x4."
2013-08-16 11:28:35 -07:00
Frank Galligan
bdc785e976
Merge "vp9: neon: optimise vp9_wide_mbfilter_neon"
2013-08-16 11:16:48 -07:00
hkuang
df0715204c
Reduce instructions of idct4x4.
...
Change-Id: Ia26a2526804e7e2f656b0051618a615fca8fc79d
2013-08-16 10:54:56 -07:00
hkuang
60ecd60c9a
Reduce the instructions of idct8x8. Also add the
...
saving and restoring of D registers.
Change-Id: Id3630c90fcb160ef939fef55411342608af5f990
2013-08-16 10:32:12 -07:00
Mans Rullgard
4fa93bcef4
vp9: neon: use aligned stores in convolve functions
...
The destination is block-aligned so it is safe to use aligned
stores.
Change-Id: I38261e4fa40bc60e6472edffece59e372908da7e
2013-08-16 14:25:08 +01:00
Johann
a9aa7d07d0
Merge "vp9: neon: add vp9_convolve_avg_neon"
2013-08-15 14:55:15 -07:00
Johann
63e140eaa7
Merge "vp9: neon: add vp9_convolve_copy_neon"
2013-08-15 14:55:08 -07:00
Mans Rullgard
67e53716e0
vp9: neon: optimise vp9_wide_mbfilter_neon
...
Break up long dependency chains to improve instruction scheduling.
Change-Id: I0e0cb66943df24af920767bb4167b25c38af9630
2013-08-15 19:07:22 +01:00
hkuang
39f42c8713
Merge "Add neon optimize vp9_short_idct16x16_add."
2013-08-14 14:16:20 -07:00
hkuang
cf6beea661
Add neon optimize vp9_short_idct16x16_add.
...
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
2013-08-14 13:52:16 -07:00
Mans Rullgard
0f1deccf86
vp9: neon: add vp9_convolve_avg_neon
...
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
2013-08-14 16:27:55 +01:00
Mans Rullgard
635ba269be
vp9: neon: add vp9_convolve_copy_neon
...
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
2013-08-14 16:27:55 +01:00
Johann
4417c04531
Merge "vp9: neon: optimise convolve8_vert functions"
2013-08-12 17:54:47 -07:00
Mans Rullgard
ad7021dd6c
vp9: neon: optimise convolve8_vert functions
...
Invert loops to operate vertically in the inner loop. This allows
removing redundant loads.
Also add preloading of data.
Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
2013-08-12 15:37:48 +01:00
Mans Rullgard
b84dc949c8
vp9: neon: optimise convolve8_horiz functions
...
Each iteration of the horizontal loop reuses 7 of the 11 source
values. Loading only the 4 new values saves some time.
Also add preload for source data.
Overall 4% faster on Chromebook.
Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
2013-08-11 16:21:55 +01:00
Christian Duvivier
78182538d6
Neon version of vp9_short_idct4x4_add.
...
Change-Id: Idec4cae0cb9b3a29835fd2750d354c1393d47aa4
2013-08-06 18:41:27 -07:00
Mans Rullgard
355cb14dc7
vp9: neon: convolve: replace some insns with simpler equivalents
...
Change-Id: I5d6906772e6e6adf68d7f0fd5b8b5207a64a3a37
2013-08-02 08:11:28 -07:00
Mans Rullgard
2003468df8
vp9: neon: convolve: simplify branching to C fallbacks
...
Change-Id: Ic7cacd02d6dc9243ad8fc85082c5618a9d1e66dc
2013-08-02 08:11:25 -07:00
Mans Rullgard
5e2e78d024
vp9: neon: optimise loads in horiz convolve functions
...
Loading to single lanes in multiple registers is expensive since
it requires a read and write of each register which saturates
the register file access. Loading to single registers followed
by a separate transpose reduces this pressure.
Change-Id: I4cc35887ddbca80e5e635b50d2b1d158de9668ee
2013-08-02 08:11:08 -07:00
Mans Rullgard
d85ae87183
vp9: neon: add vp9_mb_lpf_* functions
...
Change-Id: I13e0880df234f15abc4cc7c57fe84488d5d46a75
2013-08-02 08:10:50 -07:00
hkuang
588b4daf54
Fix some format error and code error in neon code.
...
Change-Id: I748dee8938dfb19f417f24eed005f3d216f83a82
2013-07-26 14:14:57 -07:00
Frank Galligan
e88db77892
Merge "Speedup loopfilter neon code."
2013-07-22 17:39:42 -07:00
Frank Galligan
5af6bf6c43
Speedup loopfilter neon code.
...
Try and cut down the cycle count by rearranging the instructions
so there are less stalls.
Change-Id: Ic1383335ee0f05e656477d9ee9c179ec231285d5
2013-07-22 17:00:01 -07:00
hkuang
97dbee00dd
Merge "Add neon optimize vp9_short_idct8x8_add."
2013-07-19 08:28:39 -07:00
hkuang
d757de744c
Add neon optimize vp9_short_idct8x8_add.
...
Change-Id: Ic32acf3e2939c6d12d9c2bf192a5f5da59705fda
2013-07-18 16:40:41 -07:00
Frank Galligan
7fd5d8e6a4
Fix horz loopfilter loops
...
If count was greater than 1 the src pointer would be off on
the second loop.
Change-Id: I8e09037e68dc4ae92076a8067f7b6dacbbef8263
2013-07-18 09:44:15 -07:00
Johann
59dc4e9cdd
vp9_convolve8_neon placeholder
...
Call the individually optimized horizontal and vertical functions. This
implementation abuses the temp buffer.
This will be replaced with a custom optimized function.
Over 2x speedup.
Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
2013-07-17 08:39:27 -07:00
Johann
90ebfe621f
Merge "vp9_convolve8_[horiz|vert]_avg"
2013-07-16 09:42:52 -07:00
Frank Galligan
f4f60f6005
Neon: Update mbfilter if all vectors follow one branch.
...
Change the mbfilter Neon code from executing both branches if all
vectors follow only one branch.
The code is about 5% faster when executing only one branch and about
1% slower when executing both branches.
-PS5: Remove local stack space from mbfilter.
Change-Id: I6a23f9b318a9f4568a2718b4c9348db988fe2182
2013-07-15 13:08:28 -07:00
Johann
a15bebfc0a
vp9_convolve8_[horiz|vert]_avg
...
Super basic conversion from the other implementations. Any changes to
one should be trivial to copy over keep in sync.
Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
2013-07-12 16:21:33 -07:00