Yunqing Wang
64f728caef
Do horizontal loopfiltering in parallel
...
This patch followed "Rewrite filter_selectively_horiz for parallel
loopfiltering" commit, and added x86 SSE2 optimization to do
16-pixel filtering in parallel. Also, corrected the declaration
of aligned arrays. For 8-pixel-in-parallel case, improved the
calculation of the masks and filters. Updated the threshold loading
since the thresholds were already duplicated. Updated neon C functions
to call neon loopfilters twice.
Using tulip clip, tests showed it gave a ~1.5% decoder speed gain.
Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
2013-11-15 16:18:43 -08:00
Johann
e72d49a97a
Use lowercase 'b' to branch
...
iOS doesn't recognize B:
bad instruction `B idct32_pass_loop'
Change-Id: I3cf6aede4639f1d9efa97f7962fa287ba6feaaef
2013-11-12 10:41:06 -08:00
hkuang
c689a126ed
Fix a bug in the assembly code.
...
Change-Id: Ic416e3f8a11e82ee298e6f709b2119a9ddf1e2f8
2013-11-11 12:49:12 -08:00
hkuang
6b16f63332
Add back vp9_short_idct32x32_1_add_neon which is deleted in
...
cleanup I63df79a13cf62aa2c9360a7a26933c100f9ebda3.
Change-Id: I034848cf05031618818f7df2e7f9c35102686948
2013-11-05 14:57:32 -08:00
Dmitry Kovalev
65f118d72f
Making input pointer of any inverse transform constant.
...
Also renaming dest_stride to stride in some places.
Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940
2013-10-11 18:27:12 -07:00
Dmitry Kovalev
7ef573914d
Consistent names for inverse hybrid transforms (1 of 2).
...
Renames:
vp9_short_iht4x4_add -> vp9_iht4x4_16_add
vp9_short_iht8x8_add -> vp9_iht8x8_64_add
vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add
Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0
2013-10-11 13:31:32 -07:00
Dmitry Kovalev
1e766b50e2
Giving consistent names to IDCT 32x32 functions.
...
Renames:
vp9_short_idct32x32_add -> vp9_idct32x32_1024_add
vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add
vp9_idct_add_32x32 -> vp9_idct32x32_add
Change-Id: Id85306f5814bac6c47463a6b5901a93082510666
2013-10-10 11:27:39 -07:00
Dmitry Kovalev
b096c5a336
Giving consistent names to IDCT 16x16 functions.
...
Renames:
vp9_short_idct16x16_add -> vp9_idct16x16_256_add
vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add
vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add
vp9_idct_add_16x16 -> vp9_idct16x16_add
Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3
2013-10-07 14:31:10 -07:00
Dmitry Kovalev
c6ad70d5f1
Giving consistent names to IDCT 8x8 functions.
...
Renames:
vp9_short_idct8x8_add -> vp9_idct8x8_64_add
vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add
vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add
vp9_idct_add_8x8 -> vp9_idct8x8_add
Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1
2013-10-06 00:24:09 -07:00
Dmitry Kovalev
3a0602578e
Giving consistent names to IDCT/IWHT functions.
...
The idea is to have the following names for each transform size:
vp9_idct4x4_add
vp9_idct4x4_1_add
vp9_idct4x4_10_add
vp9_idct4x4_16_add
vp9_idct8x8_add
vp9_idct8x8_1_add
vp9_idct8x8_10_add
vp9_idct8x8_64_add
etc for 16x16, 32x32
The actual list of renames in this patch:
vp9_idct_add_lossless -> vp9_iwht4x4_add
vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add
vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
vp9_idct_add -> vp9_idct4x4_add
vp9_short_idct4x4_add -> vp9_idct4x4_16_add
vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
2013-10-04 14:17:06 -07:00
Dmitry Kovalev
3fab2125ff
Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add.
...
Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1.
Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5
2013-09-27 15:26:27 -07:00
Christian Duvivier
b1b4ba1bdd
Properly save neon registers.
...
Replace current code which corrupts the stack by
duplicate of vp8 code to save and restore neon
registers.
Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
2013-09-27 14:25:33 -07:00
Dmitry Kovalev
db60c02c9e
Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."
2013-09-27 13:08:52 -07:00
Dmitry Kovalev
15a36a0a0d
Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10.
...
Making function name consistent with vp9_short_idct16x16 and
vp9_short_idct16x16_1.
Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517
2013-09-26 14:01:25 -07:00
Christian Duvivier
5b1dc1515f
Fix a bunch of TODO from vp9_short_idct32x32_add_neon.
...
- full ASM version, no more C gateway file.
- integrate combine-add with last step of 2nd pass.
- remove a few push/pop pairs.
- some instruction reordering to hide latency.
Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
2013-09-25 21:15:19 -07:00
Johann
a6a00fc6a3
Use lowercase instruction in assembly
...
The iOS compiler does not recognize BLE:
bad instruction `BLE idct32_transpose_pair_loop'
Change-Id: I7426694c66bc31caf939a2d5000968da1222c15b
2013-09-20 16:11:05 -07:00
hkuang
23e1a29fc7
Speed up iht8x8 by rearranging instructions.
...
Speed improves from 282% to 302% faster based on assembly-perf.
Change-Id: I08c5c1a542d43361611198f750b725e4303d19e2
2013-09-16 14:23:26 -07:00
hkuang
86fb12b600
Merge "Add neon optimize iht8x8 which is 282% faster than C."
2013-09-12 15:42:44 -07:00
hkuang
182366c736
Add neon optimize iht8x8 which is 282% faster than C.
...
Change-Id: I963dd4a6e8671957403ccbb9a16ea7de703e3530
2013-09-12 11:49:05 -07:00
Christian Duvivier
6a501462f8
First draft of vp9_short_idct32x32_add_neon.
...
Lots of TODO which will be taken care in upcoming changes. As is,
about 6x faster than C version.
Change-Id: Ie2557b72fd2d8edca376dbf400a4d173aa5e63e0
2013-09-11 15:19:38 -07:00
hkuang
fc5ec206a7
Speed up idct16x16 by rearrange instructions.
...
Speed improve from 376% to 400% faster base on assembly-perf.
Change-Id: If0b2eccc39d5793dc101ce9feb7fcadf88396ea2
2013-09-09 18:00:13 -07:00
hkuang
01c4e04424
Speed up idct8x8 by rearrange instructions.
...
Speed improve from 264% ~ 270% to 280% ~ 300% base on assembly-perf.
Change-Id: I3e2cc818ec14b432204ff43732f39b6438db685d
2013-09-04 15:57:22 -07:00
hkuang
3b8614a8f6
Add neon optimize vp9_short_iht4x4_add.
...
Change-Id: I42c497b68ae1ee645b59c9968ad805db0a43e37e
2013-09-04 12:37:58 -07:00
hkuang
3a679e56b2
Add neon optimize vp9_short_idct16x16_1_add.
...
Change-Id: Ib9354c1d975d03e8081df20d50b6a77dfe2dc7e5
2013-08-27 14:00:27 -07:00
hkuang
36e9b82080
Add neon optimize vp9_short_idct8x8_1_add.
...
Change-Id: I0b15d5e3b0eb97abb9ab5ec08e88b61f8723aaf4
2013-08-26 16:28:57 -07:00
hkuang
69384f4fad
Add neon optimize vp9_short_idct4x4_1_add.
...
Change-Id: I6ecb5c4a1a472feb8e84e9f3352b536d5e28a4a5
2013-08-26 15:55:16 -07:00
hkuang
b85367a608
Merge "Optimise idct4x4: rearrange the instructions a bit to improve instruction scheduling."
2013-08-23 10:08:43 -07:00
hkuang
4082bf9d7c
Add neon optimize vp9_short_idct10_16x16_add.
...
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.
Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
2013-08-22 15:53:22 -07:00
hkuang
610642c130
Optimise idct4x4: rearrange the instructions a bit
...
to improve instruction scheduling.
Change-Id: I5ea881a6e419f9e8ed4b3b619406403b4de24134
2013-08-22 11:02:22 -07:00
hkuang
37cda6dc4c
Add neon optimize vp9_short_idct10_8x8_add.
...
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.
Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
2013-08-20 11:51:07 -07:00
Johann
d514b778c4
Merge "Reduce the instructions of idct8x8. Also add the saving and restoring of D registers."
2013-08-16 11:30:21 -07:00
Johann
65aa89af1a
Merge "Reduce instructions of idct4x4."
2013-08-16 11:28:35 -07:00
Frank Galligan
bdc785e976
Merge "vp9: neon: optimise vp9_wide_mbfilter_neon"
2013-08-16 11:16:48 -07:00
hkuang
df0715204c
Reduce instructions of idct4x4.
...
Change-Id: Ia26a2526804e7e2f656b0051618a615fca8fc79d
2013-08-16 10:54:56 -07:00
hkuang
60ecd60c9a
Reduce the instructions of idct8x8. Also add the
...
saving and restoring of D registers.
Change-Id: Id3630c90fcb160ef939fef55411342608af5f990
2013-08-16 10:32:12 -07:00
Mans Rullgard
4fa93bcef4
vp9: neon: use aligned stores in convolve functions
...
The destination is block-aligned so it is safe to use aligned
stores.
Change-Id: I38261e4fa40bc60e6472edffece59e372908da7e
2013-08-16 14:25:08 +01:00
Johann
a9aa7d07d0
Merge "vp9: neon: add vp9_convolve_avg_neon"
2013-08-15 14:55:15 -07:00
Johann
63e140eaa7
Merge "vp9: neon: add vp9_convolve_copy_neon"
2013-08-15 14:55:08 -07:00
Mans Rullgard
67e53716e0
vp9: neon: optimise vp9_wide_mbfilter_neon
...
Break up long dependency chains to improve instruction scheduling.
Change-Id: I0e0cb66943df24af920767bb4167b25c38af9630
2013-08-15 19:07:22 +01:00
hkuang
39f42c8713
Merge "Add neon optimize vp9_short_idct16x16_add."
2013-08-14 14:16:20 -07:00
hkuang
cf6beea661
Add neon optimize vp9_short_idct16x16_add.
...
Change-Id: I27134b9a5cace2bdad53534562c91d829b48838d
2013-08-14 13:52:16 -07:00
Mans Rullgard
0f1deccf86
vp9: neon: add vp9_convolve_avg_neon
...
Change-Id: I33cff9ac4f2234558f6f87729f9b2e88a33fbf58
2013-08-14 16:27:55 +01:00
Mans Rullgard
635ba269be
vp9: neon: add vp9_convolve_copy_neon
...
Change-Id: I15adbbda15d1842e9f15f21878a5ffbb75c3c0c9
2013-08-14 16:27:55 +01:00
Johann
4417c04531
Merge "vp9: neon: optimise convolve8_vert functions"
2013-08-12 17:54:47 -07:00
Mans Rullgard
ad7021dd6c
vp9: neon: optimise convolve8_vert functions
...
Invert loops to operate vertically in the inner loop. This allows
removing redundant loads.
Also add preloading of data.
Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
2013-08-12 15:37:48 +01:00
Mans Rullgard
b84dc949c8
vp9: neon: optimise convolve8_horiz functions
...
Each iteration of the horizontal loop reuses 7 of the 11 source
values. Loading only the 4 new values saves some time.
Also add preload for source data.
Overall 4% faster on Chromebook.
Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
2013-08-11 16:21:55 +01:00
Christian Duvivier
78182538d6
Neon version of vp9_short_idct4x4_add.
...
Change-Id: Idec4cae0cb9b3a29835fd2750d354c1393d47aa4
2013-08-06 18:41:27 -07:00
Mans Rullgard
355cb14dc7
vp9: neon: convolve: replace some insns with simpler equivalents
...
Change-Id: I5d6906772e6e6adf68d7f0fd5b8b5207a64a3a37
2013-08-02 08:11:28 -07:00
Mans Rullgard
2003468df8
vp9: neon: convolve: simplify branching to C fallbacks
...
Change-Id: Ic7cacd02d6dc9243ad8fc85082c5618a9d1e66dc
2013-08-02 08:11:25 -07:00
Mans Rullgard
5e2e78d024
vp9: neon: optimise loads in horiz convolve functions
...
Loading to single lanes in multiple registers is expensive since
it requires a read and write of each register which saturates
the register file access. Loading to single registers followed
by a separate transpose reduces this pressure.
Change-Id: I4cc35887ddbca80e5e635b50d2b1d158de9668ee
2013-08-02 08:11:08 -07:00