325 Commits

Author SHA1 Message Date
Ronald S. Bultje
7fd643264a SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction.
Change-Id: Iad70966b986f65259329070e258f76ef0af816b4
2013-07-10 09:28:03 -07:00
Ronald S. Bultje
8dade638a1 SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction.
Change-Id: I3441c059214c2956e8261331bbf521525a617a86
2013-07-10 09:28:03 -07:00
Ronald S. Bultje
75b33c68c7 SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction.
Change-Id: I55a6cfa2daba738cbc0c4a02f806893f7e556997
2013-07-10 09:28:03 -07:00
Ronald S. Bultje
92c5d3665d SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction.
Change-Id: Ibe1690afc5459f3b3beca401e7734fcd03da6dd0
2013-07-10 09:28:03 -07:00
Frank Galligan
198fa6d0a0 Add Neon horizontal and vertical vp9_mbloop_filter
- The vp9 mbfilter C code will branch on flat and mask. This CL
  will perform both branches and combine the data. A later CL will
  perform a check to see if all patch will take one branch.
- These functions are about 1.75 times faster than the C code on
  Nexus 7.

PS #3
- Changed all functions to dub limit, blimit, and thresh from
  vld {dx[]}, freeing up r4-r6.
- Changed code to use vbif to reduce one instruction and free
  up a d register.

Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
2013-07-09 12:40:05 -07:00
Ronald S. Bultje
8350e7fe38 Make intra prediction pointers RTCD-based.
This probably has a mildly negative impact on performance, but will
(in future commits - or possibly merged with this one) allow SIMD
implementations of individual intra prediction functions. We may
perhaps want to consider having separate functions per txfm-size
also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
each intra prediction mode), but I haven't played much with that
yet.

Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
2013-07-08 17:25:51 -07:00
Ronald S. Bultje
c8defcfdee Update quantize SSSE3 SIMD to cover 32x32 transform case also.
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.

Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
2013-07-01 11:36:33 -07:00
Ronald S. Bultje
7353ceab9d Quantize (64-bit only, for now) SSSE3 SIMD.
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.

Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
2013-07-01 11:36:07 -07:00
Jingning Han
993942ce0c Merge "Enable SSE2 4x4 ADST/DCT transform" 2013-06-29 15:57:04 -07:00
Christian Duvivier
466e0cf303 SSE2 version of vp9_short_fdct32x32_rd.
43,000 -> 5,750 cycles, about 7.5x faster.

Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
2013-06-29 13:53:00 -07:00
chm
a83cfd4da1 add Neon optimized add constant residual functions
- Add add_constant_residual_8x8 16x16 32x32 functions
- Tested under RealView debugger enviroment

Change-Id: I5c3a432f651b49bf375de6496353706a33e3e68e
2013-06-28 19:06:51 -07:00
Jingning Han
1109b6b888 Enable SSE2 4x4 ADST/DCT transform
This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.

Change-Id: Iad4d526346e05c7be896466c05500711bb763660
2013-06-28 17:24:43 -07:00
Ronald S. Bultje
af660715c0 Make coefficient skip condition an explicit RD choice.
This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.

The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.

Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.

Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
2013-06-28 10:28:49 -07:00
Frank Galligan
1d6dc1b702 Add Neon optimized loop filter functions.
- Added vp9_loop_filter_horizontal_edge_neon and
  vp9_loop_filter_vertical_edge_neon.
- The functions are based off the vp8 loopfilter
  functions.
- Matches x86 md5 checksum.

Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0
2013-06-27 16:14:45 -07:00
Jingning Han
3cc8c8c3a0 Merge "Refactor intra predictor block" 2013-06-25 19:46:55 -07:00
Jingning Han
d19ea3861d Refactor intra predictor block
Remove vp9_intra4x4_predict(). Use the common intra prediction
function for all block sizes.

Change-Id: Ibd19d51dfa3da8bbdfb79ddeb81530b2e2089560
2013-06-25 16:33:13 -07:00
Ronald S. Bultje
c24d922396 Add averaging-SAD functions for 8-point comp-inter motion search.
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.

Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
2013-06-25 12:57:28 -07:00
Yaowu Xu
b9c934df8e Merge "Enable sse2 implmentation of 8x8 ADST/DCT" 2013-06-25 09:13:22 -07:00
Jingning Han
a32a086d23 Enable sse2 implmentation of 8x8 ADST/DCT
This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.

The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.

Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
2013-06-24 18:41:33 -07:00
John Koleszar
ece724ae16 Merge "Remove unused vp9_build_intra_predictors_sb{y,uv}_s" 2013-06-24 15:08:58 -07:00
John Koleszar
9e7019f7df Remove unused vp9_build_intra_predictors_sb{y,uv}_s
The functions no longer referenced.

Change-Id: If2705dfbc607f79ec8ec2242d5e03bec27a35aaf
2013-06-21 16:10:05 -07:00
Ronald S. Bultje
54b2a59623 Implement SSE2 block_error.
Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.

Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.

Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
2013-06-21 12:54:52 -07:00
Ronald S. Bultje
25c588b1e4 Add subtract_block SSE2 version and unit test.
3% faster overall (3min35.0 to 3min28.5).

Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
2013-06-21 09:35:37 -07:00
Ronald S. Bultje
1e6a32f1af SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance().
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.

Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
2013-06-20 15:59:48 -07:00
Ronald S. Bultje
8fb6c58191 Implement sse2 and ssse3 versions for all sub_pixel_variance sizes.
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
3min58). Specific changes to timings for each function compared to
original assembly-optimized versions (or just new version timings if
no previous assembly-optimized version was available):

sse2   4x4:    99 ->   82 cycles
sse2   4x8:           128 cycles
sse2   8x4:           121 cycles
sse2   8x8:   149 ->  129 cycles
sse2   8x16:  235 ->  245 cycles (?)
sse2  16x8:   269 ->  203 cycles
sse2  16x16:  441 ->  349 cycles
sse2  16x32:          641 cycles
sse2  32x16:          643 cycles
sse2  32x32: 1733 -> 1154 cycles
sse2  32x64:         2247 cycles
sse2  64x32:         2323 cycles
sse2  64x64: 6984 -> 4442 cycles

ssse3  4x4:           100 cycles (?)
ssse3  4x8:           103 cycles
ssse3  8x4:            71 cycles
ssse3  8x8:           147 cycles
ssse3  8x16:          158 cycles
ssse3 16x8:   188 ->  162 cycles
ssse3 16x16:  316 ->  273 cycles
ssse3 16x32:          535 cycles
ssse3 32x16:          564 cycles
ssse3 32x32:          973 cycles
ssse3 32x64:         1930 cycles
ssse3 64x32:         1922 cycles
ssse3 64x64:         3760 cycles

Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
2013-06-20 09:34:25 -07:00
Jingning Han
7088426976 Merge "Make fdct32 computation flow within 16bit range" 2013-06-18 11:40:14 -07:00
Jingning Han
a41a4860c0 Make fdct32 computation flow within 16bit range
This commit makes use of dual fdct32x32 versions for rate-distortion
optimization loop and encoding process, respectively. The one for
rd loop requires only 16 bits precision for intermediate steps.
The original fdct32x32 that allows higher intermediate precision (18
bits) was retained for the encoding process only.

This allows speed-up for fdct32x32 in the rd loop. No performance
loss observed.

Change-Id: I3237770e39a8f87ed17ae5513c87228533397cc3
2013-06-18 09:46:24 -07:00
Jingning Han
0b7910b9ff Merge "Enable sse2 version of sad8x4/4x8" 2013-06-14 13:15:49 -07:00
Jingning Han
c43af9a8a3 Enable sse2 version of sad8x4/4x8
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-14 09:19:28 -07:00
Jingning Han
15f50e7b42 Enable sse2 version of sad8x4/4x8
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-13 16:18:18 -07:00
Scott LaVarnway
a81bd12a2e Quick modifications to mb loopfilter intrinsic functions
Modified to work with 8x8 blocks of memory.  Will revisit
later for further optimizations.  For the HD clip used, the
decoder improved by almost 20%.

Change-Id: Iaa4785be293a32a42e8db07141bd699f504b8c67
2013-06-12 19:23:03 -04:00
Yaowu Xu
d682243012 Merge "Quick modifications to wide loopfilter intrinsic functions" 2013-06-12 15:16:11 -07:00
Ronald S. Bultje
fa96eeb835 Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d.
Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56
to 4min42.

Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f
2013-06-12 17:40:01 -04:00
Scott LaVarnway
26496c52bf Quick modifications to wide loopfilter intrinsic functions
Modified to work with 8x8 blocks of memory.  Will revisit
later for further optimizations.  For the HD clip used, the
decoder improved my 20%.

Change-Id: Ia0057f55d66d1445882351ea6c43b595a5a980e5
2013-06-12 16:49:08 -04:00
John Koleszar
ceee4563d6 Remove unused vp9_idct_add_{y,uv}_block
These functions are not used, and appear to have been superceded.

Change-Id: I86fe51b088264f6b1b8d4d232bba97b371b98120
2013-06-12 12:24:22 -07:00
John Koleszar
0e1e16db90 Enable mmx loop filter routines
The mmx routines work as expected for the loop filter, so enable them.

Change-Id: I2bbd9b99a4445fcba17bb95002f1fb6e01fe8f85
2013-06-12 11:28:21 -07:00
John Koleszar
44db42c114 Merge the new loopfilter experiment
Change-Id: I524ba98841f2e1850e3276ac365c501cea31546d
2013-06-10 12:30:12 -07:00
Ronald S. Bultje
073c7d5eec Fix firstpass if framesize is not a multiple of 16.
Change-Id: Iec41736c2b6140715f90f40de5ae6cf52497a9b8
2013-06-08 13:32:05 -07:00
John Koleszar
736c7b804a Merge "Reimplementation of loop filter" into experimental 2013-06-06 17:34:26 -07:00
John Koleszar
043d348aae Reimplementation of loop filter
This version of the loop filter supports non-4:2:0 subsampling and
a fourth plane, as well as changing the filtering order to be more
friendly to hardware implementations.

The filters are applied first to all vertical edges within the
64x64 SB, followed by the top horizontal edge and any internal
horizontal edges. Since filtering is applied on each 4x4 edge
serially, a dependency is created from filtering one block edge
to the next. It would be possible to remove this depencnecy by
building all filtering decisions from the unfiltered
reconstruction data.

Change-Id: I08f3e9683eb7bded8a76651cbc50fc0dfdd05fa7
2013-06-06 08:45:45 -07:00
Jim Bankoski
ced21bd6a6 Creates a new speed 1:
This speed 1 - uses variance threshold stolen from static-thresh
to determine split.  Any superblock with greater than the variance
set by static thresh * quantizer index squared is split. In addition
transform size is set to largest size less than or equal to partition
size, sub pixel filter is set to normal,  and only 12 modes are used
at all.

Change-Id: If7a2858ee70f96d1eb989c04fd87a332b147abef
2013-05-30 19:53:00 -07:00
Jingning Han
7ac5ac52f9 Merge 4x4 block level partition into codebase
Move 4x4/4x8/8x4 partition coding out of experimental list.

This commit fixed the unit test failure issues. It also resolved
the merge conflicts between 4x4 block level partition and iterative
motion search for comp_inter_inter.

Change-Id: I898671f0631f5ddc4f5cc68d4c62ead7de9c5a58
2013-05-23 11:58:50 +01:00
Yunqing Wang
f4fcfe3075 Optimize variance functions
Added SSE2 version of variance functions for super blocks.

Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
2013-05-22 10:29:38 -07:00
Scott LaVarnway
0c3f3bf1d5 Removed vp9_recon functions
No longer used.

Change-Id: Ica5166f7117f4693dffdf7633dcfc1b263103d0d
2013-05-21 13:57:50 -04:00
Scott LaVarnway
ba48a11130 WIP: 4x4 idct/recon merge
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.

Change-Id: I296604bf73579c45105de0dd1adbcc91bcc53c22
2013-05-20 13:03:17 -04:00
Scott LaVarnway
9aa37a51b2 Merge "WIP: 8x8 idct/recon merge" into experimental 2013-05-16 14:28:30 -07:00
Scott LaVarnway
794a7bedbd WIP: 8x8 idct/recon merge
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.

Change-Id: Iacfd57324fbe2b7beca5d7f3dcae25c976e67f45
2013-05-16 13:52:15 -04:00
Jingning Han
8e3d0e4d7d Add building blocks for 4x8/8x4 rd search
These building blocks enable rate-distortion optimization search
over block sizes of 8x4 and 4x8. Need to convert them into mmx/sse
forms.

Change-Id: I570ea2d22d14ceec3fe3575128d7dfa172a577de
2013-05-16 10:41:29 -07:00
Dmitry Kovalev
cd16fe9160 Merge "Preparing vp9_deblock and vp9_denoise to alpha support." into experimental 2013-05-15 15:40:52 -07:00
Scott LaVarnway
a272ff25cd WIP: 16x16 idct/recon merge
This patch eliminates the intermediate diff buffer usage by
combining the short idct and the add residual into one function.
The encoder can use the same code as well.

Change-Id: Iea7976b22b1927d24b8004d2a3fddae7ecca3ba1
2013-05-15 13:16:02 -04:00