245 Commits

Author SHA1 Message Date
Scott LaVarnway
466f395148 Merge "Removing extra params from x_add_residual() functions" into experimental 2013-04-16 08:58:28 -07:00
Scott LaVarnway
6f95d53e37 Removing extra params from x_add_residual() functions
Now that the predictor is the dest, we do not need the
extra parameters.

Change-Id: I31e2c3d2015f4a1cd12e7f04536d8db478582a0a
2013-04-16 09:59:01 -04:00
Scott LaVarnway
5393379c84 Merge "Removing extra params in dequant functions" into experimental 2013-04-16 06:37:00 -07:00
Jingning Han
aaf33d7df5 Add rectangular block size variance/sad functions.
With this, the RD loop properly supports rectangular blocks.

Change-Id: Iece79048fb4e84741ee1ada982da129a7bf00470
2013-04-15 13:39:07 -07:00
Scott LaVarnway
74610b1ae4 Removing extra params in dequant functions
Now that the predictor is the dest, we do not need the
extra parameters.

Change-Id: I78db73d39b5aff62f15303f3d51ad2797eae74b6
2013-04-15 13:43:11 -04:00
Jingning Han
815e95fbeb Make intra predictor support rectangular blocks
The intra predictor supports configurable block sizes. It can handle
intra prediction down to 4x4 sizes, when enabled in BLOCK_SIZE_TYPE.

Change-Id: I7399ec2512393aa98aadda9813ca0c83e19af854
2013-04-11 16:45:57 -07:00
John Koleszar
2f19cd03aa Merge "Remove unused vp9_recon_mb{y,uv}_s" into experimental 2013-04-11 15:51:20 -07:00
John Koleszar
c382ed09f8 Remove unused vp9_recon_mb{y,uv}_s
These functions now are handled through the common superblock code.

Change-Id: Ib6688971bae297896dcec42fae1d3c79af7a611c
2013-04-11 14:05:59 -07:00
Scott LaVarnway
6189f2bcb1 WIP: removing predictor buffer usage from decoder
This patch will use the dest buffer instead of the
predictor buffer.  This will allow us in future commits
to remove the extra mem copy that occurs in the dequant
functions when eob == 0.  We should also be able to remove
extra params that are passed into the dequant functions.

Change-Id: I7241bc1ab797a430418b1f3a95b5476db7455f6a
2013-04-11 13:55:18 -07:00
Ronald S. Bultje
b4f6098ef7 Make RD superblock mode search size-agnostic.
Merge various super_block_yrd and super_block_uvrd versions into one
common function that works for all sizes. Make transform size selection
size-agnostic also. This fixes a slight bug in the intra UV superblock
code where it used the wrong transform size for txsz > 8x8, and stores
the txsz selection for superblocks properly (instead of forgetting it).
Lastly, it removes the trellis search that was done for 16x16 intra
predictors, since trellis is relatively expensive and should thus only
be done after RD mode selection.

Gives basically identical results on derf (+0.009%).

Change-Id: If4485c6f0a0fe4038b3172f7a238477c35a6f8d3
2013-04-10 16:50:30 -07:00
Ronald S. Bultje
a3874850dd Make SB coding size-independent.
Merge sb32x32 and sb64x64 functions; allow for rectangular sizes. Code
gives identical encoder results before and after. There are a few
macros for rectangular block sizes under the sbsegment experiment; this
experiment is not yet functional and should not yet be used.

Change-Id: I71f93b5d2a1596e99a6f01f29c3f0a456694d728
2013-04-09 21:28:27 -07:00
John Koleszar
4c05a051ab Move qcoeff, dqcoeff from BLOCKD to per-plane data
Start grouping data per-plane, as part of refactoring to support
additional planes, and chroma planes with other-than 4:2:0
subsampling.

Change-Id: Idb76a0e23ab239180c818025bae1f36f1608bb23
2013-04-04 16:30:57 -07:00
Yunqing Wang
0e91bec4b5 Merge "Optimize 32x32 idct function" into experimental 2013-03-27 11:30:48 -07:00
Yunqing Wang
21a718d9a7 Optimize 32x32 idct function
Wrote sse2 version of vp9_short_idct_32x32 function. Compared
to c version, the sse2 version is 5X faster.

Change-Id: I071ab7378358346ab4d9c6e2980f713c3c209864
2013-03-27 11:05:42 -07:00
Deb Mukherjee
23144d2345 Implicit weighted prediction experiment
Adds an experiment to use a weighted prediction of two INTER
predictors, where the weight is one of (1/4, 3/4), (3/8, 5/8),
(1/2, 1/2), (5/8, 3/8) or (3/4, 1/4), and is chosen implicitly
based on consistency of the predictors to the already
reconstructed pixels to the top and left of the current macroblock
or superblock.

Currently the weighting is not applied to SPLITMV modes, which
default to the usual (1/2, 1/2) weighting. However the code is in
place controlled by a macro. The same weighting is used for Y and
UV components, where the weight is derived from analyzing the Y
component only.

Results (over compound inter-intra experiment)
derf: +0.18%
yt: +0.34%
hd: +0.49%
stdhd: +0.23%

The experiment suggests bigger benefit for explicitly signaled weights.

Change-Id: I5438539ff4485c5752874cd1eb078ff14bf5235a
2013-03-26 16:58:56 -07:00
Yunqing Wang
869d6c0534 Optimize 16x16 idct10 function
Wrote sse2 version of vp9_short_idct10_16x16 function. Compared
to c version, the sse2 version is 2.3X faster.

Change-Id: I314c4f09369648721798321eeed6f58e38857f26
2013-03-21 16:36:01 -07:00
Yunqing Wang
ec3100661c Optimize 16x16 idct function
Wrote sse2 version of vp9_short_idct16x16 function. Compared to c
version, the sse2 version is over 2.5X faster.

Change-Id: I38536e2b846427a2cc5c5423aaf305fd0e605d61
2013-03-21 11:44:05 -07:00
Yunqing Wang
6344c84c82 Optimize 8x8 idct function
Wrote sse2 functions of vp9_short_idct8x8 and vp9_short_idct10_8x8.
Compared to c version, the sse2 version is 2X faster. The decoder
test didn't show noticeable gain since 8x8 idct doesn't take much
of decoding time (less than 1% in my test).

Change-Id: I56313e18cd481700b3b52c4eda5ca204ca6365f3
2013-03-18 15:34:14 -07:00
Yaowu Xu
12ade55719 Merge "removed reference to "LLM" and "x8"" into experimental 2013-03-18 08:51:19 -07:00
Christian Duvivier
4418b790a7 Faster vp9_short_fdct16x16.
Scalar path is about 1.5x faster (3.1% overall encoder speedup).
SSE2 path is about 7.2x faster (7.8% overall encoder speedup).

Change-Id: I06da5ad0cdae2488431eabf002b0d898d66d8289
2013-03-15 15:55:31 -07:00
Yaowu Xu
005552639b removed reference to "LLM" and "x8"
The commit changed the name of files and function to remove obselete
reference to LLM and x8.

Change-Id: I973b20fc1a55149ed68b5408b3874768e6f88516
2013-03-13 08:35:46 -07:00
Yunqing Wang
11ca81f8b6 Add vp9_idct4_1d_sse2
Added SSE2 idct4_1d which is called by vp9_short_iht4x4. Also,
modified the parameter type passed to vp9_short_iht functions to
make it work with rtcd prototype.

Change-Id: I81ba7cb4db6738f1923383b52a06deb760923ffe
2013-03-08 15:04:22 -08:00
Yunqing Wang
f240782650 Optimize add_constant_residual function
Optimized adding constant diff to predictor, which gave about
2% decoder performance gain.

Change-Id: I47db20c31428e8c4a8f16214a85cbe386a6e9303
2013-03-07 15:49:07 -08:00
Yunqing Wang
f4e383f3d1 Merge "Optimize add_residual function" into experimental 2013-03-05 16:47:58 -08:00
Yunqing Wang
943c6d7172 Optimize add_residual function
Optimized adding diff to predictor, which gave 0.8% decoder
performance gain.

Change-Id: Ic920f0baa8cbd13a73fa77b7f9da83b58749f0f8
2013-03-05 16:27:45 -08:00
Ronald S. Bultje
4209bba462 Merge changes Ifacbf5a0,Ibad7c3dd into experimental
* changes:
  vpxenc: actually report mismatch on stderr.
  Make superblocks independent of macroblock code and data.
2013-03-05 11:17:14 -08:00
Ronald S. Bultje
111ca42133 Make superblocks independent of macroblock code and data.
Split macroblock and superblock tokenization and detokenization
functions and coefficient-related data structs so that the bitstream
layout and related code of superblock coefficients looks less like it's
a hack to fit macroblocks in superblocks.

In addition, unify chroma transform size selection from luma transform
size (i.e. always use the same size, as long as it fits the predictor);
in practice, this means 32x32 and 64x64 superblocks using the 16x16 luma
transform will now use the 16x16 (instead of the 8x8) chroma transform,
and 64x64 superblocks using the 32x32 luma transform will now use the
32x32 (instead of the 16x16) chroma transform.

Lastly, add a trellis optimize function for 32x32 transform blocks.

HD gains about 0.3%, STDHD about 0.15% and derf about 0.1%. There's
a few negative points here and there that I might want to analyze
a little closer.

Change-Id: Ibad7c3ddfe1acfc52771dfc27c03e9783e054430
2013-03-04 16:34:36 -08:00
Yunqing Wang
37932d9168 Merge "Optimize vp9_short_idct4x4llm function" into experimental 2013-03-04 14:13:31 -08:00
Yunqing Wang
e8bc9f4220 Optimize vp9_short_idct4x4llm function
Wrote a SSE2 vp9_short_idct4x4llm to improve the decoder
performance.

Change-Id: I90b9d48c4bf37aaf47995bffe7e584e6d4a2c000
2013-03-04 12:01:27 -08:00
John Koleszar
1cfc86ebe0 Add unit test for x4 multi-SAD functions
Update the function prototypes to match between VP9 and VP8.

Change-Id: If58965073989e87df3b62b67a030ec6ce23ca04f
2013-03-01 18:14:02 -08:00
John Koleszar
69c67c9531 Merge master branch into experimental
Picks up some build system changes, compiler warning fixes, etc.

Change-Id: I2712f99e653502818a101a72696ad54018152d4e
2013-03-01 11:06:05 -08:00
Yunqing Wang
c550bb3b09 Add eob<=10 case in idct32x32
Simplified idct32x32 calculation when there are only 10 or less
non-zero coefficients in 32x32 block. This helps the decoder
performance.

Change-Id: If7f8893d27b64a9892b4b2621a37fdf4ac0c2a6d
2013-02-28 16:40:29 -08:00
Yunqing Wang
72b146690a Merge "Refactor vp9_dequant_idct_add function" into experimental 2013-02-28 14:34:27 -08:00
Yunqing Wang
6193bc3ba8 Refactor vp9_dequant_idct_add function
Provided a wrapper and removed duplicate code.

Change-Id: Iaef842226ec348422e459202793b001d0983ea30
2013-02-28 14:18:46 -08:00
Scott LaVarnway
aa8fb070b8 Removed vp9_dequantize_b
Change-Id: Ie89bd00d58e30bf4094cb748a282f1dfa81a31d8
2013-02-28 14:08:12 -08:00
Jim Bankoski
714aa9f3c0 this commit converts all sad ptrs to uint32
sse4_1 code used uint16_t for returning sad, but that
won't work for 32x32 or 64x64.   This code fixes the
assembly for those and also reenables sse4_1 on linux

Change-Id: I5ce7288d581db870a148e5f7c5092826f59edd81
2013-02-28 08:46:35 -08:00
Christian Duvivier
c129203f7e Faster vp9_short_fdct8x8.
Scalar path is about 1.4x faster (4% overall encoder speedup).
SSE2 path is about 7x faster (13% overall encoder speedup).

Change-Id: I7e85d8225a914a74c61ea370210414696560094d
2013-02-27 17:23:08 -08:00
John Koleszar
5ac141187a Merge "Remove unused vp9_copy32xn" into experimental 2013-02-27 12:23:45 -08:00
John Koleszar
7ad8dbe417 Remove unused vp9_copy32xn
This function was part of an optimization used in VP8 that required
caching two macroblocks. This is unused in VP9, and might not
survive refactoring to support superblocks, so removing it for now.

Change-Id: I744e585206ccc1ef9a402665c33863fc9fb46f0d
2013-02-27 10:24:56 -08:00
Yunqing Wang
35bc02c6eb Optimize vp9_dc_only_idct_add_c function
Wrote SSE2 version of vp9_dc_only_idct_add_c function. In order to
improve performance, clipped the absolute diff values to [0, 255].
This allowed us to keep the additions/subtractions in 8 bits.
Test showed an over 2% decoder performance increase.

Change-Id: Ie1a236d23d207e4ffcd1fc9f3d77462a9c7fe09d
2013-02-26 17:16:13 -08:00
Jingning Han
77a3becf92 clean up forward and inverse hybrid transform
Rebased.

Remove the old matrix multiplication transform computation. The 16x16
ADST/DCT can be switched on/off and evaluated by setting ACTIVE_HT16
300/0 in vp9/common/vp9_blockd.h.

Change-Id: Icab2dbd18538987e1dc4e88c45abfc4cfc6e133f
2013-02-25 09:16:12 -08:00
James Zern
e5fb6321a1 give vp9 variance struct a unique name
variance_vtable clashed with vp8/common/variance.h

Change-Id: I09c1de44d5519f1bd13f58c01144c0de4706de6f
2013-02-22 16:25:13 -08:00
Jingning Han
babbd5d170 Forward butterfly hybrid transform
This patch includes 4x4, 8x8, and 16x16 forward butterfly ADST/DCT
hybrid transform. The kernel of 4x4 ADST is sin((2k+1)*(n+1)/(2N+1)).
The kernel of 8x8/16x16 ADST is of the form sin((2k+1)*(2n+1)/4N).

Change-Id: I8f1ab3843ce32eb287ab766f92e0611e1c5cb4c1
2013-02-21 18:24:28 -08:00
Ronald S. Bultje
35524e2231 Remove "eobs" array in MACROBLOCKD.
The information is a duplicate of "eob" in BLOCKD.

Change-Id: Ia6416273bd004611da801e4bfa6e2d328d6f02a3
2013-02-21 10:07:36 -08:00
Yaowu Xu
d262e26cc7 Merge lossless experiment
Change-Id: I7b7b8d4fda3a23699e0c920d727f8c15d37d43aa
2013-02-20 07:54:28 -08:00
Jingning Han
cd907b1601 16x16 butterfly inverse ADST/DCT hybrid transform
rebased.

This patch includes 16x16 butterfly inverse ADST/DCT hybrid
transform. It uses the variant ADST of kernel
    sin((2k+1)*(2n+1)/4N),
which allows a butterfly implementation.

The coding gains as compared to DCT 16x16 are about 0.1% for
both derf and std-hd. It is noteworthy that for std-hd sets
many sequences gains about 0.5%, some 0.2%. There are also few
points that provides -1% to -3% performance. Hence the average
goes to about 0.1%.

Change-Id: Ie80ac84cf403390f6e5d282caa58723739e5ec17
2013-02-19 09:07:00 -08:00
Ronald S. Bultje
46dff5d233 Remove some Y2-related code.
Change-Id: I4f46d142c2a8d1e8a880cfac63702dcbfb999b78
2013-02-15 14:06:25 -08:00
Scott LaVarnway
7755657ea7 Merge "WIP: ssse3 version of convolve avg functions" into experimental 2013-02-15 07:54:21 -08:00
Yaowu Xu
d3de97794f Merge "fix the lossless experiment" into experimental 2013-02-13 09:54:35 -08:00
Yaowu Xu
16f25f9dc8 fix the lossless experiment
Change-Id: I95acfc1417634b52d344586ab97f0abaa9a4b256
2013-02-13 09:20:26 -08:00