- HBD encoder speed improvement (SSE4.1):
Enable CONFIG_VP9_HIGHBITDEPTH, on Xeon E5-2680,
50 frames, park_joy_1080p, 12-bit,
Encoding time reduces from 4846481 to 4177471 (ms)
- Add unit test to verify bit-exact and EOB calculation
Change-Id: I08e8ef3549ddad5ab36d86e78557df3b288537ea
This reinstates commit efda2831e5f758b4f350679b5c55c0b9282449b0
without the tests and with fixes for 32 bit x86 builds.
Change-Id: I34be4fe1e8a67686d26ba256fd7efe0eb6a569e8
This reverts commit efda2831e5f758b4f350679b5c55c0b9282449b0.
This commit causes segmentation fault at SSE2/SumSquares2DTest.RandomValues/0
Change-Id: I171937e4daf6f15323e8206418773deb03bd8c53
We can optimize wedge partition selection by pre-computing the
residuals of the 2 underlying predictors, and then blend these
to compute the sse of the compound predictor, without actually
having to compute and subtract the compound predictor.
Similarly we can pre-compute a proxy array which we can use to
cheaply check which mask sign would have lower sse.
Details are in wedge_utils.c.
Mathematically these are equivalence transformations, but due to the
finite precision the encoder output will be perturbed, though on
average this should make 0% difference.
ext-inter gains about ~4.5% speedup.
Change-Id: Ib2657c3209ae161b4090b58b4b6c392641bf2792
Function level timing test shows about 27% time saving on
a Xeon E5-2680 v2 desktop.
Rename vp9_dct_sse2.c to vp9_dct_intrin_sse2.c for vp9 and
rename dct_sse2.c to dct_intrin_sse2.c for vp10 to avoid
duplicate basenames.
Actually vp9_fwht4x4_mmx/sse2() and vp10_fwht4x4_mmx/sse2()
are identical. TODO: They should be unified later if there is
no intention to keep a duplicate.
Change-Id: I3e537b7bbd9ba417c606cd7c68c4dbbfa583f77d
- Integrate 5 flip transform types for each 4x4, 8x8, and 16x16
block, for experiment, EXT_TX.
- Encoder speed improves about 12%-15%.
- Update the unit tests for bit-exact result against C.
Change-Id: Idf27c87f1e516ca5b66c7b70142477a115404ccb
- Tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Update vp10_fht16x16_test.cc to do bit-exact test against
latest C version.
- HBD encoder speed improves ~1.8%.
Change-Id: Icfc799a212e5289bcf6cedcae3722032133a2bc6
- Tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Update bit-exact unit test against current C version.
- HBD encoder speed improves ~3.8%.
Change-Id: Ie13925ba11214eef2b5326814940638507bf68ec
- Optimization on tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Overall encoder speed improves ~4.5%-6%.
- Update bit-exact unit test against current C version.
Change-Id: If751c030612245b1c2470200c9570cf40d655504
- Implemented Angie's new fwd txfm algorithm.
- Improve ~100% than last 64-bit version; 3 times faster than
original C code.
- Passed bit-exact unit test.
Change-Id: Ica30b9768706604a6d69fe42da778441f0f5f02e
If --enable-ext-partition is used at build time, the superblock size
(sometimes also referred to as coding unit (CU) size) is extended to
128x128 pixels.
Change-Id: Ie09cec6b7e8d765b7555ff5d80974aab60803f3a
- Wrote function: fidtx8_sse2() and fidtx16_sse2().
- Turned on vp10_fht8x8_sse2()/vp10_fht16x16_sse2() for new types.
- Updated 8x8/16x16 unit tests for accuracy/speed.
- Running 20K times with random numbers and getting through
tx type from V_DCT to H_FLIPADST, SSE2 speed improvement:
8x8: ~131%
16x16: ~66%
Change-Id: Ibbb707e932a08fec3b1f423a7dab280a1d696c9a
- Added function fidtx4_sse2().
- Turned on vp10_fht4x4_sse2() for these tx types.
- Updated 4x4 unit test for speed/accuracy.
- 4x4 Unit test passed.
- Running 20K times with random numbers for tx type from
V_DCT to H_FLIPADST, SSE2 against C, speed improves ~46%.
Change-Id: I828088b7f98dc0f5939a72e3fcd6cb0b8d8dd8bf
- Use Makefile to control the build for highbd_fwd_txfm_sse4.c.
- Fixed hybrid transform (HT) types due to recent update.
- Added new unit test cases for highbd HT.
Change-Id: Ifd768a9b429a8c21ed40c1de8152fb5ac71e2f90
- Setup function vp10_highbd_fht4x4_sse4_1 for highbd SSE4.1
intrinsics optimization.
- Wrote SSE4.1 functions: load_buffer_4x4(), write_buffer_4x4(),
and fdct4x4_sse4_1().
- Used logic right shift to avoid coeff memory write/read.
- Turned on vp10_highbd_fht4x4_sse4_1 for DCT_DCT mode only.
- Improved overall encoding performance >2.3% for 50 frames
sequence, park_joy_1080p_12.y4m, in which, --input-bit-depth=12,
--bit-depth=12, 50 frames.
- Unit test passed.
Change-Id: Idd6dc6e472cbbf235f0ade4f66fbe859a860a004
Makes a set of 16 transforms total, adding all 1D
combinations of ADST and FlipADST, and removng all DST
transforms.
lowres, midres both improve by about 0.1% and hdres by
-0.378% in BDRATE but with fewer transforms that are also
simpler.
Further experiments to continue later.
Change-Id: I7348a4c0e12078fdea5ae3a2d36a89a319ffcc6e
- Implemented fdst16_sse2(), fdst16_8col() against C version: fdst16().
- Turned on 7 DST related hybrid txfm types in vp10_fht16x16_sse2().
- Replaced vp10_fht10x10_c() with vp10_fht16x16_sse2() in
fwd_txfm_16x16().
- Added vp10_fht16x16_sse2() unit test against C version:
vp10_fht16x16_c() (--gtest_filter=*VP10Trans16x16*).
- Unit test passed.
- Speed improvement: 2.4%, 3.2%, 3.2%, for city_cif.y4m, garden_sif.y4m,
and mobile_cif.y4m.
Change-Id: Ib30a67ce5d5964bef143d588d0f8fa438be8901f
fdct16_sse2() was not bit-exact with C reference, fdct16().
The inconsistency was found by writing a unit test for
vp10_fht16x16_sse2(). Since the unit test needs a pending
change on the inherited base class. I will commit this unit
test after making a header file for this base class.
Passed the uncommitted unit test: vp10_fht16x16_test.cc.
Change-Id: If2b617883c633a3ea90c19e1d018240c8007102b
Implemented fdst8_sse2() function against C version: fdst8().
Added seven DST related hybrid transform types in vp10_fht8x8_sse2().
Replaced vp10_fht8x8_c() with vp10_fht8x8_sse2() in fwd_txfm_8x8().
Speedup: 18.1%, 11.5%, 22.0% based on speed test from
city_cif.y4m, garden_sif.y4m, mobile_cif.y4m.
Change-Id: Ia4aa1ea44c7a33e494f64ce843037f8703f975e3
This patch eliminates the copying of data when using FLIPADST forward
transforms, by incorporating the necessary data flipping into the
load_buffer_* functions of the SSE2 optimized forward transforms. The
load_buffer_* functions are normally inlined, so the overhead of copying
the data is removed and the overhead of flipping is minimized. Left to
right flipping is still not free, as the columns need to be shuffled in
registers.
To preserve identity between the C and SSE2 implementations, the
appropriate C implementations now also do the data flipping as part of
the transform, rather than relying on the caller for flipping the input.
Overall speedup is about 1.5-2% in encode on my tests. Note that these
are only the forward transforms. Inverse transforms to come in a later
patch.
There are also a few code hygiene changes:
- Fixed some indents of switch statements.
- DCT_DCT transform now always use vp10_fht* functions, which dispatch
to vpx_fdct* for DCT_DCT (some of them used to call vpx_fdct*
directly, some of them used to call vp10_fht*).
Change-Id: I93439257dc5cd104ac6129cfed45af142fb64574