Manually cherry-picked the following changes:
8c8d16de vp9 -> vpx in names
75b57d39 VP9_ -> VPX_ in function names
761a7088 VP9_INTERP_EXTEND -> VPX_INTERP_EXTEND
4273a52c VP9->VPX in border pixel macros
03568c31 VP9_FRAME_MARKER -> VPX_FRAME_MARKER
2334f51d VP9->VPX in fdct function names
Change-Id: Icc18dbf4b416dd0fa21033b3e19ab8a47c893508
Added a new expt rect-tx to be used in conjunction with ext-tx.
[rect-tx is a temporary config flag and will eventually be
merged into ext-tx once it works correctly with all other
experiments].
Added 4x8 and 8x4 tranforms for use initially with rectangular
sub8x8 y blocks as part of this experiment.
There is about a -0.2% BDRATE improvement on lowres, others pending.
When var-tx is on rectangular transforms are currently not used.
That will be enabled in a subsequent patch.
Change-Id: Iaf3f88ede2740ffe6a0ffb1ef5fc01a16cd0283a
- HBD encoder speed improvement (SSE4.1):
Enable CONFIG_VP9_HIGHBITDEPTH, on Xeon E5-2680,
50 frames, park_joy_1080p, 12-bit,
Encoding time reduces from 4846481 to 4177471 (ms)
- Add unit test to verify bit-exact and EOB calculation
Change-Id: I08e8ef3549ddad5ab36d86e78557df3b288537ea
Keep track of the best and second best full pixel motion vector
candidates, and do subpel search around both of them.
Compression improvement:
lowres 0.22% midres 0.23% hdres 0.18%
No noticeable encoding speed changes observed on lowres test clips.
Change-Id: I5f4df2a03d1db061cfdfdba6138b27e9ea91f089
This commit bring all up-to-date changes from master that are
applicable to nextgenv2. Due to the remove VP10 code in master,
we had to cherry pick the following commits to get those changes:
Add default flags for arm64/armv8 builds
Allows building simple targets with sane default flags.
For example, using the Android arm64 toolchain from the NDK:
https://developer.android.com/ndk/guides/standalone_toolchain.html
./build/tools/make-standalone-toolchain.sh --arch=arm64 \
--platform=android-24 --install-dir=/tmp/arm64
CROSS=/tmp/arm64/bin/aarch64-linux-android- \
~/libvpx/configure --target=arm64-linux-gcc --disable-multithread
BUG=webm:1143
vpx_lpf_horizontal_4_sse2: Remove dead load.
Change-Id: I51026c52baa1f0881fcd5b68e1fdf08a2dc0916e
Fail early when android target does not include --sdk-path
Change-Id: I07e7e63476a2e32e3aae123abdee8b7bbbdc6a8c
configure: clean up var style and set_all usage
Use quotes whenever possible and {} always for variables.
Replace multiple set_all calls with *able_feature().
Conflicts:
build/make/configure.sh
vp9-svc: Remove some unneeded code/comment.
datarate_test,DatarateTestLarge: normalize bits type
quiets a msvc warning:
conversion from 'const int64_t' to 'size_t', possible loss of data
mips added p6600 cpu support
Removed -funroll-loops
psnr.c: use int64_t for sum of differences
Since the values can be negative.
*.asm: normalize label format
add a trailing ':', though it's optional with the tools we support, it's
more common to use it to mark a label. this also quiets the
orphan-labels warning with nasm/yasm.
BUG=b/29583530
Prevent negative variance
Due to rounding, hbd variance may become negative. This commit put in
check and clamp of negative values to 0.
configure: remove old visual studio support (<2010)
BUG=b/29583530
Conflicts:
configure
configure: restore vs_version variable
inadvertently lost in the final patchset of:
078dff7 configure: remove old visual studio support (<2010)
this prevents an empty CONFIG_VS_VERSION and avoids make failure
Require x86inc.asm
Force enable x86inc.asm when building for x86. Previously there were
compatibility issues so a flag was added to simplify disabling this
code.
The known issues have been resolved and x86inc.asm is the preferred
abstraction layer (over x86_abi_support.asm).
BUG=b:29583530
convolve_test: fix byte offsets in hbd build
CONVERT_TO_BYTEPTR(x) was corrected in:
003a9d2 Port metric computation changes from nextgenv2
to use the more common (x) within the expansion. offsets should occur
after converting the pointer to the desired type.
+ factorized some common expressions
Conflicts:
test/convolve_test.cc
vpx_dsp: remove x86inc.asm distinction
BUG=b:29583530
Conflicts:
vpx_dsp/vpx_dsp.mk
vpx_dsp/vpx_dsp_rtcd_defs.pl
vpx_dsp/x86/highbd_variance_sse2.c
vpx_dsp/x86/variance_sse2.c
test: remove x86inc.asm distinction
BUG=b:29583530
Conflicts:
test/vp9_subtract_test.cc
configure: remove x86inc.asm distinction
BUG=b:29583530
Change-Id: I59a1192142e89a6a36b906f65a491a734e603617
Update vpx subpixel 1d filter ssse3 asm
Speed test shows the new vertical filters have degradation on Celeron
Chromebook. Added "X86_SUBPIX_VFILTER_PREFER_SLOW_CELERON" to control
the vertical filters activated code. Now just simply active the code
without degradation on Celeron. Later there should be 2 set of vertical
filters ssse3 functions, and let jump table to choose based on CPU type.
improve vpx_filter_block1d* based on replace paddsw+psrlw to pmulhrsw
Make set_reference control API work in VP9
Moved the API patch from NextGenv2. An example was included.
To try it, for example, run the following command:
$ examples/vpx_cx_set_ref vp9 352 288 in.yuv out.ivf 4 30
Conflicts:
examples.mk
examples/vpx_cx_set_ref.c
test/cx_set_ref.sh
vp9/decoder/vp9_decoder.c
deblock filter : moved from vp8 code branch
The deblocking filters used in vp8 have been moved to vpx_dsp for
use by both vp8 and vp9.
vpx_thread.[hc]: update webp source reference
+ drop the blob hash, the updated reference will be updated in the
commit message
BUG=b/29583578
vpx_thread: use native windows cond var if available
BUG=b/29583578
original webp change:
commit 110ad5835ecd66995d0e7f66dca1b90dea595f5a
Author: James Zern <jzern@google.com>
Date: Mon Nov 23 19:49:58 2015 -0800
thread: use native windows cond var if available
Vista / Server 2008 and up. no speed difference observed.
100644 blob 4fc372b7bc6980a9ed3618c8cce5b67ed7b0f412 src/utils/thread.c
100644 blob 840831185502d42a3246e4b7ff870121c8064791 src/utils/thread.h
vpx_thread: use InitializeCriticalSectionEx if available
BUG=b/29583578
original webp change:
commit 63fadc9ffacc77d4617526a50c696d21d558a70b
Author: James Zern <jzern@google.com>
Date: Mon Nov 23 20:38:46 2015 -0800
thread: use InitializeCriticalSectionEx if available
Windows Vista / Server 2008 and up
100644 blob f84207d89b3a6bb98bfe8f3fa55cad72dfd061ff src/utils/thread.c
100644 blob 840831185502d42a3246e4b7ff870121c8064791 src/utils/thread.h
vpx_thread: use WaitForSingleObjectEx if available
BUG=b/29583578
original webp change:
commit 0fd0e12bfe83f16ce4f1c038b251ccbc13c62ac2
Author: James Zern <jzern@google.com>
Date: Mon Nov 23 20:40:26 2015 -0800
thread: use WaitForSingleObjectEx if available
Windows XP and up
100644 blob d58f74e5523dbc985fc531cf5f0833f1e9157cf0 src/utils/thread.c
100644 blob 840831185502d42a3246e4b7ff870121c8064791 src/utils/thread.h
vpx_thread: use CreateThread for windows phone
BUG=b/29583578
original webp change:
commit d2afe974f9d751de144ef09d31255aea13b442c0
Author: James Zern <jzern@google.com>
Date: Mon Nov 23 20:41:26 2015 -0800
thread: use CreateThread for windows phone
_beginthreadex is unavailable for winrt/uwp
Change-Id: Ie7412a568278ac67f0047f1764e2521193d74d4d
100644 blob 93f7622797f05f6acc1126e8296c481d276e4047 src/utils/thread.c
100644 blob 840831185502d42a3246e4b7ff870121c8064791 src/utils/thread.h
vp9_postproc.c missing extern.
BUG=webm:1256
deblock: missing const on extern const.
postproc - move filling of noise buffer to vpx_dsp.
Fix encoder crashes for odd size input
clean-up vp9_intrapred_test
remove tuple and overkill VP9IntraPredBase class.
postproc: noise style fixes.
gtest-all.cc: quiet an unused variable warning
under windows / mingw builds
vp9_intrapred_test: follow-up cleanup
address few comments from ce050afaf3e288895c3bee4160336e2d2133b6ea
Change-Id: I3eece7efa9335f4210303993ef6c1857ad5c29c8
- For experiment EXT_INTERP under high bit depth.
- Add unit test to verify bit-exact.
- Speed performance improvement:
On Xeon E5-2680, park_joy_1080p_12.y4m, 50 frames, encoding time
drops from 6682503 ms to 5390270 ms.
Change-Id: Iea4debf5414f3accf1eb5672abeab56a0539ac77
Directly call c functions, otherwise when EXT_TX is enabled, hybrid
transform other than combination of DCT/ADST has not been implemented, thus
will cause assertion failures in the switch loops in vp10_fhtnxn_msa() and
vp10_ihtnxn_nxn_add_msa().
BUG=webm:1239
Change-Id: I2379a07e5406f9489edcd2f3205682f679c9b091
- Apply 8-pixel vertical filtering direction parallelism.
- Add unit tests to verify bit exact.
- Encoder speed improves ~29% (enable EXT_INTERP) on Xeon E5-2680.
- Combinational cycle count of vp10_convolve() drops from 26.06%
to 6.73%.
Change-Id: Ic1ae48f8fb1909991577947a8c00d07832737e57
This reinstates commit efda2831e5f758b4f350679b5c55c0b9282449b0
without the tests and with fixes for 32 bit x86 builds.
Change-Id: I34be4fe1e8a67686d26ba256fd7efe0eb6a569e8
- Apply signal direction/4-pixel vertical/8-pixel vertical
parallelism.
- Add unit test to verify the bit exact result.
- Overall encoding time improves ~24% on Xeon E5-2680 CPU.
Change-Id: I104dcbfd43451476fee1f94cd16ca5f965878e59
This experiment implements non-uniform quantization where
the width of the bins increases gradually to more closely
match a laplacian distribution of the coeficcients.
Performance Gain:
derflr: 0.15%
hevcmr: 0.675%
Change-Id: I25234244e3bcd94b87c1f77cf682190b61c8ef94
This reverts commit efda2831e5f758b4f350679b5c55c0b9282449b0.
This commit causes segmentation fault at SSE2/SumSquares2DTest.RandomValues/0
Change-Id: I171937e4daf6f15323e8206418773deb03bd8c53
We can optimize wedge partition selection by pre-computing the
residuals of the 2 underlying predictors, and then blend these
to compute the sse of the compound predictor, without actually
having to compute and subtract the compound predictor.
Similarly we can pre-compute a proxy array which we can use to
cheaply check which mask sign would have lower sse.
Details are in wedge_utils.c.
Mathematically these are equivalence transformations, but due to the
finite precision the encoder output will be perturbed, though on
average this should make 0% difference.
ext-inter gains about ~4.5% speedup.
Change-Id: Ib2657c3209ae161b4090b58b4b6c392641bf2792
Function level timing test shows about 27% time saving on
a Xeon E5-2680 v2 desktop.
Rename vp9_dct_sse2.c to vp9_dct_intrin_sse2.c for vp9 and
rename dct_sse2.c to dct_intrin_sse2.c for vp10 to avoid
duplicate basenames.
Actually vp9_fwht4x4_mmx/sse2() and vp10_fwht4x4_mmx/sse2()
are identical. TODO: They should be unified later if there is
no intention to keep a duplicate.
Change-Id: I3e537b7bbd9ba417c606cd7c68c4dbbfa583f77d
- Tx_type: DCT_DCT, DCT_ADST, ADST_DCT, ADST_ADST.
- Encoder overall instruction count drops 2.91%.
- Decoder overall instruction count drops 1.01%.
- Add unit test to test bit-exact result against C.
Change-Id: I908c9e0e5106c58f67dd72d28760e6c9ce54278e
This change has no performance impact. It prepares the proper
function interface for better performance optimization.
Change-Id: I12e2f2deaf7f3adc603de0a74852116468c762f6
Unit test shows manually developed SSE4.1 code would performs ~30%
better if TXFM_2D_CFG configuration is set in lower level. This
change only updates function signature. There is no performance
impact.
Change-Id: I62692bd50a21ffc8a944bbd6c155c0a2020ad77b
- Setup function vp10_highbd_fht4x4_sse4_1 for highbd SSE4.1
intrinsics optimization.
- Wrote SSE4.1 functions: load_buffer_4x4(), write_buffer_4x4(),
and fdct4x4_sse4_1().
- Used logic right shift to avoid coeff memory write/read.
- Turned on vp10_highbd_fht4x4_sse4_1 for DCT_DCT mode only.
- Improved overall encoding performance >2.3% for 50 frames
sequence, park_joy_1080p_12.y4m, in which, --input-bit-depth=12,
--bit-depth=12, 50 frames.
- Unit test passed.
Change-Id: Idd6dc6e472cbbf235f0ade4f66fbe859a860a004
- Implemented fdst16_sse2(), fdst16_8col() against C version: fdst16().
- Turned on 7 DST related hybrid txfm types in vp10_fht16x16_sse2().
- Replaced vp10_fht10x10_c() with vp10_fht16x16_sse2() in
fwd_txfm_16x16().
- Added vp10_fht16x16_sse2() unit test against C version:
vp10_fht16x16_c() (--gtest_filter=*VP10Trans16x16*).
- Unit test passed.
- Speed improvement: 2.4%, 3.2%, 3.2%, for city_cif.y4m, garden_sif.y4m,
and mobile_cif.y4m.
Change-Id: Ib30a67ce5d5964bef143d588d0f8fa438be8901f
This commit enables a hybrid 1-D/2-D transform coding scheme and
the accompany entropy coding system. It currently uses hybrid
1-D/2-D DCT transform coding. It provides coding performance gains:
lowres_all 0.55%
hdres_all 0.43%
Change-Id: I2b30dcafd21eb2bb3371f6e854cbab440a4dfa78
Adds new 32x32 masked 1-d transforms that combine 1-D length-16
DCT with length-16 identity transforms.
To be continued in subsequent patches.
Change-Id: I0b4f66492d44c079b3c3b531ba48a97201de1484
1) copy following files from vpx_dsp/ to vp10/common/
vp10_inv_txfm.c
vp10_inv_txfm.h
vp10_inv_txfm_sse2.c
vp10_inv_txfm_sse2.h
2) change the function prefix "vpx_" to "vp10_" in above files
3) add unit test at vp10_inv_txfm_test.cc
Change-Id: I206f10f60c8b27d872c84b7482c3bb1d1cb4b913