In ssse3 functions, DEFINE_ARGS macro hard codes qcoeff and dqcoeff
to r3 and r4. If skip is 1, qcoeff and dqcoeff need to be loaded
from the stack, which doesn't work because of the above definitions.
Currently, skip=1 case is not used in the encoder. This patch fixed
the issue, so it can be turned on later.
Change-Id: I998d696b1a7a85dca2b3bcee790b21c21e039147
This commit introduces a new block match motion estimation
using integral projection measurement. The 2-D block and the nearby
region is projected onto the horizontal and vertical 1-D vectors,
respectively. It then runs vector match, instead of block match,
over the two separate 1-D vectors to locate the motion compensated
reference block.
This process is run per 64x64 block to align the reference before
choosing partitioning in speed 6. The overall CPU cycle cost due
to this additional 64x64 block match (SSE2 version) takes around 2%
at low bit-rate rtc speed 6. When strong motion activities exist in
the video sequence, it substantially improves the partition
selection accuracy, thereby achieving better compression performance
and lower CPU cycles.
The experiments were tested in RTC speed -6 setting:
cloud 1080p 500 kbps
17006 b/f, 37.086 dB, 5386 ms ->
16669 b/f, 37.970 dB, 5085 ms (>0.9dB gain and 6% faster)
pedestrian_area 1080p 500 kbps
53537 b/f, 36.771 dB, 18706 ms ->
51897 b/f, 36.792 dB, 18585 ms (4% bit-rate savings)
blue_sky 1080p 500 kbps
70214 b/f, 33.600 dB, 13979 ms ->
53885 b/f, 33.645 dB, 10878 ms (30% bit-rate savings, 25% faster)
jimred 400 kbps
13380 b/f, 36.014 dB, 5723 ms ->
13377 b/f, 36.087 dB, 5831 ms (2% bit-rate savings, 2% slower)
Change-Id: Iffdb6ea5b16b77016bfa3dd3904d284168ae649c
The high bit depth build failed while building for 32bit target.
The bugs were in vp9_highbd_subpel_variance.asm and
vp9_highbd_sad4d_sse2.asm functions. This patch fixed the bugs,
and made 32bit build work.
Change-Id: Idc8e5e1b7965bb70d4afba140c6583c5d9666b75
This patch was to fix issue 924:
https://code.google.com/p/webm/issues/detail?id=924
The SECTION_RODATA macro was modified to support macho32 format.
The sub-pixel functions were modified to pass in 2 more parameters
to handle the global offsets for PIC build.
Change-Id: I3bfcd336bcae945edf300bca4ab40376a2628cd4
For key frame at speed 6: enable the non-rd mode selection in speed setting
and use the (non-rd) variance_based partition.
Adjust some logic/thresholds in variance partition selection for key frame only (no change to delta frames),
mainly to bias to selecting smaller prediction blocks, and also set max tx size of 16x16.
Loss in key frame quality (~0.6-0.7dB) compared to rd coding,
but speeds up key frame encoding by at least 6x.
Average PNSR/SSIM metrics over RTC clips go down by ~1-2% for speed 6.
Change-Id: Ie4845e0127e876337b9c105aa37e93b286193405
Also removes some spurious changes in common/vp9_blockd.h which
was introduced by a rebase issue between nextgen and master branches.
Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
(cherry picked from commit 005d80cd05)
(cherry picked from commit 08d2f54800)
(cherry picked from commit 4230c2306c)
This commit reworks the forward transform and quantization process
for 8x8 block coding. It combines the two operations in a single
function to save a store/load stage of the original transform
coefficients. Overall the speed -6 is slightly faster (around 1%
range). The compression performance of speed -6 is improved by
3.4%.
Change-Id: Id6628daef123f3e4649248735ec2ad7423629387
vp9_quantize_fp is the quantization process used by rtc coding
mode. This commit adds a sse2 implementation of it. The
implementation is modified based on vp9_quantize_b_sse2. No speed
difference from ssse3 version.
Change-Id: I24949c5b27df160b4f35117d28858d269454e64a
Combined vp9_denoiser_8xM_sse2 and vp9_denoiser_4xM_sse2 into one
function vp9_denoiser_NxM_sse2_small and passed the bitexact testing.
Changed the name of the function vp9_denoiser_64_32_16xM_sse2 to
vp9_denoiser_NxM_sse2_big.
Change-Id: Ib22478df585994dd347ebae04202c0b701e7f451
All sad function that process above 32 consecutive elements are optimized
for AVX2:
vp9_sad64x64
vp9_sad64x32
vp9_sad32x64
vp9_sad32x32
vp9_sad32x16
vp9_sad64x64_avg
vp9_sad64x32_avg
vp9_sad32x64_avg
vp9_sad32x32_avg
vp9_sad32x16_avg
The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64
vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90%
both of them gave and overall ~2.3% user level gain
Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd
The concept:
There's too much noise in source pixels for variance and at low bitrate
the reconstructed looks nothing like the source so we have problems
getting good partitionings with either. This skirts the issue by using
a box blur scaled down version for variance calculations. To compare
against source_var_ moved keyframe to be rd based like source_var.
Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624
This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are
only 16x16 blocks in denoiser, while in VP9, there are 13 different
block sizes.
By adding this SSE2 code, the improvement of encoder speed is around
20%(using C code vs using SSE2 code), vary for different clips.
The unit test for VP9 denoiser is to confirm that the SSE2 code is
bit-exact with the C code. The unit test covers all block size.
Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d
in the sub_pixel_*variance* function the dst is aligned to 16 bytes and not
to 32 bytes - now load unaligned data
Change-Id: I2e0b9745543697efc56fefa32857ea10117af135
in the function sad32x32x4d and sad64x64x4d the source is aligned to 16 bytes
and not to 32 bytes - the load is now unaligned.
Change-Id: I922fdba56d0936b5cf72e4503519f185645a168c
Remove all the redundant dct functions (dct4x4, dct8x8)
in avx2 except dct32x32 those functions were copied originally from dct_sse2
Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e
This commit enables a new quantization process for 32x32 2D-DCT
transform coefficient blocks. It improves the compression
performance of speed 5 by 1.4%. The overall compression gains of
speed 5 due to the new quantization scheme is 4.7%. It also includes
the SSSE3 implementation of the 32x32 quantization process.
Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b