The warning only happens in VP9 encoder's first pass due to src_mi
is not set up yet. But it will not fail the encoder as left_mi and
above_mi are not used in the first_pass and they will be set up again
in the second pass.
Change-Id: I12dffcd5fb1002b2b2dabb083c8726650e4b5f08
The function vp9_filter_block1d16_h8_ssse3 uses the PSHUFB instruction which has a 3 cycle latency and slows execution when done in blocks of 5 or more on Atom processors.
By replacing the PSHUFB instructions with other more efficient single cycle instructions (PUNPCKLBW + PUNPCHBW + PALIGNR) performance can be improved.
In the original code, the PSHUBF uses every byte and is consecutively copied.
This is done more efficiently by PUNPCKLBW and PUNPCHBW, using PALIGNR to concatenate the intermediate result and then shift right the next consecutive 16 bytes for the final result.
For example:
filter = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8
Reg = 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
REG1 = PUNPCKLBW Reg, Reg = 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7
REG2 = PUNPCHBW Reg, Reg = 8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15
PALIGNR REG2, REG1, 1 = 0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8
This optimization improved the function performance by 23% and produced a 3% user level gain on 1080p content on Atom processors.
There was no observed performance impact on Core processors (expected).
Change-Id: I3cec701158993d95ed23ff04516942b5a4a461c0
Change 72193 made the encoder behave differently
when configured with and without high bitdepth.
This change means the same algorithm is used for both.
Change-Id: I707a44a94afca773a9e0c2f7ebeeea83030257c5
For key frame at speed 6: enable the non-rd mode selection in speed setting
and use the (non-rd) variance_based partition.
Adjust some logic/thresholds in variance partition selection for key frame only (no change to delta frames),
mainly to bias to selecting smaller prediction blocks, and also set max tx size of 16x16.
Loss in key frame quality (~0.6-0.7dB) compared to rd coding,
but speeds up key frame encoding by at least 6x.
Average PNSR/SSIM metrics over RTC clips go down by ~1-2% for speed 6.
Change-Id: Ie4845e0127e876337b9c105aa37e93b286193405
Also removes some spurious changes in common/vp9_blockd.h which
was introduced by a rebase issue between nextgen and master branches.
Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
(cherry picked from commit 005d80cd05)
(cherry picked from commit 08d2f54800)
(cherry picked from commit 4230c2306c)
This change is made in preparation for a
subsequent patch which adds acceleration
for the highbitdepth transform functions.
The highbitdepth transform functions attempt
to use 16/32bit sse instructions where possible,
but fallback to using the C implementations if
potential overflow is detected. For this reason
the dct routines are made global so they can be
called from the acceleration functions in the
subsequent patch.
Change-Id: Ia921f191bf6936ccba4f13e8461624b120c1f665
(cherry picked from commit 454342d4e7)
This commit reworks the forward transform and quantization process
for 8x8 block coding. It combines the two operations in a single
function to save a store/load stage of the original transform
coefficients. Overall the speed -6 is slightly faster (around 1%
range). The compression performance of speed -6 is improved by
3.4%.
Change-Id: Id6628daef123f3e4649248735ec2ad7423629387
vp9_quantize_fp is the quantization process used by rtc coding
mode. This commit adds a sse2 implementation of it. The
implementation is modified based on vp9_quantize_b_sse2. No speed
difference from ssse3 version.
Change-Id: I24949c5b27df160b4f35117d28858d269454e64a
The function pointer in compressor instance does not change, so this
commit changes to call the function directly.
Change-Id: I9c9c460e3475711c384b74c9842f0b4f3d037cc5
This commit rename a reserved color space entry to BT_2020, it intends
to provide support for VP9 bitstream to pass along the color space
type defined in BT.2020(Rec.2020)
please note this entry does not have any effect on encoding/decoding
behavior, but allow applications to the pass the information along
from encoding end to decoding end.
Change-Id: I4678520e89141ea5e8900f7bd1c0e95b710b7091
This patch was to fix the vpxdec fuzzing3 test failure. When an
error occurs, setjmp() is invoked, which calls the decoder
removing routine. In multiple thread situation, other threads
could try to access the frame context memory that is already
deallocated, thus causing a segfault.
An invalid unit test was added for this issue.
Change-Id: Ida7442154f3d89759483f0f4fe0324041fffb952
This will save the memory and improve the decode speed due to
removing unnecessary memset of big prev_mi array for
all the key frames.
Decoding a all key frames 1080p video shows speed improve around 2%.
Change-Id: I6284a445c1291056e3c15135c3c20d502f791c10
In the function mb_lpf_horizontal_edge_w_avx2_16 the usage of the intrinsic
_mm256_cvtepu8_epi16 cause a compiler bug in gcc 4.9.1.
until it will be fixed I created a workaround that create the up convert by
using broadcast128+shuffle.
The bug was reported here:
https://code.google.com/p/webm/issues/detail?id=867
Change-Id: I73452e6806f42e0fadcde96b804ea3afa7eeb351
This will save a lot of memory for decoder due to removing of prev_mi,
but prev_mi is still needed in encoder. So this will increase a little bit
memory for encoder.
Change-Id: I24b2f1a423ebffa55a9bd2fcee1077dac995b2ed
This patch allocated frame contexts outside VP9_COMMON. This allows
multiple threads to share the same copy of frame contexts, and
reduces the overhead. It also guarantees the correct update of
these contexts during bitstream packing. This patch doesn't change
encoding result.
Change-Id: Ic181a2460b891d1d587278a6d02d8057b9dbd353
Using 4 threads, frame parallel decode is ~3x faster than single thread
decode and around 30% faster than tile parallel decode for frame parallel
encoded video on both Android and desktop with 4 threads. Decode speed is
scalable to threads too which means decode could be even faster with more threads.
Change-Id: Ia0a549aaa3e83b5a17b31d8299aa496ea4f21e3e
All sad function that process above 32 consecutive elements are optimized
for AVX2:
vp9_sad64x64
vp9_sad64x32
vp9_sad32x64
vp9_sad32x32
vp9_sad32x16
vp9_sad64x64_avg
vp9_sad64x32_avg
vp9_sad32x64_avg
vp9_sad32x32_avg
vp9_sad32x16_avg
The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64
vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90%
both of them gave and overall ~2.3% user level gain
Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd
this removes an assumption that worker->data1 would be pointing to a
TileWorkerData allocation.
additionally, within the multi-threaded loopfilter pass VP9LfSync as a
parameter to the worker hook, removing the need for a shadow pointer in
LFWorkerData.
Change-Id: Ic7b2faa34e3eb59dbcb8a7c67f333448fa047c88
The concept:
There's too much noise in source pixels for variance and at low bitrate
the reconstructed looks nothing like the source so we have problems
getting good partitionings with either. This skirts the issue by using
a box blur scaled down version for variance calculations. To compare
against source_var_ moved keyframe to be rd based like source_var.
Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624
This commit breaks the overly broad header files into more
targeted and smaller ones, to help better structure the system
layout.
Change-Id: I7b24559d3ea6e582cf5d9bbe8f71459f9824d71b
The functions b_width_log2 and b_height_log2 only do direct
table fetch. This commit unifies such use cases by using the
table directly and removes these functions.
Change-Id: I3103fc6ba959c1182886a2799d21b8b77c8a7b6b
Add comments on the use case of these definitions. Further reduce
the scope of header file in vp9_context_tree.h.
Change-Id: Ic4a7638e838d0ac441b64abfc56e57354c059d75
This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are
only 16x16 blocks in denoiser, while in VP9, there are 13 different
block sizes.
By adding this SSE2 code, the improvement of encoder speed is around
20%(using C code vs using SSE2 code), vary for different clips.
The unit test for VP9 denoiser is to confirm that the SSE2 code is
bit-exact with the C code. The unit test covers all block size.
Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d
Incorporates the WRAPLOW macro into the non-highbitdepth transforms
to aid hardware verification between a software C model and an
intended hardware implementation though the use of the configure
options: --enable-experimental --enable-emulate-hardware.
Note that to avoid further discrepancies between the sse/sse2
implementations of the transforms and the C implementation, when the
emulate hardware option is invoked, we also disable sse/sse2/etc.
Also incudes some minor cleanups/renaming etc.
Change-Id: Ib864d8493313927d429cce402982f1c8e45b3287
Miscellaneous bug-fixes for high bitdepth functionality.
With this patch, high bit-depth profiles become mostly functional,
except for an intermittent assert failure issue that is being
tracked.
Change-Id: I6a7fcbdcf1e5b09842e88535f8442d2e1230748c
The commit cleans up the header files in vp9_entropymv.h. This
file should only depend on vp9_mv.h and vp9_prob.h. Remove the
giant vp9_blockd.h from header file list.
Change-Id: I44cd26d2cfd10a16a9325778347dd53f888a874c
Moves transform type defines to vp9_common.h from vp9_idct.h
so that they can be included in vp9_rtcd_defs.pl safely.
Change-Id: Id5106227bee5934f7ce8b06f2eb9fa8a9a2e0ddb
This reverts commit eafc8c9c40.
tran_low_t/tran_high_t don't belong in a public header, they're private.
Similarly the public headers shouldn't rely on config defines,
vpx_config.h isn't installed.
Change-Id: I194ec273598da418df8dd727b6c0e78a556740ad
Some header file in vp9_idct.c has been included in vp9_idct.h.
This commit removes these redundant declarations.
Change-Id: I0238c27e4efff5c981eb437022c6bc6970c4e445
This commit fixes a compiling error in vp9_idct.h, where the codec
checks that the intermediate steps of transformation fit within
16-bit length. The issue was due to broken file dependency.
Change-Id: Ib22bba13a1e6df28489cb23d6774c561969f1fdc
The first comment is obselete given the way is now normative in VP9
bitstream. The second comment line was too long.
Change-Id: I6546585babf60d466485ddcf2daa6d2fa79e999a
As reported in issue #850, the condition for border extension was not
complete. This commit added the case when the scaling is enabled.
This fixes issue #850.
Change-Id: I67768b23f0dcc4ac9a9aa0a0825b0fe8cb85a72e
mi_grid_* are arrays of pointer to pointer. They save the pointers that point
to the MIs in cm->mi. But they are unnecessary and complicated. The original
goal was to remove MODE_INFO_t copy. But with an extra MODE_INFO_t pointer
inside MODE_INFO_t, same goal could be achieved.
This commit totally removes the mi_grid_* structures. But there are still
many dummy MODE_INFO_t inside cm->mi which are a waste of memory. Next commit
will do on-demand MODE_INFO_t allocation in order to save these memories.
Change-Id: I3a05cf1610679fed26e0b2eadd315a9ae91afdd6
This commit adds back sse2 or ssse3 optimized versio of a couple of
functions, fixes a ~10% performance regression.
Change-Id: I049786906e5a641224dced63c6492aec9d86d183
Libvpx was memseting every external frame buffer before decode. This
was to work around a valgrind issue in our C loop filter. Most of
the time this was not needed and we have noticed some significant
performance loss on some platforms. Now we require the application to
zero out the buffers if it is using external frame buffers.
Change-Id: I7330d00a315e65137ed30edd5f813e8929b76242
The issue was discovered on bitstream with 2x vertical downscale. For
zero MVs, y_pad is set to 1 only when vertical convolution is
required. The original code assumes that for y_step_q4 == 32 we don't
perform vertical convolution. But vp9_setup_scale_factors_for_frame()
sets convolve functions so that when x_step and y_step are both not
equal to 16, convolve in both directions is performed. And convolve()
unconditionally subtracts one stride from source pointer when calls
convolve_horiz(). This leads to invalid memory access.
Change-Id: I882dfa6081a58e172b5ffa55842bfcd6727f10bf
Adds various high bitdepth transform functions and tests.
Much of the changes are related to using typedefs tran_low_t
and tran_high_t for the final transform cofficients and intermediate
stages of the transform computation respectively rather than fixed
types int16_t/int. When vp9_highbitdepth configure flag is off,
these map tp int16_t/int32_t, but when the flag is on, they map
to int32_t/int64_t to make space for needed extra precision.
Change-Id: I3c56de79e15b904d6f655b62ffae170729befdd8
If optimizations use more than one cpu feature, allow
specifying them so that '--disable-X' still works
https://code.google.com/p/webm/issues/detail?id=854
Change-Id: I3108ea37b397371a2be84dd5f2380b304db23f18
Removed functions:
* vp9_post_proc_down_and_across_mmx
* vp9_mbpost_proc_down_mmx
* vp9_plane_add_noise_mmx
They all have sse2 equivalent.
Change-Id: I59c1fac12b7c96ca4538d455e4400c2b7875feff
vp9_variance_sse2.c contains a mix of intrinsics and references to
assembly which uses x86inc.asm; it's conditionally included as a result.
Change-Id: I254451483a65881c0b8e18e27bf0c3ddef60c4ec