Make all post-processor code conditionally
compilable based on the CONFIG_VP9_POSTPROC
macro.
Also, remove the vizualization code from VP9
since it is out of date and will not compile.
Change-Id: I1e9e13a09ecd43e9a3f3704c175ae8cd258ababd
vp9_block_error_sse2 can only handle 16 bytes at a time but
the function requires to handle a sequence of 32 bytes at a time
so each 16 bytes is handled in a different register.
With AVX2 optimization the 32 bytes can be handled in one register instead
of two in the SSE2
The vp9_block_error was optimized by 85%.
The user level was optimized by 1.2%
Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd
As a side-effect, the max_sad check is removed from the
C-implementation of VP8, for consistency with VP9, and to
ensure that the SAD tests common to VP8/VP9 pass.
That will make the VP8 C implementation of sad a little slower
but given that is rarely used in practice, the impact will be
minimal.
Change-Id: I7f43089fdea047fbf1862e40c21e4715c30f07ca
The warning messages complained that there are unused arguments
in a few prediction modes. This structure was designed on purpose,
such that a wrapper function can cover all prediction mode cases
and make them readily accessible as an pointer array.
This commit silences such warnings.
Change-Id: I7036b6bdb70747e5327d8f6fceb154f100abc4c0
Inline loopfilter has been already handled in vp9_decode_frame().
Collecting all similar code in one place now.
Change-Id: I358a0280fc7c2b27cca520bc1e8c16c4eb6491dd
* changes:
Turn on unit tests for SSSE3 8x8 forward and inverse 2D-DCT
Change eob threshold for partial inverse 8x8 2D-DCT to 12
SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zero
The scanning order has the first 12 coefficients of the 8x8 2D-DCT
sitting in the top left 4x4 block. Hence the partial inverse 8x8
2D-DCT allows to handle cases with eob below 12.
The overall runtime of the inverse 8x8 2D-DCT unit is reduced from
166 cycles (using SSE2) to 150 cycles (using SSSE3).
Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2
This commit enables ssse3 assembly implementation of the 8x8
inverse 2D-DCT with only first 10 coefficients non-zero. The
average runtime for this unit goes down from 198 cycles to 129
cycles (34.8% faster).
Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7
Includes changes that are not compatible with VS windows builds.
Amongst other things stdint.h is not supported in VS.
This reverts commit 89fbf3de50.
Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd
This commit enables SSSE3 version full inverse 8x8 2D-DCT and
reconstruction. It makes the runtime of vp9_idct8x8_64_add down
from 256 cycles (SSE2) to 246 cycles.
Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3
Assembly implementation of ssse3 8x8 forward 2D-DCT. The current
version is turned on only for x86_64. The average unit runtime
goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster.
This translates into about 1.5% speed-up for pedestrian_area 1080p
at speed 2.
Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4
This patch fixed the uninitialized read errors in Issue 748:
"dr memory VP9 encode errors". In vp9_convolve_avg_sse2,
when width is 4, pavgb reads 8 bytes from dst buffer that is
out of range. An error is reported although the data is not
actually used later. This issue was resolved by preventing
uninitialized reads.
Change-Id: I109a54910aa47139cb13119de86f2062cff207df
The macosx release of clang v5.0 identifies itself as:
Apple LLVM version 5.0 (clang-500.2.79) (based on LLVM 3.3svn)
This version of clang uses the older _mm_broadcastsi128_si256, like
v3.3, as given away in the LLVM svn version above.
Change-Id: I4d6d59d5454efd57d2ae9e75f5eb7486af7cbd0c
Calculate the difference variance between last source frame and
current source frame. The variance is calculated at 16x16 block
level. The variances are compared to several thresholds to decide
final partition sizes.
An adaptive strategy is implemented to decide using
SOURCE_VAR_BASED_PARTITION or FIXED_PARTITION based on motions
in the video. The switching test is done once every
search_type_check_frequency frames.
The selection of source_var_thresh needs to be investigated
further later.
RTC set Borg test showed 0.424% overall psnr gain, and 0.357%
ssim gain. For clips with large enough static area, the
encoding speedup is around 2% to 15%.
Change-Id: Id7d268f1d8cbca7fb8026aa4a53b3c77459dc156
Adds some high-level hooks for profile 2 before further
progress on the implementation.
According to the definitiion in this patch:
1. Profile 2 only supports 10 or 12 bit color but not 8
2. Profile 2 supports all color sampling modes: 444, 422 and 420,
and alpha plane.
3. Profile 3 is currently undefined.
Please consider the definition carefully and suggest modifications
to the definition as needed.
Change-Id: I5b284fc679e54ac5aee171af72fa7994cfd28995
For very large size video image, the scaling calculation may need use
value beyond the range of int. This commit upgrade the value to 64bit
to make sure the calculation do not wrap around INT_MAX.
The change corrected the decoder behavior.
The bug affects only very large resolution video because the scaling
calculation was sufficient for image size smaller than 2^13.
This resolves issue:
https://code.google.com/p/webm/issues/detail?id=750
Change-Id: I2d2ed303ca6482f31f819f3c07d6d3e98ef3adc5
2 functions were optimized for avx2 by using full 256 bit register
In order to handle 32 elements in parallel instead of only 16 in parallel:
1. vp9_sad32x32x4d
2. vp9_sad64x64x4d
The function level gain is 66% and the user level gain is ~1%.
Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
Fixed dr memory errors reported in Issue 736:
https://code.google.com/p/webm/issues/detail?id=736
All elements in left_col buffer need to be initialized to ensure
the correctness of SIMD operations in x86 optimized code.
Change-Id: I8e7f26ab45cca8099c1f9342bcf852f828bda7e4