Restructured and rewrote SSE2 loopfilter functions. Combined u and
v into one function to take advantage of SSE2 128-bit registers.
Tests on test clips showed a 4% decoder performance improvement on
Linux desktop.
Change-Id: Iccc6669f09e17f2224da715f7547d6f93b0a4987
Following conversations with Tim T (Derf) I ran a large number of
tests comparing the existing polynomial expression with a simpler
^2 variant. Though the polynomial was sometimes a little better at
the extremes of Q it was possible to get close for most clips and
even a little better on some.
This code also changes the way the RD multiplier is calculated
when the ZBIN is extended to use a variant of the same ^2
expression.
I hope that this simpler expression will be easier to tune further
as we expand our test set and consider adjustments based on content.
Change-Id: I73b2564346e74d1332c33e2c1964ae093437456c
Besides the slight improvement in round trip error. This
also fixes a sign bias in the forward transform, so the
round trip errors are evenly distributed between +1s and
-1s. The old bias seemed to work well with the dc sign bias
in old fdct, which no longer exist in the improved fdct.
Change-Id: I8635e7be16c69e69a8669eca5438550d23089cef
Corrected setting of "which_buffer" for U & V cases to match that
used for Y, i.e. to refer to the temporally most recent frame of
those to be filtered.
Change-Id: Idf94b287ef47a05f060da3e61134a0b616adcb6b
The new fdct lowers the round trip sum squared error for a
4x4 block ~0.12. or ~0.008/pixel. For reference, the old
matrix multiply version has average round trip error 1.46
for a 4x4 block.
Thanks to "derf" for his suggestions and references.
Change-Id: I5559d1e81d333b319404ab16b336b739f87afc79
bestsad needs to be a int and set to INT_MAX because at the end
of the function it is compared to INT_MAX to determine if there
was a match in the function.
Change-Id: Ie80e88e4c4bb4a1ff9446079b794d14d5a219788
bestsad should be an int initialized to INT_MAX. The optimized
SAD function expects a signed value for bestsad to use for comparison
and early loop termination. When no match is made, which is
determined by a comparison of bestsad to INT_MAX, INT_MAX is returned.
These are mostly vestigial, it's up to the compiler to decide what
should be inlined, and this collided with certain Windows platform SDKs.
Change-Id: I80dd35de25eda7773156e355b5aef8f7e44e179b
1. Unavailability of each reference frame type should be tested
independently,
2. Also, only the VP8_GOLD_FLAG needs to be tested before setting
golden frame specific thresholds, and only VP8_ALT_FLAG needs
testing before setting thresholds relevant to the AltRef frame.
(Raised by gbvalor, in response to Issue 47)
Change-Id: I6a06fc2a6592841d85422bc1661e33349bb6c3b8
Since the intent is
to reset the appropriate bit in ref_frame_flags not to
test a logic condition. Prior result would always have
been ref_frame_flags being set to 0.
(Issue reported by dgohman, issue 47)
Change-Id: I2c12502ed74c73cf38e98c9680e0249c29e16433
The DOUBLE_DIVIDE_CHECK macro prevents from divide by 0,
so must be on the denominator to work as intended.
Change-Id: Ie109242d52dbb9a2c4bc1e11890fa51b5f87ffc7
If the version script produced by the libvpx build system is not
used when linking a shared library on x86-64 Linux, the constant
data in the subpel filters produces R_X86_64_32 relocation errors
due to the use of wrt rip addressing instead of
wrt rip wrt ..gotpcrel.
Instead of adding a new macro for this addressing mode, this patch
sets the ELF visibility of these symbols to "hidden", which
allows wrt rip addressing to work without a text relocation.
This allows building a shared library without using the provided
build system or a separate version script.
Fixes http://code.google.com/p/webm/issues/detail?id=46
Change-Id: Ie108f9d9a4352e5af46938bf4750d2302c1b2dc2
When the license headers were updated, they accidentally contained
trailing whitespace, so unfortunately we have to touch all the files
again.
Change-Id: I236c05fade06589e417179c0444cb39b09e4200d
Change bitreading functions to use a larger window which is refilled less
often.
This makes it cheap enough to do bounds checking each time the window is
refilled, which avoids the need to copy the input into a large circular
buffer.
This uses less memory and speeds up the total decode time by 1.6% on an ARM11,
2.8% on a Cortex A8, and 2.2% on x86-32, but less than 1% on x86-64.
Inlining vp8dx_bool_decoder_fill() has a big penalty on x86-32, as does moving
the refill loop to the front of vp8dx_decode_bool().
However, having the refill loop between computation of the split values and
the branch in vp8_decode_mb_tokens() is a big win on ARM (presumably due to
memory latency and code size: refilling after normalization duplicates the
code in the DECODE_AND_BRANCH_IF_ZERO and DECODE_AND_LOOP_IF_ZERO cases.
Unfortunately, refilling at the end of vp8dx_bool_decoder_fill() and at the
beginning of each decode step in vp8_decode_mb_tokens() means the latter
requires an extra refill at the end.
Platform-specific versions could avoid the problem, but would require most of
detokenize.c to be duplicated.
Change-Id: I16c782a63376f2a15b78f8086d899b987204c1c7
Added sse2 version of vp8_regular_quantize_b which improved encode
performance(for the clip used) by ~10% for 32 bit builds and ~3% for
64 bit builds.
Also updated SHADOW_ARGS_TO_STACK to allow for more than 9 arguments.
Change-Id: I62f78eabc8040b39f3ffdf21be175811e96b39af
This patch addresses issue #79, which is a regression since commit
28de670 "Fix RD bug." If the coded error value is zero, the iiratio
calculation effectively multiplies by 1000000 by the
DOUBLE_DIVIDE_CHECK macro. This can result in a value larger than
INT_MAX, giving a negative ratio. Since the error values are
conceptually unsigned (though they're stored in a double) this patch
makes the iiratio values unsigned, which allows the clamping to work
as expected.
ssim.c comiles in a huge (512M) amount of global scratch space. Allocating
this data on the heap would be a better solution, but this file doesn't
need to be built at all in most cases, so as a first pass, disable it
except when doing opsnr.stt output (--enable-psnr).
Change-Id: I320d812f6d652a12516a16b52295ebff20b5bd42
No good reason to be tricky here. I don't know why 'break' occurred to me
as the natrual replacement for the 'return', but an if/else block is
definitely clearer.
Change-Id: I08a336307afeb0dc7efa494b37398f239f66c2cf
The new scheme introduced in I68d35a2f did not clamp chroma MVs in the SPLITMV
case, and clamped them incorrectly (to the luma plane bounds) in every other
case.
Because chroma MVs are computed from the luma MVs before clamping occurs, they
could still point outside of the frame buffer and cause crashes.
This clamping happens outside of the MV prediction loop, and so should not
affect bitstream decoding.
Restructure vp8_sixtap_predict functions to eliminate extra 5-line
calculation while doing first-pass only. Also, combline functions
to eliminate usage of intermediate buffer. This gives decoder a 3%
performance gain on my test clips.
Change-Id: I13de49638884d1a57d0855c63aea719316d08c1b
This patch removes the secondary MV clamping from the MV decoder. This
behavior was consistent with limits placed on non-split MVs by the
reference encoder, but was inconsistent with the MVs generated in the
split case.
The purpose of this secondary clamping was only to prevent crashes on
invalid data. It was not intended to be a behaviour an encoder could or
should rely on. Instead of doing additional clamping in a way that
changes the entropy context, the secondary clamp is removed and the
border handling is made implmentation specific. With respect to the
spec, the border is treated as essentially infinite, limited only by
the clamping performed on the near/nearest reference and the maximum
encodable magnitude of the residual MV.
This does not affect any currently produced streams.
Change-Id: I68d35a2fbb51570d6569eab4ad233961405230a3
This patch adds support for building shared libraries when configured
with the --enable-shared switch.
Building DLLs would require more invasive changes to the sample
utilities than I want to make in this patch, since on Windows you can't
use the address of an imported symbol in a static initializer. The best
way to work around this is proably to build the codec interface mapping
table with an init() function, but dll support is of questionable value
anyway, since most windows users will probably use a media framework
lib like webmdshow, which links this library in staticly.
Change-Id: Iafb48900549b0c6b67f4a05d3b790b2643d026f4
This patch creates some basic infrastructure for doing bitstream-
incompatible changes to the VP8 encoder. The key parts are:
- --enable-experimental configure switch, to enable support for this
incompatible bitstream. This switch is required to be set to enable
any "experiments"
- A list for "experiments" which translate into --enable-<experiment>
options and CONFIG_<experiment> macros.
- The high bit of the "Version" field is used to indicate that the
bitstream was produced by an experimental encoder. The decoder will
fail to decode an experimental bitstream without
--enable-experimental.
- A new "vp8x" encoder interface is created to set the experimental
bit.
- The vp8x encoder interface is made the default for ivfenc in
experimental mode.
Change-Id: Idbdd5eae4cec5becf75bb4770837dcd256b2abef
Tests on x86 showed this function costed 2.7% of total decoding time
because of all the memory reads/writes. After modification, it only
costs about 0.7% of decoding time, which gives a 2% gain.
Change-Id: I5003ee30b6dc6dea0bfa42a6ad7e7c22fcc7b215
This is to accommodate output packets for both compressed
data and psnr stats. For each frame, there are at least
one packet for compressed data and one for psnr stats. For
a max lag of 25, 64 is large enough to cover all lagged
frames at the end of encoding.
Change-Id: If20787fbc86f96e1aa16a3ccf2adc93e6c1e3d5f
This renames the vpx_codec/ directory to vpx/, to allow applications
to more consistently reference these includes with the vpx/ prefix.
This allows the includes to be installed in /usr/local/include/vpx
rather than polluting the system includes directory with an
excessive number of includes.
Change-Id: I7b0652a20543d93f38f421c60b0bbccde4d61b4f
Avoid an potential name clashes and match other external types.
s/IMG_FMT/VPX_$&/g
s/img_fmt/vpx_$&/g
Change-Id: Ia7ad5bbb6424416b37e71e5f5eb1eca31c3c707f
This doesn't play well with autotools, and the preprocessor magic is
confusing and unhelpful in the vp8-only context.
Change-Id: I2fcb57e6eb7876ecb58509da608dc21f26077ff1
Visual c++ compiler uses xmm registers for floating point
operations for 64 bit architecture, therefore its calling
convention requires the preservation of xmm6-xmm15 in any
function that have used these registers. However, the sse2
functions, that were originally written for 32 bit windows,
may have used xmm6 and xmm7 without preserving the content.
In this particular case, the compiler used xmm6 to save
the variable "two_pass_min_rate", the value of the variable
is mucked up by our sse2 optimized loop filter functions,
hence the results of release/debug mismatching.