NEON has optimized 16x16 half-pixel variance functions, but they
were not part of the RTCD framework. Add these functions to RTCD,
so that other platforms can make use of this optimization in the
future and special-case ARM code can be removed.
A number of functions were taking two variance functions as
parameters. These functions were changed to take a single
parameter, a pointer to a struct containing all the variance
functions for that block size. This provides additional flexibility
for calling additional variance functions (the half-pixel special
case, for example) and by initializing the table for all block sizes,
we don't have to construct this function pointer table for each
macroblock.
Change-Id: I78289ff36b2715f9a7aa04d5f6fbe3d23acdc29c
Post process option to color the block for either the mode
of the macro block, or the frame that the macro block references.
Change-Id: Ie498175497f2d20e3319924d352dc4ddc16f4134
ARM NEON has a platform specific version of vp8_recon16x16mb, though
it's just a stub to extract the various parameters from the
MACROBLOCKD struct and pass them to vp8_recon16x16mb_neon(). Using
that function's prototype directly will be a better long term solution,
but it's quite an invasive change.
Change-Id: I04273149e2ade34749e2d09e7edb0c396e1dd620
The ARM version of vp8_hex_search() is a faster implementation
of the same algorithm. Since it doesn't use any ARM specific
code, it can be made the default implementation. This removes
a linking error.
Change-Id: I77d10f2c16b2515bff4522c350004e03b7659934
Some of the ARM functions differed from their generic counterparts
only by unrolling their loops. Since this change may be useful
on other platforms, or might even supercede the looped version
in the generic case, move it back to the generic file.
This code is left under #if ARCH_ARM for now, but it may be worth
considering a different (possibly new) conditional for these. If
it turns out that this should be runtime selectable, these
functions will have to move to the RTCD infrastructure. Don't want
to take that step at this time without more profile data.
Change-Id: I4612fdbc606fbebba4971a690fb743ad184ff15f
These functions were true duplicates of functions present in the
generic code. This fixes some of the link errors when building
with --enable-shared --enable-pic.
Change-Id: Idff26599d510d954e439207883607ad6b74df20c
These functions made global references but did not set up the GOT,
causing compilation failures in PIC mode.
Change-Id: Iac473bf46733f87eb2e001cd736af4acf73fa51d
Postproc level that uses Bresenham's line algorithm
to draw motion vectors onto the postproc buffer.
Change-Id: I34c7daa324f2bdfee71e84fcb1c50b90fa06f6fb
cppcheck found a leaked file descriptor in the debugging code
enabled by defining ENTROPY_STATS. Fixes issue #60.
Change-Id: I0c1d0669cb94d44fed77860f97b82763be06b7cb
there were four versions for the regular and
macroblock loopfilters:
horizontal [y|uv]
vertical [y|uv]
this moves all the common code into 2 functions:
vp8_loop_filter_neon
vp8_mbloop_filter_neon
this provides no gain in performance. there's a bit
of jitter, but it trends down ~0.25-0.5%. however,
this is a huge gain maintenance. also, there is the
potential to drop some stack usage in the macroblock
loopfilter.
Change-Id: I91506f07d2f449631ff67ad6f1b3f3be63b81a92
The primary goal is to allow a binary to be built which supports
NEON, but can fall back to non-NEON routines, since some Android
devices do not have NEON, even if they are otherwise ARMv7 (e.g.,
Tegra).
The configure-generated flags HAVE_ARMV7, etc., are used to decide
which versions of each function to build, and when
CONFIG_RUNTIME_CPU_DETECT is enabled, the correct version is chosen
at run time.
In order for this to work, the CFLAGS must be set to something
appropriate (e.g., without -mfpu=neon for ARMv7, and with
appropriate -march and -mcpu for even earlier configurations), or
the native C code will not be able to run.
The ASFLAGS must remain set for the most advanced instruction set
required at build time, since the ARM assembler will refuse to emit
them otherwise.
I have not attempted to make any changes to configure to do this
automatically.
Doing so will probably require the addition of new configure options.
Many of the hooks for RTCD on ARM were already there, but a lot of
the code had bit-rotted, and a good deal of the ARM-specific code
is not integrated into the RTCD structs at all.
I did not try to resolve the latter, merely to add the minimal amount
of protection around them to allow RTCD to work.
Those functions that were called based on an ifdef at the calling
site were expanded to check the RTCD flags at that site, but they
should be added to an RTCD struct somewhere in the future.
The functions invoked with global function pointers still are, but
these should be moved into an RTCD struct for thread safety (I
believe every platform currently supported has atomic pointer
stores, but this is not guaranteed).
The encoder's boolhuff functions did not even have _c and armv7
suffixes, and the correct version was resolved at link time.
The token packing functions did have appropriate suffixes, but the
version was selected with a define, with no associated RTCD struct.
However, for both of these, the only armv7 instruction they actually
used was rbit, and this was completely superfluous, so I reworked
them to avoid it.
The only non-ARMv4 instruction remaining in them is clz, which is
ARMv5 (not even ARMv5TE is required).
Considering that there are no ARM-specific configs which are not at
least ARMv5TE, I did not try to detect these at runtime, and simply
enable them for ARMv5 and above.
Finally, the NEON register saving code was completely non-reentrant,
since it saved the registers to a global, static variable.
I moved the storage for this onto the stack.
A single binary built with this code was tested on an ARM11 (ARMv6)
and a Cortex A8 (ARMv7 w/NEON), for both the encoder and decoder,
and produced identical output, while using the correct accelerated
functions on each.
I did not test on any earlier processors.
Change-Id: I45cbd63a614f4554c3b325c45d46c0806f009eaa
The code was not checking for frame sizes smaller than 3 bytes, and the
partition size checks might have failed if the input buffer was within
16MB of the top of the heap.
In addition, the reference count on the current frame buffer was not
being decremented on error, so after a small number of errors, no new
frame buffer could be found and it would run off the list of them.
Change-Id: I0c60dba6adb1e2a29df39754f72a56ab6c776b46
Most of the code that actually uses these matrices indexes them as
if they were a single contiguous array, and coverity produces
reports about the resulting accesses that overflow the static
bounds of the first row.
This is perfectly legal in C, but converting them to actual [16]
arrays should eliminate the report, and removes a good deal of
extraneous indexing and address operators from the code.
Change-Id: Ibda479e2232b3e51f9edf3b355b8640520fdbf23
Change the pts of the altref frame to be as close as possible to the
pts of the preceding frame and still be strictly increasing.
Change-Id: Iae3033a4c89ae5a9d0e5c4198e9196e5f3ee57c7
The first implementation of the firstpass motion map for motion
compensated temporal filtering created a file, fpmotionmap.stt,
in the current working directory. This was not safe for multiple
encoder instances. This patch merges this data into the first pass
stats packet interface, so that it is handled like the other
(numerical) firstpass stats.
The new stats packet is defined as follows:
Numerical Stats (16 doubles) -- 128 bytes
Motion Map -- 1 byte / Macroblock
Padding -- to align packet to 8 bytes
The fpmotionmap.stt file can still be generated for debugging
purposes in the same way that the textual version of the stats
are available (defining OUTPUT_FPF in firstpass.c)
Change-Id: I083ffbfd95e7d6a42bb4039ba0e81f678c8183ca
x86-64 passes most arguments in registers. There is no need to
push them to the stack before using them.
Change-Id: I13c683f1358782682ecafaf1df3fb0af23b978ea
This rewriting reflects changes made in commit "Improve the
accuracy of forward walsh-hadamard transform". Since this function
is not called much, only a small encoder performance gain (~0.5% )
is seen.
Change-Id: Ie9df58a43028a11fd5b115c4bbe3141f7596578b
Instead of doing 8-bit data unpack and 16-bit subtraction, use
psubb to do 16 8-bit subtractions and pcmpgtb to preserve the
sign information. This does not bring noticable gain since
these functions are not called frequently.
Change-Id: I90a0dfaa3db9d422e4ada324076596ffb178548e
generic version got fixed, but not the arm version. fixes:
vp8/encoder/arm/mcomp_arm.c: In function 'vp8_full_search_sadx3':
vp8/encoder/arm/mcomp_arm.c:1208: warning: pointer targets in passing
argument 5 of 'fn_ptr->sdx3f' differ in signedness
vp8/encoder/arm/mcomp_arm.c:1208: note: expected 'unsigned int *' but
argument is of type 'int *'
and another unsigned change to keep the files similar
Change-Id: I1b6255dc3a03b90394a791ee0d15d8167d9454db
vp8_diamond_search_sadx4 isn't used in arm because there is no
corrosponding sdx4df as in x86. rather than keep it in sync with
../mcomp.c, delete it
vp8_hex_search had the original, more readable/understandable code if`d
out. it's also available in ../mcomp.c, so remove the dead copy
Change-Id: Ia42aa6e23b3a2e88040f467280befec091ec080e
when a subsequent frame is encoded as an alt reference frame, it is
unlikely that any mb in current frame will be used as reference for
future frames, so we can enable quantization optimization even when
the RD constant is slightly rate-biased. The change has an overall
benefit between 0.1% to 0.2% bit savings on the test sets based on
vpxssim scores.
Change-Id: I9aa7bc5cd573ea84e3ee655d2834c18c4460ceea
../libvpx/vp8/encoder/bitstream.c: In function ‘pack_inter_mode_mvs’:
../libvpx/vp8/encoder/bitstream.c:1026: warning: array subscript has type ‘char’
Change-Id: Ic77491e0a172fa1821e5b3e914d0dc41fe87c00f
In order to know if all 4/8 neighbor points are within the bounds,
4 bounds checking are enough instead of checking 4 bounds for
each points (16/32 checkings). This improvement reduces cost of
vp8_diamond_search_sadx4() by 30%, and gives encoder a 1.5%
performance gain (test options: 1 pass, good, speed=4).
Change-Id: Ie8da29d18a6ecfc9829e74ac02f6fa70e042331a
This patch moves the scattered updates to the mb skip state
(mode_info_context->mbmi.mb_skip_coeff) to vp8_tokenize_mb. Recent
changes to the quantizer exposed a bug where if a macroblock
could be coded as a skip but isn't, the encoder would run the
loopfilter but the decoder wouldn't, causing a reference buffer
mismatch.
The loopfilter is controlled by a flag called dc_diff. The decoder
looks at the number of decoded coefficients when setting this flag.
The encoder sets this flag based on the skip state, since any
skippable macroblock should be transmitted as a skip. The coefficient
optimization pass (vp8_optimize_b()) could change the coefficients
such that a block that was not a skip becomes one. The encoder was
not updating the skip state in this situation for intra coded blocks.
The underlying issue predates it, but this bug was recently triggered
by enabling trellis quantization on the Y2 block in commit dcd29e3,
and by changing the quantizer range control in commit 305be4e.
Change-Id: I5cce5da0dbc2d22f7d79ee48149f01e868a64802
This uses MB variance to change the RDO weight for mode decision
and quantization.
Activity is normalized against the average for the frame, which is
currently tracked using feed-forward statistics.
This could also be used to adjust the quantizer for the entire
frame, but that requires more extensive rate control changes.
This does not yet attempt to adapt the quantizer within the frame,
but the signaling cost means that will likely only be useful at
very high rates.
Change-Id: I26cd7c755cac3ff33cfe0688b1da50b2b87b9c93
These functions should never change their input, and there's no
reason not to declare that.
This allows them to be passed static const data.
Change-Id: Ia49fe4b01e80e9afcb24b4844817694d4da5995c
There is currently no inexact version of this function, so do not
even compile it without EXACT_QUANT.
This will prevent someone from inadvertently trying to use it without
the proper EXACT_QUANT setup.
Change-Id: Ia13491e0128afb281c05c9222ee5987101e4010d
This is just eliminating some cruft.
Although a number of variables are declared only when INTRARDOPT
is defined, they are used elsewhere without that protection, and
no longer just for intra RDO.
The intra_rd_opt flag was hard-coded to 1 and never checked.
Change-Id: I83a81554ecee8053e7b4ccd8aa04e18fa60f8e4f
Moved vp8_fast_quantize_b_sse from quantize_mmx.asm into
quantize_sse2.asm and renamed. Updated the assembly code to
match the C version.
Change-Id: I1766d9e1ca60e173f65badc0ca0c160c2b51b200
As the zbin and rounding constants are normalized, rounding effectively
does the zbinning, therefore the zbin operation can be removed. In
addition, the memset on the two arrays are no longer necessary.
Change-Id: If39c353c42d7e052296cb65322e5218810b5cc4c
Filed for nasm as:
https://sourceforge.net/tracker/?func=detail&atid=106208&aid=3081103&group_id=6208
nasm just does not accept any size parameter for movhps:
1.asm:2: error: mismatch in operand sizes
Some parts of libvpx already use MMWORD for movhps and MMWORD is
defined-out so it is compatible both with yasm and nasm.
Provide nasm compatibility. No binary change by this patch with yasm on
{x86_64,i686}-fedora13-linux-gnu.
Change-Id: I4008a317ca87ec07c9ada958fcdc10a0cb589bbc
nasm does not support `label wrt rip', it requires `rel label'. It is
still fully compatible with yasm.
Provide nasm compatibility. No binary change by this patch with yasm on
{x86_64,i686}-fedora13-linux-gnu. Few longer opcodes with nasm on
{x86_64,i686}-fedora13-linux-gnu have been checked as safe.
Change-Id: I488773a4e930a56e43b0cc72d867ee5291215f50
nasm requires the instruction length (movd/movq) to match to its
parameters. I find it more clear to really use 64bit instructions when
we use 64bit registers in the assembly.
Provide nasm compatibility. No binary change by this patch with yasm on
{x86_64,i686}-fedora13-linux-gnu. Few longer opcodes with nasm on
{x86_64,i686}-fedora13-linux-gnu have been checked as safe.
Change-Id: Id9b1a5cdfb1bc05697e523c317a296df43d42a91
Raised by Lei Yang, the Y plane stride was used for UV blocks.
This is clearly a typo. But as the comments in the code suggested
that this port of code has not been used yet, so the typo should
not have created any damage yet.
Change-Id: Iea895edc17469a51c803a8cc6d0fce65a1a7fc2f
This code adjust the impact of the amount and speed of motion
on GF and KF boost.
Sections with lots of slow motion will tend to have a
somewhat bigger boost and sections with fast motion may
have less.
There is a knock on effect to the selection of the active
quantizer range.
This will likely require further tuning but helps with a couple
of particularly bad edge cases.
Change-Id: Ic2449cda7305672b69acf42fc0a845b77ac98d40
Experimented with different value for Y2_RD_MULT ranging f[1, 32],
without adapting the value to MB coding mode/frame type/Q value,
4 works out best among all values, providing overall 0.1% coding
gain on the test set.
Change-Id: I6b2583a8aa5db5e7e5c65c646301909c0c58f876
If temporal filtering is enabled but a filter type is not specified
centered filter mode is used by default.
Change-Id: I87306f267c1390074c806c506a69b4ba914d92a2
Modified the range checking of parameters used in the
AltRef temporal filter (arnr-max-frames, arnr-strength,
arnr-type) and default values for each of them.
Change-Id: Ib261028d501b9523f6e44cb4790cc52167b6e92b
This function graduated from being a test func to something that's on
by default. Rename it and remove some spurious comments that confuse
its status.
Change-Id: I689695a3ad29c35e9a72a43ec93766733ac6c20b
Loopfilter deltas are initialized to zero on keyframes in the decoder.
The values then persist from the previous frame unless an update bit
is set in the bitstream. This data is not included in the entropy
data saved by the 'refresh entropy' bit in the bitstream, so it is
effectively an additional contextual element beyond the 3 ref-frames
and the entropy data.
The encoder was treating this delta update bit as update-if-nonzero,
meaning that the value would be refreshed even if it hadn't changed,
and more significantly, if the correct value for the delta changed
to zero, the update wouldn't be sent, and the decoder would preserve
the last (presumably non-zero) value.
This patch updates the encoder to send an update only if the value
has changed from the previously transmitted value. It also forces the
value to be transmitted in error resilient mode, to account for lost
context in the event of lost frames.
Change-Id: I56671d5b42965d0166ac226765dbfce3e5301868
Create look up tables for controlling the active quantizer range.
Some initial tuning to improve quality circa 0.5% on test set.
Clean up of some stats output code
Change-Id: Ia698a8525f8b8129a503cadace3ee73fe888f543
- Scheduling for Atom processors
- Combining of macros to allow for better interleaving
- Change from multiplies to adds for main filter
- Use of movhps/movlps to fill xmm registers without
shifting and orring
Change-Id: I0b3500a5f58abf7085253ec92d64c8a96723040b
Modified AltRef temporal filter to adapt filter length based
on macroblock coding modes selected during first-pass
encode.
Also added sub-pixel motion compensation to the AltRef
filter.
The existing code applied a 6-tap filter with 0's on either end.
We're already paying the branch penalty to avoid computing the two
extra columns needed as input to this filter.
We might as well save time computing the filter as well.
This reduces the inner loop from 21 instructions to 16, the number
of loads per iteration from 4 to 1, and the number of multiplies
from 7 to 4.
The gain in overall decoding performance, however, is small (less
than 1%).
This change also means we now valgrind clean on ARMv6, which is
its real purpose.
The errors reported here were valgrind's fault (it does not detect
that 0 times an uninitialized value is initialized), but Julian
Seward says it would slow down valgrind considerably to make such
checks.
Speeding up libvpx rather, even by a small amount, seems a much
better idea if only to enable proper valgrind checking of the
rest of the codec.
Change-Id: Ifb376ea195e086b60f61daf1097d8910c4d8ff16
This function was accessing values below the stack pointer, which
can be corrupted by signal delivery at any time.
Change-Id: I92945b30817562eb0340f289e74c108da72aeaca
previous implementation compared each set of values to limit and then
&'d them together, requiring a compare and & for each value.
this does the accumulation first, requiring only one compare
Change-Id: Ia5e3a1a50e47699c88470b8c41964f92a0dc1323
This patch avoids compiling some debugging code in onyx_if.c. The most
significant fix is to avoid generating code for vp8_write_yuv_frame,
which is never called. Some other code was removed by the dead code
elimination performed by the compiler, and this patch does it with the
preprocessor instead. There are advantages both ways.
Change-Id: I044fd43179d2e947553f0d6f2cad5b40907ac458
reconintra_mt.c is only required for building the decoder right now.
It could definitely be used for the encoder in the future, but it
currently depends on decoder only data structures. (onyxd_int.h,
VP8D_COMP, etc). Move it from common/ to decoder/ until the
necessary changes to the common multithread code are complete.
This patch is needed to build with --disable-vp8-decoder.
Change-Id: I568c52221a2b309234d269675cba97131ce35c86
Having these symbols be available as functions rather than data is
occasionally more convenient. Implemented this way rather than a
get-codec-by-id style to avoid creating a link-time dependency
between the encoder and the decoder.
Fixes issue #169
Change-Id: I319f281277033a5e7e3ee3b092b9a87cce2f463d
The MV decoding changes in c5fb0eb introduced a bug where the
macroblock clamping state was reset for each partition, so if an
earlier partition needed clamping but a subsequent one didn't,
the MB wouldn't receive clamping. Instead, the state is only
set during splitmv decoding, never cleared.
Change-Id: I224fe258493405ee0f6a04596acdb622c475e845
Movdqu is more expensive (throughput, uops) than movq. Minimal
impact for newer big cores, but ~2.25% gain on Atom.
Change-Id: I62c80bb1cc01d8a91c350c4c7719462809a4ef7f
Use pmaxub instead of a combination of psubusb/por to
determine if any comparisons go over the limit.
Change-Id: I3f0bd7d2aabe5fee9ba6620508e2b60605abcb82
The patch related with issue #55 (5a72620) fixed some warnings, but the
fix was not optimal. It actually was a trick to confuse compiler rather
than a fix.
This patch fixes it by creating a new macro used when needed just a high
limit check for an unsigned.
Change-Id: I94b322e0f7fb07604b3b1df1f9321185f48cfcb5
the previous commit laid the groundwork by doing two sets of idcts
together. this moved that further by grouping the interesting data
(q[0], q+16[0]) together to allow using wider instructions. also
managed to drop a few instructions by recognizing that the constant
for sinpi8sqrt2 could be downshifted all the time which avoided a
dowshift as well as workarounds for a function which only accepted
signed data
looks like a modest gain for performance: at qcif, went from ~180
fps to ~183
Change-Id: I842673f3080b8239e026cc9b50346dbccbab4adf
On each MB, loopfiltering is done right after MB decoding. This
combines two loops in multi-threaded code into one, which reduces
number of synchronizations to half.
The above-row/left-col data are saved in temp buffers for
next-row/next MB decoding.
Tests on 4-core gLucid machine showed 10% decoder performance
gain with threads=4 (tulip clip). Testing on other platforms
isn't done yet.
Change-Id: Id18ea7c1e84965dabea65d4c01ca5bc056ddeac9
This patch reduces the size of the global tables maintained by the
tokenizer to 16k from 80k-96k. See issue #177.
Change-Id: If0275d5f28389af11ac83c5d929d1157cde90fbe
There is no need to make sure that the lower byte of the
register is 0 because the downshift by 11 overwrites that byte.
Change-Id: I89cbf004b2ff532a2c68e0dc399c45a49cdad5a1
Sequentially accessing memory from a low address to a high
address should make it easier for the processor to predict
the cache.
Change-Id: I1921ce996bdd547144fe864fea6435f527f5842d
Improved the subset block search and fill. (about 3% improvement for
32 bit) Modified/merged the code in order to create
vp8_read_mb_modes_mv which can decode the modes/mvs on a macroblock
level. This will allow the decode loop (in the future) to decode
modes/mvs on a frame, row, or mb level.
Change-Id: If637d994b508792f846d39b5d44a7bf9aa5cddf3
Expand 93c32a55 which used SSE2 instructions to do two
idct/dequant/recons at a time to NEON. Initial working
commit. More work needs to be put into rearranging and
interlacing the data to take advantage of quadword
operations, which is when we'll hopefully see a much
better boost
Change-Id: I86d59d96f15e0d0f9710253e2c098ac2ff2865d1
When ARFs are enabled in non-lagged compress modes, the GF interval
was being reset to zero. Non-lagged ARF updates were enabled in commit
63ccfbd, but this incorrect GF interval caused a quality regression.
Change-Id: I615c3b493f4ce2127044f4e68d0bcb07d6b730c3
Changes 'The VP8 project' to 'The WebM project', for consistency
with other webmproject.org repositories.
Fixes issue #97.
Change-Id: I37c13ed5fbdb9d334ceef71c6350e9febed9bbba
vp8_get_compressed_data() was defeating logic in
encode_frame_to_datarate() that determined the reference buffers to
search and forcing all frames to be eligible to search. In cases
where buffers have identical contents, this is unnecessary extra
work.
Change-Id: I9e667ac39128ae32dc455a3db4c62e3efce6f114
ARFs were explicitly disabled except in lagged compress mode. New
ARF logic allows for the ARF buffer to hold an older golden frame,
which does not require lagged compress.
Change-Id: I1dff82b6f53e8311f1e0514b1794ae05919d5f79
Moved partition_bmi and partition_count out of MB_MODE_INFO and
placed into MACROBLOCK. Also reduced the size of other members
of the MB_MODE_INFO struct. For 1080p, the memory was reduced
by 1,209,516 bytes. The decoder performance appeared to improve
by 3% for the clip used.
Note: The main goal for this change is to improve the decoder
performance. The encoder will be revisited at a later date for
further structure cleanup.
Change-Id: I4733621292ee9cc3fffa4046cb3fd4d99bd14613
Remove the dependency on postproc.c for the encoder in general, the only
unchecked need for it is when CONFIG_PSNR is enabled. All other cases
are already wrapped in CONFIG_POSTPROC. In the CONFIG_PSNR case the file
will still be included.
Additionally, when VP8_SET_POSTPROC is used with the encoder when post
processing has been disabled an error will be returned.
This addresses issue #153.
Change-Id: Ia6dfe20167f7077734a6058cbd1d794550346089
There was an extremely rare deadlock that happened when one thread
was waiting to start the loop filter on frame n while the other
threads were starting to work on frame n+1.
Change-Id: Icc94f728b3b6663405435640d9a2996735ba19ef
These changes improve the behaviour of the code with
forced key frames sent in by a calling application.
The sizing of the frames is still suboptimal for two pass in
particular but the behaviour is much better than it was.
Change-Id: I35fae610c67688ccc69d11f385e87dfc884e65a1
The main reason for the change was to reduce cycles in the token
decoder. (~1.5% gain for 32 bit) This layout should be more
cache friendly.
As a result of this change, the encoder had to be updated.
Change-Id: Id5e804169d8889da0378b3a519ac04dabd28c837
Note: dixie uses a similar layout
The memory being zeroed in vp8_update_mode_info_border() was just
allocated with calloc, and so the entire function is actually
redundant, but it should be made correct in case someone expects
it to actually work in the future.
Change-Id: If7a84e489157ab34ab77ec6e2fe034fb71cf8c79
did a test compile with clang and got rid of some warnings that have
been annoying me for a while:
vp8/decoder/detokenize.c: In function 'vp8_init_detokenizer':
vp8/decoder/detokenize.c:121: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:122: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:123: warning: assignment from incompatible pointer type
vp8/decoder/detokenize.c:124: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:125: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:128: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:129: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:130: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:131: warning: assignment discards qualifiers from pointer target type
Change-Id: I78ddab176fe47cbeed30379709dc7bab01c0c2e4
Moving the eob structure allows for a non-struct based
function to handle decoding an entire mb of
idct/dequant/recon data. This allows for SIMD functions
to idct/dequant/recon multiple blocks at once.
SSE2 implementation gives 3% gain on Atom.
Change-Id: I8a8f3efd546ea4e0535f517d94f347cfb737c9c2
The external API exposes the RC initial/optimal/full buffer level in
milliseconds, but this value was truncated internally to seconds. This
patch allows the use of the full precision during the conversion from
time to bits.
Change-Id: If8dd2a87614c05747f81432cbe75dd9e6ed2f04e
move some things around, reorder some instructions
constant 0 is used several times. load it once per call in horiz,
once per loop in vert.
separate saturating instructions to avoid stalls.
just use one usub8 call to set GE flags, rather than uqsub8 followed by
usub8 w/ 0
document some stalls for further consideration
Change-Id: Ic3877e0ddbe314bb8a17fd5db73501a7d64570ec
test cases were causing a crash because the count was being read
incorrectly. after fixing that, noticed that the output was not
matching. fixed that.
Change-Id: Idb0edb887736bd566a3cf6d4aa1a03ea8d20eb27
The C version of the dequant/idct/add function depends on the C
version of the IDCT, but this isn't compiled in on ARM. Since this
code has asm version, we can just remove this file to eliminate the
link error.
Change-Id: I21de74d89d3765a1db2da27292b20727c53178e9
vp8_update_gf_useage_maps() is only used by the encoder. This patch
fixes the ability to build in decode-only or encode-only
configurations.
Change-Id: I3a5211428e539886ba998e09e8abd747ac55c9aa
adds a compile time option: --enable-arm-asm-detok which pulls in
vp8/decoder/arm/detokenize.asm
currently about break even speed wise, but changes are pending to
the fill code (branch and load 3 bytes versus conditionally always
load one) and the error handling. Currently it doesn't handle zero
runs or overrunning the buffer.
this is really just so i don't have to rebase my changes all the
time to run benchmarks - now just need to replace one file!
Change-Id: I56d0e2354dc0ca3811bffd0e88fe1f952fa6c797
asm_offsets contains some definitions which are no longer used. this
was one of them. v6 build works now
Change-Id: If370cfa8acd145de4fead2d9a11b048fccc090df
These copies occurred for each macroblock in the encoder and decoder.
Thetemp MB_MODE_INFO mbmi was removed from MACROBLOCKD. As a result,
a large number compile errors had to be fixed.
Change-Id: I4cf0ffae3ce244f6db04a4c217d52dd256382cf3
The mv_ref and sub_mv_ref token encodings are indexed from NEARESTMV
and LEFT4X4, respectively, rather than being zero-based like the
other token encodings.
Change-Id: I3699c3f84111209ecfb91097c4b900773e9a3ad5
This patch changes a few numbers in the two constant arrays
for quantizer's zerobin and rounding factors, in general to
make the sum of the two factors for any Q to be 128. While
it might be beneficial to calibrate the two arrays for best
quantizer performance, it is not the purpose of this patch.
Normalizing the two arrays will enable quick optimization
of the current faster quantizer, i.e .zerobin check can be
removed.
Change-Id: If9abfd7929bf4b8e9ecd64a79d817c6728c820bd
Replace the exponential search for optimal rounding during
quantization with a linear Viterbi trellis and enable it
by default when using --best.
Right now this operates on top of the output of the adaptive
zero-bin quantizer in vp8_regular_quantize_b() and gives a small
gain.
It can be tested as a replacement for that quantizer by
enabling the call to vp8_strict_quantize_b(), which uses
normal rounding and no zero bin offset.
Ultimately, the quantizer will have to become a function of lambda
in order to take advantage of activity masking, since there is
limited ability to change the quantization factor itself.
However, currently vp8_strict_quantize_b() plus the trellis
quantizer (which is lambda-dependent) loses to
vp8_regular_quantize_b() alone (which is not) on my test clip.
Patch Set 3:
Fix an issue related to the cost evaluation of successor
states when a coefficient is reduced to zero. With this
issue fixed, now the trellis search almost exactly matches
the exponential search.
Patch Set 2:
Overall, the goal of this patch set is to make "trellis"
search to produce encodings that match the exponential
search version. There are three main differences between
Patch Set 2 and 1:
a. Patch set 1 did not properly account for the scale of
2nd order error, so patch set 2 disable it all together
for 2nd blocks.
b. Patch set 1 was not consistent on when to enable the
the quantization optimization. Patch set 2 restore the
condition to be consistent.
c. Patch set 1 checks quantized level L-1, and L for any
input coefficient was quantized to L. Patch set 2 limits
the candidate coefficient to those that were rounded up
to L. It is worth noting here that a strategy to check
L and L+1 for coefficients that were truncated down to L
might work.
(a and b get trellis quant to basically match the exponential
search on all mid/low rate encodings on cif set, without
a, b, trellis quant can hurt the psnr by 0.2 to .3db at
200kbps for some cif clips)
(c gets trellis quant to match the exponential search
to match at Q0 encoding, without c, trellis quant can be
1.5 to 2db lower for encodings with fixed Q at 0 on most
derf cif clips)
Change-Id: Ib1a043b665d75fbf00cb0257b7c18e90eebab95e
This is the first modification of VP8 multi-thread decoder, which uses
same threads to decode macroblocks and then do loopfiltering for each
frame.
Inspired by Rob Clark, synchronization was done on every 8 macroblocks
instead of every macroblock to reduce lock contention.
Comparing with the original code, this implementation gave about 15%-
20% performance gain while decoding my test clips on a Core2 Quad
platform (Linux).
The work is not done yet.
Test on other platforms are needed.
Change-Id: Ice9ddb0b511af1359b9f71e65066143c04fef3b5
Clang defaults to C99 mode, and inline works differently in C99.
(gcc, on the other hand, defaults to a special gnu-style inlining,
which uses different syntax.) Making the functions static makes sure
clang doesn't decide to discard a function because it's too large to
inline.
Thanks to eli.friedman for the patch.
Fixes http://code.google.com/p/webm/issues/detail?id=114
Change-Id: If3c1c3c176eb855a584a60007237283b0cc631a4
global label:data
^^
Provide nasm compatibility. No binary change by this patch with yasm
on {x86_64,i686}-fedora13-linux-gnu. Few longer opcodes with nasm on
{x86_64,i686}-fedora13-linux-gnu have been checked as safe.
Change-Id: I10f17eb1e4d4a718d4ebd1d0ccddc807c365e021
Labels should end by colon (':'), nasm requires it.
Provide nasm compatibility. No binary change by this patch with yasm
on {x86_64,i686}-fedora13-linux-gnu. Few longer opcodes with nasm on
{x86_64,i686}-fedora13-linux-gnu have been checked as safe.
Change-Id: I0b2ec6f01afb061d92841887affb5ca0084f936f
nasm knows only OWORD. yasm knows both OWORD and DQWORD.
Provide nasm compatibility. No binary change by this patch with yasm on
{x86_64,i686}-fedora13-linux-gnu. Few longer opcodes with nasm on
{x86_64,i686}-fedora13-linux-gnu have been checked as safe.
Change-Id: I62151390089e90df9a7667822fa594ac20b00e78
Removed the global variables vp8_an and vp8_cd. vp8_an was causing problems
because it was increasing the .bss by 1572864 bytes.
Change-Id: I6c12e294133c7fb6e770c0e4536d8287a5720a87
To facilitate more testing related to quantizer and rate
control, the old version quantizer is added back. old and
new quantizer can be switched back and forth by define or
un-define the macro "EXACT_QUANT".
Change-Id: Ia77e687622421550f10e9d65a9884128a79a65ff
follow up to Change I0e51492d: neon: disable asm quantizer
Now x86 doesn't segfault with --disable-runtime-cpu-detect and -p=2
Change-Id: I8ca127bb299198efebbcbd5a661e81788361933f
The assembly version of the quantizer has not been updated to match
the new exact quantizer introduced in commit e04e2935. That commit tried
to disable this code but missed the non-RTCD case.
Thanks to David Baker <david.baker at openmarket.com> for isolating the
issue and testing this fix.
Change-Id: I0e51492dc6f8e44d2c10b587427448bf94135c65
Jeff Muizelaar posted some changes to the idct/reconstruction c code.
This is the equivalent update for the arm assembly.
This shows a good boost on v6, and a minor boost on neon.
Here are some numbers for highway in qcif, 2641 frames:
HEAD neon: ~161 fps
new neon: ~162 fps
HEAD v6: ~102 fps
new v6: ~106 fps
The following functions have been updated for armv6 and neon:
vp8_dc_only_idct_add
vp8_dequant_idct_add
vp8_dequant_dc_idct_add
Conflicts:
vp8/decoder/arm/armv6/dequantdcidct_v6.asm
vp8/decoder/arm/armv6/dequantidct_v6.asm
Resolved by removing these files. When I rewrote the functions, I also
moved the files to dequant_dc_idct_v6.asm/dequant_idct_v6.asm
Change-Id: Ie3300df824d52474eca1a5134cf22d8b7809a5d4
This moves the prediction step before the idct and combines the idct and
reconstruction steps into a single step. Combining them seems to give an
overall decoder performance improvement of about 1%.
Change-Id: I90d8b167ec70d79c7ba2ee484106a78b3d16e318
At the end of the decode, frame buffers were being copied.
The frames are not updated after the copy, they are just
for reference on later frames. This change allows multiple
references to the same frame buffer instead of copying it.
Changes needed to be made to the encoder to handle this. The
encoder is still doing frame buffer copies in similar places
where pointer reference could be done.
Change-Id: I7c38be4d23979cc49b5f17241ca3a78703803e66
In two pass encodes, the calculation of the number of bits
allocated to a KF group had the potential to overflow for high data
rates if the interval is very long.
We observed the problem in one test clip where there was one
section where there was an 8000 frame gap between key frames.
Change-Id: Ic48eb86271775d7573b4afd166b567b64f25b787
This replaces the approximate division-by-multiplication in the
quantizer with an exact one that costs just one add and one
shift extra.
The asm versions have not been updated in this patch, and thus
have been disabled, since the new method requires different
multipliers which are not compatible with the old method.
Change-Id: I53ac887af0f969d906e464c88b1f4be69c6b1206
These files were out of date and no longer maintained.
Token decoding has implemented the no-crash code which
is incompatible with this arm assembly code.
Change-Id: Ibf729886c56fca48181af60b44bda896c30023fc