the previous commit laid the groundwork by doing two sets of idcts
together. this moved that further by grouping the interesting data
(q[0], q+16[0]) together to allow using wider instructions. also
managed to drop a few instructions by recognizing that the constant
for sinpi8sqrt2 could be downshifted all the time which avoided a
dowshift as well as workarounds for a function which only accepted
signed data
looks like a modest gain for performance: at qcif, went from ~180
fps to ~183
Change-Id: I842673f3080b8239e026cc9b50346dbccbab4adf
On each MB, loopfiltering is done right after MB decoding. This
combines two loops in multi-threaded code into one, which reduces
number of synchronizations to half.
The above-row/left-col data are saved in temp buffers for
next-row/next MB decoding.
Tests on 4-core gLucid machine showed 10% decoder performance
gain with threads=4 (tulip clip). Testing on other platforms
isn't done yet.
Change-Id: Id18ea7c1e84965dabea65d4c01ca5bc056ddeac9
Improved the subset block search and fill. (about 3% improvement for
32 bit) Modified/merged the code in order to create
vp8_read_mb_modes_mv which can decode the modes/mvs on a macroblock
level. This will allow the decode loop (in the future) to decode
modes/mvs on a frame, row, or mb level.
Change-Id: If637d994b508792f846d39b5d44a7bf9aa5cddf3
Expand 93c32a55 which used SSE2 instructions to do two
idct/dequant/recons at a time to NEON. Initial working
commit. More work needs to be put into rearranging and
interlacing the data to take advantage of quadword
operations, which is when we'll hopefully see a much
better boost
Change-Id: I86d59d96f15e0d0f9710253e2c098ac2ff2865d1
Changes 'The VP8 project' to 'The WebM project', for consistency
with other webmproject.org repositories.
Fixes issue #97.
Change-Id: I37c13ed5fbdb9d334ceef71c6350e9febed9bbba
Moved partition_bmi and partition_count out of MB_MODE_INFO and
placed into MACROBLOCK. Also reduced the size of other members
of the MB_MODE_INFO struct. For 1080p, the memory was reduced
by 1,209,516 bytes. The decoder performance appeared to improve
by 3% for the clip used.
Note: The main goal for this change is to improve the decoder
performance. The encoder will be revisited at a later date for
further structure cleanup.
Change-Id: I4733621292ee9cc3fffa4046cb3fd4d99bd14613
There was an extremely rare deadlock that happened when one thread
was waiting to start the loop filter on frame n while the other
threads were starting to work on frame n+1.
Change-Id: Icc94f728b3b6663405435640d9a2996735ba19ef
The main reason for the change was to reduce cycles in the token
decoder. (~1.5% gain for 32 bit) This layout should be more
cache friendly.
As a result of this change, the encoder had to be updated.
Change-Id: Id5e804169d8889da0378b3a519ac04dabd28c837
Note: dixie uses a similar layout
did a test compile with clang and got rid of some warnings that have
been annoying me for a while:
vp8/decoder/detokenize.c: In function 'vp8_init_detokenizer':
vp8/decoder/detokenize.c:121: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:122: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:123: warning: assignment from incompatible pointer type
vp8/decoder/detokenize.c:124: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:125: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:128: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:129: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:130: warning: assignment discards qualifiers from pointer target type
vp8/decoder/detokenize.c:131: warning: assignment discards qualifiers from pointer target type
Change-Id: I78ddab176fe47cbeed30379709dc7bab01c0c2e4
Moving the eob structure allows for a non-struct based
function to handle decoding an entire mb of
idct/dequant/recon data. This allows for SIMD functions
to idct/dequant/recon multiple blocks at once.
SSE2 implementation gives 3% gain on Atom.
Change-Id: I8a8f3efd546ea4e0535f517d94f347cfb737c9c2
adds a compile time option: --enable-arm-asm-detok which pulls in
vp8/decoder/arm/detokenize.asm
currently about break even speed wise, but changes are pending to
the fill code (branch and load 3 bytes versus conditionally always
load one) and the error handling. Currently it doesn't handle zero
runs or overrunning the buffer.
this is really just so i don't have to rebase my changes all the
time to run benchmarks - now just need to replace one file!
Change-Id: I56d0e2354dc0ca3811bffd0e88fe1f952fa6c797
These copies occurred for each macroblock in the encoder and decoder.
Thetemp MB_MODE_INFO mbmi was removed from MACROBLOCKD. As a result,
a large number compile errors had to be fixed.
Change-Id: I4cf0ffae3ce244f6db04a4c217d52dd256382cf3
This is the first modification of VP8 multi-thread decoder, which uses
same threads to decode macroblocks and then do loopfiltering for each
frame.
Inspired by Rob Clark, synchronization was done on every 8 macroblocks
instead of every macroblock to reduce lock contention.
Comparing with the original code, this implementation gave about 15%-
20% performance gain while decoding my test clips on a Core2 Quad
platform (Linux).
The work is not done yet.
Test on other platforms are needed.
Change-Id: Ice9ddb0b511af1359b9f71e65066143c04fef3b5
Jeff Muizelaar posted some changes to the idct/reconstruction c code.
This is the equivalent update for the arm assembly.
This shows a good boost on v6, and a minor boost on neon.
Here are some numbers for highway in qcif, 2641 frames:
HEAD neon: ~161 fps
new neon: ~162 fps
HEAD v6: ~102 fps
new v6: ~106 fps
The following functions have been updated for armv6 and neon:
vp8_dc_only_idct_add
vp8_dequant_idct_add
vp8_dequant_dc_idct_add
Conflicts:
vp8/decoder/arm/armv6/dequantdcidct_v6.asm
vp8/decoder/arm/armv6/dequantidct_v6.asm
Resolved by removing these files. When I rewrote the functions, I also
moved the files to dequant_dc_idct_v6.asm/dequant_idct_v6.asm
Change-Id: Ie3300df824d52474eca1a5134cf22d8b7809a5d4
This moves the prediction step before the idct and combines the idct and
reconstruction steps into a single step. Combining them seems to give an
overall decoder performance improvement of about 1%.
Change-Id: I90d8b167ec70d79c7ba2ee484106a78b3d16e318
At the end of the decode, frame buffers were being copied.
The frames are not updated after the copy, they are just
for reference on later frames. This change allows multiple
references to the same frame buffer instead of copying it.
Changes needed to be made to the encoder to handle this. The
encoder is still doing frame buffer copies in similar places
where pointer reference could be done.
Change-Id: I7c38be4d23979cc49b5f17241ca3a78703803e66
These files were out of date and no longer maintained.
Token decoding has implemented the no-crash code which
is incompatible with this arm assembly code.
Change-Id: Ibf729886c56fca48181af60b44bda896c30023fc
Change I9fd1a5a4 updated the multithreaded loopfilter to avoid
reinitializing several parameteres if they haven't changed from the
last frame, but the code to update the last frame's parameters wasn't
invoked in the multithreaded case.
Change-Id: Ia23d937af625c01dd739608e02d110f742b7e1f2
When the license headers were updated, they accidentally contained
trailing whitespace, so unfortunately we have to touch all the files
again.
Change-Id: I236c05fade06589e417179c0444cb39b09e4200d
Change bitreading functions to use a larger window which is refilled less
often.
This makes it cheap enough to do bounds checking each time the window is
refilled, which avoids the need to copy the input into a large circular
buffer.
This uses less memory and speeds up the total decode time by 1.6% on an ARM11,
2.8% on a Cortex A8, and 2.2% on x86-32, but less than 1% on x86-64.
Inlining vp8dx_bool_decoder_fill() has a big penalty on x86-32, as does moving
the refill loop to the front of vp8dx_decode_bool().
However, having the refill loop between computation of the split values and
the branch in vp8_decode_mb_tokens() is a big win on ARM (presumably due to
memory latency and code size: refilling after normalization duplicates the
code in the DECODE_AND_BRANCH_IF_ZERO and DECODE_AND_LOOP_IF_ZERO cases.
Unfortunately, refilling at the end of vp8dx_bool_decoder_fill() and at the
beginning of each decode step in vp8_decode_mb_tokens() means the latter
requires an extra refill at the end.
Platform-specific versions could avoid the problem, but would require most of
detokenize.c to be duplicated.
Change-Id: I16c782a63376f2a15b78f8086d899b987204c1c7
No good reason to be tricky here. I don't know why 'break' occurred to me
as the natrual replacement for the 'return', but an if/else block is
definitely clearer.
Change-Id: I08a336307afeb0dc7efa494b37398f239f66c2cf
The new scheme introduced in I68d35a2f did not clamp chroma MVs in the SPLITMV
case, and clamped them incorrectly (to the luma plane bounds) in every other
case.
Because chroma MVs are computed from the luma MVs before clamping occurs, they
could still point outside of the frame buffer and cause crashes.
This clamping happens outside of the MV prediction loop, and so should not
affect bitstream decoding.
This patch removes the secondary MV clamping from the MV decoder. This
behavior was consistent with limits placed on non-split MVs by the
reference encoder, but was inconsistent with the MVs generated in the
split case.
The purpose of this secondary clamping was only to prevent crashes on
invalid data. It was not intended to be a behaviour an encoder could or
should rely on. Instead of doing additional clamping in a way that
changes the entropy context, the secondary clamp is removed and the
border handling is made implmentation specific. With respect to the
spec, the border is treated as essentially infinite, limited only by
the clamping performed on the near/nearest reference and the maximum
encodable magnitude of the residual MV.
This does not affect any currently produced streams.
Change-Id: I68d35a2fbb51570d6569eab4ad233961405230a3
Tests on x86 showed this function costed 2.7% of total decoding time
because of all the memory reads/writes. After modification, it only
costs about 0.7% of decoding time, which gives a 2% gain.
Change-Id: I5003ee30b6dc6dea0bfa42a6ad7e7c22fcc7b215
This renames the vpx_codec/ directory to vpx/, to allow applications
to more consistently reference these includes with the vpx/ prefix.
This allows the includes to be installed in /usr/local/include/vpx
rather than polluting the system includes directory with an
excessive number of includes.
Change-Id: I7b0652a20543d93f38f421c60b0bbccde4d61b4f