follow up to Change I0e51492d: neon: disable asm quantizer
Now x86 doesn't segfault with --disable-runtime-cpu-detect and -p=2
Change-Id: I8ca127bb299198efebbcbd5a661e81788361933f
The assembly version of the quantizer has not been updated to match
the new exact quantizer introduced in commit e04e2935. That commit tried
to disable this code but missed the non-RTCD case.
Thanks to David Baker <david.baker at openmarket.com> for isolating the
issue and testing this fix.
Change-Id: I0e51492dc6f8e44d2c10b587427448bf94135c65
Jeff Muizelaar posted some changes to the idct/reconstruction c code.
This is the equivalent update for the arm assembly.
This shows a good boost on v6, and a minor boost on neon.
Here are some numbers for highway in qcif, 2641 frames:
HEAD neon: ~161 fps
new neon: ~162 fps
HEAD v6: ~102 fps
new v6: ~106 fps
The following functions have been updated for armv6 and neon:
vp8_dc_only_idct_add
vp8_dequant_idct_add
vp8_dequant_dc_idct_add
Conflicts:
vp8/decoder/arm/armv6/dequantdcidct_v6.asm
vp8/decoder/arm/armv6/dequantidct_v6.asm
Resolved by removing these files. When I rewrote the functions, I also
moved the files to dequant_dc_idct_v6.asm/dequant_idct_v6.asm
Change-Id: Ie3300df824d52474eca1a5134cf22d8b7809a5d4
This moves the prediction step before the idct and combines the idct and
reconstruction steps into a single step. Combining them seems to give an
overall decoder performance improvement of about 1%.
Change-Id: I90d8b167ec70d79c7ba2ee484106a78b3d16e318
At the end of the decode, frame buffers were being copied.
The frames are not updated after the copy, they are just
for reference on later frames. This change allows multiple
references to the same frame buffer instead of copying it.
Changes needed to be made to the encoder to handle this. The
encoder is still doing frame buffer copies in similar places
where pointer reference could be done.
Change-Id: I7c38be4d23979cc49b5f17241ca3a78703803e66
In two pass encodes, the calculation of the number of bits
allocated to a KF group had the potential to overflow for high data
rates if the interval is very long.
We observed the problem in one test clip where there was one
section where there was an 8000 frame gap between key frames.
Change-Id: Ic48eb86271775d7573b4afd166b567b64f25b787
This replaces the approximate division-by-multiplication in the
quantizer with an exact one that costs just one add and one
shift extra.
The asm versions have not been updated in this patch, and thus
have been disabled, since the new method requires different
multipliers which are not compatible with the old method.
Change-Id: I53ac887af0f969d906e464c88b1f4be69c6b1206
These files were out of date and no longer maintained.
Token decoding has implemented the no-crash code which
is incompatible with this arm assembly code.
Change-Id: Ibf729886c56fca48181af60b44bda896c30023fc
The libs.mk file must be installed for the vpx.vcproj file to be
generated. It was being installed, but not in the src/ directory as
expected.
Also missed include files yasm.rules, quantize_x86.h
Change-Id: Ic1a6f836e953bfc954d6e42a18c102a0114821eb
Add targets x86-win32-vs9 and x86_64-win64-vs9 for support of Visual
Studio 2008-- this removes the need to convert the vs8 projects before
using them within the IDE.
Change-Id: Idb83e2ae701e07d98db1be71638280a493d770a2
Change submitted for Adrian Grange. Convert threshold
calculation in ARNR filter to a lookup table.
Change-Id: I12a4bbb96b9ce6231ce2a6ecc2d295610d49e7ec
Previously we had assumed that it was necessary to give a full frame's
bit allocation to the alt ref frame if it has been created through temporal
filtering. This is not the case. The active max quantizer control
insures that sufficient bits are allocated if needed and allocating a
full frame's worth of bits creates an excessive overhead for the ARF.
Change-Id: I83c95ed7bc7ce0e53ccae6ff32db5a97f145937a
In the case where the best reference mv is not (0,0) a secondary
search is carried out centered on (0,0). However, rather than
sending tmp_err into the search function, motion_error was
inadvertently passed.
As a result tmp_err remains set at INT_MAX and the (0,0)-based
search result will never be selected, even if it is better.
Change-Id: I3c82b246c8c82ba887b9d3fb4c9e0a0f2fe5a76c
Change I9fd1a5a4 updated the multithreaded loopfilter to avoid
reinitializing several parameteres if they haven't changed from the
last frame, but the code to update the last frame's parameters wasn't
invoked in the multithreaded case.
Change-Id: Ia23d937af625c01dd739608e02d110f742b7e1f2
Restructured and rewrote SSE2 loopfilter functions. Combined u and
v into one function to take advantage of SSE2 128-bit registers.
Tests on test clips showed a 4% decoder performance improvement on
Linux desktop.
Change-Id: Iccc6669f09e17f2224da715f7547d6f93b0a4987
The generated project is vpx.vcproj, change vpx_decoder references to
match. Remove .rules file dependency as it will be pulled from the
source tree.
Change-Id: I679db2748b37adae3bafd764dba8575fc3abde72
Following conversations with Tim T (Derf) I ran a large number of
tests comparing the existing polynomial expression with a simpler
^2 variant. Though the polynomial was sometimes a little better at
the extremes of Q it was possible to get close for most clips and
even a little better on some.
This code also changes the way the RD multiplier is calculated
when the ZBIN is extended to use a variant of the same ^2
expression.
I hope that this simpler expression will be easier to tune further
as we expand our test set and consider adjustments based on content.
Change-Id: I73b2564346e74d1332c33e2c1964ae093437456c
Besides the slight improvement in round trip error. This
also fixes a sign bias in the forward transform, so the
round trip errors are evenly distributed between +1s and
-1s. The old bias seemed to work well with the dc sign bias
in old fdct, which no longer exist in the improved fdct.
Change-Id: I8635e7be16c69e69a8669eca5438550d23089cef
Corrected setting of "which_buffer" for U & V cases to match that
used for Y, i.e. to refer to the temporally most recent frame of
those to be filtered.
Change-Id: Idf94b287ef47a05f060da3e61134a0b616adcb6b
The new fdct lowers the round trip sum squared error for a
4x4 block ~0.12. or ~0.008/pixel. For reference, the old
matrix multiply version has average round trip error 1.46
for a 4x4 block.
Thanks to "derf" for his suggestions and references.
Change-Id: I5559d1e81d333b319404ab16b336b739f87afc79