This commit makes the NearestMV match the chosen
best reference MV. It can be a 0,0 or non zero vector
which means the the compound nearest mv mode can
combine a 0,0 and a non zero vector.
Change-Id: I2213d09996ae2916e53e6458d7d110350dcffd7a
and called this function in vp9_dequant_idct_add_32x32_c when
eob == 1. For the test clip used, the decoder performance improved
by 21+%. Based on Yaowu's 16 point idct work.
Change-Id: Ib579a90fed531d45777980e04bf0c9b23c093c43
fixed a function prototypes to prevent compiler warnings;
removed a function not in use;
un-capitialize "Refstride" to ref_stride
Change-Id: Ib4472b6084f357d96328c6a06e795b6813a9edba
This commit changes the inverse 16 point dct to use the same algorithm
as the one for 32 point idct. In fact, now 16 point dct uses the exact
version of the souce code for even portion of the 32 point idct.
Tests showed current implementation has significant better accuracy
than the previous version. With this implementation and the minor bug
fix on forward 16 point dct, encoding tests showed about 0.2% better
compression of CIF set, test results on std-hd setting pending.
Change-Id: I68224b60c816ba03434e9f08bee147c7e344fb63
This commit changes the 32x32 idct to use integer only. The algorithm
was taken directly from "A Fast Computational Algorithm for the
Discrete Cosine Tranform" by W. Chen, et al., which was published in
IEEE Transaction on Communication Vol. Com.-25 No. 9, 1977. The signal
flow graph in the original paper is for a 32 point forward dct, the
current implementation of inverse DCT was done by follow the graph in
reversed direction.
With this implementation, the 32 point inverse dct contains a 16 point
inverse dct in its even portion, similarly the 16 point idct further
contains 8 point and 4 point inverse dcts.
As of patch 4, encoding tests showed there is no compression loss when
compared against the floating point baseline. Numbers even showed very
small postives. (cif: .01%, std-hd: .05%).
Change-Id: I2d2d17a424b0b04b42422ef33ec53f5802b0f378
Adds a special combination mode specific to intra prediciton
mode D45.
Current results with the compound inter/intra experiment:
derf: 0.2%
yt: 0.55%
std-hd: 0.75%
hd: 0.74%
Change-Id: I8976bdf3b9b0b66ab8c5c628bbc62c14fc72ca86
First step in simplifying the segment mode and
segment EOB flags into a simpler segment skip
flag that implies 0,0 mv and EOB at position 0.
Change-Id: Ib750cac31a7a02dc21082580498efd9f7d8d72a5
We'd backup and restore all cols for a 64x64 SB, but the array wouldn't
be big enough to hold all that data.
Change-Id: Ic68ea721bf07e0b2f3937bd16b0b734bcc743ce1
Adds a flag to disable features that would inhibit frame parallel
decoding. This includes backward adaptation and MV sorting based
on search in ref frame buffer.
Also includes some minor clean-ups.
Change-Id: I434846717a47b7bcb244b37ea670c5cdf776f14d
Adds an error-resilient mode where frames can be continued
to be decoded even when there are errors (due to network losses)
on a prior frame. Specifically, backward updates are turned off
and probabilities of various symbols are reset to defaults at
the beginning of each frame. Further, the last frame's mvs are
not used for the mv reference list, and the sorting of the
initial list based on search on previous frames is turned off
as well.
Also adds a test where an arbitrary set of frames are skipped
from decoding to simulate errors. The test verifies (1) that if
the error frames are droppable - i.e. frame buffer updates have
been turned off - there are no mismatch errors for the remaining
frames after the error frames; and (2) if the error-frames are non
droppable, there are not only no decoding errors but the mismatch
PSNR between the decoder's version of the post-error frames and the
encoder's version is at least 20 dB.
Change-Id: Ie6e2bcd436b1e8643270356d3a930e8989ff52a5
Updated the instrinsic code to match Yaowu's latest loopfilter change.
(I584393906c4f5f948a581d6590959522572743bb)
The decoder performance improved by ~30% for the test clip used.
Change-Id: I026cfc75d5bcb7d8d58be6f0440ac9e126ef39d2
* changes:
Use alt-ref frame context for keyframes
Preserve the previous golden frame on golden updates
Generalize and increase frame coding contexts
Start to anonymize reference frames
Update encoder to use fb_idx_ref_cnt
Remove buffer-to-buffer copy logic
The loop filtering used for MB edge or internal edge of a MB using 8x8
tranform was reading 5 pixel each side and writting 3 pixel each side.
With suggestion from Aki and Scott on hardware&software performance,
this commit changed to read 4 pixel each side and write 3 pixel each
side.
Change-Id: I584393906c4f5f948a581d6590959522572743bb
This commit restores the quality lost when the buffer-to-buffer copy
logic was removed. Note that this is specific to the current use of
golden frames and will need rework when RTC functionality is added.
Change-Id: I7324a75acd96eafd9e0f9b8633d782e390d5dc21
Previously there were two frame coding contexts tracked, one for normal
frames and one for alt-ref frames. Generalize this by signalling the
context to use in the bitstream, rather than tieing it to the alt ref
refresh bit. Also increase the number of contexts available to 4, which
may be useful for temporal scalability.
Change-Id: I7b66daaddd55c535c20cd16713541fab182b1662
Remove lst_fb_idx, gld_fb_idx, alt_fb_idx, refresh_last_frame,
refresh_golden_frame, refresh_alt_ref_frame from common. Gold/Alt are
encode side conventions. From the decoder's perspective, we want to be
dealing with numbered references.
Updates to active_ref 2 signal mode context switches, vestigial from
refresh_alt_ref_frame. This needs some clean up to make sense with
increased numbers of reference frames, as well as reimplementing the
swapping of alt/golden which was previously done using the
buffer-to-buffer copy mechanism removed in an earlier commit.
Change-Id: I7334445158b7666f9295d2a2dd22aa03f4485f58
Do reference counting the same way on the encoder as the decoder does,
rather than maintaining the 'flags' member of YV12_BUFFER_CONFIG.
Change-Id: I91dc210ffca081acaf9d5c09a06e7461b3c3139c
This is the first in a series of commits to add additional reference
frames to the codec. Each frame will be able to update any of the
available references, but copying between references is not
supported.
Change-Id: I5945b5ce6cc3582c495102b4e7eed4f08c44d5a1
Uses a single 1D table to implement the weighting of the predictors
for the compound inter-intra experiment.
Change-Id: I204ffbe4f9fc79d5d43b6c724ad253d800461012
These variables have the type int64_t, not long long. long long could
be a larger type than 64 bits. Emulate INT64_MAX for older versions of
MSVC, and remove the unreferenced vpx_ports/vpxtypes.h
Change-Id: Ideaca71838fcd3849d816d5ab17aa347c97d03b0
This experiment gives little gains and adds relatively much code
complexity (and it hinders other experiments), so let's get rid of
it.
Change-Id: Id25e79a137a1b8a01138aa27a1fa0ba4a2df274a
In commit 9a1d73d, loop filtering was added for UV 4x4 boundaries
when TX_8X8 is used by a MB. This commit further refined the decision
to be based on the actual transform used for the UV planes. When
UV planes use 4x4 transform, i.e. when prediction mode used is either
I8X8_PRED or SPLITMV, UV planes are filtered on 4x4 boundaries, and no
filtering is applied on 4x4 block boundaries when UV planes use 8X8
transform.
Change-Id: Ibb404face0a1d129b4b4abaf67c55d82e8df8bec
Fixes some scaling issues. Adds an option to only compute the
dct on the low-low subband for 32x32 and 64x64 blocks using
only a single 16x16 dct after 1 and 2 wavelet decomposition
levels respectively. Also adds an option to use a 8x8 dct
as building block.
Currenlty with the 2/6 filter and with a single 16x16 dct on
the low low band, the reuslts compared to full 32x32 dct is
as follows:
derf: -0.15%
yt: -0.29%
std-hd: -0.18%
hd: -0.6%
These are my current recommended settings, since the 2/6 filter
is very simple.
Results with 8x8 dct are about 0.3% worse.
Change-Id: I00100cdc96e32deced591985785ef0d06f325e44
and vp9_mb_lpf_vertical_edge_w_sse2. This was quickly done so we can
run some tests over the weekend. Future commits will optimize/refactor these
functions further.
The decoder performance improved by ~17% for the clip used.
Change-Id: I612687cd5a7670ee840a0cbc3c68dc2b84d4af76
Updated the rtcd_defs and used the sse2 uv version
of the loopfilter. The performance improved by ~8%
for the test clip used.
Change-Id: I5a0bca3b6674198d40ca4a77b8cc722ddde79c36
The commit changed to not to use wider lpf within a superblock when
32x32 transform is used for the block.
The commit also changed to use the shorter version of loop filtering:
for UV planes.
Change-Id: I344c1fb9a3be9d1200782a788bcb0b001fedcff8
This patch removes the old pred-filter experiment and replaces it
with one that is implemented using the switchable filter framework.
If the pred-filter experiment is enabled, three interopolation
filters are tested during mode selection; the standard 8-tap
interpolation filter, a sharp 8-tap filter and a (new) 8-tap
smoothing filter.
The 6-tap filter code has been preserved for now and if the
enable-6tap experiment is enabled (in addition to the pred-filter
experiment) the original 6-tap filter replaces the new 8-tap smooth
filter in the switchable mode.
The new experiment applies the prediction filter in cases of a
fractional-pel motion vector. Future patches will apply the filter
where the mv is pel-aligned and also to intra predicted blocks.
Change-Id: I08e8cba978f2bbf3019f8413f376b8e2cd85eba4
This is to add to the 64x64 transform experiment as an alternative to
a 64x64 DCT.
Two levels of wavelet decomposition is used on a 64x64 block, followed
by 16x16 DCT on the four lowest subbands. The highest three subbands
are left untransformed after the first level DWT.
Change-Id: I3d48d5800468d655191933894df6b46e15adca56
This commit did a couple of minor cleanup/refactoring to prepare for
futher loop filter experiments. It merged y_only version of loop filter
function into the regular one, which makes sure that same logic is used
for functions for picking level and for actual loop filtering.
Change-Id: Id10c94dccd45f58e5310bacfdf6ee63cbb60b86f
This experimental change reorders the search so
that all possible references that match the target
reference frame are tested first and these in order
of distance from the current block. These will usually
be the highest scoring candidates.
If we do not find enough good candidates this way
we try non matching cases. These will usually be lower
scoring candidates.
The change in order together with breakouts when
we have found enough candidates should reduce
the computational cost and especially reduce the number
of sort operations.
Quality Results:
Std Hd +0.228%, Hd +0.074%, YT +0.046%, derf +0.137%
This effect is probably due to the fact that more distant
weak candidates are now less likely to get "promoted" over
near candidates even if they are repeated.
Change-Id: Iec37e77d88a48ad0ee1f315b14327a95d63f81f6
The 2-D inverse transform X = M1*Z*Transposed_M2 was calculated
in 2 steps from left to right:
1. Vertical transform: Y = M1*Z
2. Horizontal transform: X= Y*Transposed_M2
In SIMD, a transpose is needed in vertical transform.
Here, switched the calculation order to do it from right to left.
In this way, we could eliminate that transpose by writing the
intermediate results out to their transposed positions.
Change-Id: I34dfe5eb01292f6e363712420d99475e2e81e12c
Various fixups to resolve issues when building vp9-preview under the more stringent
checks placed on the experimental branch.
Change-Id: I21749de83552e1e75c799003f849e6a0f1a35b07
Adds an experiment to derive the previous context of a coefficient
not just from the previous coefficient in the scan order but from a
combination of several neighboring coefficients previously encountered
in scan order. A precomputed table of neighbors for each location
for each scan type and block size is used. Currently 5 neighbors are
used.
Results are about 0.2% positive using a strategy where the max coef
magnitude from the 5 neigbors is used to derive the context.
Change-Id: Ie708b54d8e1898af742846ce2d1e2b0d89fd4ad5
For coefficients, use int16_t (instead of short); for pixel values in
16-bit intermediates, use uint16_t (instead of unsigned short); for all
others, use uint8_t (instead of unsigned char).
Change-Id: I3619cd9abf106c3742eccc2e2f5e89a62774f7da
MSVC 2012 (_MSC_VER=1600) introduced the definition, this commit
prevents the redefinition of the macro
Change-Id: I7de92e7e9e865a342f2bcc4b071f8d3c9b2a508c
Modifies the scanning pattern and uses a floating point 16x16
dct implementation for now to handle scaling better.
Also experiments are in progress with 2/6 and 9/7 wavelets.
Results have improved to within ~0.25% of 32x32 dct for std-hd
and about 0.03% for derf. This difference can probably be bridged by
re-optimizing the entropy stats for these transforms. Currently
the stats used are common between 32x32 dct and dwt/dct.
Experiments are in progress with various scan pattern - wavelet
combinations.
Ideally the subbands should be tokenized separately, and an
experiment will be condcuted next on that.
Change-Id: Ia9cbfc2d63cb7a47e562b2cd9341caf962bcc110
Gives 0.5-0.6% improvement on derf and stdhd, and 1.1% on hd. The
old tables basically derive from times that we had only 4x4 or
only 4x4 and 8x8 DCTs.
Note that some values are filled with 128, because e.g. ADST ever
only occurs as Y-with-DC, as does 32x32; 16x16 ever only occurs
as Y-with-DC or as UV (as complement of 32x32 Y); and 8x8 Y2 ever
only has 4 coefficients max. If preferred, I can add values of
other tables in their place (e.g. use 4x4 2nd order high-frequency
probabilities for 8x8 2nd order), so that they make at least some
sense if we ever implement a larger 2nd order transform for the
8x8 DCT (etc.), please let me know
Change-Id: I917db356f2aff8865f528eb873c56ef43aa5ce22
As suggested by Yaowu, we can use eob to reduce the complexity
of the vp9_ihtllm_c function. For the 1080p test clip used, the decoder
performance improved by 17%.
Change-Id: I32486f2f06f9b8f60467d2a574209aa3a3daa435
Add a function clip_pixel() to clip a pixel value to the [0,255] range
of allowed values, and use this where-ever appropriate (e.g. prediction,
reconstruction). Likewise, consistently use the recently added function
clip_prob(), which calculates a binary probability in the [1,255] range.
If possible, try to use get_prob() or its sister get_binary_prob() to
calculate binary probabilities, for consistency.
Since in some places, this means that binary probability calculations
are changed (we use {255,256}*count0/(total) in a range of places,
and all of these are now changed to use 256*count0+(total>>1)/total),
this changes the encoding result, so this patch warrants some extensive
testing.
Change-Id: Ibeeff8d886496839b8e0c0ace9ccc552351f7628
Some further changes and refactoring of mv
reference code and selection of center point for
searches. Mainly relates to not passing so many
different local copies of things around.
Some place holder comments.
Change-Id: I309f10ffe9a9cde7663e7eae19eb594371c8d055
This commit changed the ENTROPY_CONTEXT conversion between MBs that
have different transform sizes.
In additioin, this commit also did a number of cleanup/bug fix:
1. removed duplicate function vp9_fix_contexts() and changed to use
vp8_reset_mb_token_contexts() for both encoder and decoder
2. fixed a bug in stuff_mb_16x16 where wrong context was used for
the UV.
3. changed reset all context to 0 if a MB is skipped to simplify the
logic.
Change-Id: I7bc57a5fb6dbf1f85eac1543daaeb3a61633275c
Use these, instead of the 4/5-dimensional arrays, to hold statistics,
counts, accumulations and probabilities for coefficient tokens. This
commit also re-allows ENTROPY_STATS to compile.
Change-Id: If441ffac936f52a3af91d8f2922ea8a0ceabdaa5
This adds Debargha's DCT/DWT hybrid and a regular 32x32 DCT, and adds
code all over the place to wrap that in the bitstream/encoder/decoder/RD.
Some implementation notes (these probably need careful review):
- token range is extended by 1 bit, since the value range out of this
transform is [-16384,16383].
- the coefficients coming out of the FDCT are manually scaled back by
1 bit, or else they won't fit in int16_t (they are 17 bits). Because
of this, the RD error scoring does not right-shift the MSE score by
two (unlike for 4x4/8x8/16x16).
- to compensate for this loss in precision, the quantizer is halved
also. This is currently a little hacky.
- FDCT and IDCT is double-only right now. Needs a fixed-point impl.
- There are no default probabilities for the 32x32 transform yet; I'm
simply using the 16x16 luma ones. A future commit will add newly
generated probabilities for all transforms.
- No ADST version. I don't think we'll add one for this level; if an
ADST is desired, transform-size selection can scale back to 16x16
or lower, and use an ADST at that level.
Additional notes specific to Debargha's DWT/DCT hybrid:
- coefficient scale is different for the top/left 16x16 (DCT-over-DWT)
block than for the rest (DWT pixel differences) of the block. Therefore,
RD error scoring isn't easily scalable between coefficient and pixel
domain. Thus, unfortunately, we need to compute the RD distortion in
the pixel domain until we figure out how to scale these appropriately.
Change-Id: I00386f20f35d7fabb19aba94c8162f8aee64ef2b
Only declare the functions in vpx_scale RTCD and include the relevant
header.
Remove unused files and functions in vpx_scale to avoid wasting time
renaming. vpx_scale/win32/scaleopt.c contains functions which have not
been called in a long time but are potentially optimized.
The 'vp8' functions have not been renamed yet. That is for after the
cleanup.
Change-Id: I2c325a101d60fa9d27e7dfcd5b52a864b4a1e09c
This patch reduces the cpu cost of the MV ref
search by only allowing insert for candidates
that would be in the current top 4.
This could alter the outcome and slightly favors
near candidates which are tested first but also
limits the worst case loop count to 4 and means in
many cases it will drop out and not happen.
Change-Id: Idd795a825f9fd681f30f4fcd550c34c38939e113
Only declare the functions in vpx_scale RTCD and include the relevant
header.
Remove unused files and functions in vpx_scale to avoid wasting time
renaming. vpx_scale/win32/scaleopt.c contains functions which have not
been called in a long time but are potentially optimized.
The 'vp8' functions have not been renamed yet. That is for after the
cleanup.
Change-Id: I2c325a101d60fa9d27e7dfcd5b52a864b4a1e09c
Adds support for compound inter-intra prediction with superblocks.
Also, fixes a bug that disabled intra modes for superblocks.
Change-Id: I4d711317e1bc19df8c2f32dc645429f7fff31036
Allows switchbale filters to be used without mismatch when the
superblock experiment is on.
Also removes a spurious clamping code in decodemv.c which causes
rare encode/decode mismatches.
Change-Id: I809d9ee0b2859552b613500b539a615515b863ae
This patch allows use of 8x8 and 4x4 ADST correctly for Intra
16x16 modes and Intra 8x8 modes when the block size selected
is smaller than the prediction mode. Also includes some cleanups
and refactoring.
Rebase.
Change-Id: Ie3257bdf07bdb9c6e9476915e3a80183c8fa005a
This change included:
1. Aligned reads in vp9_mbloop_filter_vertical_edge function.
Since we actually read 16 bytes, we can align the reads to read
starting at (s - 8) instead of (s - 5).
2. Combined u, v loop filters.
3. Added 8x16 transpose.
This gave 2% decoder performance gain (tulip clip).
Change-Id: Ib14c2f1645c4a3436df17fe2f24789506bf0bb58
This commit removed a couple of redundant data structures in frame
coding contextsm, mode_context and mode_context_a, and changed to
use vp9_mode_contexts only. The switch of the context for different
frame type now relies on the switch of frame coding context between
lfc and lfc_a. This commit also removed a number of memcpy among
these redundant data structure.
Change-Id: I42e8174bd60f466b0860afc44c1263896471b0f3
Not all segment feature data elements are full-range powers of two, so
there are values that can be encoded that are invalid. Add a new function
to clamp values to the maximum allowed.
Change-Id: Ie47cb80ef2d54292e6b8db9f699c57214a915bc4
Support for gyp which doesn't support multiple objects in the same
static library having the same basename.
Change-Id: Ib947eefbaf68f8b177a796d23f875ccdfa6bc9dc
Vp9_sad3x16_sse2() is heavily called in decoder, in which the
unaligned reads consume lots of cpu cycles. When CONFIG_SUBPELREFMV
is off, the unaligned offset is 1. In this situation,
we can adjust the src_ptr to be 4-byte aligned, and then do the
aligned reads. This reduced the reading time significantly. Tests
on 1080p clip showed over 2% decoder performance gain with
CONFIG_SUBPELREFM off.
Change-Id: I953afe3ac5406107933ef49d0b695eafba9a6507
Experiments with a larger set of contexts and some
clean up to replace magic numbers regarding the
number of contexts.
The starting values and rate of backwards adaption
are still suspect and based on a small set of tests.
Added forwards adjustment of probabilities.
The net result of adding the new context and forward
update is small compared to the old context from the
legacy find_near function. (down a little on derf but
up by a similar amount for HD)
HOWEVER.... with the new context and forward update
the impact of disabling the reverse update (which may be
necessary in some use cases to facilitate parallel decoding)
is hugely reduced.
For the old context without forward update, the impact of
turning off reverse update (Experiment was with SB off) was
Derf - 0.9, Yt -1.89, ythd -2.75 and sthd -8.35. The impact was
mainly at low data rates.
With the new context and forward update enabled the impact
for all the test sets was no more than 0.5-1% (again most at
the low end).
Change-Id: Ic751b414c8ce7f7f3ebc6f19a741d774d2b4b556