Added hiding global symbols for macho32 and macho64 in x86inc.asm.
This was done to fix exported symbol issue in Chrome build.
Change-Id: I08d5c559b985b82f655b537469fee125615e78c0
"keyframe" variable in the current code actually means that previous
frame is a keyframe because cm->frame_type has not been initialized
in read_uncompressed_header.
Change-Id: I5645b0816c70abdef5dfc70113018d06276dac77
The clamp operation may not affect the values of the final assigned mv
where compiler may make use of strict aliasing rule to optimize out the
clamp operation. This change made the code segments to better comply
the strict aliasing rule.
Change-Id: I24502ff18bd4f9e62507a879cc8760a91a0fd07e
The commit added check to make sure no invalid memory access even when
the decoder instance is never initialized.
Change-Id: I4da343d0b3c78c27777ac7f5ce7688562c69f0c5
For bad input data, the decoder may access the array out of bounds. The
commit added clamp to prevent such out of bound access
Change-Id: I0a1cfd9b8786ea7113a998053c76605c963b077a
The codec should effectively run with motion vector of range (-2048, 2047)
in full pixels, for sequences of 1080p and below. Add assertions to clarify
this behavior.
Change-Id: Ia0cac28249f587d8f8882205228fa480263ab313
The probability model used to code prediction mode is conditioned
on the immediate above and left 8x8 blocks' prediction modes. When
the above/left block is coded in sub8x8 mode, we use the prediction
mode of the bottom-right sub8x8 block as the reference to generate
the context.
This commit moves the update of mbmi.mode out of the sub8x8 decoding
loop, hence removing redundant update steps and keeping the bottom-
right block's mode for the decoding process of next blocks.
Change-Id: I1e8d749684d201c1a1151697621efa5d569218b6
Current x86inc.asm didn't handle 32bit PIC build properly.
TEXTRELs were seen in the library built. The PIC macros from
libvpx's x86_abi_support.asm was used to fix this problem.
The assembly code was modified to use the macros.
Notes: We need this fix in for decoder building. Functions in
encoder will be fixed later.
Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838
This commit cleans up the second reference check in the
rate-distortion optimization loop of sub8x8 blocks.
Change-Id: Ife68feaa4cddbfad2878c9b44d3012788d634f97
Modified the resize unit test so that it optionally
writes the encoded bitstream to file. The macro
WRITE_COMPRESSED_STREAM should be set to 1 to enable
output of the test bitstream; it is set to 0 by default.
Change-Id: I7d436b1942f935da97db6d84574a98d379f57fb1
The sub8x8 check can be directly inferred from block_idx, hence
removed from the arguments if get_sub_block_mv.
Change-Id: Ib766d57e81248fb92df0f6d9b163e6c77b933ccd
Jenkins was failing to detect the case where an existing
file is recreated with new content. In this case, thinking
that the file already existed, Jenkins did not re-copy the
file as it should have.
By adding the file test-data.sha1 as a dependendency to
the LIBVPX_TEST_DATA build target the files will be
recopied if the MD5 of an existing file changes.
This could be further improved to only copy files that
have changed rather than copying the whole set as done in
this patch.
(Thanks to jzern@ who diagnozed ithe problem and suggested
this fix).
Change-Id: Icea7c61a95189bc639fec83020c28c70da5b2b41
The commit added reset of pred_mv at the beginning of each SB64x64
partition mv search, also limited the usage of pred_mv only when
search on the largest partition is already done. This is to fix
a crash at speed 1/2 encoder where an invalid mv is used in mv
search.
Change-Id: I39010177da76d054e3c90b7899a44feb2e3a5b1b
This is incompatible with most toolchains other than gcc.
Revert "Deleted #include <inttypes.h>"
This reverts commit 4d018be950.
This reverts commit d22a504d11.
Change-Id: I1751dc6831f4395ee064e6748281418e967e1dcf
This commit enables adaptive constraint on motion search range for
smaller partitions, given the motion vectors of collocated larger
partition as a candidate initial search point.
It makes speed 0 runtime of bus at CIF and 2000 kbps goes from
167s down to 162s (3% speed-up), at 0.01dB performance gains. In
the settings of speed 1, this makes the runtime goes from 33687 ms
to 32142 ms (4.5% speed-up), at 0.03dB performance gains.
Compression performance wise, it gains at speed 1:
derf 0.118%
yt 0.237%
hd 0.203%
stdhd 0.438%
Change-Id: Ic8b34c67810d9504a9579bef2825d3fa54b69454
The CpuSpeedTest is extended to cover 2pass good quality with CpuUsed
from 0 to 4. The BordersTest is changed to use CpuUsed 1 for faster
turn around.
Change-Id: I005e89adee7fe63af4b1f2a76a3a13ea826feadf
Mis-merge of the following change managed to break mode order
and delete two mode options (new alt ref and near alt ref)
It also created a situation where we could test two undefined
modes off the end of the VP9_mode_order[] data structure.
"clang warnings : remove split and i4x4_pred fake modes"
"Change Id: I8ef3c*"
Initial testing on Akiyo at speed 2.
101.35 44.567 44.447 improves to
96.82 44.915 44.815
Approx 0.3-0.4db gain and 2.5% size reduction
Change-Id: Icff813e7c0778d140ad4f0eea18cf1ed203c4e34
Removes this speed feature since it is very slow and unlikely
to be used in practice. This cleanup removes a bunch of unnecessary
complications in the outer encode loop.
Change-Id: I3c66ef1ca924fbfad7dadff297c9e7f652d308a1
Reformatted version of a patch submitted by Erik/Tamar
from Intel. For the test clips used, the decoder
performance improved by ~2%.
Change-Id: Ifbc37ac6311bca9ff1cfefe3f2e9b7f13a4a511b
Propose some changes to the speed 2 settings to improve quality.
In particular, turns off the adjust_thresholds_by_speed feature
which improves results by 6%. Also removes the code for
adjust_thresholds_by_speed since it conflicts with the adaptive
rd thresh feature.
Overall, with this change speed 2 is -15.2% from speed 0 settings,
on derf, which is significantly better than -21.6% down before.
Change-Id: I6e90a563470979eb0c258ec32d6183ed7ce9a505
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of pointers to MODE_INFO structs. The
MODE_INFO structs are now stored as a stream (decoder only),
eliminating unnecessary copies and is a little more cache
friendly.
Change-Id: I031d376284c6eb98a38ad5595b797f048a6cfc0d
The c code implementation of 32x32 quantization does the zbin check
of all coefficients prior to the quant/dequant loop, hence removing
the redundant zbin check inside the loop. This only affects the
c code version. SSSE3 version does not separate the zbin check out.
Change-Id: Ic197a7d61d0b25fcac3cc092987651378cb56e4e
Added the resize_test unit test to the VP9 set.
Set g_in_frames = 0 to avoid a problem when the total
number of frames being encoded is smaller than
g_in_frames. In this case the test will not have
access to the encoded frames and will not be able to
compare them for testing for encoder/decoder mismatch.
Change-Id: I0d2ff8ef058de7002c5faa894ed6ea794d5f900b
If the current obtained distortion is very small, which happens
for static image case, we pick the current partition type without
further split checking.
This won't affect regular videos. For static videos, we got 10%~12%
encoding speed gain. PSNR was better for some clips, and worse for
others. Overall it was even.
Change-Id: If787a57bedf46fc595ca4f5ded2b0c0a69e9fdef
Thank Paul for the suggestions. While turning on static-thresh
for static-image videos, a big jump on bitrate was seen. In this
patch, we detected static frames in the video using first-pass
stats. For different cases, disable encode breakout or reduce
encode breakout threshold to limit the skipping.
More modification need be done to break incorrect partition
picking pattern for static frames while skipping happens.
Change-Id: Ia25f47041af0f04e229c70a0185e12b0ffa6047f
A previous speed feature skipped modes not used in earlier
partitions but this not longer worked as intended following
changes to the partition coding order and in conjunction
with some other speed features (Especially speed 2 and above).
This modified mode skip feature sets a mask after the first X
modes have been tested in each partition depending on the
reference frame of the current best case.
This patch also makes some changes to the order modes are
tested to fit better with this skip functionality.
Initial testing suggests speed and rd hit count improvements
of up to 20% at speed 1. Quality results. (derf -1.9%, std hd +0.23%).
Change-Id: Idd8efa656cbc0c28f06d09690984c1f18b1115e1
This commit completes the per coefficient accuracy check and memory
overflow check for SSE2 and other implemented versions of 16x16
transform.
Change-Id: If26a3e4f6ba82ccecc13f0b73cb8f7bb6ac14584
This commit refactors the 16x16 transform unit test. It enables the
test on all implemented versions of forward and inverse 16x16 transform
modules.
Change-Id: I0c7d5f3c5fdd5d789a25f73e287aeeaf463b9d69
Sample app: vp9_spatial_scalable_encoder
vpx_codec_control extensions:
VP9E_SET_SVC
VP9E_SET_WIDTH, VP9E_SET_HEIGHT, VP9E_SET_LAYER
VP9E_SET_MIN_Q, VP9E_SET_MAX_Q
expanded buffer size for vp9_convolve
modified setting of initial width in vp9_onyx_if.c so that layer size
can be set prior to initial encode
Default number of layers set to 3 (VPX_SS_DEFAULT_LAYERS)
Number of layers set explicitly in vpx_codec_enc_cfg.ss_number_layers
Change-Id: I2c7a6fe6d665113671337032f7ad032430ac4197
In configure when internal-stats is enabled, because postprocessing
code is needed for computing stats for enabling internal-stats
Change-Id: I3601dc5a4aa65feb99465452486a21e75eb62c1f
The commit changes the border pixel extension from 160 pixel each side
to what is necessary in arnr filter or motion estimation portion, i.e.
16 pixel on top and left side. For right or bottom side, the extension
is changed to either round up image size to multiple of 64 or at least
16 pixels.
Change-Id: Ic05e19b94368c1ab4df568723aae5734e6c3d2c5
The 16x16 transform unit test suggested that the peak coefficient
value can reach 32639. This could cause potential overflow issue
in the SSSE3 implmentation of 16x16 block quantization. This commit
fixes this issue by replacing addition with saturated addition.
Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
this prevents returning an address smaller than the natural heap
alignment from vpx_malloc on e.g., x86_64
Change-Id: I283e858664a8529f28b22060c3815116a7798c0d
Adds a new end-usage option for constant quality encoding in vpx. This
first version implemented for VP9, encodes all regular inter frames
using the quality specified in the --cq-level= option, while encoding
all key frames and golden/altref frames at a quality better than that.
The current performance on derfraw300 is +0.910% up from bitrate control,
but achieved without multiple recode loops per frame.
The decision for qp for each altref/golden/key frame will be improved
in subsequent patches based on better use of stats from the first pass.
Further, the qp for regular inter frames may also be varied around the
provided cq-level.
Change-Id: I6c4a2a68563679d60e0616ebcb11698578615fb3
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of a pointer to a MODE_INFO struct and
a "in the image" flag. The MODE_INFO structs are now stored
as a stream, eliminating unnecessary copies and is a little
more cache friendly.
For the test clips used, the decoder performance improved
by ~4.3% (1080p) and ~9.7% (720p).
Patch Set 2: Re-encoded clips with latest. Now ~1.7% (1080p)
and 5.9% (720p).
Change-Id: I846f29e88610fce2523ca697a9a9ef2a182e9256
This commit enabled a full functional test on 32x32 forward/inverse
transform, including round-trip error and memory overflow check. It
tests the prototype functions in C and all other implementations if
applicable.
Change-Id: I9cc50b05abdb4863e7abbcb29209a19b1fe90da7
The 32x32 forward transform can potentially reach peak coefficient
value close to 32700, while the rounding factor can go upto 610.
This could cause overflow issue in the SSSE3 implementation of 32x32
quantization process.
This commit resolves this issue by replacing the addition operations
with saturated addition operations in 32x32 block quantization.
Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
The segment feature SEG_LVL_SKIP requires the prediction unit size
to be at least BLOCK_8X8. This commit makes the requirement to be
explicit. This is to prevent future encoder implementations from
making wrong choices.
Change-Id: I0127f0bd4c66e130b81f0cb0a8d3dbfe3b2da5c2
There is another unit test that has been failing randomly on win32
build. Investigation has shown that the failure was caused by simd
register state is not reset appropriately in the fdct8x8 test. This
commit added ClearSystemState() in the teardown of this test, tests
showed it resolved the random failure issue for win32 build.
Related issue: https://code.google.com/p/webm/issues/detail?id=614
Change-Id: I9381d0c1a6f4b855ccaeef1aca8c417ac8c71ee2
Moves counting of mv branches to where we have a new mv, instead of after
the whole frame is summed.
Change-Id: I945d9f6d9199ba2443fe816c92d5849340d17bbd
Speed 4 fixed partition size. Use fixed size unless it does not
fit inside image, in which case use the largest size that does.
Change-Id: I250f7a80506750dd82ab355721624a1344247223
This commit fixed the potential overflow issue in the SSE2
implementation of 32x32 forward DCT. It resolved the corrupted
coded frames in the border of scenes.
Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
While static-thresh is on, we only need to transmit skip
flag if skip = 1. The cost of skip bit is added to the
total rate cost.
Change-Id: I64e73e482bc297eba22907026298a15fa8cc3920
- Intermediate height was not correct i.e. when block size is 4 and
y_step_q4 is 6. In this case intermediate height was
(4*6) >> 4 = 1 and vertical interpolation needs two source pixels
plus 7 extra pixels for taps.
- Also if the current output block is 16x16 and we are using 4x upscaling
we need only 12 rows after horizontal filtering instead of 16.
Patch Set 2: Intermediate_height updated after CL 66723
"Fix bug in convolution functions (filter selection)"
Change-Id: I5a1a1bc2ac9d5edb3a6e0818de618bf318fdd589
Added some code to output normalized rd hit count stats.
In effect this approximates to the average number of rd
operations/tests per pixel for the sequence.
The results are not quite accurate and I have not bothered
to account for partial SB64s at frame edges and for key frames
However they do give some idea of the number of modes /
prediction methods being tested for each pixel across the
different partition sizes. This indicates how much scope their
is for further gains either by reducing the number of partitions
examined or the modes per partition through heuristics.
Patch 3 moved place where count incremented so partial rd
tests that are aborted with INT_MAX return are also counted.
Example numbers for first 50 frames of Akiyo.
Speed 0 ~84.4 rd operations / pixel
Speed 1 ~28.8
Speed 2 ~11.9
Change-Id: Ib956e787e12f7fa8b12d3a1a2f6cda19a65a6cb8
The 32x32 quantization process can potentially have the intermediate
stacks over 16-bit range, thereby causing enc/dec mismatch. This commit
fixes this overflow issue in the SSSE3 implementation, as well as the
prototype, of 32x32 quantization.
This fixes issue 607 from webm@googlecode.
Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
The two arrays are typically initialized to INT64_MAX, if they are not
filled with valid values before the addition, the values can overflow
and lead to wrong results.
Change-Id: I515de22cf3e8f55af4b74bdb2c8eb821a02d3059
This patch is a reformatted version of optimizations done by
engineers at Intel (Erik/Tamar) who have been providing
performance feedback for VP9. For the test clips used (720p, 1080p),
up to 1.2% performance improvement was seen.
Change-Id: Ic1a7149098740079d5453b564da6fbfdd0b2f3d2
Switching from mi_{width, height}_log2 and b_{width, height}_log2 to
num_8x8_blocks_{wide, high} and num_4x4_blocks_{wide, high}. Removing
redundant code, adding const.
Change-Id: Iaab2207590fd24d0b76999071778d1395dc5cd5d
Incorporates a speed feature for fast forward updates of
coefficients. This feature takes 3 values:
0 - use standard 2-loop version
1 - use a 1-loop version
2 - use a 1-loop version with reduced updates
Results: derfraw300 +0.007% (on speed 0) at feature value = 1
-0.160% (on speed 0) at feature value = 2
There is substantial speed up at speeds 2 and above for low
resolution sequences where the entropy updates are a big part
of the overall computations.
Change-Id: Ie96fc50777088a5bd441288bca6111e43d03bcae
This commit resolved a mis-alignment issue in compound inter-inter
prediction of sub8x8. This patch follows solution from dkovalev@.
Change-Id: I3cc0cf7e55b84110e0c42ef4b2e6ca7ac3f8f932
In subpel_avg_variance functions, code similar to the following
punpkldq m2, [addr]
actually reads 8 bytes. For functions that are supposed to work on
buffers only have less 8 bytes a line, this caused valgrind error
of reading uninitialized memory.
Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
vp9_setup_interp_filters before each inter block decoding, it is not
necessary to call it just before the whole frame decoding.
Change-Id: Id1b0ee62f987474e27eafba0013a4896b492c400
Removing references to plane_block_width and plane_block_height (we are
going to delete the latter ones).
Change-Id: I7982da4d373aebb54d2209dc8886f6192df4d287
Make the current head working properly, while working on fixing an
issue in the SSSE3 implementation of 32x32 quantization.
Change-Id: Ic029da3fd7f1f5e58bc641341cbd226ec49a16bc
- s|source -> src
- dest -> dst
- use verbose names in extend_plane dropping the redundant comments
+ light cosmetics:
- join a few lines / assignments
- drop some unnecessary comments & includes
Change-Id: I6d979a85a0223a0a79a22f79a6d9c7512fd04532
Previous change c4048dbd limits the mv search range assuming max block
size of 64x64, this commit change the search range using actual block
size instead.
Change-Id: Ibe07ab02b62bf64bd9f8675d2b997af20a2c7e11
We could avoid calling clamp_mv2 because it has been already called
inside vp9_find_best_ref_mvs function.
Change-Id: I08edeaf3e11e98c19e67b9711b2523ca5fb1416e
Fix of https://code.google.com/p/webm/issues/detail?id=608. We could have
used invalid display size equal to the previous frame size (not to the
current frame size).
Change-Id: I91b576be5032e47084214052a1990dc51213e2f0
Making code more compact, adding consts, removing redundant arguments,
adding do/while(0) for macros.
Change-Id: Ic9ec0bc58cee0910a5450b7fb8cfbf35fa9d0d16
To the source buffer to be encoded as an alt ref frame. This is to fix
the problem of using uninitialized memory in encoder.
See https://code.google.com/p/webm/issues/detail?id=605
Change-Id: I97618a2fc207e08abcf5301b734aa9e3ad695e2c
(In response to Issue 604:
https://code.google.com/p/webm/issues/detail?id=604)
There were bugs in the convolution code for two cases:
1. Where the filter table was assumed to be aligned to a
256 byte boundary. The offset of the pixel in the
source buffer was computed incorrectly.
2. Where no such alignment assumption was made. An
incorrect address for the filter table base was used.
To fix both problems, I now assume that the filter table is
256-byte aligned and modify the pixel offset calculation to
match.
A later patch should remove the restriction that the filter
table is aligned to a 256-byte boundary.
There was also a bug in the ConvolveTest unit test
(convolve_test.cc).
(Bug & initial fix suggestion submitted by Tero Rintaluoma
and Sami Pietilä).
Change-Id: I71985551e62846e55e40de9e7e3959d4805baa82
Values now carried over frame to frame.
Change to algorithm for decreasing threshold after
a hit and to max threshold (now based on speed)
Removed some old commented out code relating to
VP8 adaptive thresholds.
The impact of these changes tested on Akiyo (50 frames)
and measured in terms of unit rd hits is as follows:
Speed 0 84.36 -> 84.67
Speed 1 29.48 -> 22.22
Speed 2 11.76 -> 8.21
Speed 3 12.32 -> 7.21
Encode speed impact is broadly in line with these.
Change-Id: I5b886efee3077a11553fa950d796fd6d00c8cb19
Most of the focus so far has been on inter frames.
At high speed settings the key frame is now taking a high %
of the cycles.
This patch puts in some masking to reduce the number
of INTRA modes searched during key frame coding (as already
happens for inter frames) at higher speed settings
TODO: Develop this further with either adaptive rd thresholds
when choosing which intra modes to consider or some other
heuristic.
Impact.
At high speed settings on some clips the key frame was starting
to dominate. In a coding of the first 50 frames of AKIYO at speed
2 limiting the key frame intra modes to DC or TM_PRED resulted in
~30% overall speedup. For Bus the number was lower at ~4-5%.
Change-Id: I7bde68aee04995f9d9beb13a1902143112e341e2
Put rectangular partition check flag change according to the rd
costs of NONE and SPLIT partition types under the speed feature.
Change-Id: If681e1e078a8d43d86961ea4b748da5cd1b6c331
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.
Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
It is possible to have invalid scale factors and not access them
during decoding. Error is reported if we really try to use invalid scale
factors.
Change-Id: Ie532d3ea7325ee0c7a6ada08269f804350c80fdf
Comment is wrong, we don't initialize any xd pointers. We only initialize
xd->planes[i]->dst and xd->planes[i]->pre[], which are actually initialized
for every block during the decoding.
Change-Id: If152ea872ebef1f83ca70712fa6f8df1b6855f56
remove duplicate allocation from vp9_create_compressor, it was added to
vp9_alloc_frame_buffers in:
d5bec52 Added resizing & initialization of last frame segment map
Change-Id: I996723226a16a62aff8f9a52ac74e0b73cc98fdf
This commit changes the partition search order of superblocks from
{SPLIT, NONE, HORZ, VERT} to {NONE, SPLIT, HORZ, VERT} for
consistency with that of sub8x8 partition search. It enable the use
of early termination in partition search for all block sizes.
For ped_area_1080p 50 frames coded at 4000 kbps, it makes the runtime
goes down from 844305ms -> 818003ms (3% speed-up) at speed 0.
This will further move towards making the in-search partition types
configurable, hence unifying various speed-up approaches.
Some speed 1 and 2 features are turned off during the refactoring
process, including:
disable_split_var_thresh
using_small_partition_info
Stricter constraints are applied to use_square_partition_only for
right/bottom boundary blocks. Will bring back/refine these features
subsequently. At this point, it makes derf set at speed 1 about
0.45% higher in compression performance, and 9% down in run-time.
Change-Id: I3db9f9d1d1a0d6cbe2e50e49bd9eda1cf705f37c
Adds a couple of minor fixes, which may be absorbed in Jingning's
patch. Thanks to Guillaume for pointing these out.
Also adjusts the thresholds for speed 1 and 2 to 16 and 32
respectively, to keep quality drops small.
Results:
--------
derfraw300: threshold = 16, psnr -0.082%, speedup 2-3%
threshold = 32, psnr -0.218%, speedup 5-6%
stdhdraw250: threshold = 16, psnr -0.031%, speedup 2-3%
threshold = 32, psnr -0.273%, speedup 5-6%
Change-Id: I4b11ae8296cca6c2a9f644be7e40de7c423b8330
It appears that the above/left mb_skip_coeff used during
the pick modes, is left over from the previously
encode frame. This patch initializes the flag to the default
value of zero.
Change-Id: Ida4684cc99611d6e3e82628db35ed717e28ce550
+ disable() -> disable_feature() for balance
this avoids shadowing the bash builtin 'enable' allowing the scripts to
be linted with checkbashisms
Change-Id: Ia11cf86c92ec25bd14e69427b0ac0a9a61a5f7a5
the final macroblock rows are scheduled in the main thread. prior to
this change one additional macroblock row would be scheduled in the
worker forcing the main thread to wait before finishing.
Change-Id: I05f3168e5c629b898fcebb0d77eb6d6a90d6105e
Currently, the best quality mode in VP9 is not very well developed,
and unnecessarily makes the encode too slow. Hence the command line
default is changed to "good" quality. Also, the number of passes
default is changed to 2 passes as well, since 1-pass encoding is
not very efficient in VP9.
Besides, a number of VP9 defaults are set to the currently
recommended settings. With these changes, vpxenc
run with --codec=vp9 --kf-max-dist=9999 --cpu-used=0 should
work about the same as our borg results.
Note when the --cpu-used=0 option is dropped there will be a slight
difference in the output, because of a difference in the cpu-used
value for the first pass. Specifically, the default when unspecified
is to use cpu_used=1 for the first pass and cpu_used=0 for the
second pass. But when specified, both passes will use the cpu-used
value specified.
Note that this also changes the default for VP8 as being "good"
but other options stay unchanged.
Change-Id: Ib23c1a05ae2f36ee076c0e34403efbda518c5066
Adding set_contexts contexts function and call it instead of
set_contexts_on_border. Calling txfrm_block_to_raster_xy to get aoff and
loff.
Change-Id: I41897e344afd2cae1f923f4fdbe63daccf6fe80e
Moving foreach_predicted_block_in_plane function to vp9_reconinter.c
because there is only one usage.
Change-Id: I9852feae43fc3cf809b817fc541d043bc5496209
Updating implementation of vp9_get_pred_context_single_ref_p2 using
has_second_ref function to make code easier to read.
Change-Id: I5ba642712f59861a48aab974e73aa01640d086fe
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.
Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
Check the minimum rate-distortion cost of regular quantization and
all zero coeffs cases in the sub8x8 inter prediction rd loop for
luma components. Use this as the cumulative rdcost sent to UV rd
estimation.
Change-Id: Ia4bc7700437d5e13d7cdad4cf9ae57ab036d3e97
Cleans up the switchable filter search logic. Also adds a
speed feature - a variance threshold - to disable filter search
if source variance is lower than this value.
Results: derfraw300
threshold = 16, psnr -0.238%, 4-5% speedup (tested on football)
threshold = 32, psnr -0.381%, 8-9% speedup (tested on football)
threshold = 64, psnr -0.611%, 12-13% speedup (tested on football)
threshold = 96, psnr -0.804%, 16-17% speedup (tested on football)
Based on these results, the threshold is chosen as 16 for speed 1,
32 for speed 2, 64 for speed 3 and 96 for speed 4.
Change-Id: Ib630d39192773b1983d3d349b97973768e170c04
Changes to code to auto select a partition size range
based on data from spatial neighbors.
Now looks at the sb_type in each 8x8 block of above
and left SB64.
The effect on speed 1 is now weaker giving better
quality but less speed gain. Now also used in speed 2.
Change-Id: Iace33a97d5c3498dd2a9a8a4067351941abcbabc
Updating implementation of vp9_get_pred_context_single_ref_p1 using
has_second_ref function to make code easier to read.
Change-Id: Ie8f60403a7195117ceb2c6c43176ca9a9e70b909
As the pixel values beyond image border are duplicates of pixels
on edge, the change limits the mv search range, any mv beyond
the limits no longer produce new/different prediction values
as entire block with pixels used for subpel interpolation are
outside image border.
Change-Id: I4c6fdf06e33c1cef1489f5470ce0fb4e5e01fb79
For certain partition size, the function poniter may not be intialized
at all. The patch prevent the call if the pointer is not set.
Change-Id: I78b8c3992b639e8799a16b3c74f0973d07b8b9ac
This commit enables early termination in the rate-distortion
optimization search loop for chroma components. When the cumulative
rd cost is above the current best value, skip the rest per-block
transform/quantization/coeff_cost and continue to the next
prediction mode.
For bus_cif at 2000 kbps, the average run-time goes down from
168546ms -> 164678ms, (2% speed-up) at speed 0
36197ms -> 34465ms, (4% speed-up) at speed 1
Change-Id: I9d3043864126e62bd0166250d66b3170d520b3c0
Updating all foreach_transformed_block_visitor functions to work with
plane block size instead of general block. Removing a lot of duplicated
code.
Change-Id: I6a9069e27528c611f5a648e1da0c5a5fd17f1bb4
This change set is intermediate. The next one will remove all repetitive
plane_bsize calculations, because it will be passed as argument to
foreach_transformed_block_visitor.
Change-Id: Ifc12e0b330e017c6851a28746b3a5460b9bf7f0b
The intent was to initialize the deltas for the
segment to the computed value, irrespective of mode
and reference frame if (mode_ref_delta_enabled == 0).
(In response to bug posted by Manjit Hota to codec-devel
and webm-discuss lists)
Change-Id: I10435cb63d0f88359bb4c14f22181878a1988e72
Return the distortion value in vp9_rd_pick_intra_mode_sb as sum of
dist_y and dist_uv. Remove the right shift operation on dist_uv,
and make it consistent with that of vp9_rd_pick_inter_mode_sb.
Change-Id: I9d564e242d9add38e32595d33b0e0dddb1d55e5b
When the frame size changes the last frame segment map must
be resized to match and initialized to 0.
Change-Id: Idc10de109f55dbe9af3a6caae355a2974712243d
This commit makes the rate-distortion optimization search of chroma
components consistent across all block sizes. It removes redundant
codes.
Change-Id: I7e76f54d045e8efdd41d84a164c71f55b484471b
VP9_COMMON is the right place to segmentatation struct because it has
global segmentation parameters, not something specific to macroblock
processing.
Change-Id: Ib9ada0c06c253996eb3b5f6cccf6a323fbbba708
Adds a speed feature to disable split partition search based on a
given threshold on the source variance. A tighter threshold derived
from the threshold provided is used to also disable horizontal and
vertical partitions.
Results on derfraw300:
threshold = 16, psnr = -0.057%, speedup ~1% (football)
threshold = 32, psnr = -0.150%, speedup ~4-5% (football)
threshold = 64, psnr = -0.570%, speedup ~10-12% (football)
Results on stdhdraw250:
threshold = 32, psnr = -0.18%, speedup is somewhat more than derf
because of a larger number of smoother blocks at higher resolution.
Based on these results, a threshold of 32 is chosen for speed 1,
and a threshold of 64 is chosen for speeds 2 and above.
Change-Id: If08912fb6c67fd4242d12a0d094783a99f52f6c6
This commit unifies the rate-distortion cost calculation process of
luma and chroma components. It allows early termination to be enabled
later in the rd search loop of chroma components, in consistent with
luma pixels.
Change-Id: I2e52a7c6496176bf2a5e3ef338d34ceb8aad9b3d
write_ivf_file_header would incorrectly skip writing the file header in
the 2nd pass, causing the initial frame header to be overwritten on
close potential causing an overly large frame header to be read and a
crash.
most likely broken since:
9e50ed7 vpxenc: initial implementation of multistream support
fixes issue #585
Change-Id: I7e863e295dd6344c33b3e9c07f9f0394ec496e7b
Making foreach_transformed_block_in_plane more clear (it's not finished
yet). Using explicit tx_size variable consistently instead of
(ss_txfrm_size / 2) or (ss_txfrm_size >> 1) expression.
Change-Id: I1b9bba2c0a9f817fca72c88324bbe6004766fb7d
The macro block mode info context originally contained an
entry for each 16x16 macroblock. In VP9 each entry refers
to an 8x8 region not a macro block, so the naming is misleading.
This first stage clean up changes the names of 3 entries in the
structure to remove the mb_ prefix.
TODO clean up the nomenclature more widely in respect of
mbmi and bmi.
Change-Id: Ia7305c6d0cb805dfe8cdc98dad21338f502e49c6
The conversion was done with the help of the checkbashisms script
and https://wiki.ubuntu.com/DashAsBinSh .
Change-Id: Id64ecefb35c8d72302f343cd2ec442e7ef989d47
Don't do vertical or horizontal splits if subsize < min_partition_size,
except for edge blocks where it makes sense.
Change-Id: I479aa66ba1838d227b5de8312d46be184a8d6401
Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.
Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
Adding function build_inter_predictors_for_planes to build inter
predictors for specified planes. This function allows to remove
condition "#if CONFIG_ALPHA" and use MAX_MB_PLANE for general case.
Renaming 'which_mv' local var to 'ref', and 'weight' argument to 'ref'.
Change-Id: I1a97160c9263006929d38953f266bc68e9c56c7d
Invert loops to operate vertically in the inner loop. This allows
removing redundant loads.
Also add preloading of data.
Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
Each iteration of the horizontal loop reuses 7 of the 11 source
values. Loading only the 4 new values saves some time.
Also add preload for source data.
Overall 4% faster on Chromebook.
Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
'skippable' can remain unset and negatively affect later decisions
address one aspect of issue #599
Change-Id: Iffdf0ac2e49ac481c27dc27c87fa546d4167bb28
Loop filter configuration doesn't belong to macroblock, so moving it from
MACROBLOCKD to VP9_COMMON. Also moving the declaration of loopfilter struct
from vp9_blockd.h to vp9_loopfilter.h.
Change-Id: I4b3e34be9623b47cda35f9b1f9951f8c5b1d5d28
The memset sets 16 bytes rather than the correct size of the
final array dimension (MAX_MODE_LF_DELTAS).
(In response to bug posted by Manjit Hota to codec-devel
and webm-discuss lists)
Change-Id: I8980f5aa71ddc9d7ef57c5b4700bc28ddf8651c7
The mix use of double type and simd code caused invalid values stored
in double variables, further caused unit tests to fail. The failures
were only observed on x86-win32-vs9 build with vs2008.
Change-Id: If0131754a3bf217a5ace303b7963e8f5162c34b5
Adds a new subpel motion estimation function that uses a 2-level
tree-structured decision tree to eliminate redundant computations.
It searches fewer points than iterative search (which can search
the same point multiple times) but has the same quality roughly.
This is made the default setting at speeds 0 and 1, while at
speed 2 and above only a 1-level search is used.
Also includes various cleanups for consistency and redundancy removal.
Results:
derf: +0.012% psnr
stdhd: +0.09% psnr
Speedup of about 2-3%
Change-Id: Iedde4866f5475586dea0f0ba4cb7428fba24eee9
Different partitionings were not being evaluated against
best_rd and there were unnecessary calls to RDCOST. This
could have resulted in a non-optimal partioning being
selected.
I simplified the variables used to track the rate,
distortion and RD values throughout the function.
Change-Id: Ifa7085ee80d824e86791432a5bc6d8fea5a3e313
Using block width and block height instead of their logarithms. Using
SUBPEL_BITS and SUBPEL_SHIFTS constants instead of magic numbers.
Change-Id: I4e10e93c907c8a5e1cb27dfe74d1fcdcc4995448
The low precision 32x32 fdct has all the intermediate steps within
16-bit depth, hence allowing faster SSE2 implementation, at the
expense of larger round-trip error. It was used in the rate-distortion
optimization search loop only.
Using the low precision version, in replace of the high precision one,
affects the compression performance by about 0.7% (derf, stdhd) at
speed 0. For speed 1, it makes derf set down by only 0.017%.
Change-Id: I4e7d18fac5bea5317b91c8e7dabae143bc6b5c8b
Removing the old one bsize_from_dim_lookup. Now we have a way to determine
block size for plane using its subsampling values (ss_size_lookup). And
then we can find the number of pixels in the block (num_pels_log2_lookup).
Change-Id: I6fc981da2ae093de81741d3d78eaefed11015db9
Removes some unused code and speed features, and organizes the
interfaces for fractional mv step functions for use in new speed
features to come.
In the process a new speed feature - number of iterations per
step during the subpel search - is exposed.
No change when this parameter is set as the original value of 3.
Results:
subpel_iters_per_step = 3: baseline
subpel_iters_per_step = 2: psnr -0.067%, 1% speedup
subpel_iters_per_step = 1: psnr -0.331%, 3-4% speedup
Change-Id: I2eba8a21f6461be8caf56af04a5337257a5693a8
Functions scale_mv_q4 and scale_mv_q3_to_q4 were almost identical except
q3->q4 conversion in scale_mv_q3_to_q4. Now q3->q4 conversion happens
directly in vp9_build_inter_predictor.
Also adding useful constants: SUBPEL_BITS and SUBPEL_MASK.
Change-Id: Ia0a6ad2ac07c45fdf95a5139ece6286c035e9639
Enable use_x86inc as a commandline option. Fix Bug with sse2 when
x86inc is disabled. Adds Sad asm protection to x86inc protection
Change-Id: Iee0f9dd235ea10e8ace512eb362ba9bebe8c9df6
Adds a few pattern searches to achieve various tradeoffs
between motion estimation complexity and performance.
The search framework is unified across these searches so that a
common pattern search function is used for all. Besides it will
be easier to experiment with various patterns or combinations
thereof at different scales in the future.
The new pattern search is multi-scale and is capable of using
different patterns at different scales.
The new hex search uses 8 points at the smallest scale
and 6 points at other scales.
Two other pattern searches - big-diamond and square are
also added. Big diamond uses 4 points at the smallest scale and
8 points in diamond shape at the larger scales.
Square is very similar conceptually to the default n-step search
but is somewhat faster since it keeps only one survivor across
all scales.
Psnr/speed-up results on derf300:
hex: -1.6% psnr%, 6-8% speed-up
big-diamond: -0.96% psnr, 4-5% speedup
square: -0.93% psnr, 4-5% speedup
Change-Id: I02a7ef5193f762601e0994e2c99399a3535a43d2
There was no benefit having this function. For example, inside
read_switchable_filter_type switchable filter context was calculated twice.
Change-Id: I79cd5bf95cbc0f6d8bf91a2e32289e01b18dcff1
Converting arguments of two functions (clamp_mv_ref, lower_mv_precision)
from int_mv* to MV*. Rewriting is_inside function to make it much shorter.
Change-Id: Ie4c4cf3eccd46707c7df099ec21fb1b61c72fc7a
This is in preparation for the SSE2 version of the high-precision
32x32 forward DCT which will share a lot of code with the existing
low precision version used for rate-distortion search.
Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
Support enabling it or disabling it. Moved read out to configure.sh
so that its done once instead of in make and in config.
Change-Id: I73a9190cf31de9f03e8a577f478fa522f8c01c8b
Adds a speed feature to skip all intra modes other than
DC_PRED if the source variance is small. This feature is
made part of speed 1 and up.
Results on derf300: psnr -0.07%, speedup about 1-2%
Also uses the source variance to fine-tune the early
termination criteria when FLAG_EARLY_TERMINATE is on.
This feature is made part of speed 2 and up.
Results on derf300: psnr -0.52%, speedup about 5-7%
Change-Id: I59e38aa836557cfa5405ae706fc64815cbfe4232
Currently the only threaded option for vp9 decode. Enabled when the
decoder config thread count is > 1.
Change-Id: I082959abac9e31aa4a38ed9fd68b94680e57f4df
This changeset allows to remove vp9_switchable_interp and
vp9_switchable_interp_map arrays and make code much clear. Actually we
still have to use these mapping but only inside read_interp_filter_type and
write_interp_filter_type functions.
Change-Id: I4026c6f8c4acefba6c81421b7bacbaa52cc45f50
Chromium does not support 32bit builds for Mac which use x86inc.asm.
Make the files which include it work if 64bit or not PIC enabled
starting with vp9_copy_sse2.asm
Consolidate these targets in vp9_rtcd_defs.sh
Change-Id: If18f0b957a611efd085a3ee7d245cf1eb91e8248
This is an attempt at rewriting vp9_find_mv_refs_idx. I believe that it gains
about 1-2% decode speed
Change-Id: Ia5359c94ce9bb43b32652890e605e9a385485c1b
Loading to single lanes in multiple registers is expensive since
it requires a read and write of each register which saturates
the register file access. Loading to single registers followed
by a separate transpose reduces this pressure.
Change-Id: I4cc35887ddbca80e5e635b50d2b1d158de9668ee
Removing assign_and_clamp_mv function, making implementation of clamp_mv
and clamp_mv2 more clear and consistent.
Change-Id: Iecd08e1c1bf0379f8314ebe01811f8253f4ade58
Adds a function to compute source variance for various
sb_types to be used for pruning mode and partition searches.
[The existing activity measure function is currently specialized
for only 16x16 MBs and needs to be updated].
Change-Id: I22a41e6f1430184201487326fdbebb9b47e6fc24
The inverse 32x32 transform detects all zero entries and skips the
computations accordingly per 8 rows in the first 1-D operation. The
function vp9_short_idct10_32x32_add performs differently and is not
used anywhere, hence removed.
Change-Id: Ic4fad422debbde7b6b6ffed47c69fbd4268a906c
If the partition is out of partition size range, we don't
need to process small partition information.
Change-Id: Ice9bfbbdebe1f2ef79271a3aee17de0ed4608376
use_min_partition_size and use_max_partition_size are not used
currently, and could be added back if needed later.
Change-Id: Ib22a9c06b064567a7c1d6d5445567ed77e0d3acc
The function name rd_pick_intra4x4mby_modes is confusing, so
I changed it to rd_pick_intra_sub_8x8_y_modes to better
reflect what the function does. Also added const qualifiers
to some of the input parameters and removed camel-case.
Change-Id: I23d53d4c7af5d79ed8a471acd59a09bbb47add39
This commit exploits the sparsity of quantized coefficient matrix.
It detects each 32x8 array and skip the corresponding inverse
transformation if all entries are zero.
For ped1080p at 8000 kbps, this on average reduces the runtime of
32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200
cycles. It makes the overall encoding process about 2% faster at
speed 0. The speed-up is more pronounceable for the decoding process.
Change-Id: If20056c3566bd117642a76f8884c83e8bc8efbcf
This commit removes redundant arguments passing in the function of
rd_pick_reference_frame. This resolves the clang warnings about
potential use of uninitialized values.
Change-Id: Ic68f949a9f8fcd0a583786b0c75321104ea44739
Refactor the frame buffer referencing in choose_partition and make
it consistent with other places. This means to prevent potential
issues when we extend reference frame buffer.
Change-Id: I5ff33ed5f671e1f4cc7049622212769a9b4578d9
The tokenize_b function is only called when output flag is on. Hence
removing the conditional branch on it therein.
Change-Id: Ib709f47f23f39ca05a695faf86fa3377f11f2dd0
This commit optimizes the tokenization and detokenization operational
flow for speed-up. It makes the coding process about 0.3% faster at
speed 0.
Change-Id: I28008df7482874e4b5f237f2d418ff82a249dd56
This commit makes the encoder skip the redundant tokenization process
in the rate-distortion optimization search loop, while updating the
entropy contexts accordingly. It makes the speed 0 encoding process
about 0.5% faster at no performance change.
Change-Id: I34a4155a0b5332afeb45c93a51c7f35a294d685c
This commit provides special handle on 16x16 inverse 2D-DCT, where
only DC coefficient is quantized to be non-zero value.
Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c
Fixes a warning on MSVS 2012 where the alignment of vp9_default_iscan_8x8
didn't match between its declaration and definition.
Change-Id: I1466a15635f4b22594d705d570b7e399bfb6cf21
This allows us to increment the position at the band-level only as
we go from one band to the next; more importantly, that allows us to
use an add instead of multiply instruction, and omit the instruction
altogether if the band doesn't change from one coef to the next, thus
being slightly faster (probably more noticeable on systems where a
multiply is expensive, like arm).
Change-Id: I4343fe35b9f9a47fa00b217bdcbf5f91ff96c381
This commit brought back the shortcut implementation of 8x8/16x16
inverse 2D-DCT. When the eob <= 10, it skips the inverse transform
operations on row 4:7/4:15 in the first round. For bus_cif at 1000
kbps, this provides about 2% speed-up at speed 0.
Change-Id: I453e2d72956467d75be4ad8c04b4482ab889d572
This commit enables a special handle for the 8x8 inverse 2D-DCT,
where only DC coefficient is quantized to be non-zero. For bus_cif
at 2000 kbps, it provides about 1% speed-up at speed 0.
Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011
Speed feature experiment to set an upper and lower
partition size limit based on what has been seen
in spatial neighbors.
This seems to gives quite reasonable speed gains in local
(10-15%) and when used with speed 0 the losses are small
(0.25% derf, 0.35% stdhd). However, for now I am only
enabling it on speed 1 as there may be clashes with the existing
temporal partition selection in speed 2.
Using a tighter min / max around the range derived from the
neighbors increases speed further but at the cost of a
bigger quality loss. However, I think this spatial method could
be combined with data from either the last frame or a variance
method (or both) to refine the range of minimum and maximum
partition size. I.e. consider the min and max from spatial and
temporal neighbors and the variance recommendation.
Change-Id: I1b96bf8b84368d6aad0c7aa600fe141b4f07435f
Used 3 * standard_deviation in internal threshold calculation
instead of fit curve. This actually approached the algorithm
better.
For comparison, similar tests were done:
The overall psnr loss is less than before.
1. derf set:
when static-thresh = 1, psnr loss is 0.329%;
when static-thresh = 500, psnr loss is 0.970%;
2. stdhd set:
when static-thresh = 1, psnr loss is 0.922%;
when static-thresh = 500, psnr loss is 1.307%;
Similar speedup is achieved. For example,
clip bitrate static-thresh psnr time
akiyo(cif) 500 0 48.952 5.077s(50f)
akiyo 500 500 48.866 4.169s(50f)
parkjoy(1080p) 4000 0 30.388 78.20s(30f)
parkjoy 4000 500 30.367 70.85s(30f)
sunflower(1080p) 4000 0 44.402 74.55s(30f)
sunflower 4000 500 44.414 68.69s(30f)
Change-Id: Ic78833642ce1911dbbd1cb6c899a2d7e2dfcc1f3
Now read_inter_mode_info calls read_intra_block_part (renamed from
read_intra_block_modes) or read_inter_block_part (just added).
Change-Id: I541badea6b663e0ae692ec158665efb90ed20c03
This option exists in VP8, and it was rewritten in VP9 to support
skipping on different partition levels. After prediction is done,
we can check if the residuals in the partition block will be all
quantized to 0. If this is true, the skip flag is set, and only
prediction data are needed in reconstruction. Based on DCT's energy
conservation property, the skipping check can be estimated in
spatial domain.
The prediction error is calculated and compared to a threshold.
The threshold is determined by the dequant values, and also
adjusted by partition sizes. To be precise, the DC and AC parts
for Y, U, and V planes are checked to decide skipping or not.
Test showed that
1. derf set:
when static-thresh = 1, psnr loss is 0.666%;
when static-thresh = 500, psnr loss is 1.162%;
2. stdhd set:
when static-thresh = 1, psnr loss is 1.249%;
when static-thresh = 500, psnr loss is 1.668%;
For different clips, encoding speedup range is between several
percentage and 20+% when static-thresh <= 500. For example,
clip bitrate static-thresh psnr time
akiyo(cif) 500 0 48.923 5.635s(50f)
akiyo 500 500 48.863 4.402s(50f)
parkjoy(1080p) 4000 0 30.380 77.54s(30f)
parkjoy 4000 500 30.384 69.59s(30f)
sunflower(1080p) 4000 0 44.461 85.2s(30f)
sunflower 4000 500 44.418 78.1s(30f)
Higher static-thresh values give larger speedup with larger
quality loss.
Change-Id: I857031ceb466ff314ab580ac5ec5d18542203c53
Removing unused constants, macros, and function declarations. Using
ROUND_POWER_OF_TWO macro, vp9_zero, vp9_copy where possible. Moving
#include from *.h to *.c. Merging for loops for motion vectors.
Change-Id: Ic3bf841764a2bb177128bb3a6d7aa8f68229cd13
Simplified the code that extracts and uses the motion
vectors for the 4 sub-partitions in rd_pick_partition.
Change-Id: Iaf698ef7ee3aef9edd59015e1ae065dd359b17d9
This commit makes the initialization of trellis coeff optimization
a per-plane operation, thereby eliminating the redundant steps in
encode_sby and encode_sbuv. It makes the encoder at speed 0 slightly
faster.
Change-Id: Iffe9faca6a109dafc0dd69dc7273cbdec19b17cd
The feature that uses small partition results as a measure to skip
mode evaluation at larger partition requires the flags to be reset.
The reset was missing in the code path that calls rd_use_partition().
Change-Id: Ia0a3a0aee1a862b6e2333d596808db7c48033d50
Add SSE2 implementation to handle the special case of inverse 2D-DCT
where only DC coefficient is non-zero.
Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f
Adding common merge_probs and merge_probs2 functions. Changing ints to
usigned ints in some places.
Change-Id: Icf088ffdea7cf5b95284a128916409bdd53506b0
Prior to 73c4e284, the generated .sln files didn't contain any
information about the different configurations when using .vcxproj
project files. The MSVS IDE was able to fill this in just fine when
loaded though.
When building for ARM, the obj_int_extract project still is built
for x86, in order for the build process to be able to use
obj_int_extract.exe.
Now that configuration info is generated, it breaks current ARM
setups, since the configurations generated by gen_msvs_sln.sh only
included configurations from the last parsed project file (as
mentioned in the comment).
In these setups, the MSVS IDE generated a third meta-platform, called
"Mixed Platforms". This meta-platform points to either ARM or
Win32 as platform in each of the individual projects.
When the MSVS IDE generated this automatically, it also included
the original ARM and Win32 platforms as separate choices, but these
can be omitted since they don't make sense.
Change-Id: Ie25226496f91af4bb1ad8eb9ae9ca5bfed0433d7
Adding plane type check condition because it was always used outside of
get_tx_type_{4x4, 8x8, 16x16}.
Change-Id: I02f0bbfee8063474865bd903eb25b54d26e07230
Although local copies of the mode member variables
(mode, ref_frame) were made, they were not used in
all places. Also, made a local copy of the
second_ref_frame member.
Change-Id: I84d8c822e5cb3d8a02fc3de8a4037ca3fea8bfad
Prevents doing duplicate IDCTs; encoding of first 50 frames of bus
(speed 0) @ 1500kbps goes from 1min4.0 to 1min3.5, i.e. 0.87% faster
overall.
Change-Id: I2df39e29ed9d5ea5e7d2704a34940ba622832ddd
Encoding time of first 50 frames of bus (speed 0) @ 1500kbps goes from
1min5.4 to 1min4.0, i.e. 2.2% faster overall.
Change-Id: I8c32f2aff9a649ce7dd49d910dc5ba16b99c3bc6
Counts are separate from frame context. We have several frame contexts but
need only one copy of all counts.
Change-Id: I5279b0321cb450bbea7049adaa9275306a7cef7d
The struct optimize_block_args is defined same as encode_b_args.
Remove this redundant definition, and use encode_b_args consistently.
Change-Id: I1703aeeb3bacf92e98a34f4355202712110173d9
All filters are interpolating now, so we don't need this array, all
values from this array are evaluated to true.
Change-Id: I9af6d8219ae0eb984063cd15e4e2296374ae4961
The xform_quant() module is only used by inter modes, hence removing
the redundant switches therein conditioned on tx_type.
Change-Id: Ib87ce5b2f2e4cbf3ceb133a1108afa173c933a3f
When all the transform coefficients were quantized to zero, skip
the inverse transform operation. For bus_cif at 1000 kbps, the
runtime goes from 154967ms -> 149842ms, i.e., about 3% speed-up,
at speed 0.
Change-Id: Ic0a813fff5e28972d4888ee42d8747846a6c3cc6
many structures use bw and bh and they have different meanings. This cl attempts
to start this clean up and remove unneccessary 2 step look up log and then
shift operations...
also removed partition type multiple operation code in bitstream.c.
Change-Id: I7e03e552bdfc0939738e430862e3073d30fdd5db
Renamed:
MAX_MB_SEGMENTS to MAX_SEGMENTS
MB_SEG_TREE_PROBS to SEG_TREE_PROBS
The minimum unit for segmentation in the segment map
is now 8x8 so it is misleading to use MB_ as macro-block
traditionally refers to a 16x16 region.
Change-Id: I0b55a6f0426bb46dd13435fcfa5bae0a30a7fa22
Moving common encoder/decoder code to update_tx_counts. Also renaming
vp9_get_pred_probs_tx_size to get_tx_probs2 and adding get_tx_probs to
call vp9_get_pred_context_tx_size inside read_selected_tx_size only once
(twice before).
Change-Id: Ia50247f3893de88ef8e9041b0d44be44a40aaa4d
Update logic for both mode and mvref was the same, so using MODE_COUNT_SAT,
MODE_MAX_UPDATE_FACTOR, update_ct, update_ct2 for both cases. Removing
function update_tx_ct because it was identical to update_mode_ct2.
Change-Id: Iff566be27dbd6cde4c2ec04e8d988f207046b8f0
Optional change in diamond search to continue in the best move
direction until that move turns worse.
This is still WIP since the exact way the new method is to be used is
under investigation. One option is to make it an option in diamond
search and use it only when motion is large.
Overall slightly positive on derfraw300 +0.02%, stdhdraw +0.13%,
but works a lot better for high motion sequences (ex. football : +1%).
Change-Id: If88e01a6021daa0cda934680cdc70be1ee04f798
Stack the rate-distortion statistics in the sub8x8 rd loop. This allows
the encoder to skip the forward transform, quantization, and coeff cost
estimation, in the sub8x8 rd optimization search, if the motion
vector(s) are of integer pixel value, and have been tested in the
previous prediction filter type rd loops of the same block.
This gives about 2% speed-up for bus_cif at 2000 kpbs, for speed 0.
Its efficacy depends how frequently the motion search will select an
integer motion vector.
Change-Id: Iee15d4283ad4adea05522c1d40b198b127e6dd97
Mode search order in rd loop changed to better reflect
observed hit counts.
Also some adjustment of the baseline mode rd thresholds
to reflect the order change and observed frequencies.
Change-Id: I47a131cc83e11551df8add6d6d8d413d78d3a63c
When CONFIG_POSTPROC is set there was a now
invalid reference to cm->filter_level.
Changed to cpi->mb.e_mbd.lf.filter_level in line with
change Iaf5fb71c33719cdfa1b991f671caf071be9ea035
Change-Id: If746e60044903f7ba8d0d346225b3d015226c7d0
This commit allows the encoder to skip a few buffer update steps in
rd_pick_best_mbsegmentation, when early breakout has been triggered
in the rd_check_segment_txsize. It provides about 1% speed-up for
bus_cif at 2000 kbps, in the settings of speed 0.
Change-Id: Ica034f10a24dec572b397d8389a2b81020ebc0b9
At speed 2, due to the threshold scheme used, it is possible the rate
and distortion assigned with INT_MAX value. The patch added checking
to prevent the INT_MAX value is used in further calculation of RD
scores. The patch also changed the assertion in rd_use_partition() to
be mirror similar assertion in rd_pick_partition().
Change-Id: Idb52c543cc1e10abdf6e6a5d6e9cb535a42214dc
Adding loopfilter struct with fields from MACROBLOCKD and VP9Common.
Eventually it will be moved to vp9_loopfilter.h for better code structure.
Change-Id: Iaf5fb71c33719cdfa1b991f671caf071be9ea035
Changing fc->tx_probs back to fc->pre_tx_probs. This change actually
affects the bitstream but current test vectors work. Chrome branch is not
affected at all. Broken since:
cc662dd Adding struct tx_probs and struct tx_counts to cleanup the code.
Change-Id: I36dd4b3678e902e10aba8dd49b0012eb558c209d
This patch modifies the auto_mv_step_size speed feature to
use a combination of the maximum magnitude mv from the last
inter frame, and the maximum magnitude mv for the two reference
mvs with the same reference. For arf frames, the max mav step
for the resolution is used.
The bounds therefore are slightly tighter. The feature is made
a speed 1 feature.
Rebased.
Results (when this feature is turned on over speed 0):
derfraw300: -0.046% psnr, about 5+% speedup
(tested on football: goes from 4m30.760s to 4m17.410s).
Change-Id: If492797a61b0b4b3e58c0b8f86afb880165fc9f6
Renaming vp9_sb_mv_ref_tree to vp9_inter_mode_tree, and
vp9_sb_mv_ref_encoding_array to vp9_inter_mode_encodings.
Change-Id: I0e91fbf81350d3ec5a2599064c74089b5d06133a
We would skip the rectangular blocks for sub8x8 partitions because
we would conclude that PARTITION_NONE was better than PARTITION_SPLIT,
however, that conclusion was made before we actually really tested
PARTITION_SPLIT.
Change-Id: I8fa91e59894badc1d8cee3ba8a49e40ae4c4a489
This prevents a duplicate memcpy of a 128-byte struct every time
set_scale_factors() is called (which is a lot), thus leading to a
decrease from 3.7 MB to 1.85 MB of struct copying per 64x64 block
RD/partition loop.
Overall, this decreases encoding time of the first 50 frames of bus
@ 1500kbps (speed 0) from 1min5.9 to 1min4.9, i.e. about a 1.5%
overall speedup. We can likely get more gains by removing the copy
of the other struct (and replacing it with an indexing) as well.
Change-Id: I3dceb7e79f71e6fe911b11cc994cf89a869dde7a
This means we only do UV intra mode selection if we find any intra
mode to actually be useful at all; in addition, we only do UV intra
mode selection for the transform sizes that were selected, rather
than all sizes available in this partition.
First 50 frames of bus @ 1500kbps (speed 0) gains about 5% with this
change.
Change-Id: I7b461eb8b803247f57896c5a9505f745b55502b3
The break statement only breaks out of the nested loop, not the
top-level loop, so it doesn't always work as intended. Changing it
to a return statement does what's intended.
Change-Id: I585419823b39a04ec8826b1c8a216099b1728ba7
these are only used in the encoder.
frames_since_golden / frames_till_alt_ref_frame -> VP[89]_COMP
Change-Id: Ie14a6f46987bced685ddb449b85dc261caba6dfe
This could happen during golden overlay frame coding from a previous
alt-ref frame if the special overlay code was triggered.
Change-Id: I3056d0c547cd26903b260ef93c94026e96bd9868
These arrays have constant values (no any updates). Removing two
corresponding memcpy calls. Making a little cleanup in vp9_entropymode.h
as well: removing redundant 'extern' keyword and moving all function
declarations at the end.
Change-Id: Ia16b38b46aec2e2500f5df29c40a297ae241dede
vp9_init_quantizer() is called in vp9_create_compressor(), and
should not be called in vp9_set_speed_features().
Change-Id: Ic2f1f4b0531b9d46bb841d7e1d8da9812207dad6
Encoding of first 50 frames of bus (speed 0) @ 1500kbps goes from
1min6.2 to 1min5.9, i.e. 0.5% faster overall.
Change-Id: I59d8a3b2f0a75010fa041d5e2646c8caac5bd683
Encode of first 50 frames of bus @ 1500kbps (speed 0) goes from
1min7.3 to 1min6.2, i.e. 1.7% faster overall.
Change-Id: I19d2deacfbffadd61d32551cee9586757ab4a987
Encode of first 50 frames of bus @ 1500kbps (speed 0) goes from 1min12.8
to 1min7.3, i.e. 8% faster.
Change-Id: Ia22d1c7b687316c553cc60eacae988b24e175b62
About 15% faster for bus (speed 0) first 50 frames @ 1500kbps, which
goes from 1min36 to 1min24. Results become slightly better (+0.2% on
derf/yt, +0.4% on hd), probably because of a bugfix for skipmode in
super_block_yrd(). Overall speed change (on derfraw300) is roughly
-13%. This can probably be improved further by caching best_yrd
between partition searches. Also, we might be able to get more
speedups by always doing PARTITION_NONE before PARTITIONS_SPLIT, not
just at the sb8x8 level.
Change-Id: I83736949ebd5b4a3b400ee688d7661913fefc98b
Current partition checking starts from small sizes, and then goes up
to large sizes. This experiment uses the small partitions' motion
estimation result, which is already available, to speed up the
large partition's motion estimation. We can decide to skip some
patition checkings if they are unlikely choices. We could use the
motion vector(MV) result as current partition's prediction MV, limit
the search range and reference frame.
Current result at speed 1:
psnr loss: 1.19% for stdhd, 0.287% for derf.
speed gain: 14% for sunflower(hd), 11% for akiyo.
Further improvement will be done later.
Change-Id: I5abfd070e9cace2e91e2a0247d1325df313887ab
Call the individually optimized horizontal and vertical functions. This
implementation abuses the temp buffer.
This will be replaced with a custom optimized function.
Over 2x speedup.
Change-Id: I5b908d2a73d264e9810d6022bbff73207a3055dd
Use an estimate based on DC_PRED for intra uv cost
within the rd loop then only do a full uv mode analysis
if an intra mode is chosen.
Significant speed gains in some cases. Currently only
enabled for speed 2 pending speed/quality tests.
Change-Id: Ie851a12400d5483bce47ec0e3ccb8516041e91c0
This commit makes the encoder to perform motion search only once
per reference frame type for each 4x4/4x8/8x4 block. For bus_cif
at 2000 kbps, the runtime goes from 253812ms -> 217817ms
(14% speed-up) for speed 0.
Change-Id: I5f17599ccc8cfaf93ccb4f98fcb6008af6d79e92
Removing tile_rows and tile_columns from VP9Common, removing redundant
constants MIN_TILE_WIDTH and MAX_TILE_WIDTH, changing signature of
vp9_get_tile_n_bits.
Change-Id: I8ff3104a38179b2c6900df965c144c1d6f602267
Cosmetic code changes, renaming 'flat' local var to 'mask', removing
unused field 'blim' from loopfilter_info_n and loop_filter_info structs.
Change-Id: I51e6ccf727fe361ad9a08e29e1201aa7abd4987f
This commit enables SSE2 implementation of 16x16 inverse ADST/DCT
hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles.
This provides about 1% encoding speed-up at speed 0.
Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b
Making implementation of vp9_set_pred_flag_{seg_id, mbskip} consistent
with vp9_get_segment_id without using confusing sub(a, b) macro. Passing
mi_row and mi_col to functions explicitly instead of replying on
mb_to_right_edge and mb_to_bottom_edge.
Change-Id: I54c1087dd2ba9036f8ba7eb165b073e807d00435
In the prior code, the above context pointers used for entropy
decoding were initialized on the first frame, and not updated when
the frame size changed. The per-frame code which initializes the
contexts assumes that the contexts are contiguous, leading to an
incomplete initialization when the frame is smaller. This commit
updates the pointers so that the context is contigous whenever
the frame size changes.
Change-Id: I08b53e3a30c8289491212311682ff1b8028cff6c
This is a short term optimization till we work out a decoder
implementation requiring no frame border extension.
Change-Id: I02d15bfde4d926b50a4e58b393d8c4062d1be70f
This is required because upon downscaling, if a motion vector points
partially into the UMV (e.g. all minus 1 of 64+7 pixels, i.e. 70),
then we can point up to 140 pixels into the larger-resolution (2x)
reference buffer UMV, which means the UMV for reference buffers in
downscaling needs to be 140 rounded up to the nearest multiple of 32,
i.e. 160.
Longer-term, we should probably handle the UMV differently by detecting
edge coverage on-the-fly and using a temporary buffer for edge extensions
instead of adding 160 pixels on all sides of the image (which means a
CIF image uses 3x its own area size for borders).
Change-Id: I5184443e6731cd6721fc6a5d430a53e7d91b4f7e
Cycle times:
4x4: 151 to 131 cycles (15% faster)
8x8: 334 to 306 cycles (9% faster)
16x16: 1401 to 1368 cycles (2.5% faster)
32x32: 7403 to 7367 cycles (0.5% faster)
Total encode time of first 50 frames of bus @ 1500kbps (speed 0)
goes from 1min39.2 to 1min38.6, i.e. a 0.67% overall speedup.
Change-Id: I799a49460e5e3fcab01725564dd49c629bfe935f
Renaming flatmask4 to flat_mask4, flatmask5 to flat_mask5, hevmask to
hev_mask, filter to filter4, mbfilter to filter8, wide_mbfilter to
filter16.
Change-Id: Ic61c73e59c2eee505257584867aafac99833cea1
Also inline some of the block calculations to assist the compiler to
not do silly things like calculating the same offset (or converting
between raster/transform block offset or block, mi and pixel unit)
many, many, many times.
Cycle times:
4x4: 584 -> 505 cycles (16% faster)
8x8: 1651 -> 1560 cycles (6% faster)
16x16: 7897 -> 7704 cycles (2.5% faster)
32x32: 16096 -> 15852 cycles (1.5% faster)
Overall, this saves about 0.5 seconds (1min49.8 -> 1min49.3) on the
first 50 frames of bus (speed 0) @ 1500kbps, i.e. 0.5% overall.
Change-Id: If3dd62453f8e2ab9d4ee616bc4ea956fb8874b80
Removing unused DEC_DEBUG define and dec_debug variable. Changing function
signatures to eliminate code duplication, renaming function
mb_init_dequantizer to init_dequantizer. Also removing redundant curly
braces, and comments.
Change-Id: Ia56ee1b0be5f24abb0e878581845be8a4773c298
Change the mbfilter Neon code from executing both branches if all
vectors follow only one branch.
The code is about 5% faster when executing only one branch and about
1% slower when executing both branches.
-PS5: Remove local stack space from mbfilter.
Change-Id: I6a23f9b318a9f4568a2718b4c9348db988fe2182
Skip the inverse transform and reconstruction of inter-mode coded
blocks in the rate-distortion optimization loop, when skip_encode_sb
feature is turned on. This provides about 1% speed-up at speed 0,
and 1.5% speed-up at speed 1. No performance change in both settings.
Change-Id: I2932718bf4d007163702b61b16b6ff100cf9d007
This speed feature allows the encoder to largely remove the spatial
dependency between blocks inside a 64x64 superblock, thereby removing
the need to repeatedly encode superblocks per partition type in the
rate-distortion optimization loop.
A major challenge lies in the intra modes tested in the rate-distortion
optimization loop. The subsequent blocks do not have access to the
reconstructed boundary pixels without the intermediate coding steps.
This was resolved by using the original pixels for intra prediction
in the rd loop, followed by an appropriately designed distortion
modeling on the quantization parameters. Experiments also suggested
that the performance impact is more discernible at lower bit-rate/psnr
settings. Hence a quantizer dependent threshold is applied to deactivate
skip of block coding.
For bus_cif at 2000 kbps,
speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
performance loss.
speed 1: runtime 65312ms -> 61536ms, (7% speed-up) at 0.04dB
performance loss.
This operation is currently turned on in settings of speed 1.
Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
This commit enables SSE2 implementation of 8x8 inverse ADST/DCT
transform. The runtime goes from 1216 cycles -> 266 cycles.
For bus_cif at 2000 kbps, the overall runtime reduces from
253707ms -> 248430ms, i.e., 2% speed-up at speed 0.
Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb
Adding missed parenthesis around boolean expressions. Bitstream is changed.
Regenerating test vectors.
Change-Id: I4cc00b761e9473f92f180a9fc3a0c607f0aaae56
This function is actually called from set_offsets which is called right
before vp9_read_mode_info.
Change-Id: Ibb9d5ad606194bc80eab264fad85b31c9dfd8f77
Super basic conversion from the other implementations. Any changes to
one should be trivial to copy over keep in sync.
Change-Id: I1720b4128e0aba4b2779e3761f6494f8a09d3ea8
this was never fleshed out in the context of VP8, for which it was
added. for VP9 it has no meaning.
Change-Id: Iba2ecc026d9e947067b96690245d337e51e26eff
Implements some of the helper functions more efficiently with
lookups rathers than branches. Modeling function is consolidated
to reduce some computations.
Also merged the two enums BLOCK_SIZE_TYPES and BlockSize into
one because there is no need to keep them separate (even though
the semantics are a little different).
No bitstream or output change.
About 0.5% speedup
Change-Id: I7d71a66e8031ddb340744dc493f22976052b8f9f
For some reason iOS builds take a really long time to sort this
function out.
It's not used anywhere so remove it.
Change-Id: Ia5c8513a0d9c7eb32641cca58ca1c1113e2dd9f4
Adding segmentation struct to vp9_seg_common.h. Struct members are from
macroblockd and VP9Common structs. Moving segmentation related constants
and enums to vp9_seg_common.h.
Change-Id: I23fabc33f11a359249f5f80d161daf569d02ec03
Independent horizontal and vertical implementations.
Requires that blocks be built from 4x4 and [xy]_step_q4 == 16
6-10% improvement. CIF improved the least.
Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda
The function encode_block is called only by inter-prediction modes,
hence removing the transform type branching there.
Change-Id: I34a3172e28ce2388835efd0f8781922211bff857
This patch is in experimental but was not merged into master.
This patch swaps ptrs instead of copying and uses the
last show_frame flag instead of setting the entire buffer
to zero.
Change-Id: Ia0950466c8ba301a2a5bf917ff3d07bc1a2c2311
With sf->auto_mv_step_size on it is questionable
whether sf->reduce_first_step_size is worthwhile.
At speed 2 it was not having a big impact.
Even at speed 2 sf->optimize_coefficients = 0 is not
having a big speed imapct so for now I have moved it
down into a higher speed setting.
Change-Id: I8a54de76d486ad37aabce76474889da2768b14c1
Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from
292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the
overall runtime of speed 0 goes from 301s to 295s (2% speed-up).
Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628
Where possible, do the 16 pixel wide filter while doing the horizontal
filtering pass. The same approach can be taken for the mbloop_filter
when that's implemented. Doing so on the vertical pass is a little more
involved, but possible.
Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320
This commit fixed the mis-use of the tx_type for inverse transform
in intra4x4 rate-distortion optimization loop. It improves the
overall coding performance.
Change-Id: I7fe9953175b74890357dbcee33c138573766e980
Adds a speed feature to eliminate full-rd computation if the modeled
rd or rd based on a different parameter in the same mode is already
a lot larger than the best rd yet.
Specifically, only search the sharp and smooth filters if the modeled
rd cost based on the regular filter is within a certain factor of the
best rd cost so far. Also, skip full-rd computation of non splitmv
inter modes if the modeled rd cost based on pred error is within the
same factor of the best rd cost so far.
Also adds some enhancements in the rd search for splitmv mode to
speed things up by early breakouts. Negligible impact on performance.
Resuts on derfraw300:
psnr: -0.013% with the splitmv enhancements, -0.24% with the rd
breakout feature on.
speedup: 6% with splitmv enhancements, 20% with also residual breakout
(tested on football sequence at 600 Kbps)
Change-Id: I37abc308ea9f110c1679ce649b6a7e73ab1ad5fc
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.
Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
Encode time of first 50 frames of bus (speed 0) @ 1500kbps goes from
2min4.9 to 2min3.1, i.e. a 1.4% speedup overall.
Change-Id: I9b25e87974430cb942caa276410bb2eda815bd83
Replace case statement with lookup.
Small speed gain at low speed settings but at speed 2+ where the
number of motion searches etc. falls the impact rises to ~3-4%.
Change-Id: Idff639b7b302ee65e042b7bf836943ac0a06fad8
Change-Id: I5940719a4a161f8c26ac9a6753f1678494cec644
It does encodings with min and max q set at 0, and check to make sure
output PSNR at MAX_PSNR (100).
Change-Id: Ia2418353cccf6e487204ea4ff874a7e71e55cb3e
- The vp9 mbfilter C code will branch on flat and mask. This CL
will perform both branches and combine the data. A later CL will
perform a check to see if all patch will take one branch.
- These functions are about 1.75 times faster than the C code on
Nexus 7.
PS #3
- Changed all functions to dub limit, blimit, and thresh from
vld {dx[]}, freeing up r4-r6.
- Changed code to use vbif to reduce one instruction and free
up a d register.
Change-Id: I028dae0e434dc9891c3677bdb182e201ffb04777
This probably has a mildly negative impact on performance, but will
(in future commits - or possibly merged with this one) allow SIMD
implementations of individual intra prediction functions. We may
perhaps want to consider having separate functions per txfm-size
also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for
each intra prediction mode), but I haven't played much with that
yet.
Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269
The resulting reconstruction is never used, thus it just wastes CPU
cycles. Reduces encode time of first 50 frames of bus (speed 0) @
1500kbps from 2min2.0 to 2min1.2, i.e. a 0.65% overall speedup.
Change-Id: I74755ca3aadc21e2be220f486259060bd4088c45
Encode time for first 50 frames of bus (speed 0) @ 1500kbps goes from
2min10.9 to 2min10.5, i.e. 0.3% faster overall, basically because we
prevent the call overhead.
Change-Id: I1eab1a95dd3eae282f9b866f1f0b3dcadff073d5
Changes cost_mv_ref() into doing a LUT into pre-calculated cost
arrays instead. Encode time of first 50 frames of bus (speed 0)
@ 1500kbps goes from 2min11.6 to 2min10.9, i.e. 0.5% faster overall.
Change-Id: If186e92c34c201b29cbbc058785a15c9c09e433a
First 50 frames of bus @ 1500kbps (speed 0) goes from 2min12.6 to
2min11.6, i.e. 0.75% overall speedup.
Change-Id: I67054f8146e82a02b6457c51a1c8627a937e5e1e
Encode time of first 50 frames of bus (speed 0) @ 1500kbps goes from
2min4.9 to 2min3.1, i.e. a 1.4% speedup overall.
Change-Id: Ibe8b08d159797504c5d0c5122de1b6da3b6595e0
Overall, on all test sets, this gains about +0.2% on all metrics.
City is a clip where this really hurts (-1.0% on all metrics), I'm
not quite sure why yet. Maybe interesting to look into in the future.
Change-Id: I6f0eecb20e72f0194633270d30bf00d76d9eae78
Skips mode searches for intra and compound inter modes depending
on the best mode so far and the reference frames. The various
heuristics to be used are selected by bits from a flag. The
previous direction based intra mode search pruning is also absorbed
in this framework.
Specifically the flags and their impact are:
1) FLAG_SKIP_INTRA_BESTINTER (skip intra mode search for oblique
directional modes and TM_PRED if the best so far is
an inter mode)
derfraw300: -0.15%, 10% speedup
2) FLAG_SKIP_INTRA_DIRMISMATCH (skip D27, D63, D117 and D153
mode search if the best so far is not one of the closest
hor/vert/diagonal directions.
derfraw300: -0.05%, about 9% speedup
3) FLAG_SKIP_COMP_BESTINTRA (skip compound prediction mode
search if the best so far is an intra mode)
derfraw300: -0.06%, about 7-8% speedup
4) FLAG_SKIP_COMP_REFMISMATCH (skip compound prediction search
if the best single ref inter mode does not have the same ref
as one of the two references being tested in the compound mode)
derfraw300: -0.56%, about 10% speedup
Change-Id: I1a736cd29b36325489e7af9f32698d6394b2c495
intermediate_height for horizontal filtering must be at least 8
pixels to be able to do vertical filtering correctly. Currently
it can be less for small block and y_step_q4 sizes.
Change-Id: I2ee28b0591b2041c2fa9844d0ae2ff8a1a59cc21
This commit allows encoder to detect the cumulative rate-distortion
cost per transformed block inside a partition. If the cumulative
rd cost is already above the best rd value, it terminates the rest
operations and continue to next prediction mode test.
It reduces the runtime of bus at target bit-rate 2000 from 308 second
to 266 second, i.e., about 13% speed-up at no performance penalty.
Change-Id: I5f15a3d8955d97031d5653006027866a00654e7a
sf->unused_mode_skip_lvl. Tests modes as normal for all
sizes at or below the given level. At larger sizes it skips
all modes that were not chosen at any smaller size.
Hence setting BLOCK_SIZE_SB64X64 is in effect off.
Setting BLOCK_SIZE_AB4X4 will only consider modes that
were chosen for one or more 4x4 blocks at larger sizes.
sf->reference_masking.
Do a test encode of the NONE partition at one size and create
a reference frame mask based on the best rd choice. In the
full search only allow this reference frame.
Currently it is testing 64x64 and repeats this in the full search.
This does not work well with Jim's Partition code just now and
is disabled by default.
Change-Id: I8f8c52d2ef4a0c08100150b0ea4155d1aaab93dd
This patch allows the frame_parallel_decoding_mode flag
to be set instead of returning a codec error.
Change-Id: I4a1631c625723ac8873290d0fd0211074a87d112
This commit adds a speed feature where only squared partition are
evaluated in partition picking. Enable this feature in cpu-used 2
reduces encoding time by ~30%.
loss of compression:
-0.9% on cif set
-1.23% on stdhd
Change-Id: Ia6fad11210f0b78365abb889f9245604513be5b9
If none of the 16 coefficients that we quantize per loop iteration
are larger than the zbin, directly skip to the next round of coeffs,
rather than doing a full quantize loop that will eventually result
in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
no significantly positive results for smaller transforms, so is not
used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).
Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
This speed feature will skip searching the directional intra prediction
modes D63, D117, D27, D153 if the best intra mode so far is not one of
the diagonal, horizontal or vertical directions closest to the respective
directions being tested. In other words, this implements a sort of
binary search in the angular domain.
Speedup: about 9-10%
Results: -0.05% only on derfraw300.
Change-Id: I413584c41f2a3e8dabfbdeb40718c8fc4b1d63a2
(1) Refines the modeling function and uses that to add some speed
features. Specifically, intead of using a flag use_largest_txfm as
a speed feature, an enum tx_size_search_method is used, of which
two of the types are USE_FULL_RD and USE_LARGESTALL. Two other
new types are added:
USE_LARGESTINTRA (use largest only for intra)
USE_LARGESTINTRA_MODELINTER (use largest for intra, and model for
inter)
(2) Another change is that the framework for deciding transform type
is simplified to use a heuristic count based method rather than
an rd based method using txfm_cache. In practice the new method
is found to work just as well - with derf only -0.01 down.
The new method is more compatible with the new framework where
certain rd costs are based on full rd and certain others are
based on modeled rd or are not computed. In this patch the existing
rd based method is still kept for use in the USE_FULL_RD mode.
In the other modes, the count based method is used.
However the recommendation is to remove it eventually since the
benefit is limited, and will remove a lot of complications in
the code
(3) Finally a bug is fixed with the existing use_largest_txfm speed feature
that causes mismatches when the lossless mode and 4x4 WH transform is
forced.
Results on derf:
USE_FULL_RD: +0.03% (due to change in the tables), 0% encode time reduction
USE_LARGESTINTRA: -0.21%, 15% encode time reduction (this one is a
pretty good compromise)
USE_LARGESTINTRA_MODELINTER: -0.98%, 22% encode time reduction
(currently the benefit of modeling is limited for txfm size selection,
but keeping this enum as a placeholder) .
USE_LARGESTALL: -1.05%, 27% encode-time reduction (same as existing
use_largest_txfm speed feature).
Change-Id: I4d60a5f9ce78fbc90cddf2f97ed91d8bc0d4f936
Uses mapping tables instead of complicated modulo/division
operations for prob mapping for forward updates.
No bit-stream or output change.
Change-Id: Ifd9ce8ac1437835c305c94f64c18273c7a68f546
Added a speed feature in speed 1 to disable splitmv for HD (>=720)
clips. Test result on stdhd set: 0.3% psnr loss and 0.07% ssim
loss. Encoding speedup is 36%.
(For reference: The test result on derf set showed 2% psnr loss
and 1.6% ssim loss. Encoding speedup is 34%. SPLITMV should be
enabled for small resolution videos.)
Change-Id: I54f72b94f506c6d404b47c42e71acaa5374d6ee6
Compute the rate-distortion cost per transformed block, and cumulate
the cost through all blocks inside a partition. This allows encoder
to detect if the cumulative rd cost is already above the best rd cost,
thereby enabling early termination in the rate-distortion optimization
search.
Change-Id: I0a856367a9a7b6dd0b466e7b767f54d5018d09ac
Remove the use of sf->comp_inter_joint_search_thresh
from the baseline speed 0. Approx +0.4% on derf.
Change-Id: Icc14db98909830f40e5ac66130d40e78d2e55c71
This cl converts use partition from last frame to do the following:
if part is none,horz, vert -> try split
if part != none and one of the children is not split - try none
Change-Id: I5b6c659e35f3ac9f11c051b92ba98af6d7e8aa87
Signed-off-by: Jim Bankoski <jimbankoski@google.com>
This should significantly speedup cost_coeffs(). Basically what the
patch does is to make the neighbour arrays padded by one item to
prevent an eob check in get_coef_context(), then it populates each
col/row scan and left/top edge coefficient with two times the same
neighbour - this prevents a single/double context branch in
get_coef_context(). Lastly, it populates neighbour arrays in pixel
order (rather than scan order), so we don't have to dereference the
scantable to get the correct neighbours.
Total encoding time of first 50 frames of bus (speed 0) at 1500kbps
goes from 2min10.1 to 2min5.3, i.e. a 2.6% overall speed increase.
Change-Id: I42bcd2210fd7bec03767ef0e2945a665b851df56
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.
Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.
Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
Added a speed feature that focuses only on thresholds
for new motion modes.
Moved sf->comp_inter_joint_search_thresh into speed
1. This has ~+0.4% impact on quality at speed 0 as
our quality reference baseline.
Slight adjustment to baseline thresholds.
Change-Id: I7ebf104f1fe29af77ed4837b2e84be065621bbe5
since:
92479d9 Make update_partition_context faster
fixes:
vp9/common/vp9_blockd.h:408:22: error:
non-constant-expression cannot be narrowed from type 'int' to 'char' in
initializer list [-Wc++11-narrowing]
char pcvalue[2] = {~(0xe << boffset), ~(0xf <<boffset)};
^~~~~~~~~~~~~~~~~
Change-Id: Id5b00b9a72d00a2b314081a23879bd1fa3ce983b
This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.
Change-Id: Iad4d526346e05c7be896466c05500711bb763660
Adding read_skip_coeff function. Renaming decode_mv to read_mv for
consistency with another function names. Removing redundant function
arguments. Renaming kfread_modes to read_intra_mode_info, read_mb_modes_mv
to read_inter_mode_info, vp9_decode_mb_mode_mv to vp9_read_mode_info,
vp9_decode_mode_mvs_init to vp9_prepare_read_mode_info. Inlining function
mb_mode_mv_init inside vp9_prepare_read_mode_info.
Change-Id: Ifee05d333da4cd331d4aff40ce41ccd9b70e494a
Makes cost_coeffs() a lot faster:
4x4: 236 -> 181 cycles
8x8: 888 -> 588 cycles
16x16: 3550 -> 2483 cycles
32x32: 17392 -> 12010 cycles
Total encode time of first 50 frames of bus (speed 0) @ 1500kbps goes
from 2min51.6 to 2min43.9, i.e. 4.7% overall speedup.
Change-Id: I16b8d595946393c8dc661599550b3f37f5718896
Adding CHECK_MEM_ERROR macro to vp9_common.h and removing two duplicated
ones from vp9_onyx_int.h and vp9_onyxd_int.h.
Change-Id: I916afec61b3019f18193135dac7c35ed0f89b8b6
Cycle timings for first 3 frames of bus (speed 0) at 1500kbps:
4x4: 298 -> 234 cycles
8x8: 1227 -> 878 cycles
16x16: 23426 -> 18134 cycles
32x32: 4906 -> 3664 cycles
Total encode time of first 50 frames of bus @ 1500kbps (speed 0) goes
from 3min0.7 to 2min51.6 seconds, i.e. 5.3% faster.
Change-Id: I68a0e1b530b0563b84a67342cca4b45146077e95
This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.
The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.
Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.
Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
This commit change the partition search order to allow checking of
rectangular partition to be done after square partitions. It also
added a speed feature to skip rectangular partition check when
NONE is better than SPLIT in RD sense.
This feature roughly speed up encoder by 1.5X with loss on compression
-0.91% on cif set
-0.56% on stdhd set
Change-Id: I0d2d06993041aa9ea9073fcc39c54f73a127dfa4
Using vp9_set_pred_flag function instead of custom code, adding
decode_tokens function which is now called from decode_atom,
decode_sb_intra, and decode_sb.
Change-Id: Ie163a7106c0241099da9c5fe03069bd71f9d9ff8
- Added vp9_loop_filter_horizontal_edge_neon and
vp9_loop_filter_vertical_edge_neon.
- The functions are based off the vp8 loopfilter
functions.
- Matches x86 md5 checksum.
Change-Id: Id1c4dddb03584227e5ecd29f574a6ac27738fdd0
Encoding time of first 50 frames of bus @ 1500kbps (speed 0) goes from
3min15.0 to 3min10.9, i.e. 2.1% faster overall.
Change-Id: If592ee99be09bcd34a7c8498347f44e7305e982c
currently threading is internal to libvpx so thread safety is unneeded
in libgtest -- visual studio builds already operate in this way as they
do not have pthread.h available by default.
this removes an unconditional link to libpthread using $(extralibs)
should libvpx require it.
Change-Id: I2f278b711f533d0f4d8a6c896833e3e2237d1f45
This commit enables configurable reference buffer pointer for intra
predictor. This allows later removal of spatial dependency between
blocks inside a 64x64 superblock in the rate-distortion optimization
loop.
Change-Id: I02418c2077efe19adc86e046a6b49364a980f5b1
Use vpx_memset for updating the partition contexts. Thanks to Noah
for pointing out the need of refactoring in this part.
Change-Id: I67fb78429d632298f1cd8a0be346cc76f79392a6
Also tweaks to other features and experiments with
what is on and off at different speed settings.
Change-Id: I3e1d0be0d195216bf17c2ac5df67f34ce0b306b2
comment out some unused parameters and adjust the format to avoid:
./test/fdct4x4_test.cc|27| warning C4138: '*/' found outside of comment
Change-Id: I60f93b4c3cd7e8d61f0de80019f3404b40161f03
The aligned array in parameter list caused win32 build to report
c2719 error. This commit fixed the issue by make the parameter
type a pointer instead of an array.
Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
Each frame we reset all adaptive thresholds to MAX
rather than base. As modes are picked their thresholds
drop down.
Change-Id: Ia37f03a73003c2d9bfcda57edea07205e9a0e5e8
Renamed cpi->sf.first_step to cpi->sf.reduce_first_step_size
and changed its meaning such that it is a delta applied to
reduce the default first step size (>> x) in the motion search
rather than an absolute value.
The default first step size is already changed according to the image
dimensions (smaller for smaller images). cpi->sf.reduce_first_step_size
now applies a further correction from the default.
Change-Id: Ia94e08bc24c67b604831f980909af7e982fcd16d
The part where we align it by 8 or 16 is an implementation detail that
shouldn't matter to the outside world.
Change-Id: I9edd6f08b51b31c839c0ea91f767640bccb08d53
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.
Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
This commit enables 8x8 DCT and hybrid transform unit tests. It
also tunes the forward hybrid transform rounding opertions for
more precise round-trip performance.
Change-Id: If05c1ce59d75d641b9c6c91527d02d3a6ef498c3
This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.
The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.
Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
This reduced the size of the MODE_INFO array (mip and prev_mip)
by 425,568 bytes each for 1080p resolutions.
Change-Id: Ifa513ec2d0a49e8ec0867ec90620762fb7f1261d
; do the stores in reverse order with negative post-increment, by changing
; the sign of the stride
rsbr2,r2,#0
vst1.32{d27[1]},[r1],r2
vst1.32{d27[0]},[r1],r2
vst1.32{d26[1]},[r1],r2
vst1.32{d26[0]},[r1]; no post-increment
bxlr
ENDP; |vp9_short_iht4x4_add_neon|
END
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.