All filters are interpolating now, so we don't need this array, all
values from this array are evaluated to true.
Change-Id: I9af6d8219ae0eb984063cd15e4e2296374ae4961
The xform_quant() module is only used by inter modes, hence removing
the redundant switches therein conditioned on tx_type.
Change-Id: Ib87ce5b2f2e4cbf3ceb133a1108afa173c933a3f
When all the transform coefficients were quantized to zero, skip
the inverse transform operation. For bus_cif at 1000 kbps, the
runtime goes from 154967ms -> 149842ms, i.e., about 3% speed-up,
at speed 0.
Change-Id: Ic0a813fff5e28972d4888ee42d8747846a6c3cc6
many structures use bw and bh and they have different meanings. This cl attempts
to start this clean up and remove unneccessary 2 step look up log and then
shift operations...
also removed partition type multiple operation code in bitstream.c.
Change-Id: I7e03e552bdfc0939738e430862e3073d30fdd5db
Renamed:
MAX_MB_SEGMENTS to MAX_SEGMENTS
MB_SEG_TREE_PROBS to SEG_TREE_PROBS
The minimum unit for segmentation in the segment map
is now 8x8 so it is misleading to use MB_ as macro-block
traditionally refers to a 16x16 region.
Change-Id: I0b55a6f0426bb46dd13435fcfa5bae0a30a7fa22
Moving common encoder/decoder code to update_tx_counts. Also renaming
vp9_get_pred_probs_tx_size to get_tx_probs2 and adding get_tx_probs to
call vp9_get_pred_context_tx_size inside read_selected_tx_size only once
(twice before).
Change-Id: Ia50247f3893de88ef8e9041b0d44be44a40aaa4d
Optional change in diamond search to continue in the best move
direction until that move turns worse.
This is still WIP since the exact way the new method is to be used is
under investigation. One option is to make it an option in diamond
search and use it only when motion is large.
Overall slightly positive on derfraw300 +0.02%, stdhdraw +0.13%,
but works a lot better for high motion sequences (ex. football : +1%).
Change-Id: If88e01a6021daa0cda934680cdc70be1ee04f798
Stack the rate-distortion statistics in the sub8x8 rd loop. This allows
the encoder to skip the forward transform, quantization, and coeff cost
estimation, in the sub8x8 rd optimization search, if the motion
vector(s) are of integer pixel value, and have been tested in the
previous prediction filter type rd loops of the same block.
This gives about 2% speed-up for bus_cif at 2000 kpbs, for speed 0.
Its efficacy depends how frequently the motion search will select an
integer motion vector.
Change-Id: Iee15d4283ad4adea05522c1d40b198b127e6dd97
Mode search order in rd loop changed to better reflect
observed hit counts.
Also some adjustment of the baseline mode rd thresholds
to reflect the order change and observed frequencies.
Change-Id: I47a131cc83e11551df8add6d6d8d413d78d3a63c
When CONFIG_POSTPROC is set there was a now
invalid reference to cm->filter_level.
Changed to cpi->mb.e_mbd.lf.filter_level in line with
change Iaf5fb71c33719cdfa1b991f671caf071be9ea035
Change-Id: If746e60044903f7ba8d0d346225b3d015226c7d0
This commit allows the encoder to skip a few buffer update steps in
rd_pick_best_mbsegmentation, when early breakout has been triggered
in the rd_check_segment_txsize. It provides about 1% speed-up for
bus_cif at 2000 kbps, in the settings of speed 0.
Change-Id: Ica034f10a24dec572b397d8389a2b81020ebc0b9
At speed 2, due to the threshold scheme used, it is possible the rate
and distortion assigned with INT_MAX value. The patch added checking
to prevent the INT_MAX value is used in further calculation of RD
scores. The patch also changed the assertion in rd_use_partition() to
be mirror similar assertion in rd_pick_partition().
Change-Id: Idb52c543cc1e10abdf6e6a5d6e9cb535a42214dc
Adding loopfilter struct with fields from MACROBLOCKD and VP9Common.
Eventually it will be moved to vp9_loopfilter.h for better code structure.
Change-Id: Iaf5fb71c33719cdfa1b991f671caf071be9ea035
This patch modifies the auto_mv_step_size speed feature to
use a combination of the maximum magnitude mv from the last
inter frame, and the maximum magnitude mv for the two reference
mvs with the same reference. For arf frames, the max mav step
for the resolution is used.
The bounds therefore are slightly tighter. The feature is made
a speed 1 feature.
Rebased.
Results (when this feature is turned on over speed 0):
derfraw300: -0.046% psnr, about 5+% speedup
(tested on football: goes from 4m30.760s to 4m17.410s).
Change-Id: If492797a61b0b4b3e58c0b8f86afb880165fc9f6
Renaming vp9_sb_mv_ref_tree to vp9_inter_mode_tree, and
vp9_sb_mv_ref_encoding_array to vp9_inter_mode_encodings.
Change-Id: I0e91fbf81350d3ec5a2599064c74089b5d06133a
We would skip the rectangular blocks for sub8x8 partitions because
we would conclude that PARTITION_NONE was better than PARTITION_SPLIT,
however, that conclusion was made before we actually really tested
PARTITION_SPLIT.
Change-Id: I8fa91e59894badc1d8cee3ba8a49e40ae4c4a489
This prevents a duplicate memcpy of a 128-byte struct every time
set_scale_factors() is called (which is a lot), thus leading to a
decrease from 3.7 MB to 1.85 MB of struct copying per 64x64 block
RD/partition loop.
Overall, this decreases encoding time of the first 50 frames of bus
@ 1500kbps (speed 0) from 1min5.9 to 1min4.9, i.e. about a 1.5%
overall speedup. We can likely get more gains by removing the copy
of the other struct (and replacing it with an indexing) as well.
Change-Id: I3dceb7e79f71e6fe911b11cc994cf89a869dde7a
This means we only do UV intra mode selection if we find any intra
mode to actually be useful at all; in addition, we only do UV intra
mode selection for the transform sizes that were selected, rather
than all sizes available in this partition.
First 50 frames of bus @ 1500kbps (speed 0) gains about 5% with this
change.
Change-Id: I7b461eb8b803247f57896c5a9505f745b55502b3
The break statement only breaks out of the nested loop, not the
top-level loop, so it doesn't always work as intended. Changing it
to a return statement does what's intended.
Change-Id: I585419823b39a04ec8826b1c8a216099b1728ba7
these are only used in the encoder.
frames_since_golden / frames_till_alt_ref_frame -> VP[89]_COMP
Change-Id: Ie14a6f46987bced685ddb449b85dc261caba6dfe
This could happen during golden overlay frame coding from a previous
alt-ref frame if the special overlay code was triggered.
Change-Id: I3056d0c547cd26903b260ef93c94026e96bd9868
These arrays have constant values (no any updates). Removing two
corresponding memcpy calls. Making a little cleanup in vp9_entropymode.h
as well: removing redundant 'extern' keyword and moving all function
declarations at the end.
Change-Id: Ia16b38b46aec2e2500f5df29c40a297ae241dede
vp9_init_quantizer() is called in vp9_create_compressor(), and
should not be called in vp9_set_speed_features().
Change-Id: Ic2f1f4b0531b9d46bb841d7e1d8da9812207dad6
Encoding of first 50 frames of bus (speed 0) @ 1500kbps goes from
1min6.2 to 1min5.9, i.e. 0.5% faster overall.
Change-Id: I59d8a3b2f0a75010fa041d5e2646c8caac5bd683
Encode of first 50 frames of bus @ 1500kbps (speed 0) goes from
1min7.3 to 1min6.2, i.e. 1.7% faster overall.
Change-Id: I19d2deacfbffadd61d32551cee9586757ab4a987
Encode of first 50 frames of bus @ 1500kbps (speed 0) goes from 1min12.8
to 1min7.3, i.e. 8% faster.
Change-Id: Ia22d1c7b687316c553cc60eacae988b24e175b62
About 15% faster for bus (speed 0) first 50 frames @ 1500kbps, which
goes from 1min36 to 1min24. Results become slightly better (+0.2% on
derf/yt, +0.4% on hd), probably because of a bugfix for skipmode in
super_block_yrd(). Overall speed change (on derfraw300) is roughly
-13%. This can probably be improved further by caching best_yrd
between partition searches. Also, we might be able to get more
speedups by always doing PARTITION_NONE before PARTITIONS_SPLIT, not
just at the sb8x8 level.
Change-Id: I83736949ebd5b4a3b400ee688d7661913fefc98b
Current partition checking starts from small sizes, and then goes up
to large sizes. This experiment uses the small partitions' motion
estimation result, which is already available, to speed up the
large partition's motion estimation. We can decide to skip some
patition checkings if they are unlikely choices. We could use the
motion vector(MV) result as current partition's prediction MV, limit
the search range and reference frame.
Current result at speed 1:
psnr loss: 1.19% for stdhd, 0.287% for derf.
speed gain: 14% for sunflower(hd), 11% for akiyo.
Further improvement will be done later.
Change-Id: I5abfd070e9cace2e91e2a0247d1325df313887ab
Use an estimate based on DC_PRED for intra uv cost
within the rd loop then only do a full uv mode analysis
if an intra mode is chosen.
Significant speed gains in some cases. Currently only
enabled for speed 2 pending speed/quality tests.
Change-Id: Ie851a12400d5483bce47ec0e3ccb8516041e91c0
This commit makes the encoder to perform motion search only once
per reference frame type for each 4x4/4x8/8x4 block. For bus_cif
at 2000 kbps, the runtime goes from 253812ms -> 217817ms
(14% speed-up) for speed 0.
Change-Id: I5f17599ccc8cfaf93ccb4f98fcb6008af6d79e92
Removing tile_rows and tile_columns from VP9Common, removing redundant
constants MIN_TILE_WIDTH and MAX_TILE_WIDTH, changing signature of
vp9_get_tile_n_bits.
Change-Id: I8ff3104a38179b2c6900df965c144c1d6f602267
Making implementation of vp9_set_pred_flag_{seg_id, mbskip} consistent
with vp9_get_segment_id without using confusing sub(a, b) macro. Passing
mi_row and mi_col to functions explicitly instead of replying on
mb_to_right_edge and mb_to_bottom_edge.
Change-Id: I54c1087dd2ba9036f8ba7eb165b073e807d00435
This is a short term optimization till we work out a decoder
implementation requiring no frame border extension.
Change-Id: I02d15bfde4d926b50a4e58b393d8c4062d1be70f
Cycle times:
4x4: 151 to 131 cycles (15% faster)
8x8: 334 to 306 cycles (9% faster)
16x16: 1401 to 1368 cycles (2.5% faster)
32x32: 7403 to 7367 cycles (0.5% faster)
Total encode time of first 50 frames of bus @ 1500kbps (speed 0)
goes from 1min39.2 to 1min38.6, i.e. a 0.67% overall speedup.
Change-Id: I799a49460e5e3fcab01725564dd49c629bfe935f
Also inline some of the block calculations to assist the compiler to
not do silly things like calculating the same offset (or converting
between raster/transform block offset or block, mi and pixel unit)
many, many, many times.
Cycle times:
4x4: 584 -> 505 cycles (16% faster)
8x8: 1651 -> 1560 cycles (6% faster)
16x16: 7897 -> 7704 cycles (2.5% faster)
32x32: 16096 -> 15852 cycles (1.5% faster)
Overall, this saves about 0.5 seconds (1min49.8 -> 1min49.3) on the
first 50 frames of bus (speed 0) @ 1500kbps, i.e. 0.5% overall.
Change-Id: If3dd62453f8e2ab9d4ee616bc4ea956fb8874b80
Skip the inverse transform and reconstruction of inter-mode coded
blocks in the rate-distortion optimization loop, when skip_encode_sb
feature is turned on. This provides about 1% speed-up at speed 0,
and 1.5% speed-up at speed 1. No performance change in both settings.
Change-Id: I2932718bf4d007163702b61b16b6ff100cf9d007
This speed feature allows the encoder to largely remove the spatial
dependency between blocks inside a 64x64 superblock, thereby removing
the need to repeatedly encode superblocks per partition type in the
rate-distortion optimization loop.
A major challenge lies in the intra modes tested in the rate-distortion
optimization loop. The subsequent blocks do not have access to the
reconstructed boundary pixels without the intermediate coding steps.
This was resolved by using the original pixels for intra prediction
in the rd loop, followed by an appropriately designed distortion
modeling on the quantization parameters. Experiments also suggested
that the performance impact is more discernible at lower bit-rate/psnr
settings. Hence a quantizer dependent threshold is applied to deactivate
skip of block coding.
For bus_cif at 2000 kbps,
speed 0: runtime 269854ms -> 237774ms (12% speed-up) at 0.05dB
performance loss.
speed 1: runtime 65312ms -> 61536ms, (7% speed-up) at 0.04dB
performance loss.
This operation is currently turned on in settings of speed 1.
Change-Id: Ib689741dfff8dd38365d8c1b92860a3e176f56ec
this was never fleshed out in the context of VP8, for which it was
added. for VP9 it has no meaning.
Change-Id: Iba2ecc026d9e947067b96690245d337e51e26eff
Implements some of the helper functions more efficiently with
lookups rathers than branches. Modeling function is consolidated
to reduce some computations.
Also merged the two enums BLOCK_SIZE_TYPES and BlockSize into
one because there is no need to keep them separate (even though
the semantics are a little different).
No bitstream or output change.
About 0.5% speedup
Change-Id: I7d71a66e8031ddb340744dc493f22976052b8f9f
For some reason iOS builds take a really long time to sort this
function out.
It's not used anywhere so remove it.
Change-Id: Ia5c8513a0d9c7eb32641cca58ca1c1113e2dd9f4
Adding segmentation struct to vp9_seg_common.h. Struct members are from
macroblockd and VP9Common structs. Moving segmentation related constants
and enums to vp9_seg_common.h.
Change-Id: I23fabc33f11a359249f5f80d161daf569d02ec03
The function encode_block is called only by inter-prediction modes,
hence removing the transform type branching there.
Change-Id: I34a3172e28ce2388835efd0f8781922211bff857
This patch is in experimental but was not merged into master.
This patch swaps ptrs instead of copying and uses the
last show_frame flag instead of setting the entire buffer
to zero.
Change-Id: Ia0950466c8ba301a2a5bf917ff3d07bc1a2c2311
With sf->auto_mv_step_size on it is questionable
whether sf->reduce_first_step_size is worthwhile.
At speed 2 it was not having a big impact.
Even at speed 2 sf->optimize_coefficients = 0 is not
having a big speed imapct so for now I have moved it
down into a higher speed setting.
Change-Id: I8a54de76d486ad37aabce76474889da2768b14c1
This commit fixed the mis-use of the tx_type for inverse transform
in intra4x4 rate-distortion optimization loop. It improves the
overall coding performance.
Change-Id: I7fe9953175b74890357dbcee33c138573766e980
Adds a speed feature to eliminate full-rd computation if the modeled
rd or rd based on a different parameter in the same mode is already
a lot larger than the best rd yet.
Specifically, only search the sharp and smooth filters if the modeled
rd cost based on the regular filter is within a certain factor of the
best rd cost so far. Also, skip full-rd computation of non splitmv
inter modes if the modeled rd cost based on pred error is within the
same factor of the best rd cost so far.
Also adds some enhancements in the rd search for splitmv mode to
speed things up by early breakouts. Negligible impact on performance.
Resuts on derfraw300:
psnr: -0.013% with the splitmv enhancements, -0.24% with the rd
breakout feature on.
speedup: 6% with splitmv enhancements, 20% with also residual breakout
(tested on football sequence at 600 Kbps)
Change-Id: I37abc308ea9f110c1679ce649b6a7e73ab1ad5fc
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.
Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230