Use b_mode_info to store the inter prediction mode of sub8x8 block,
in replacement of the use of partition_info. Remove redundant buffer
update for partition_info. For bus_cif at 2000 kbps, this seem to make
speed 0 about 1% faster.
Change-Id: Id1b3be45e75a24fb4b42335ac480c23e440978f6
When all coefficients are zeros, skip the corresponding 1-D inverse
transform. This practice has been used in the SSE2 implementation of
inverse 32x32 DCT. This commit imports this algorithm into the C code.
Change-Id: I0f58bfcb183a569fab85d524d5d9cf8ae8653f86
We already have itxm_add member in MACROBLOCKD structure. Both
inv_txm4x4_1_add and inv_txm4x4_add are just its special cases for
different eob values. But eob logic is already implemented in
vp9_iwht4x4_add and vp9_idct4x4_add (that's why also removing
inverse_transform_b_4x4_add).
Change-Id: I80bec9b6f7d40c5e5033c613faca5c819c3e6326
For CpuUsed 1 & 2, this commit allow to skip retangular partition check
when NONE is better than SPLIT. It also changed to allow such logic
on alt ref frame coding rather than use square partition all them. The
change has gain compressio about .3% on yt and ythd for both 1&2, It
helped .6% compression on cif and stdhd for both CpuUsed 1&2.
Change-Id: I814b653baf89f59acd20e042629a12938a1bd4e5
Now we have entropy code separate from scan/iscan code. The next step
in future is to move iscan code from common part to the encoder.
Change-Id: Id9732f7d80aec00af35c1d58d1137c4c96c91451
A new set of MSVC warnings were introduced by change
I3f36d3f7cd8d15195a6e2fafd1777cdaf9ecb847
In particular MSVC does not like:-
typedef const int16_t subpel_kernel[SUBPEL_TAPS];
struct subpix_fn_table {
const subpel_kernel *filter_x;
const subpel_kernel *filter_y;
};
causes new warning in MSVC.
warning C4114: same type qualifier used more than once
Change-Id: Iae596fd13aadf36169faf00c68eabe9a32a9b156
This commit allows sub8x8 intra modes test in the rate-distortion
loop for hd sequences in speed 1 and 2.
For sequence y90n of hd set at 8000 kbps, speed 2 runtime goes
from 207s to 210s. For ped_1080p at 3000 kbps, speed 2 runtim goes
from 336s to 337s. Both are running with 300 frames.
This improves compression performance by 0.24% for stdhd and 0.32%
for hd.
Change-Id: I173ca38a6411565ae6cfadd184c42b2070c5de1f
The idea is to have the following names for each transform size:
vp9_idct4x4_add
vp9_idct4x4_1_add
vp9_idct4x4_10_add
vp9_idct4x4_16_add
vp9_idct8x8_add
vp9_idct8x8_1_add
vp9_idct8x8_10_add
vp9_idct8x8_64_add
etc for 16x16, 32x32
The actual list of renames in this patch:
vp9_idct_add_lossless -> vp9_iwht4x4_add
vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add
vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add
vp9_idct_add -> vp9_idct4x4_add
vp9_short_idct4x4_add -> vp9_idct4x4_16_add
vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add
Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1
Speed 4 still does not give a big gain over speed 3.
This just cleans it up a little from the last patch and comments
out features that do not seem to be giving much benefit.
Change-Id: I5f366e6160e1dbe5dc45cf5eb90cc02712baa1b6
Allow selective masking of individual split modes rather than
just a single on / off flag.
For speed 2 recovers the large speed loss seen for some derf
clips in change Ie6bdfa0a370148dd60bd800961077f7e97e67dd4
and a small quality gain.
For speed 1 10 % speed increase observed locally on some derf clips
for minimal quality change.
Change-Id: If86191087b93cbc05351c26c60c7933e2149e485
Moving INTERPOLATIONFILTERTYPE enum and subpix_fn_table struct to
vp9_filter.h. Adding convenient typedef for subpel kernels.
Function vp9_setup_interp_filters() besides setting xd->subpix.filter_x &
xd->subpix.filter_y has a side effect of also setting scale factors. This
is not required inside decode_modes_b() because scale factors have been
already set by set_ref() calls. That's why replacing
vp9_setup_interp_filters() call with newly created vp9_get_filter_kernel()
call. The behavior of vp9_setup_interp_filters() is unchanged (it
is used from the encoder).
Change-Id: I3f36d3f7cd8d15195a6e2fafd1777cdaf9ecb847
This commit removes the redundant second reference frame check in
the rate-distortion optimization loop for sub8x8 blocks.
Change-Id: I13a57a6f624c4a9bcef02ff2a867fa30d8b44a93
This commit defines b_mode_info as a struct type. This will allow
us to further remove the use of PARTITION_INFO in the encoding process.
Change-Id: I975b0f7d557b5e0f66545a61b472def76b671cce
This commit separates the rate-distortion optimization loop of
superblocks from that of sub8x8 blocks. This allows better design
rate-distortion optimization search loop for each setting. It also
removes the use of SPLITMV and I4X4_PRED therein.
No performance change in speed 0 settings. For bus@CIF at 2000kbps,
the speed 1 runtime goes from 48009ms to 43894ms (about 10% faster).
The overall compression performance on derf changed by -0.021%.
Speed 2 runtime goes from 27114ms to 28700ms (6% slower), while the
overall coding efficiency goes up by 1.629% for derf, 1.236% for yt.
Change-Id: Ie6bdfa0a370148dd60bd800961077f7e97e67dd4
In subpixel filters, prefetched source data, unrolled loops,
and interleaved instructions.
In HORIZx4, integrated the idea in Scott's CL (commit:
d22a504d11), which was suggested by
Erik/Tamar from Intel. Further tweaking was done to combine row 0,
2, and row 1, 3 in registers to do more 2-row-in-1 operations until
the last add.
Test showed a ~2% decoder speedup.
Change-Id: Ib53d04ede8166c38c3dc744da8c6f737ce26a0e3
Substantial reworking of the speed vs quality trade offs for
speed 1 and 2.
In this patch I am attempting to freeze the "quality" meaning of
speeds 1 and 2 relative to speed 0 so that in future we can
better evaluate progress.
I am targeting :
Speed 1 quality ~-5% vs speed 0.
Speed 2 quality ~-10% vs speed 0
It is inevitable that quality will still fluctuate a little as we adjust
settings and add new features, but we will attempt to keep as
close as possible to these values. Above speed 2 things will remain
a bit more fluid for now.
In this patch speed 1 is approximately 4-5x as fast as speed 0. This
is similar to before but the quality hit is a lot less. Likewise speed 2
is approximately 2x as fast as speed 1 but is similar in quality to the
previous speed 1 configuration.
Also slight change to behavior of FLAG_EARLY_TERMINATE to insure
all reference frames get at least one rd test. Important for very low
variance regions.
WIP :- Added a new speed level with old speed 4 becoming speed 5.
Speed 3 and 4 tradeoffs still WIP
Change-Id: Ic7a38dd7b5b63ab1501f9352411972f480ac6264
This commit causes use last partition to consider whether a 64x64 has
motion that might make a new partitioning worth while.
Change-Id: I3a57bedef4f3cd961fadbfa96651c206fa36da4a
Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two
instructions. Then inlined them.
quoting from intel MMX_App_Compute_16bit_Vector.pdf
"The PMADDWD instruction multiplies four
pairs of 16-bit numbers and produces partial sums of the results
and can do so once per clock (with a three-clock latency)."
so I am assuming that there will be three clock overhead after the
last _mm_madd_pi16 command.
Even with the overhead the number of clocks in general should be
smaller. I am not sure though becasue I could not find information
about number of clocks required for instructions in k_cvtlo_epi16
and k_cvthi_epi16. I will run a test and compare the execution time.
Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7
Moving functions from vp9_idct_blk to vp9_idct because these functions are
used from both encoder and decoder. Removing duplicated code from
vp9_encodemb.c and reusing existing functions.
Change-Id: Ia0a6782f8c4c409efb891651b871dd4bf22d5fe8
The codec should effectively run with motion vector of range (-2048, 2047)
in full pixels, for sequences of 1080p and below. Add assertions to clarify
this behavior.
Change-Id: Ia0cac28249f587d8f8882205228fa480263ab313
Moving out decode_tokens function calls and adding decode_blocks boolean
variable. We only have to decode if eobtotal > 0, i.e. we have at least one
non-zero coefficient. Also inlining and remove vp9_set_pred_flag_mbskip
function.
Change-Id: I7be38b12ee8206faf0beea2bbf4d52be42575b03
Interleaved the instructions, reduced register dependency, and
prefetched the source data. This improved the decoder speed
by 0.6% - 2%.
Change-Id: I568067aa0c629b2e58219326899c82aedf7eccca
byte version of ronalds d153 ssse3 optimizations for
4x4 and 8x8
(commit: fc91a2a112238a1aee568f3b840585de4e928fca)
Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7
Make encoder skip rectangular partition check in speed 1 and above,
when early termination was triggered in partition split.
Thanks Guillaume (gmartres@) for catching this issue.
This change makes bus_cif at 2000kbps speed 1 runtime goes down from
25612ms to 23438ms (about 9% speed-up), at the expense of -0.235%
performance down.
Change-Id: I98613fad081a261d30d5fa206f934ca70601c180
We don't need these functions anymore. The only one which was actually
used is vp9_add_constant_residual_32x32. Addition of
vp9_short_idct32x32_1_add eliminates this single usage. SSE2 optimized
version of vp9_short_idct32x32_1_add will be added in the next patch set,
right now it is only C implementation. Now we have all idct functions
implemented in a consistent manner.
Change-Id: I63df79a13cf62aa2c9360a7a26933c100f9ebda3
The code now takes into account temporal and spatial
information to determine the partition size range, but the
frequency counts have been removed.
The net effect is similar in quality but about 10% faster.
Change-Id: I39a513fb79cec9177b73b2a7218f0da70963ae95
This patch deletes the variance based speed three partitioning.
Speed 3 now uses the same partitioning method as speed 2
but with some stricter conditions.
The speed and quality are now somewhere between speeds 2 and 4
whereas before it was worse in both than speed 4.
Change-Id: Ia142e7007299d79db3ceee6ca8670540db6f7a41