byte version of ronalds d153 ssse3 optimizations for
4x4 and 8x8
(commit: fc91a2a112238a1aee568f3b840585de4e928fca)
Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7
We don't need these functions anymore. The only one which was actually
used is vp9_add_constant_residual_32x32. Addition of
vp9_short_idct32x32_1_add eliminates this single usage. SSE2 optimized
version of vp9_short_idct32x32_1_add will be added in the next patch set,
right now it is only C implementation. Now we have all idct functions
implemented in a consistent manner.
Change-Id: I63df79a13cf62aa2c9360a7a26933c100f9ebda3
Both vp9_init_mbmode_probs() and vp9_zero(cm->ref_frame_sign_bias) are
called inside vp9_setup_past_independence() which called in any case for
encoder/decoder after VP9_COMMON struct creation.
Change-Id: I3724d1a4fb8060101ff0290dd6a158f0b5c57bb4
Replace current code which corrupts the stack by
duplicate of vp8 code to save and restore neon
registers.
Change-Id: Ibb0220b9aa985d10533befa0a455ebce57a2891a
- full ASM version, no more C gateway file.
- integrate combine-add with last step of 2nd pass.
- remove a few push/pop pairs.
- some instruction reordering to hide latency.
Change-Id: Ic9d9933c908b65d1bf7ba8fd47b524cda808c9c6
Current x86inc.asm didn't handle 32bit PIC build properly.
TEXTRELs were seen in the library built. The PIC macros from
libvpx's x86_abi_support.asm was used to fix this problem.
The assembly code was modified to use the macros.
Notes: We need this fix in for decoder building. Functions in
encoder will be fixed later.
Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838
The sub8x8 check can be directly inferred from block_idx, hence
removed from the arguments if get_sub_block_mv.
Change-Id: Ib766d57e81248fb92df0f6d9b163e6c77b933ccd
This is incompatible with most toolchains other than gcc.
Revert "Deleted #include <inttypes.h>"
This reverts commit 4d018be950.
This reverts commit d22a504d11.
Change-Id: I1751dc6831f4395ee064e6748281418e967e1dcf
Reformatted version of a patch submitted by Erik/Tamar
from Intel. For the test clips used, the decoder
performance improved by ~2%.
Change-Id: Ifbc37ac6311bca9ff1cfefe3f2e9b7f13a4a511b
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of pointers to MODE_INFO structs. The
MODE_INFO structs are now stored as a stream (decoder only),
eliminating unnecessary copies and is a little more cache
friendly.
Change-Id: I031d376284c6eb98a38ad5595b797f048a6cfc0d
Sample app: vp9_spatial_scalable_encoder
vpx_codec_control extensions:
VP9E_SET_SVC
VP9E_SET_WIDTH, VP9E_SET_HEIGHT, VP9E_SET_LAYER
VP9E_SET_MIN_Q, VP9E_SET_MAX_Q
expanded buffer size for vp9_convolve
modified setting of initial width in vp9_onyx_if.c so that layer size
can be set prior to initial encode
Default number of layers set to 3 (VPX_SS_DEFAULT_LAYERS)
Number of layers set explicitly in vpx_codec_enc_cfg.ss_number_layers
Change-Id: I2c7a6fe6d665113671337032f7ad032430ac4197
The commit changes the border pixel extension from 160 pixel each side
to what is necessary in arnr filter or motion estimation portion, i.e.
16 pixel on top and left side. For right or bottom side, the extension
is changed to either round up image size to multiple of 64 or at least
16 pixels.
Change-Id: Ic05e19b94368c1ab4df568723aae5734e6c3d2c5
Adds a new end-usage option for constant quality encoding in vpx. This
first version implemented for VP9, encodes all regular inter frames
using the quality specified in the --cq-level= option, while encoding
all key frames and golden/altref frames at a quality better than that.
The current performance on derfraw300 is +0.910% up from bitrate control,
but achieved without multiple recode loops per frame.
The decision for qp for each altref/golden/key frame will be improved
in subsequent patches based on better use of stats from the first pass.
Further, the qp for regular inter frames may also be varied around the
provided cq-level.
Change-Id: I6c4a2a68563679d60e0616ebcb11698578615fb3
mode_info_context was stored as a grid of MODE_INFO structs.
The grid now constists of a pointer to a MODE_INFO struct and
a "in the image" flag. The MODE_INFO structs are now stored
as a stream, eliminating unnecessary copies and is a little
more cache friendly.
For the test clips used, the decoder performance improved
by ~4.3% (1080p) and ~9.7% (720p).
Patch Set 2: Re-encoded clips with latest. Now ~1.7% (1080p)
and 5.9% (720p).
Change-Id: I846f29e88610fce2523ca697a9a9ef2a182e9256
The 32x32 forward transform can potentially reach peak coefficient
value close to 32700, while the rounding factor can go upto 610.
This could cause overflow issue in the SSSE3 implementation of 32x32
quantization process.
This commit resolves this issue by replacing the addition operations
with saturated addition operations in 32x32 block quantization.
Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
Moves counting of mv branches to where we have a new mv, instead of after
the whole frame is summed.
Change-Id: I945d9f6d9199ba2443fe816c92d5849340d17bbd
This commit fixed the potential overflow issue in the SSE2
implementation of 32x32 forward DCT. It resolved the corrupted
coded frames in the border of scenes.
Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
- Intermediate height was not correct i.e. when block size is 4 and
y_step_q4 is 6. In this case intermediate height was
(4*6) >> 4 = 1 and vertical interpolation needs two source pixels
plus 7 extra pixels for taps.
- Also if the current output block is 16x16 and we are using 4x upscaling
we need only 12 rows after horizontal filtering instead of 16.
Patch Set 2: Intermediate_height updated after CL 66723
"Fix bug in convolution functions (filter selection)"
Change-Id: I5a1a1bc2ac9d5edb3a6e0818de618bf318fdd589
The 32x32 quantization process can potentially have the intermediate
stacks over 16-bit range, thereby causing enc/dec mismatch. This commit
fixes this overflow issue in the SSSE3 implementation, as well as the
prototype, of 32x32 quantization.
This fixes issue 607 from webm@googlecode.
Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
This patch is a reformatted version of optimizations done by
engineers at Intel (Erik/Tamar) who have been providing
performance feedback for VP9. For the test clips used (720p, 1080p),
up to 1.2% performance improvement was seen.
Change-Id: Ic1a7149098740079d5453b564da6fbfdd0b2f3d2
Make the current head working properly, while working on fixing an
issue in the SSSE3 implementation of 32x32 quantization.
Change-Id: Ic029da3fd7f1f5e58bc641341cbd226ec49a16bc
Previous change c4048dbd limits the mv search range assuming max block
size of 64x64, this commit change the search range using actual block
size instead.
Change-Id: Ibe07ab02b62bf64bd9f8675d2b997af20a2c7e11
Making code more compact, adding consts, removing redundant arguments,
adding do/while(0) for macros.
Change-Id: Ic9ec0bc58cee0910a5450b7fb8cfbf35fa9d0d16
(In response to Issue 604:
https://code.google.com/p/webm/issues/detail?id=604)
There were bugs in the convolution code for two cases:
1. Where the filter table was assumed to be aligned to a
256 byte boundary. The offset of the pixel in the
source buffer was computed incorrectly.
2. Where no such alignment assumption was made. An
incorrect address for the filter table base was used.
To fix both problems, I now assume that the filter table is
256-byte aligned and modify the pixel offset calculation to
match.
A later patch should remove the restriction that the filter
table is aligned to a 256-byte boundary.
There was also a bug in the ConvolveTest unit test
(convolve_test.cc).
(Bug & initial fix suggestion submitted by Tero Rintaluoma
and Sami Pietilä).
Change-Id: I71985551e62846e55e40de9e7e3959d4805baa82
vp9_short_idct10_16x16_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut many
unnecessary calculations in order to save instructions.
Change-Id: I6e30a3fee1ece5af7f258532416d0bfddd1143f0
It is possible to have invalid scale factors and not access them
during decoding. Error is reported if we really try to use invalid scale
factors.
Change-Id: Ie532d3ea7325ee0c7a6ada08269f804350c80fdf
Adding set_contexts contexts function and call it instead of
set_contexts_on_border. Calling txfrm_block_to_raster_xy to get aoff and
loff.
Change-Id: I41897e344afd2cae1f923f4fdbe63daccf6fe80e
Moving foreach_predicted_block_in_plane function to vp9_reconinter.c
because there is only one usage.
Change-Id: I9852feae43fc3cf809b817fc541d043bc5496209
Updating implementation of vp9_get_pred_context_single_ref_p2 using
has_second_ref function to make code easier to read.
Change-Id: I5ba642712f59861a48aab974e73aa01640d086fe
vp9_short_idct10_8x8_add is used to handle the block that only have valid data
at top left 4x4 block. All the other datas are 0. So we could cut several
unnecessary calculations in order to save instructions.
Change-Id: I34fda95e29082b789aded97c2df193991c2d9195
Updating implementation of vp9_get_pred_context_single_ref_p1 using
has_second_ref function to make code easier to read.
Change-Id: Ie8f60403a7195117ceb2c6c43176ca9a9e70b909
As the pixel values beyond image border are duplicates of pixels
on edge, the change limits the mv search range, any mv beyond
the limits no longer produce new/different prediction values
as entire block with pixels used for subpel interpolation are
outside image border.
Change-Id: I4c6fdf06e33c1cef1489f5470ce0fb4e5e01fb79
Updating all foreach_transformed_block_visitor functions to work with
plane block size instead of general block. Removing a lot of duplicated
code.
Change-Id: I6a9069e27528c611f5a648e1da0c5a5fd17f1bb4
This change set is intermediate. The next one will remove all repetitive
plane_bsize calculations, because it will be passed as argument to
foreach_transformed_block_visitor.
Change-Id: Ifc12e0b330e017c6851a28746b3a5460b9bf7f0b
The intent was to initialize the deltas for the
segment to the computed value, irrespective of mode
and reference frame if (mode_ref_delta_enabled == 0).
(In response to bug posted by Manjit Hota to codec-devel
and webm-discuss lists)
Change-Id: I10435cb63d0f88359bb4c14f22181878a1988e72
When the frame size changes the last frame segment map must
be resized to match and initialized to 0.
Change-Id: Idc10de109f55dbe9af3a6caae355a2974712243d
VP9_COMMON is the right place to segmentatation struct because it has
global segmentation parameters, not something specific to macroblock
processing.
Change-Id: Ib9ada0c06c253996eb3b5f6cccf6a323fbbba708
This commit unifies the rate-distortion cost calculation process of
luma and chroma components. It allows early termination to be enabled
later in the rd search loop of chroma components, in consistent with
luma pixels.
Change-Id: I2e52a7c6496176bf2a5e3ef338d34ceb8aad9b3d
Making foreach_transformed_block_in_plane more clear (it's not finished
yet). Using explicit tx_size variable consistently instead of
(ss_txfrm_size / 2) or (ss_txfrm_size >> 1) expression.
Change-Id: I1b9bba2c0a9f817fca72c88324bbe6004766fb7d
The macro block mode info context originally contained an
entry for each 16x16 macroblock. In VP9 each entry refers
to an 8x8 region not a macro block, so the naming is misleading.
This first stage clean up changes the names of 3 entries in the
structure to remove the mb_ prefix.
TODO clean up the nomenclature more widely in respect of
mbmi and bmi.
Change-Id: Ia7305c6d0cb805dfe8cdc98dad21338f502e49c6
Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.
Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
Adding function build_inter_predictors_for_planes to build inter
predictors for specified planes. This function allows to remove
condition "#if CONFIG_ALPHA" and use MAX_MB_PLANE for general case.
Renaming 'which_mv' local var to 'ref', and 'weight' argument to 'ref'.
Change-Id: I1a97160c9263006929d38953f266bc68e9c56c7d
Invert loops to operate vertically in the inner loop. This allows
removing redundant loads.
Also add preloading of data.
Change-Id: I4fa85c0ab1735bcb1dd6ea58937efac949172bdc
Each iteration of the horizontal loop reuses 7 of the 11 source
values. Loading only the 4 new values saves some time.
Also add preload for source data.
Overall 4% faster on Chromebook.
Change-Id: I8f69e749f2b7f79e9734620dcee51dbfcd716b44
Loop filter configuration doesn't belong to macroblock, so moving it from
MACROBLOCKD to VP9_COMMON. Also moving the declaration of loopfilter struct
from vp9_blockd.h to vp9_loopfilter.h.
Change-Id: I4b3e34be9623b47cda35f9b1f9951f8c5b1d5d28
The memset sets 16 bytes rather than the correct size of the
final array dimension (MAX_MODE_LF_DELTAS).
(In response to bug posted by Manjit Hota to codec-devel
and webm-discuss lists)
Change-Id: I8980f5aa71ddc9d7ef57c5b4700bc28ddf8651c7
Using block width and block height instead of their logarithms. Using
SUBPEL_BITS and SUBPEL_SHIFTS constants instead of magic numbers.
Change-Id: I4e10e93c907c8a5e1cb27dfe74d1fcdcc4995448
Removing the old one bsize_from_dim_lookup. Now we have a way to determine
block size for plane using its subsampling values (ss_size_lookup). And
then we can find the number of pixels in the block (num_pels_log2_lookup).
Change-Id: I6fc981da2ae093de81741d3d78eaefed11015db9
Functions scale_mv_q4 and scale_mv_q3_to_q4 were almost identical except
q3->q4 conversion in scale_mv_q3_to_q4. Now q3->q4 conversion happens
directly in vp9_build_inter_predictor.
Also adding useful constants: SUBPEL_BITS and SUBPEL_MASK.
Change-Id: Ia0a6ad2ac07c45fdf95a5139ece6286c035e9639
Enable use_x86inc as a commandline option. Fix Bug with sse2 when
x86inc is disabled. Adds Sad asm protection to x86inc protection
Change-Id: Iee0f9dd235ea10e8ace512eb362ba9bebe8c9df6
There was no benefit having this function. For example, inside
read_switchable_filter_type switchable filter context was calculated twice.
Change-Id: I79cd5bf95cbc0f6d8bf91a2e32289e01b18dcff1
Converting arguments of two functions (clamp_mv_ref, lower_mv_precision)
from int_mv* to MV*. Rewriting is_inside function to make it much shorter.
Change-Id: Ie4c4cf3eccd46707c7df099ec21fb1b61c72fc7a
Support enabling it or disabling it. Moved read out to configure.sh
so that its done once instead of in make and in config.
Change-Id: I73a9190cf31de9f03e8a577f478fa522f8c01c8b
Currently the only threaded option for vp9 decode. Enabled when the
decoder config thread count is > 1.
Change-Id: I082959abac9e31aa4a38ed9fd68b94680e57f4df
This changeset allows to remove vp9_switchable_interp and
vp9_switchable_interp_map arrays and make code much clear. Actually we
still have to use these mapping but only inside read_interp_filter_type and
write_interp_filter_type functions.
Change-Id: I4026c6f8c4acefba6c81421b7bacbaa52cc45f50
Chromium does not support 32bit builds for Mac which use x86inc.asm.
Make the files which include it work if 64bit or not PIC enabled
starting with vp9_copy_sse2.asm
Consolidate these targets in vp9_rtcd_defs.sh
Change-Id: If18f0b957a611efd085a3ee7d245cf1eb91e8248
This is an attempt at rewriting vp9_find_mv_refs_idx. I believe that it gains
about 1-2% decode speed
Change-Id: Ia5359c94ce9bb43b32652890e605e9a385485c1b