Add a full range motion search for regular block sizes. This runs
exhaustive search within the given reference area. This commit further
optimizes the search process by combining 4 points test into one
pipeline, which gives 30% speed-up as compared to run each individual
point at a time.
This full range search serves as a best possible motion search reference.
When replacing the diamond search with full range search, the speed 0
runtime of bus CIF at 2000 kbps goes from 153872ms to 623051ms. The
compression performance compared to speed 0 setting gains 0.585% for
derf set.
Change-Id: Ieef1225216b0b86b4ac4872fa7fb9e18bf2eabb3
This patch followed "Add filter_selectively_vert_row2 to enable
parallel loopfiltering" commit, and added x86 SSE2 optimization
to do 16-pixel filtering in parallel. For other optimizations
(neon and dspr2), current 16-pixel functions were done by calling
8-pixel functions twice, and real 16-pixel functions could be added
later.
Decoder speedup:
tulip clip: 2% speed gain;
old_town_cross: 1.2% speed gain;
bus: 2% speed gain.
Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7
This patch followed "Rewrite filter_selectively_horiz for parallel
loopfiltering" commit, and added x86 SSE2 optimization to do
16-pixel filtering in parallel. Also, corrected the declaration
of aligned arrays. For 8-pixel-in-parallel case, improved the
calculation of the masks and filters. Updated the threshold loading
since the thresholds were already duplicated. Updated neon C functions
to call neon loopfilters twice.
Using tulip clip, tests showed it gave a ~1.5% decoder speed gain.
Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35
on arm until we implenment real vp9_idct32x32_34_add_neon.
This issue is due to commit 47665452f0da3c11427ecb4852535e1787bb0c5b
Merge "Add 32x32 idct function for eob<=34 case".
Change-Id: I56b5f0abc20e7dd1bba521f78a995e85d65ea296
This CL contains two AVX2 optimized loop filter functions,
mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16.
Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b
When only upper-left 8x8 area has non-zero dct coefficients, we
could skip 1D IDCT for 9th to 32th rows to save operations. This
function is called when eob <= 34.
Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5
For consistency with idct function names. Renames:
vp9_short_fdct4x4 -> vp9_fdct4x4
vp9_short_walsh4x4 -> vp9_fwht4x4
Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
This patch fixed a bug that caused 32bit PIC build mismatch. The
stack pointer was modified after "GET_GOT". Loading left pointer
from a hard-coded position gave wrong result.
Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc
Commit "d207 intra prediction ssse3 using bytes" caused mismatch
while building 32bit PIC code. Disabled these SSSE3 functions
until we fix the bug.
Change-Id: Ic444e531d3d4058092fe6eab09006b44fcb18e4c
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.
Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4
We have two SSE2-optimized functions for idct4_1d:
vp9_idct4_1d_sse2 <-- removing this one
idct4_1d_sse2
vp9_idct4_1d_sse2 was used only by the following functions which already
have SSE2 optimized variants:
vp9_idct4x4_16_add_c -> vp9_idct4x4_16_add_see2
idct8_1d -> vp9_idct8x8_{16, 10, 1}_see2
vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2
Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb
To ensure fast encoding/decoding on devices without ssse3 support,
SSE2 optimization of sub-pixel filters was done. Test using 1080p
clip showed the decoder speeds were ~70fps with ssse3 filters, ~60fps
with sse2 filters, and ~15fps with c filters.
Change-Id: Ie2088f87d83a889fba80a613e4d0e287aadd785c