Commit Graph

186 Commits

Author SHA1 Message Date
Dmitry Kovalev
25a666ef39 Moving pair_set_epi32 macro into vp9_dct32x32_sse2.c.
Change-Id: I642a7d343677bf934e9a54cf4ad78e908620e39a
2014-05-01 16:45:49 -07:00
Dmitry Kovalev
e05b92c0aa Merge "Removing half-variance asm functions which are not used." 2014-05-01 14:50:45 -07:00
Jingning Han
39761eb5d6 Merge "Enable SSSE3 implementation of 8x8 forward 2D-DCT" 2014-04-30 13:41:36 -07:00
Dmitry Kovalev
94f5491c46 Removing half-variance asm functions which are not used.
Corresponding C functions were removed in
I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3

Change-Id: I50a5575065a7a9e41904eb2161afd739def927db
2014-04-30 12:21:54 -07:00
Jingning Han
1eaa3a76dc Enable SSSE3 implementation of 8x8 forward 2D-DCT
Assembly implementation of ssse3 8x8 forward 2D-DCT. The current
version is turned on only for x86_64. The average unit runtime
goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster.
This translates into about 1.5% speed-up for pedestrian_area 1080p
at speed 2.

Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4
2014-04-29 15:49:18 -07:00
Dmitry Kovalev
6e01079cc0 Removing unused vp9_variance_halfpixvar*() functions.
Change-Id: I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3
2014-04-25 11:50:07 -07:00
Dmitry Kovalev
2fc3a18653 Removing unused vp9_mcomp_x86.h file.
We don't use declarations from this file. The real declarations
(differently named) are in vp9_rtcd_defs.pl, e.g. vp9_full_search_sad.

Change-Id: I73cbf064305710ba20747233cfdbe67366f069a0
2014-04-14 11:32:58 -07:00
levytamar82
0fa8b668c1 AVX2 SAD Optimization:
2 functions were optimized for avx2 by using full 256 bit register
In order to handle 32 elements in parallel instead of only 16 in parallel:
1. vp9_sad32x32x4d
2. vp9_sad64x64x4d

The function level gain is 66% and the user level gain is ~1%.

Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb
2014-03-21 13:53:32 -07:00
Yaowu Xu
5511968f21 Removed several unused functions.
Change-Id: Ib9e27298c575afc02a98b593bc6ad60762064d9b
2014-03-17 14:09:29 -07:00
Andrew Russell
e337322e63 Merge "improved speed of 4x4 sse2 fdct." 2014-03-05 14:35:44 -08:00
Andrew Russell
a46f5459c3 improved speed of 4x4 sse2 fdct.
* speed improvment of 30 percent achieved
* multiplies and adds remain the same
* non-arithmetic instructions minimized by hand, by:
   -expanding 2 pass loop
   -removing irrelivant "shuffles"
   -combining last two rounding steps
* further improvments may be possible

Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
2014-03-03 14:25:42 -08:00
levytamar82
ea14909687 AVX2 SubPixel AVG Variance Optimization
Optimizing 2 functions to process 32 elements in parallel instead of 16:
1. vp9_sub_pixel_avg_variance64x64
2. vp9_sub_pixel_avg_variance32x32
both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3
instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2
that is written in avx2 and process 32 elements in parallel.
This Optimization gave 80% function level gain and 2% user level gain

Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8
2014-02-28 22:51:04 -07:00
James Zern
d12b39daab vp9_subpel_variance_impl_intrin_avx2.c: make some tables static
+ fix formatting

Change-Id: I7b4ec11b7b46d8926750e0b69f7a606f3ab80895
2014-02-18 20:42:49 -08:00
levytamar82
52dac5d1cb AVX2 SubPixel Variance Optimization
Optimizing 2 functions to process 32 elements in parallel instead of 16:
1. vp9_sub_pixel_variance64x64
2. vp9_sub_pixel_variance32x32
both of those function were calling vp9_sub_pixel_variance16xh_ssse3
instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2
that is written in avx2 and process 32 elements in parallel.
This Optimization gave 70% function level gain and 2% user level gain

Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde
2014-02-14 16:59:11 -07:00
Andrew Russell
549c31f8ae minor spelling cleanup in comments
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-12 16:32:51 -08:00
Yunqing Wang
0d43bd77e5 Bug fix in ssse3 quantize function
A bug was reported in Issue 702: "SIGILL (Illegal instruction) when
transcoding with vp9 - using FFmpeg". It was reproduced and fixed.

Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700
2014-02-07 14:32:30 -08:00
Dmitry Kovalev
005fc6970b Finally removing "short" from transform names.
Change-Id: I5259b68dc1bcceb153e3ffe638a79a59a3019e9d
2014-02-06 11:54:15 -08:00
Dmitry Kovalev
ff41764920 Removing _1d suffix from transform names.
It is enough to specify (e.g.) idct16, it is obviously different from
idct16x16.

Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-27 16:15:36 -08:00
James Zern
b453941caf vp9/encoder: add extern "C" to headers
Change-Id: I4f51ce859a97bf1b8fd2b37ac585b7c643232b69
2014-01-23 16:21:24 -08:00
levytamar82
357b65369f AVX2 Variance Optimization
Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32,
vp9_variance64x64, vp9_variance32x16, vp9_variance64x32,
vp9_mse16x16 by migrating to AVX2
some of the functions were optimized by processing 32 elements instead of 16.
some of the functions were optimized by processing 2 loop strides of 16
elements in a single 256 bit register
This optimization gives between 2.4% - 2.7% user level performance gain
and 42% function level gain.

Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d
2014-01-08 12:05:53 -07:00
James Zern
bd9a388a06 vp9: normalize include guards
Change-Id: If4ddbdcfb3ab387cbca6910b42cf4df8111e6879
2013-12-16 19:40:49 -08:00
Yaowu Xu
e9c19617bf Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2" 2013-11-27 10:27:32 -08:00
levytamar82
8def766de2 vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2
Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f
2013-11-21 14:19:49 -08:00
Abo Talib Mahfoodh
ec2dbdd107 Improve vp9_fdct4x4_sse2 (x1.2)
Modifications are done to reduce the total clock cycle.
Speedup: 1.2

Tested with: park_joy_420_720p50.y4m

Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
2013-11-21 15:04:35 -05:00
Jingning Han
fabc783695 Fix an overflow issue in SSE2 forward ADST
The step that sums three input samples could potentially cause the
intermediate result go beyond 16 bit limit, when operating as the
second 1-D transform. This commit fixes the issue.

Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
2013-11-13 15:15:59 -08:00
Yunqing Wang
d7289658fb Remove TEXTREL from 32bit encoder
This patch fixed the issue reported in "Issue 655: remove textrel's
from 32-bit vp9 encoder". The set of vp9_subpel_variance functions
that used x86inc.asm ABI didn't build correctly for 32bit PIC. The
fix was carefully done under the situation that there was not
enough registers.

After the change, we got
$ eu-findtextrel libvpx.so
eu-findtextrel: no text relocations reported in 'libvpx.so'

Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82
2013-11-07 13:39:40 -08:00
Dmitry Kovalev
600a3860a4 Making input pointer constant for all fdct/fht functions.
Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8
2013-10-24 11:48:25 -07:00
Dmitry Kovalev
fd724f13b0 Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4.
For consistency with idct function names. Renames:
  vp9_short_fdct4x4  -> vp9_fdct4x4
  vp9_short_walsh4x4 -> vp9_fwht4x4

Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
2013-10-23 14:28:39 -07:00
Dmitry Kovalev
a018988ce8 Renaming vp9_short_fdct32x32 to vp9_fdct32x32.
For consistency with idct function names.

Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18
2013-10-23 13:41:40 -07:00
Dmitry Kovalev
5bdd4d9ccf Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16." 2013-10-23 13:37:09 -07:00
Dmitry Kovalev
02feb63684 Renaming vp9_short_fdct16x16 to vp9_fdct16x16.
For consistency with idct function names.

Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71
2013-10-23 10:57:12 -07:00
Dmitry Kovalev
fa143dbc8e Renaming vp9_short_fdct8x8 to vp9_fdct8x8.
For consistency with idct function names.

Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-23 10:52:33 -07:00
Dmitry Kovalev
9f09618bd4 Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4." 2013-10-22 13:05:24 -07:00
Dmitry Kovalev
a767d10fa5 Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8." 2013-10-22 11:34:17 -07:00
Dmitry Kovalev
190c2b4591 Using stride (# of elements) instead of pitch (bytes) in fdct4x4.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
2013-10-21 15:27:35 -07:00
Dmitry Kovalev
e5fa44c869 Using stride (# of elements) instead of pitch (bytes) in fdct8x8.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
2013-10-18 12:20:26 -07:00
Dmitry Kovalev
1aa7fd5aef Using stride (# of elements) instead of pitch (bytes) in fdct16x16.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
2013-10-18 11:49:33 -07:00
Dmitry Kovalev
e05412fc23 Using stride (# of elements) instead of pitch (bytes) in fdct32x32.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4
2013-10-17 13:02:28 -07:00
Dmitry Kovalev
a4585285ed Removing unused 8x4 transform from the encoder.
Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e
2013-10-15 11:27:28 -07:00
Jingning Han
80f215198f Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16" 2013-10-09 16:08:42 -07:00
Jim Bankoski
9603989c72 Merge "cpplint vp9_variance_sse2.c" 2013-10-07 15:44:50 -07:00
Jim Bankoski
f59cb3eacc Merge "added nolint to function that doesn't seem easy to breakup" 2013-10-05 16:47:23 -07:00
Jim Bankoski
5b4f836148 cpplint issues resolved in vp9_variance_mmx.c
Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd
2013-10-04 14:22:08 -07:00
Jim Bankoski
eb5b7ac27b added nolint to function that doesn't seem easy to breakup
Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e
2013-10-04 14:17:47 -07:00
Jim Bankoski
25ecb1f0b3 cpplint vp9_variance_sse2.c
Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa
2013-10-04 14:15:06 -07:00
A.Mahfoodh
5215b83aea Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16
Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two
instructions. Then inlined them.

quoting from intel MMX_App_Compute_16bit_Vector.pdf‎
"The PMADDWD instruction multiplies four
pairs of 16-bit numbers and produces partial sums of the results
and can do so once per clock (with a three-clock latency)."
so I am assuming that there will be three clock overhead after the
last _mm_madd_pi16 command.
Even with the overhead the number of clocks in general should be
smaller. I am not sure though becasue I could not find information
about number of clocks required for instructions in k_cvtlo_epi16
and k_cvthi_epi16. I will run a test and compare the execution time.

Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7
2013-10-02 20:02:03 -04:00
A.Mahfoodh
13c7715a75 Number of instructions in fdct4_1d_sse2 reduced by two.
Mathematically the results are the same.

Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
2013-09-23 17:23:27 -07:00
Jingning Han
09bc942b47 Fix overflow issue in 16x16 quantization SSSE3
The 16x16 transform unit test suggested that the peak coefficient
value can reach 32639. This could cause potential overflow issue
in the SSSE3 implmentation of 16x16 block quantization. This commit
fixes this issue by replacing addition with saturated addition.

Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e
2013-09-06 21:06:10 -07:00
Jingning Han
458c2833c0 Use saturated addition in SSSE3 of 32x32 quant
The 32x32 forward transform can potentially reach peak coefficient
value close to 32700, while the rounding factor can go upto 610.
This could cause overflow issue in the SSSE3 implementation of 32x32
quantization process.

This commit resolves this issue by replacing the addition operations
with saturated addition operations in 32x32 block quantization.

Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70
2013-09-05 12:49:12 -07:00
Jingning Han
3cf46fa591 Fix 32x32 forward transform SSE2 version
This commit fixed the potential overflow issue in the SSE2
implementation of 32x32 forward DCT. It resolved the corrupted
coded frames in the border of scenes.

Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9
2013-08-31 18:47:08 -07:00
Jingning Han
c86c5443eb Merge "Fix overflow issue in SSSE3 32x32 quantization" 2013-08-29 16:49:04 -07:00
Jingning Han
abff678866 Fix overflow issue in SSSE3 32x32 quantization
The 32x32 quantization process can potentially have the intermediate
stacks over 16-bit range, thereby causing enc/dec mismatch. This commit
fixes this overflow issue in the SSSE3 implementation, as well as the
prototype, of 32x32 quantization.

This fixes issue 607 from webm@googlecode.

Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806
2013-08-29 11:00:54 -07:00
Yaowu Xu
9482c07953 fixed the reading too many bytes
In subpel_avg_variance functions, code similar to the following

punpkldq m2, [addr]

actually reads 8 bytes. For functions that are supposed to work on
buffers only have less 8 bytes a line, this caused valgrind error
of reading uninitialized memory.

Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34
2013-08-27 08:39:20 -07:00
Yaowu Xu
6c5433c836 Fix the reading of too many input pixels
in VP9_get4x4var_mmx

Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227
2013-08-26 12:35:27 -07:00
Jingning Han
78136edcdc SSE2 high precision 32x32 forward DCT
Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.

Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
2013-08-12 16:52:53 -07:00
Jingning Han
2c091f9768 Merge "Place holder for high-precision 32x32 fdct" 2013-08-06 14:47:30 -07:00
Jim Bankoski
5b307886fb variance x86inc guards
also fixed bug in sad calcs

Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d
2013-08-06 14:17:13 -07:00
Jingning Han
28566a6cd5 Place holder for high-precision 32x32 fdct
Resolve compile warnings on re-define FDCT32x32_2D template.

Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad
2013-08-06 11:44:08 -07:00
Christian Duvivier
3d98205fce Move fdct32x32 SSE2 implementation in separate file.
This is in preparation for the SSE2 version of the high-precision
32x32 forward DCT which will share a lot of code with the existing
low precision version used for rate-distortion search.

Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
2013-08-06 10:17:11 -07:00
Ronald S. Bultje
c13e0bcb52 Remove unused fwalsh/fdct x86 SIMD implementations.
Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e
2013-07-10 18:22:51 -07:00
Jingning Han
114423538f SSE2 16x16 ADST/DCT hybrid transform
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.

Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
2013-07-10 12:14:53 -07:00
Jingning Han
a38cf2658a Merge "Refactor SSE2 8x8 functional units" 2013-07-05 11:18:18 -07:00
Jingning Han
2cb75c9607 Refactor SSE2 8x8 functional units
These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT
hybrid transform coding.

Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
2013-07-03 10:11:59 -07:00
Ronald S. Bultje
e5fb4b61b6 Use pmovmskb to skip quantize loops over empty coefficients.
If none of the 16 coefficients that we quantize per loop iteration
are larger than the zbin, directly skip to the next round of coeffs,
rather than doing a full quantize loop that will eventually result
in 16 zeroes. This incurs a jump cost, but saves a lot of other work.
32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded
no significantly positive results for smaller transforms, so is not
used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles).

Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647
2013-07-02 16:34:24 -07:00
Ronald S. Bultje
c8defcfdee Update quantize SSSE3 SIMD to cover 32x32 transform case also.
Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to
2min10.1, i.e. a 2.3% overall speed increase.

Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87
2013-07-01 11:36:33 -07:00
Ronald S. Bultje
7353ceab9d Quantize (64-bit only, for now) SSSE3 SIMD.
Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps
goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is
x86-64 only, it needs some minor modifications to be 32bit compatible,
because it uses 15 xmm registers, whereas 32bit only has 8.

Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904
2013-07-01 11:36:07 -07:00
Jingning Han
993942ce0c Merge "Enable SSE2 4x4 ADST/DCT transform" 2013-06-29 15:57:04 -07:00
Christian Duvivier
466e0cf303 SSE2 version of vp9_short_fdct32x32_rd.
43,000 -> 5,750 cycles, about 7.5x faster.

Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
2013-06-29 13:53:00 -07:00
Jingning Han
1109b6b888 Enable SSE2 4x4 ADST/DCT transform
This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.

Change-Id: Iad4d526346e05c7be896466c05500711bb763660
2013-06-28 17:24:43 -07:00
Jingning Han
9def7f72a0 Fix switch statement in 8x8 transform
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
2013-06-28 13:40:36 -07:00
Ronald S. Bultje
af660715c0 Make coefficient skip condition an explicit RD choice.
This commit replaces zrun_zbin_boost, a method of biasing non-zero
coefficients following runs of zero-coefficients to be rounded towards
zero, with an explicit skip-block choice in the RD loop.

The logic is basically that if individual coefficients should be rounded
towards zero (from a RD point of view), the trellis/optimize loop should
take care of it. If whole blocks should be zero (from a RD point of
view), a single RD check is much more efficient than a complete
serialization of the quantization loop.

Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim.
SIMD for quantize will follow in a separate patch. Results for other
test sets pending.

Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4
2013-06-28 10:28:49 -07:00
Yaowu Xu
60dc7375af fixed a compiling problem with MSVC win32 build
The aligned array in parameter list caused win32 build to report
c2719 error. This commit fixed the issue by make the parameter
type a pointer instead of an array.

Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
2013-06-26 09:33:16 -07:00
Ronald S. Bultje
c5be54eef3 Merge "Add averaging-SAD functions for 8-point comp-inter motion search." 2013-06-25 13:50:53 -07:00
Ronald S. Bultje
c24d922396 Add averaging-SAD functions for 8-point comp-inter motion search.
Makes first 50 frames of bus @ 1500kbps encode from 3min22.7 to 3min18.2,
i.e. 2.3% faster. In addition, use the sub_pixel_avg functions to calc
the variance of the averaging predictor. This is slightly suboptimal
because the function is subpixel-position-aware, but it will (at least
for the SSE2 version) not actually use a bilinear filter for a full-pixel
position, thus leading to approximately the same performance compared to
if we implemented an actual average-aware full-pixel variance function.
That gains another 0.3 seconds (i.e. encode time goes to 3min17.4), thus
leading to a total gain of 2.7%.

Change-Id: I3f059d2b04243921868cfed2568d4fa65d7b5acd
2013-06-25 12:57:28 -07:00
Jingning Han
0084e61d5f Tune the rounding operations in 8x8 ADST/DCT sse2
Improve the round-trip precision to meet the unit test setttings.

Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
2013-06-25 12:02:26 -07:00
Jingning Han
67365520e7 Merge "Use aligned buffer operations in 8x8/16x16 2D-DCT" 2013-06-25 09:49:03 -07:00
Yaowu Xu
b9c934df8e Merge "Enable sse2 implmentation of 8x8 ADST/DCT" 2013-06-25 09:13:22 -07:00
Jingning Han
82d504b50f Use aligned buffer operations in 8x8/16x16 2D-DCT
This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.

Change-Id: I137758b81cd127b936175284310e81378db64552
2013-06-24 19:56:23 -07:00
Jingning Han
a32a086d23 Enable sse2 implmentation of 8x8 ADST/DCT
This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.

The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.

Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
2013-06-24 18:41:33 -07:00
Ronald S. Bultje
fc033b38ee Remove emms - that shouldn't be there.
Change-Id: I8fcab81e390f93dc17e9666bbf8f77883b5aa897
2013-06-21 14:45:04 -07:00
Ronald S. Bultje
ba42c02654 Add missing SECTION .text marker in assembly file.
Fixes a crash on Windows when building with MSVC.

Change-Id: I124ac756a1be55d190fadda5fcc46d23b1445dbf
2013-06-21 12:55:46 -07:00
Ronald S. Bultje
54b2a59623 Implement SSE2 block_error.
Change vp9_block_error() to return a 64bit error variable, change all
callers to expect a 64bit return value (this will prevent overflows,
which we basically don't check for at all right now). Remove duplicate
block_error() function, which fixed that through truncation. Remove
old (incompatible) mmx/sse2 block_error SIMD versions and replace with
a new one that returns a 64bit value.

Encoding time of first 50 frames of bus @ 1500kbps goes from 3min29 to
3min23, i.e. a 3% overall speedup.

Change-Id: Ib71ac5508b5ee8a80f1753cd85d72df1629abe68
2013-06-21 12:54:52 -07:00
Ronald S. Bultje
25c588b1e4 Add subtract_block SSE2 version and unit test.
3% faster overall (3min35.0 to 3min28.5).

Change-Id: I5ff8a5c2c91586b6632ca5009ad1ea51ce94af5e
2013-06-21 09:35:37 -07:00
Ronald S. Bultje
1e6a32f1af SSE2/SSSE3 optimizations and unit test for sub_pixel_avg_variance().
Encoding of bus @ 1500kbps (first 50 frames) goes from 3min57 to
3min35, i.e. approximately a 10.5% speedup. Note that the SIMD versions
which use a bilinear filter (x_offset & 7 || y_offset & 7) aren't
perfectly interleaved, and can probably be improved further in the
future. I've marked this with a few TODOs/FIXMEs in the code.

Change-Id: I5c9e900c0f0d32e431a50fecae213b510b2549f9
2013-06-20 15:59:48 -07:00
Ronald S. Bultje
8fb6c58191 Implement sse2 and ssse3 versions for all sub_pixel_variance sizes.
Overall speedup around 5% (bus @ 1500kbps first 50 frames 4min10 ->
3min58). Specific changes to timings for each function compared to
original assembly-optimized versions (or just new version timings if
no previous assembly-optimized version was available):

sse2   4x4:    99 ->   82 cycles
sse2   4x8:           128 cycles
sse2   8x4:           121 cycles
sse2   8x8:   149 ->  129 cycles
sse2   8x16:  235 ->  245 cycles (?)
sse2  16x8:   269 ->  203 cycles
sse2  16x16:  441 ->  349 cycles
sse2  16x32:          641 cycles
sse2  32x16:          643 cycles
sse2  32x32: 1733 -> 1154 cycles
sse2  32x64:         2247 cycles
sse2  64x32:         2323 cycles
sse2  64x64: 6984 -> 4442 cycles

ssse3  4x4:           100 cycles (?)
ssse3  4x8:           103 cycles
ssse3  8x4:            71 cycles
ssse3  8x8:           147 cycles
ssse3  8x16:          158 cycles
ssse3 16x8:   188 ->  162 cycles
ssse3 16x16:  316 ->  273 cycles
ssse3 16x32:          535 cycles
ssse3 32x16:          564 cycles
ssse3 32x32:          973 cycles
ssse3 32x64:         1930 cycles
ssse3 64x32:         1922 cycles
ssse3 64x64:         3760 cycles

Change-Id: I81ff6fe51daf35a40d19785167004664d7e0c59d
2013-06-20 09:34:25 -07:00
Ronald S. Bultje
d9fc451666 Move subpixel variance function from common/ to encoder/.
This seems to only be used in the encoder. Also remove an empty wrapper
file that contained forward declarations for this function, but didn't
actually define any actual functions.

Change-Id: Ifc561eef7ebe374a7d03698055e51e105f6d614b
2013-06-17 16:54:09 -07:00
Jingning Han
0b7910b9ff Merge "Enable sse2 version of sad8x4/4x8" 2013-06-14 13:15:49 -07:00
Jingning Han
15f50e7b42 Enable sse2 version of sad8x4/4x8
The encoding time for bus at CIF goes from 661s to 625s. This commit
also enabled unit test of sad8x4/4x8 in sad_test.cc.

Change-Id: If3d10ebb56bda584bdb69bcf056599d580b12cb1
2013-06-13 16:18:18 -07:00
Ronald S. Bultje
fa96eeb835 Implement SSE version for sad4x8x4d and SSE2 version for sad8x4x4d.
Encoding time of crew (CIF, first 50 frames) @ 1500kbps goes from 4min56
to 4min42.

Change-Id: I92c0c8b32980d2ae7c6dafc8b883a2c7fcd14a9f
2013-06-12 17:40:01 -04:00
John Koleszar
d0ed677a34 Merge branch 'master' into experimental
Change-Id: Ie648398b82f7311143709f55c0e30ba452f50eff
2013-06-11 16:29:28 -07:00
Yunqing Wang
f4fcfe3075 Optimize variance functions
Added SSE2 version of variance functions for super blocks.

Change-Id: Ibeaae8771ca21c99d41dd74067574a51e97b412d
2013-05-22 10:29:38 -07:00
Johann
e43662e8e6 Remove unused quantize optimizations.
Files were copied from vp8 and never maintained.

Change-Id: I9659a8755985da73e8c19c3c984423b6666d8871
2013-04-30 18:42:05 -07:00
Johann
32a5c52856 Merge branch 'master' into experimental
Conflicts:
	vp9/common/vp9_findnearmv.c
	vp9/common/vp9_rtcd_defs.sh
	vp9/decoder/vp9_decodframe.c
	vp9/decoder/x86/vp9_dequantize_sse2.c
	vp9/encoder/vp9_rdopt.c
	vp9/vp9_common.mk

Resolve file name changes in favor of master. Resolve rdopt changes in
favor of experimental, preserving the newer experiments.

Change-Id: If51ed8f457470281c7b20a5c1a2f4ce2cf76c20f
2013-04-26 12:57:10 -07:00
Johann
e3038ca8b7 Whitespace nit
Change-Id: I7486970c57cda75d26ec2c6d1f36bd668c955f66
2013-04-26 01:03:35 -07:00
Johann
863601c589 Normalize more intrinsic filenames
vp9_dequantize_x86 has only sse2 functions.

vp9_dct_sse2_intrinsics has no namespace collision and can drop
_intrinsics.

vp9_idct_mmx.h is unused.

Change-Id: Ic16e31fb372a1d1e841a62ecb4189fe8f95808ec
2013-04-25 23:26:20 -07:00
John Koleszar
15255eef82 Move dequant from BLOCKD to per-plane MACROBLOCKD
This data can vary per-plane, but not per-block.

Change-Id: I1971b0b2c2e697d2118e38b54ef446e52f63c65a
2013-04-25 11:57:20 -07:00
John Koleszar
cbd1315ac4 Move src_diff to per-plane MACROBLOCK data
First in a series of commits making certain MACROBLOCK members
addressable per-plane. This commit also refactors the block subtraction
functions vp9_subtract_b, vp9_subtract_sby_c, etc to be
loops-over-planes and variable subsampling aware.

Change-Id: I371d092b914ae0a495dfd852ea1a3d2467be6ec3
2013-04-23 12:18:51 -07:00
Jingning Han
6f43ff5824 Make the use of pred buffers consistent in MB/SB
Use in-place buffers (dst of MACROBLOCKD) for  macroblock prediction.
This makes the macroblock buffer handling consistent with those of
superblock. Remove predictor buffer MACROBLOCKD.

Change-Id: Id1bcd898961097b1e6230c10f0130753a59fc6df
2013-04-18 14:59:36 -07:00
Ronald S. Bultje
0c481f4d18 Add SSE2 versions for rectangular sad and sad4d functions.
About 11% overall encoder speedup with the sbsegment experiment enabled.

Change-Id: Iffb1bdba6932d9f11a6c791cda8697ccf9327183
2013-04-17 10:31:59 -07:00
Christian Duvivier
5b6d33f9af Faster vp9_short_fdct4x4 and vp9_short_fdct8x4.
Scalar path is about 1.3x faster (2.1% overall encoder speedup).
SSE2 path is about 5.0x faster (8.4% overall encoder speedup).

Change-Id: I360d167b5ad6f387bba00406129323e2fe6e7dda
2013-04-16 16:38:30 -07:00