Commit Graph

52 Commits

Author SHA1 Message Date
Jingning Han
e253eaa036 Unify the high bit-depth forward hybrid transforms
The SSE2 version high bit-depth forward hybrid transforms are
essentially using the C functions via cross referencing to 1-D
functions in vp9_dct.c. This commit unifies the two versions and
removes the unnecessary dependency.

Change-Id: Ib4d0702a138f8daf7d0bd97c141ee7088f293765
2015-07-20 11:17:49 -07:00
James Zern
a989c66b84 rename vp9_dct_impl_sse2.c to vp9_dct_sse2_impl.h
this file shouldn't be built directly, it is included in vp9_dct_sse2.c
to create a non-high-bitdepth and a high-bitdepth version

silences missing prototype warnings for the unused FDCT* functions

Change-Id: Ide6ff8c24ab31bdb0f833260505ae33660a1ad5b
2015-05-15 17:01:19 -07:00
James Zern
587a71f1d6 rename vp9_dct32x32_sse2.c to vp9_dct32x32_sse2_impl.h
this file shouldn't be built directly, it is included in vp9_dct_sse2.c
to create a non-high-bitdepth and a high-bitdepth version

silences missing prototype warnings for the unused FDCT32x32* functions

Change-Id: I0e38f16dae5ea1728de184ee2c89287d48675c51
2015-05-15 16:59:52 -07:00
James Zern
330fba41e2 vp9 intrinsics: add vp9_rtcd include
silences a missing declaration warning

Change-Id: I59a34e1a1377cf3529b678d7ec0122bd43ab1bf1
2015-05-15 10:43:47 -07:00
James Zern
8515e62e6b vp9_dct_sse2: make some functions static
silences missing prototype warnings

Change-Id: I773b6a6b5bd7c57db18c3b17c519534f80e131de
2015-05-15 10:43:47 -07:00
James Zern
198b039e2a vp9_fdct8x8_quant_sse2: quiet a static analysis warning
add an assert to validate 'in' array size

Change-Id: Ib72946a86f34e1ce8a69954e8e3e4fe1a0f18a91
2015-03-18 14:33:04 -07:00
Jingning Han
e47033319d Fix fwd transform sse2 build issue on older gcc version
Change-Id: I3e0e53d129552babf29e6c5d047483733983973c
2015-02-24 23:25:21 -08:00
Jingning Han
d0f2377027 Revert "Revert "Removal of legacy zbin_extra / zbin_oq_value.""
This reverts commit 9946ee23e0.

Fix the ssse3 asm function.

Change-Id: I07f77a63aa98087626e45c4e87aa5dcafc0b0b07
2014-12-22 10:09:25 -08:00
Paul Wilkins
9946ee23e0 Revert "Removal of legacy zbin_extra / zbin_oq_value."
This reverts commit e9b586e21b.

Change-Id: I5b36e6727da6c05278d97e2c37b80c109f79bed4
2014-12-19 15:02:58 +00:00
Paul Wilkins
e9b586e21b Removal of legacy zbin_extra / zbin_oq_value.
zbin extra / zbin_oq_value was widely passed around,
hence removal touches a lot of code.

Change-Id: Idc94359735b60c38a160e4385ae09d5ca8b6b8e5
2014-12-18 16:49:11 +00:00
Deb Mukherjee
6615706af2 sse2 visual studio build fix
Change-Id: Id8c8c3be882bcd92afea3ccec6ebdf3f208d28ef
2014-12-03 16:35:26 -08:00
Peter de Rivaz
7e40a55ef9 Added high bitdepth sse2 transform functions
Also removes some spurious changes in common/vp9_blockd.h which
was introduced by a rebase issue between nextgen and master branches.

Change-Id: If359f0e9a71bca9c2ba685a87a355873536bb282
(cherry picked from commit 005d80cd05)
(cherry picked from commit 08d2f54800)
(cherry picked from commit 4230c2306c)
2014-12-02 11:16:24 -08:00
Jingning Han
c6908fd5f7 Combine fdct8x8 and quantization process
This commit reworks the forward transform and quantization process
for 8x8 block coding. It combines the two operations in a single
function to save a store/load stage of the original transform
coefficients. Overall the speed -6 is slightly faster (around 1%
range). The compression performance of speed -6 is improved by
3.4%.

Change-Id: Id6628daef123f3e4649248735ec2ad7423629387
2014-11-18 18:10:56 -08:00
Yaowu Xu
2c4fee17bc Fix visual studio 2013 compiler warnings
For configured with --enable-vp9-highbitdepth

Change-Id: I2b181519d7192f8d7a241ad5760c3578255f24e6
2014-11-05 13:47:28 -08:00
Dmitry Kovalev
070210e20b Removing duplicated code.
Change-Id: I7b5c776d5e6f5ca428b87fa9411ae4012a9538ba
2014-09-02 17:57:35 -07:00
Jingning Han
5b21708fd5 Fix def pairs in 32x32 2D-DCT sse2
Properly pair the def/undef order.

Change-Id: I9736a6f8d2efc075b1d72dafc75b9350d055cf65
2014-08-20 09:40:30 -07:00
Jingning Han
ccba289f8d Fast computation path for forward transform and quantization
This commit enables a fast path computational flow for forward
transformation. It checks the sse and variance of prediction
residuals and decides if the quantized coefficients are all
zero, dc only, or more. It then selects the corresponding coding
path in the forward transformation and quantization stage.

It is currently enabled in rtc coding mode. Will do it for rd
coding mode next.

In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps
goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up.
Overall coding performance for rtc set is changed by -0.18%.

Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1
2014-06-12 11:10:54 -07:00
Jingning Han
7f547336b7 Adjust the forward 16x16 DCT computation steps
This commit adjusts the forward 16x16 DCT computation steps to
simplify the register level operations. It fixes the corresponding
sse2 version accordingly.

Change-Id: I72a9c25b8ca9442fc5e113f47cd701ae55aa7f08
2014-05-19 12:39:26 -07:00
Andrew Russell
a46f5459c3 improved speed of 4x4 sse2 fdct.
* speed improvment of 30 percent achieved
* multiplies and adds remain the same
* non-arithmetic instructions minimized by hand, by:
   -expanding 2 pass loop
   -removing irrelivant "shuffles"
   -combining last two rounding steps
* further improvments may be possible

Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0
2014-03-03 14:25:42 -08:00
Andrew Russell
549c31f8ae minor spelling cleanup in comments
Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06
2014-02-12 16:32:51 -08:00
Dmitry Kovalev
005fc6970b Finally removing "short" from transform names.
Change-Id: I5259b68dc1bcceb153e3ffe638a79a59a3019e9d
2014-02-06 11:54:15 -08:00
Dmitry Kovalev
ff41764920 Removing _1d suffix from transform names.
It is enough to specify (e.g.) idct16, it is obviously different from
idct16x16.

Change-Id: I6b408a37a945de3162429380b59a775b03b95db0
2014-01-27 16:15:36 -08:00
Abo Talib Mahfoodh
ec2dbdd107 Improve vp9_fdct4x4_sse2 (x1.2)
Modifications are done to reduce the total clock cycle.
Speedup: 1.2

Tested with: park_joy_420_720p50.y4m

Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f
2013-11-21 15:04:35 -05:00
Jingning Han
fabc783695 Fix an overflow issue in SSE2 forward ADST
The step that sums three input samples could potentially cause the
intermediate result go beyond 16 bit limit, when operating as the
second 1-D transform. This commit fixes the issue.

Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516
2013-11-13 15:15:59 -08:00
Dmitry Kovalev
600a3860a4 Making input pointer constant for all fdct/fht functions.
Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8
2013-10-24 11:48:25 -07:00
Dmitry Kovalev
fd724f13b0 Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4.
For consistency with idct function names. Renames:
  vp9_short_fdct4x4  -> vp9_fdct4x4
  vp9_short_walsh4x4 -> vp9_fwht4x4

Change-Id: Id15497cc1270acca626447d846f0ce9199770f58
2013-10-23 14:28:39 -07:00
Dmitry Kovalev
a018988ce8 Renaming vp9_short_fdct32x32 to vp9_fdct32x32.
For consistency with idct function names.

Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18
2013-10-23 13:41:40 -07:00
Dmitry Kovalev
5bdd4d9ccf Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16." 2013-10-23 13:37:09 -07:00
Dmitry Kovalev
02feb63684 Renaming vp9_short_fdct16x16 to vp9_fdct16x16.
For consistency with idct function names.

Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71
2013-10-23 10:57:12 -07:00
Dmitry Kovalev
fa143dbc8e Renaming vp9_short_fdct8x8 to vp9_fdct8x8.
For consistency with idct function names.

Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f
2013-10-23 10:52:33 -07:00
Dmitry Kovalev
9f09618bd4 Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4." 2013-10-22 13:05:24 -07:00
Dmitry Kovalev
a767d10fa5 Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8." 2013-10-22 11:34:17 -07:00
Dmitry Kovalev
190c2b4591 Using stride (# of elements) instead of pitch (bytes) in fdct4x4.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b
2013-10-21 15:27:35 -07:00
Dmitry Kovalev
e5fa44c869 Using stride (# of elements) instead of pitch (bytes) in fdct8x8.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1
2013-10-18 12:20:26 -07:00
Dmitry Kovalev
1aa7fd5aef Using stride (# of elements) instead of pitch (bytes) in fdct16x16.
Just making fdct consistent with iht/idct/fht functions which all use
stride (# of elements) as input argument.

Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d
2013-10-18 11:49:33 -07:00
Dmitry Kovalev
a4585285ed Removing unused 8x4 transform from the encoder.
Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e
2013-10-15 11:27:28 -07:00
A.Mahfoodh
13c7715a75 Number of instructions in fdct4_1d_sse2 reduced by two.
Mathematically the results are the same.

Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0
2013-09-23 17:23:27 -07:00
Jingning Han
78136edcdc SSE2 high precision 32x32 forward DCT
Enable SSE2 implementation of high precision 32x32 forward DCT. The
intermediate stacks are of 32-bits. The run-time goes down from
32126 cycles to 13442 cycles.

Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56
2013-08-12 16:52:53 -07:00
Jingning Han
28566a6cd5 Place holder for high-precision 32x32 fdct
Resolve compile warnings on re-define FDCT32x32_2D template.

Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad
2013-08-06 11:44:08 -07:00
Christian Duvivier
3d98205fce Move fdct32x32 SSE2 implementation in separate file.
This is in preparation for the SSE2 version of the high-precision
32x32 forward DCT which will share a lot of code with the existing
low precision version used for rate-distortion search.

Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0
2013-08-06 10:17:11 -07:00
Jingning Han
114423538f SSE2 16x16 ADST/DCT hybrid transform
This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2
operations. It reduces the runtime from 5433 cycles to 1621 cycles, at
no compression performance loss.

Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230
2013-07-10 12:14:53 -07:00
Jingning Han
2cb75c9607 Refactor SSE2 8x8 functional units
These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT
hybrid transform coding.

Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d
2013-07-03 10:11:59 -07:00
Jingning Han
993942ce0c Merge "Enable SSE2 4x4 ADST/DCT transform" 2013-06-29 15:57:04 -07:00
Christian Duvivier
466e0cf303 SSE2 version of vp9_short_fdct32x32_rd.
43,000 -> 5,750 cycles, about 7.5x faster.

Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0
2013-06-29 13:53:00 -07:00
Jingning Han
1109b6b888 Enable SSE2 4x4 ADST/DCT transform
This commit enables SSE2 4x4 foward hybrid transform. The runtime
goes from 249 cycles down to 74 cycles. Overall around 2% speed-up
at no compression performance change.

Change-Id: Iad4d526346e05c7be896466c05500711bb763660
2013-06-28 17:24:43 -07:00
Jingning Han
9def7f72a0 Fix switch statement in 8x8 transform
Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f
2013-06-28 13:40:36 -07:00
Yaowu Xu
60dc7375af fixed a compiling problem with MSVC win32 build
The aligned array in parameter list caused win32 build to report
c2719 error. This commit fixed the issue by make the parameter
type a pointer instead of an array.

Change-Id: I4ed654ce4eba2db4995d9cdc136c68e9a6acc992
2013-06-26 09:33:16 -07:00
Jingning Han
0084e61d5f Tune the rounding operations in 8x8 ADST/DCT sse2
Improve the round-trip precision to meet the unit test setttings.

Change-Id: I303febae56b4b990ea3798b8ebed94c0510ecf79
2013-06-25 12:02:26 -07:00
Jingning Han
82d504b50f Use aligned buffer operations in 8x8/16x16 2D-DCT
This reduces 16x16 2D-DCT runtime from 865 cycles to 837 cycles.

Change-Id: I137758b81cd127b936175284310e81378db64552
2013-06-24 19:56:23 -07:00
Jingning Han
a32a086d23 Enable sse2 implmentation of 8x8 ADST/DCT
This commit makes use of the butterfly structure to enable the sse2
version implementation of 8x8 ADST/DCT hybrid transform coding.

The runtime of hybrid transform module goes down from 1170 cycles
to 245 cycles. Overall speed-up around 1.5%.

Change-Id: Ic808ffd21ece8a9d0410d8c0243d7b6c28ac3b3f
2013-06-24 18:41:33 -07:00