generic-library/vpx

Author	SHA1	Message	Date
Yaowu Xu	e9c19617bf	Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2"	2013-11-27 10:27:32 -08:00
levytamar82	8def766de2	vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2 Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f	2013-11-21 14:19:49 -08:00
Abo Talib Mahfoodh	ec2dbdd107	Improve vp9_fdct4x4_sse2 (x1.2) Modifications are done to reduce the total clock cycle. Speedup: 1.2 Tested with: park_joy_420_720p50.y4m Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f	2013-11-21 15:04:35 -05:00
Jingning Han	fabc783695	Fix an overflow issue in SSE2 forward ADST The step that sums three input samples could potentially cause the intermediate result go beyond 16 bit limit, when operating as the second 1-D transform. This commit fixes the issue. Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516	2013-11-13 15:15:59 -08:00
Yunqing Wang	d7289658fb	Remove TEXTREL from 32bit encoder This patch fixed the issue reported in "Issue 655: remove textrel's from 32-bit vp9 encoder". The set of vp9_subpel_variance functions that used x86inc.asm ABI didn't build correctly for 32bit PIC. The fix was carefully done under the situation that there was not enough registers. After the change, we got $ eu-findtextrel libvpx.so eu-findtextrel: no text relocations reported in 'libvpx.so' Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82	2013-11-07 13:39:40 -08:00
Dmitry Kovalev	600a3860a4	Making input pointer constant for all fdct/fht functions. Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8	2013-10-24 11:48:25 -07:00
Dmitry Kovalev	fd724f13b0	Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4. For consistency with idct function names. Renames: vp9_short_fdct4x4 -> vp9_fdct4x4 vp9_short_walsh4x4 -> vp9_fwht4x4 Change-Id: Id15497cc1270acca626447d846f0ce9199770f58	2013-10-23 14:28:39 -07:00
Dmitry Kovalev	a018988ce8	Renaming vp9_short_fdct32x32 to vp9_fdct32x32. For consistency with idct function names. Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18	2013-10-23 13:41:40 -07:00
Dmitry Kovalev	5bdd4d9ccf	Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16."	2013-10-23 13:37:09 -07:00
Dmitry Kovalev	02feb63684	Renaming vp9_short_fdct16x16 to vp9_fdct16x16. For consistency with idct function names. Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71	2013-10-23 10:57:12 -07:00
Dmitry Kovalev	fa143dbc8e	Renaming vp9_short_fdct8x8 to vp9_fdct8x8. For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f	2013-10-23 10:52:33 -07:00
Dmitry Kovalev	9f09618bd4	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4."	2013-10-22 13:05:24 -07:00
Dmitry Kovalev	a767d10fa5	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8."	2013-10-22 11:34:17 -07:00
Dmitry Kovalev	190c2b4591	Using stride (# of elements) instead of pitch (bytes) in fdct4x4. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b	2013-10-21 15:27:35 -07:00
Dmitry Kovalev	e5fa44c869	Using stride (# of elements) instead of pitch (bytes) in fdct8x8. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1	2013-10-18 12:20:26 -07:00
Dmitry Kovalev	1aa7fd5aef	Using stride (# of elements) instead of pitch (bytes) in fdct16x16. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d	2013-10-18 11:49:33 -07:00
Dmitry Kovalev	e05412fc23	Using stride (# of elements) instead of pitch (bytes) in fdct32x32. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4	2013-10-17 13:02:28 -07:00
Dmitry Kovalev	a4585285ed	Removing unused 8x4 transform from the encoder. Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e	2013-10-15 11:27:28 -07:00
Jingning Han	80f215198f	Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16"	2013-10-09 16:08:42 -07:00
Jim Bankoski	9603989c72	Merge "cpplint vp9_variance_sse2.c"	2013-10-07 15:44:50 -07:00
Jim Bankoski	f59cb3eacc	Merge "added nolint to function that doesn't seem easy to breakup"	2013-10-05 16:47:23 -07:00
Jim Bankoski	5b4f836148	cpplint issues resolved in vp9_variance_mmx.c Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd	2013-10-04 14:22:08 -07:00
Jim Bankoski	eb5b7ac27b	added nolint to function that doesn't seem easy to breakup Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e	2013-10-04 14:17:47 -07:00
Jim Bankoski	25ecb1f0b3	cpplint vp9_variance_sse2.c Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa	2013-10-04 14:15:06 -07:00
A.Mahfoodh	5215b83aea	Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16 Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two instructions. Then inlined them. quoting from intel MMX_App_Compute_16bit_Vector.pdf‎ "The PMADDWD instruction multiplies four pairs of 16-bit numbers and produces partial sums of the results and can do so once per clock (with a three-clock latency)." so I am assuming that there will be three clock overhead after the last _mm_madd_pi16 command. Even with the overhead the number of clocks in general should be smaller. I am not sure though becasue I could not find information about number of clocks required for instructions in k_cvtlo_epi16 and k_cvthi_epi16. I will run a test and compare the execution time. Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7	2013-10-02 20:02:03 -04:00
A.Mahfoodh	13c7715a75	Number of instructions in fdct4_1d_sse2 reduced by two. Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0	2013-09-23 17:23:27 -07:00
Jingning Han	09bc942b47	Fix overflow issue in 16x16 quantization SSSE3 The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e	2013-09-06 21:06:10 -07:00
Jingning Han	458c2833c0	Use saturated addition in SSSE3 of 32x32 quant The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70	2013-09-05 12:49:12 -07:00
Jingning Han	3cf46fa591	Fix 32x32 forward transform SSE2 version This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9	2013-08-31 18:47:08 -07:00
Jingning Han	c86c5443eb	Merge "Fix overflow issue in SSSE3 32x32 quantization"	2013-08-29 16:49:04 -07:00
Jingning Han	abff678866	Fix overflow issue in SSSE3 32x32 quantization The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806	2013-08-29 11:00:54 -07:00
Yaowu Xu	9482c07953	fixed the reading too many bytes In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34	2013-08-27 08:39:20 -07:00
Yaowu Xu	6c5433c836	Fix the reading of too many input pixels in VP9_get4x4var_mmx Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227	2013-08-26 12:35:27 -07:00
Jingning Han	78136edcdc	SSE2 high precision 32x32 forward DCT Enable SSE2 implementation of high precision 32x32 forward DCT. The intermediate stacks are of 32-bits. The run-time goes down from 32126 cycles to 13442 cycles. Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56	2013-08-12 16:52:53 -07:00
Jingning Han	2c091f9768	Merge "Place holder for high-precision 32x32 fdct"	2013-08-06 14:47:30 -07:00
Jim Bankoski	5b307886fb	variance x86inc guards also fixed bug in sad calcs Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d	2013-08-06 14:17:13 -07:00
Jingning Han	28566a6cd5	Place holder for high-precision 32x32 fdct Resolve compile warnings on re-define FDCT32x32_2D template. Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad	2013-08-06 11:44:08 -07:00
Christian Duvivier	3d98205fce	Move fdct32x32 SSE2 implementation in separate file. This is in preparation for the SSE2 version of the high-precision 32x32 forward DCT which will share a lot of code with the existing low precision version used for rate-distortion search. Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0	2013-08-06 10:17:11 -07:00
Ronald S. Bultje	c13e0bcb52	Remove unused fwalsh/fdct x86 SIMD implementations. Change-Id: Ia942e56cf322821d42ba06178672791eeee2847e	2013-07-10 18:22:51 -07:00
Jingning Han	114423538f	SSE2 16x16 ADST/DCT hybrid transform This commit enables 16x16 ADST/DCT forward hybrid transform using SSE2 operations. It reduces the runtime from 5433 cycles to 1621 cycles, at no compression performance loss. Change-Id: I75fd7f1984e9e28846af459f810ff0d6ae125230	2013-07-10 12:14:53 -07:00
Jingning Han	a38cf2658a	Merge "Refactor SSE2 8x8 functional units"	2013-07-05 11:18:18 -07:00
Jingning Han	2cb75c9607	Refactor SSE2 8x8 functional units These serve as building blocks for SSE2 8x8 and 16x16 ADST/DCT hybrid transform coding. Change-Id: I4089a754c66e0c986f67d9b8ec4dfb9627ad430d	2013-07-03 10:11:59 -07:00
Ronald S. Bultje	e5fb4b61b6	Use pmovmskb to skip quantize loops over empty coefficients. If none of the 16 coefficients that we quantize per loop iteration are larger than the zbin, directly skip to the next round of coeffs, rather than doing a full quantize loop that will eventually result in 16 zeroes. This incurs a jump cost, but saves a lot of other work. 32x32 quant goes from 1349 -> 1184 cycles. The same approach yielded no significantly positive results for smaller transforms, so is not used there (8x8: 103 -> 101 cycles; 16x16: 302 -> 306 cycles). Change-Id: I8fca17dc2543fc8eed1dbcd5100145e3c3a9b647	2013-07-02 16:34:24 -07:00
Ronald S. Bultje	c8defcfdee	Update quantize SSSE3 SIMD to cover 32x32 transform case also. Encode time of bus (speed 0) 50 frames @ 1500kbps goes from 2min14.4 to 2min10.1, i.e. a 2.3% overall speed increase. Change-Id: I3699580e74ec26c7d24e03681bc47ba25ee1ee87	2013-07-01 11:36:33 -07:00
Ronald S. Bultje	7353ceab9d	Quantize (64-bit only, for now) SSSE3 SIMD. Total encoding time for first 50 frames of bus (speed 0) @ 1500kbps goes 2min34.8 to 2min14.4, i.e. a 10.4% overall speedup. The code is x86-64 only, it needs some minor modifications to be 32bit compatible, because it uses 15 xmm registers, whereas 32bit only has 8. Change-Id: I2df53770c2e850813ffa713e1a91b45b0082b904	2013-07-01 11:36:07 -07:00
Jingning Han	993942ce0c	Merge "Enable SSE2 4x4 ADST/DCT transform"	2013-06-29 15:57:04 -07:00
Christian Duvivier	466e0cf303	SSE2 version of vp9_short_fdct32x32_rd. 43,000 -> 5,750 cycles, about 7.5x faster. Change-Id: Ibfd92821b9603f4ed9c256e0ececec14fa4565d0	2013-06-29 13:53:00 -07:00
Jingning Han	1109b6b888	Enable SSE2 4x4 ADST/DCT transform This commit enables SSE2 4x4 foward hybrid transform. The runtime goes from 249 cycles down to 74 cycles. Overall around 2% speed-up at no compression performance change. Change-Id: Iad4d526346e05c7be896466c05500711bb763660	2013-06-28 17:24:43 -07:00
Jingning Han	9def7f72a0	Fix switch statement in 8x8 transform Change-Id: I7c46354c4983feb5f6202c3ab4a1d9534da7e30f	2013-06-28 13:40:36 -07:00
Ronald S. Bultje	af660715c0	Make coefficient skip condition an explicit RD choice. This commit replaces zrun_zbin_boost, a method of biasing non-zero coefficients following runs of zero-coefficients to be rounded towards zero, with an explicit skip-block choice in the RD loop. The logic is basically that if individual coefficients should be rounded towards zero (from a RD point of view), the trellis/optimize loop should take care of it. If whole blocks should be zero (from a RD point of view), a single RD check is much more efficient than a complete serialization of the quantization loop. Quality change: derf +0.5% psnr, +1.6% ssim; yt +0.6% psnr, +1.1% ssim. SIMD for quantize will follow in a separate patch. Results for other test sets pending. Change-Id: Ife5fa641163ac5150ac428011e87188f1937c1f4	2013-06-28 10:28:49 -07:00

1 2 3

115 Commits