generic-library/vpx

Author	SHA1	Message	Date
Andrew Russell	e337322e63	Merge "improved speed of 4x4 sse2 fdct."	2014-03-05 14:35:44 -08:00
Andrew Russell	a46f5459c3	improved speed of 4x4 sse2 fdct. * speed improvment of 30 percent achieved * multiplies and adds remain the same * non-arithmetic instructions minimized by hand, by: -expanding 2 pass loop -removing irrelivant "shuffles" -combining last two rounding steps * further improvments may be possible Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0	2014-03-03 14:25:42 -08:00
levytamar82	ea14909687	AVX2 SubPixel AVG Variance Optimization Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_avg_variance64x64 2. vp9_sub_pixel_avg_variance32x32 both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 80% function level gain and 2% user level gain Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8	2014-02-28 22:51:04 -07:00
James Zern	d12b39daab	vp9_subpel_variance_impl_intrin_avx2.c: make some tables static + fix formatting Change-Id: I7b4ec11b7b46d8926750e0b69f7a606f3ab80895	2014-02-18 20:42:49 -08:00
levytamar82	52dac5d1cb	AVX2 SubPixel Variance Optimization Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_variance64x64 2. vp9_sub_pixel_variance32x32 both of those function were calling vp9_sub_pixel_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 70% function level gain and 2% user level gain Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde	2014-02-14 16:59:11 -07:00
Andrew Russell	549c31f8ae	minor spelling cleanup in comments Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06	2014-02-12 16:32:51 -08:00
Yunqing Wang	0d43bd77e5	Bug fix in ssse3 quantize function A bug was reported in Issue 702: "SIGILL (Illegal instruction) when transcoding with vp9 - using FFmpeg". It was reproduced and fixed. Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700	2014-02-07 14:32:30 -08:00
Dmitry Kovalev	005fc6970b	Finally removing "short" from transform names. Change-Id: I5259b68dc1bcceb153e3ffe638a79a59a3019e9d	2014-02-06 11:54:15 -08:00
Dmitry Kovalev	ff41764920	Removing _1d suffix from transform names. It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0	2014-01-27 16:15:36 -08:00
James Zern	b453941caf	vp9/encoder: add extern "C" to headers Change-Id: I4f51ce859a97bf1b8fd2b37ac585b7c643232b69	2014-01-23 16:21:24 -08:00
levytamar82	357b65369f	AVX2 Variance Optimization Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32, vp9_variance64x64, vp9_variance32x16, vp9_variance64x32, vp9_mse16x16 by migrating to AVX2 some of the functions were optimized by processing 32 elements instead of 16. some of the functions were optimized by processing 2 loop strides of 16 elements in a single 256 bit register This optimization gives between 2.4% - 2.7% user level performance gain and 42% function level gain. Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d	2014-01-08 12:05:53 -07:00
James Zern	bd9a388a06	vp9: normalize include guards Change-Id: If4ddbdcfb3ab387cbca6910b42cf4df8111e6879	2013-12-16 19:40:49 -08:00
Yaowu Xu	e9c19617bf	Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2"	2013-11-27 10:27:32 -08:00
levytamar82	8def766de2	vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2 Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f	2013-11-21 14:19:49 -08:00
Abo Talib Mahfoodh	ec2dbdd107	Improve vp9_fdct4x4_sse2 (x1.2) Modifications are done to reduce the total clock cycle. Speedup: 1.2 Tested with: park_joy_420_720p50.y4m Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f	2013-11-21 15:04:35 -05:00
Jingning Han	fabc783695	Fix an overflow issue in SSE2 forward ADST The step that sums three input samples could potentially cause the intermediate result go beyond 16 bit limit, when operating as the second 1-D transform. This commit fixes the issue. Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516	2013-11-13 15:15:59 -08:00
Yunqing Wang	d7289658fb	Remove TEXTREL from 32bit encoder This patch fixed the issue reported in "Issue 655: remove textrel's from 32-bit vp9 encoder". The set of vp9_subpel_variance functions that used x86inc.asm ABI didn't build correctly for 32bit PIC. The fix was carefully done under the situation that there was not enough registers. After the change, we got $ eu-findtextrel libvpx.so eu-findtextrel: no text relocations reported in 'libvpx.so' Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82	2013-11-07 13:39:40 -08:00
Dmitry Kovalev	600a3860a4	Making input pointer constant for all fdct/fht functions. Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8	2013-10-24 11:48:25 -07:00
Dmitry Kovalev	fd724f13b0	Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4. For consistency with idct function names. Renames: vp9_short_fdct4x4 -> vp9_fdct4x4 vp9_short_walsh4x4 -> vp9_fwht4x4 Change-Id: Id15497cc1270acca626447d846f0ce9199770f58	2013-10-23 14:28:39 -07:00
Dmitry Kovalev	a018988ce8	Renaming vp9_short_fdct32x32 to vp9_fdct32x32. For consistency with idct function names. Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18	2013-10-23 13:41:40 -07:00
Dmitry Kovalev	5bdd4d9ccf	Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16."	2013-10-23 13:37:09 -07:00
Dmitry Kovalev	02feb63684	Renaming vp9_short_fdct16x16 to vp9_fdct16x16. For consistency with idct function names. Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71	2013-10-23 10:57:12 -07:00
Dmitry Kovalev	fa143dbc8e	Renaming vp9_short_fdct8x8 to vp9_fdct8x8. For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f	2013-10-23 10:52:33 -07:00
Dmitry Kovalev	9f09618bd4	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4."	2013-10-22 13:05:24 -07:00
Dmitry Kovalev	a767d10fa5	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8."	2013-10-22 11:34:17 -07:00
Dmitry Kovalev	190c2b4591	Using stride (# of elements) instead of pitch (bytes) in fdct4x4. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b	2013-10-21 15:27:35 -07:00
Dmitry Kovalev	e5fa44c869	Using stride (# of elements) instead of pitch (bytes) in fdct8x8. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1	2013-10-18 12:20:26 -07:00
Dmitry Kovalev	1aa7fd5aef	Using stride (# of elements) instead of pitch (bytes) in fdct16x16. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d	2013-10-18 11:49:33 -07:00
Dmitry Kovalev	e05412fc23	Using stride (# of elements) instead of pitch (bytes) in fdct32x32. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4	2013-10-17 13:02:28 -07:00
Dmitry Kovalev	a4585285ed	Removing unused 8x4 transform from the encoder. Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e	2013-10-15 11:27:28 -07:00
Jingning Han	80f215198f	Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16"	2013-10-09 16:08:42 -07:00
Jim Bankoski	9603989c72	Merge "cpplint vp9_variance_sse2.c"	2013-10-07 15:44:50 -07:00
Jim Bankoski	f59cb3eacc	Merge "added nolint to function that doesn't seem easy to breakup"	2013-10-05 16:47:23 -07:00
Jim Bankoski	5b4f836148	cpplint issues resolved in vp9_variance_mmx.c Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd	2013-10-04 14:22:08 -07:00
Jim Bankoski	eb5b7ac27b	added nolint to function that doesn't seem easy to breakup Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e	2013-10-04 14:17:47 -07:00
Jim Bankoski	25ecb1f0b3	cpplint vp9_variance_sse2.c Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa	2013-10-04 14:15:06 -07:00
A.Mahfoodh	5215b83aea	Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16 Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two instructions. Then inlined them. quoting from intel MMX_App_Compute_16bit_Vector.pdf‎ "The PMADDWD instruction multiplies four pairs of 16-bit numbers and produces partial sums of the results and can do so once per clock (with a three-clock latency)." so I am assuming that there will be three clock overhead after the last _mm_madd_pi16 command. Even with the overhead the number of clocks in general should be smaller. I am not sure though becasue I could not find information about number of clocks required for instructions in k_cvtlo_epi16 and k_cvthi_epi16. I will run a test and compare the execution time. Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7	2013-10-02 20:02:03 -04:00
A.Mahfoodh	13c7715a75	Number of instructions in fdct4_1d_sse2 reduced by two. Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0	2013-09-23 17:23:27 -07:00
Jingning Han	09bc942b47	Fix overflow issue in 16x16 quantization SSSE3 The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e	2013-09-06 21:06:10 -07:00
Jingning Han	458c2833c0	Use saturated addition in SSSE3 of 32x32 quant The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70	2013-09-05 12:49:12 -07:00
Jingning Han	3cf46fa591	Fix 32x32 forward transform SSE2 version This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9	2013-08-31 18:47:08 -07:00
Jingning Han	c86c5443eb	Merge "Fix overflow issue in SSSE3 32x32 quantization"	2013-08-29 16:49:04 -07:00
Jingning Han	abff678866	Fix overflow issue in SSSE3 32x32 quantization The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806	2013-08-29 11:00:54 -07:00
Yaowu Xu	9482c07953	fixed the reading too many bytes In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34	2013-08-27 08:39:20 -07:00
Yaowu Xu	6c5433c836	Fix the reading of too many input pixels in VP9_get4x4var_mmx Change-Id: I4b4a8f45f25ebdfad281f169cc87aba5e2d6f227	2013-08-26 12:35:27 -07:00
Jingning Han	78136edcdc	SSE2 high precision 32x32 forward DCT Enable SSE2 implementation of high precision 32x32 forward DCT. The intermediate stacks are of 32-bits. The run-time goes down from 32126 cycles to 13442 cycles. Change-Id: Ib5ccafe3176c65bd6f2dbdef790bd47bbc880e56	2013-08-12 16:52:53 -07:00
Jingning Han	2c091f9768	Merge "Place holder for high-precision 32x32 fdct"	2013-08-06 14:47:30 -07:00
Jim Bankoski	5b307886fb	variance x86inc guards also fixed bug in sad calcs Change-Id: I6571fcbe37556c16ae32be66dc0fd879852aac1d	2013-08-06 14:17:13 -07:00
Jingning Han	28566a6cd5	Place holder for high-precision 32x32 fdct Resolve compile warnings on re-define FDCT32x32_2D template. Change-Id: Idb3a54ef8d2710ce7245b726379a0e5c875f5cad	2013-08-06 11:44:08 -07:00
Christian Duvivier	3d98205fce	Move fdct32x32 SSE2 implementation in separate file. This is in preparation for the SSE2 version of the high-precision 32x32 forward DCT which will share a lot of code with the existing low precision version used for rate-distortion search. Change-Id: I7084b6bdfb480b1fabb8493fb14e3f7fcc7888c0	2013-08-06 10:17:11 -07:00

1 2 3

127 Commits