generic-library/vpx

Author	SHA1	Message	Date
James Zern	00f1cf40ed	Merge "vp9_denoiser_sse2.c: eliminate gcc warnings"	2014-10-17 03:26:06 -07:00
JackyChen	8514d03402	vp9_denoiser_sse2.c: eliminate gcc warnings Change-Id: I5f63f48e11e31ea9951223c5b18f42a2471e4560	2014-10-17 11:00:57 +02:00
Alex Converse	7497d2fb23	Add a 32-bit friendly sse2 quantizer. This is based on the 64-bit ssse3 quantizer. 1.1x speedup for screen content at speed 7. Change-Id: I57d15415ef97c49165954bbe3daaaf9318e37448	2014-10-14 11:37:41 -07:00
James Zern	7c6fec672f	vp9_avg_intrin_sse2: correct intrinsics include immintrin.h -> emmintrin.h fixes build where newer intrinsics are unavailable Change-Id: I79311b39bfa782fc2abeb45884ecb417050cb9f8	2014-10-10 10:05:47 +02:00
Jim Bankoski	0ce51d823f	experimental : partition using 1/8 x 1/8 image The concept: There's too much noise in source pixels for variance and at low bitrate the reconstructed looks nothing like the source so we have problems getting good partitionings with either. This skirts the issue by using a box blur scaled down version for variance calculations. To compare against source_var_ moved keyframe to be rd based like source_var. Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624	2014-10-07 16:36:14 -07:00
JackyChen	80465dae88	Add SSE2 code and unit test for VP9 denoiser. This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are only 16x16 blocks in denoiser, while in VP9, there are 13 different block sizes. By adding this SSE2 code, the improvement of encoder speed is around 20%(using C code vs using SSE2 code), vary for different clips. The unit test for VP9 denoiser is to confirm that the SSE2 code is bit-exact with the C code. The unit test covers all block size. Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d	2014-10-06 15:27:40 -07:00
Dmitry Kovalev	1f19ebbab6	Replacing vp9_get_mb_ss_sse2 asm implementation with intrinsics. Change-Id: Ib4f5dd733eb2939b108070a01e83da5d9990bac0	2014-09-06 00:10:25 -07:00
Dmitry Kovalev	48197f0a70	Adding sse2 variant for vp9_mse{8x8, 8x16, 16x8}. Change-Id: I6786d25ce4f32b8d8912f2d239a45ca15b310c4b	2014-09-03 19:02:14 -07:00
Dmitry Kovalev	ab73dba65f	Merge "Replacing asm 16x16 variance calculation with intrinsics."	2014-09-03 18:57:33 -07:00
Dmitry Kovalev	7f4c3b8d93	Merge "Cleaning up vp9_variance_avx2.c."	2014-09-03 13:21:38 -07:00
Dmitry Kovalev	070210e20b	Removing duplicated code. Change-Id: I7b5c776d5e6f5ca428b87fa9411ae4012a9538ba	2014-09-02 17:57:35 -07:00
Dmitry Kovalev	318fc0c34f	Removing MMX SAD calculation code. Removed functions: * vp9_sad_16x16_mmx * vp9_sad_8x16_mmx * vp9_sad_16x8_mmx * vp9_sad_8x8_mmx * vp9_sad_4x4_mmx Change-Id: Ic5174b93b64d65d846f0c11e72cab149e9472bc3	2014-09-02 14:41:36 -07:00
Dmitry Kovalev	6f6bd282c9	Replacing asm 16x16 variance calculation with intrinsics. New code is 20% faster for 64-bit and 15% faster for 32-bit. Compiled using clang. Change-Id: Icfea461238411001fd093561293dbfedfbf8d0bb	2014-09-02 13:54:34 -07:00
Dmitry Kovalev	5c937db029	Cleaning up vp9_variance_avx2.c. Change-Id: I75eb47dd21f87015efd673dbd2aa71f4386afdf5	2014-09-02 11:01:29 -07:00
Dmitry Kovalev	0b721db543	Replacing asm 8x8 variance calculation with intrinsics. New code is 10% faster for 64-bit and 25% faster for 32-bit. Compiled using clang. Change-Id: I8ba1544c30dd6f3ca479db806384317549650dfc	2014-08-29 17:28:31 -07:00
Dmitry Kovalev	12cd6f421d	Removing variance MMX code. Removed functions: * vp9_mse16x16_mmx * vp9_get_mb_ss_mmx * vp9_get4x4var_mmx * vp9_get8x8var_mmx * vp9_variance4x4_mmx * vp9_variance8x8_mmx * vp9_variance16x16_mmx * vp9_variance16x8_mmx * vp9_variance8x16_mmx They all have SSE2 equivalent. Change-Id: I3796f2477c4f59b35b4828f46a300c16e62a2615	2014-08-29 10:26:42 -07:00
Dmitry Kovalev	dcac083cf3	Implementing 4x4 variance calculation with SSE2. New SSE2 function is three times faster than MMX one. Change-Id: I4f387ce9f75b88379176ec7bdc62d86eb5f70fbe	2014-08-28 15:01:16 -07:00
Jingning Han	5b21708fd5	Fix def pairs in 32x32 2D-DCT sse2 Properly pair the def/undef order. Change-Id: I9736a6f8d2efc075b1d72dafc75b9350d055cf65	2014-08-20 09:40:30 -07:00
levytamar82	efdfdf5787	32 Align Load bug In the sub_pixel_avg_variance the parameter sec was also aligned load and changed to unaligned. Change-Id: I4d4966e0291059ea4d705baed1503dc58444fcb7	2014-08-14 14:07:28 -07:00
levytamar82	69a5f5ecf7	Fix bug 807 in the sub_pixel_variance function the dst is aligned to 16 bytes and not to 32 bytes - now load unaligned data Change-Id: I2e0b9745543697efc56fefa32857ea10117af135	2014-08-07 18:51:02 -07:00
levytamar82	af10457e02	Fix bug 806 in the function sad32x32x4d and sad64x64x4d the source is aligned to 16 bytes and not to 32 bytes - the load is now unaligned. Change-Id: I922fdba56d0936b5cf72e4503519f185645a168c	2014-08-07 14:13:30 -07:00
levytamar82	4ba92dc5ab	Fix bug 805 Remove all the redundant dct functions (dct4x4, dct8x8) in avx2 except dct32x32 those functions were copied originally from dct_sse2 Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e	2014-07-28 15:46:01 -07:00
Jingning Han	9ad1b9fc67	Re-design quantization process for 32x32 transform block This commit enables a new quantization process for 32x32 2D-DCT transform coefficient blocks. It improves the compression performance of speed 5 by 1.4%. The overall compression gains of speed 5 due to the new quantization scheme is 4.7%. It also includes the SSSE3 implementation of the 32x32 quantization process. Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b	2014-07-08 16:55:28 -07:00
Jingning Han	00fc0e3ff5	Tune SSSE3 implementation of fast path quantization This commit further simplifies the SSSE3 implementation of the fast path quantization process. Change-Id: I5be3286ec0f1bd81d1cf5be3168fece6384fb9ca	2014-07-07 11:06:53 -07:00
Jingning Han	9ac2f66320	Re-design quantization process This commit re-designs the quantization process for transform coefficient blocks of size 4x4 to 16x16. It improves compression performance for speed 7 by 3.85%. The SSSE3 version for the new quantization process is included. The average runtime of the 8x8 block quantization is reduced from 285 cycles -> 255 cycles, i.e., over 10% faster. Change-Id: I61278aa02efc70599b962d3314671db5b0446a50	2014-07-01 17:00:07 -07:00
Jingning Han	d5ae43318e	Merge "Fast computation path for forward transform and quantization"	2014-06-12 11:59:52 -07:00
Jingning Han	ccba289f8d	Fast computation path for forward transform and quantization This commit enables a fast path computational flow for forward transformation. It checks the sse and variance of prediction residuals and decides if the quantized coefficients are all zero, dc only, or more. It then selects the corresponding coding path in the forward transformation and quantization stage. It is currently enabled in rtc coding mode. Will do it for rd coding mode next. In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up. Overall coding performance for rtc set is changed by -0.18%. Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1	2014-06-12 11:10:54 -07:00
Dmitry Kovalev	e6fadb5ba8	Merge "Cleaning up vp9_variance_mmx.c."	2014-06-10 17:27:12 -07:00
Jingning Han	540d910350	Fix potential overflow issue in SSSE3 forward 8x8 2D-DCT The SSSE3 implementation might find a potential overflow issue in its second 1-D transform, if all input residual pixels are close to 255. This commit fixes the issue and re-enables the unit test on the SSSE3 version. Change-Id: I0520478abdab7afd3ff2842516bec951111e9b3c	2014-06-03 14:21:47 -07:00
Yaowu Xu	d553cc10dc	Merge "Fixed a crash windows build"	2014-05-29 08:16:19 -07:00
Yaowu Xu	43414f3f7b	Fixed a crash windows build Change-Id: I58baa1da1f3bfc8a6da454399139fe6a7473ff10	2014-05-28 15:50:50 -07:00
Dmitry Kovalev	ac3d97f124	Cleaning up vp9_variance_mmx.c. Change-Id: I42d83f91e272c92daed604c233f74439fe6307c5	2014-05-28 12:03:55 -07:00
Dmitry Kovalev	a789bfec87	Cleaning up vp9_variance_sse2.c. Change-Id: I5ec336848f6489c31cf2b645026fa2025db07466	2014-05-27 13:53:19 -07:00
Dmitry Kovalev	72ab966d5e	Removing vp9_pragmas.h. Change-Id: I9120a87e27e73e496932d11716937e2fad246521	2014-05-22 13:46:31 -07:00
Deb Mukherjee	b59b324171	Merge "Renames x86_64 specific asm files"	2014-05-22 12:30:38 -07:00
Deb Mukherjee	e272273443	Renames x86_64 specific asm files Renames all x86_64 specific assembly files to consistently end in _x86_64.asm. This will be useful for build systems to handle these files differently. All new 64-bit specific assembly files should use the new naming convention. Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536	2014-05-21 13:55:56 -07:00
Jingning Han	d8b26caa71	Merge "Adjust the forward 16x16 DCT computation steps"	2014-05-21 09:16:04 -07:00
Deb Mukherjee	a185bc3350	Extends temporal filtering to work for 422 data This is needed for profiles 1 and 2. Change-Id: I5dd7644c2932d055ab89e050d4be7d4117cd1028	2014-05-20 15:19:40 -07:00
Jingning Han	7f547336b7	Adjust the forward 16x16 DCT computation steps This commit adjusts the forward 16x16 DCT computation steps to simplify the register level operations. It fixes the corresponding sse2 version accordingly. Change-Id: I72a9c25b8ca9442fc5e113f47cd701ae55aa7f08	2014-05-19 12:39:26 -07:00
Yunqing Wang	c661cf0dad	Merge "AVX2 To VP9 Block Error Optimization"	2014-05-15 11:29:29 -07:00
levytamar82	1fbab853c8	AVX2 To VP9 Block Error Optimization vp9_block_error_sse2 can only handle 16 bytes at a time but the function requires to handle a sequence of 32 bytes at a time so each 16 bytes is handled in a different register. With AVX2 optimization the 32 bytes can be handled in one register instead of two in the SSE2 The vp9_block_error was optimized by 85%. The user level was optimized by 1.2% Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd	2014-05-14 11:51:07 -07:00
Alex Converse	b5422fab46	Add an x86inc MMX fwht4x4. Change-Id: Ib0a73d4863478f9b8a00976379d25d2f6ebbb197	2014-05-08 12:01:27 -07:00
Dmitry Kovalev	68a600d82a	Merge "Moving pair_set_epi32 macro into vp9_dct32x32_sse2.c."	2014-05-07 13:34:05 -07:00
Paul Wilkins	33b1c457ed	Revert "Add an MMX fwht4x4" Includes changes that are not compatible with VS windows builds. Amongst other things stdint.h is not supported in VS. This reverts commit `89fbf3de50`. Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd	2014-05-07 12:53:27 +01:00
Alex Converse	75d05d5ed4	Merge "Add an MMX fwht4x4"	2014-05-06 11:12:27 -07:00
Alex Converse	89fbf3de50	Add an MMX fwht4x4 7% faster encoding a desktop lossless at RT speed 4. Change-Id: I41627f5b737752616b6512bb91a36ec45995bf64	2014-05-05 15:10:48 -07:00
Jingning Han	52ae97b6aa	SSSE3 implementation of full inverse 8x8 2D-DCT This commit enables SSSE3 version full inverse 8x8 2D-DCT and reconstruction. It makes the runtime of vp9_idct8x8_64_add down from 256 cycles (SSE2) to 246 cycles. Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3	2014-05-05 10:49:27 -07:00
Dmitry Kovalev	25a666ef39	Moving pair_set_epi32 macro into vp9_dct32x32_sse2.c. Change-Id: I642a7d343677bf934e9a54cf4ad78e908620e39a	2014-05-01 16:45:49 -07:00
Dmitry Kovalev	e05b92c0aa	Merge "Removing half-variance asm functions which are not used."	2014-05-01 14:50:45 -07:00
Jingning Han	39761eb5d6	Merge "Enable SSSE3 implementation of 8x8 forward 2D-DCT"	2014-04-30 13:41:36 -07:00
Dmitry Kovalev	94f5491c46	Removing half-variance asm functions which are not used. Corresponding C functions were removed in I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3 Change-Id: I50a5575065a7a9e41904eb2161afd739def927db	2014-04-30 12:21:54 -07:00
Jingning Han	1eaa3a76dc	Enable SSSE3 implementation of 8x8 forward 2D-DCT Assembly implementation of ssse3 8x8 forward 2D-DCT. The current version is turned on only for x86_64. The average unit runtime goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster. This translates into about 1.5% speed-up for pedestrian_area 1080p at speed 2. Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4	2014-04-29 15:49:18 -07:00
Dmitry Kovalev	6e01079cc0	Removing unused vp9_variance_halfpixvar*() functions. Change-Id: I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3	2014-04-25 11:50:07 -07:00
Dmitry Kovalev	2fc3a18653	Removing unused vp9_mcomp_x86.h file. We don't use declarations from this file. The real declarations (differently named) are in vp9_rtcd_defs.pl, e.g. vp9_full_search_sad. Change-Id: I73cbf064305710ba20747233cfdbe67366f069a0	2014-04-14 11:32:58 -07:00
levytamar82	0fa8b668c1	AVX2 SAD Optimization: 2 functions were optimized for avx2 by using full 256 bit register In order to handle 32 elements in parallel instead of only 16 in parallel: 1. vp9_sad32x32x4d 2. vp9_sad64x64x4d The function level gain is 66% and the user level gain is ~1%. Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb	2014-03-21 13:53:32 -07:00
Yaowu Xu	5511968f21	Removed several unused functions. Change-Id: Ib9e27298c575afc02a98b593bc6ad60762064d9b	2014-03-17 14:09:29 -07:00
Andrew Russell	e337322e63	Merge "improved speed of 4x4 sse2 fdct."	2014-03-05 14:35:44 -08:00
Andrew Russell	a46f5459c3	improved speed of 4x4 sse2 fdct. * speed improvment of 30 percent achieved * multiplies and adds remain the same * non-arithmetic instructions minimized by hand, by: -expanding 2 pass loop -removing irrelivant "shuffles" -combining last two rounding steps * further improvments may be possible Change-Id: Idec2c3f52910c48e6a0e0f9aefed5cae31b0b8c0	2014-03-03 14:25:42 -08:00
levytamar82	ea14909687	AVX2 SubPixel AVG Variance Optimization Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_avg_variance64x64 2. vp9_sub_pixel_avg_variance32x32 both of those function were calling vp9_sub_pixel_avg_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_avg_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 80% function level gain and 2% user level gain Change-Id: Iea694654e1b7612dc6ed11e2626208c2179502c8	2014-02-28 22:51:04 -07:00
James Zern	d12b39daab	vp9_subpel_variance_impl_intrin_avx2.c: make some tables static + fix formatting Change-Id: I7b4ec11b7b46d8926750e0b69f7a606f3ab80895	2014-02-18 20:42:49 -08:00
levytamar82	52dac5d1cb	AVX2 SubPixel Variance Optimization Optimizing 2 functions to process 32 elements in parallel instead of 16: 1. vp9_sub_pixel_variance64x64 2. vp9_sub_pixel_variance32x32 both of those function were calling vp9_sub_pixel_variance16xh_ssse3 instead of calling that function, it calls vp9_sub_pixel_variance32xh_avx2 that is written in avx2 and process 32 elements in parallel. This Optimization gave 70% function level gain and 2% user level gain Change-Id: I4f5cb386b346ff6c878a094e1c3b37e418e50bde	2014-02-14 16:59:11 -07:00
Andrew Russell	549c31f8ae	minor spelling cleanup in comments Change-Id: Ia91c6c406273345b08505097ffe1af3896980f06	2014-02-12 16:32:51 -08:00
Yunqing Wang	0d43bd77e5	Bug fix in ssse3 quantize function A bug was reported in Issue 702: "SIGILL (Illegal instruction) when transcoding with vp9 - using FFmpeg". It was reproduced and fixed. Change-Id: Ie32c149a89af02856084aeaf289e848a905c7700	2014-02-07 14:32:30 -08:00
Dmitry Kovalev	005fc6970b	Finally removing "short" from transform names. Change-Id: I5259b68dc1bcceb153e3ffe638a79a59a3019e9d	2014-02-06 11:54:15 -08:00
Dmitry Kovalev	ff41764920	Removing _1d suffix from transform names. It is enough to specify (e.g.) idct16, it is obviously different from idct16x16. Change-Id: I6b408a37a945de3162429380b59a775b03b95db0	2014-01-27 16:15:36 -08:00
James Zern	b453941caf	vp9/encoder: add extern "C" to headers Change-Id: I4f51ce859a97bf1b8fd2b37ac585b7c643232b69	2014-01-23 16:21:24 -08:00
levytamar82	357b65369f	AVX2 Variance Optimization Optimizing the variance functions: vp9_variance16x16, vp9_variance32x32, vp9_variance64x64, vp9_variance32x16, vp9_variance64x32, vp9_mse16x16 by migrating to AVX2 some of the functions were optimized by processing 32 elements instead of 16. some of the functions were optimized by processing 2 loop strides of 16 elements in a single 256 bit register This optimization gives between 2.4% - 2.7% user level performance gain and 42% function level gain. Change-Id: I265ae08a2b0196057a224a86450153ef3aebd85d	2014-01-08 12:05:53 -07:00
James Zern	bd9a388a06	vp9: normalize include guards Change-Id: If4ddbdcfb3ab387cbca6910b42cf4df8111e6879	2013-12-16 19:40:49 -08:00
Yaowu Xu	e9c19617bf	Merge "vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2"	2013-11-27 10:27:32 -08:00
levytamar82	8def766de2	vp9_short_fdct32x32_rd vp9_short_fdct32x32 optimized for AVX2 Change-Id: I6366e84490883b72362f762369d7e5bccb64f02f	2013-11-21 14:19:49 -08:00
Abo Talib Mahfoodh	ec2dbdd107	Improve vp9_fdct4x4_sse2 (x1.2) Modifications are done to reduce the total clock cycle. Speedup: 1.2 Tested with: park_joy_420_720p50.y4m Change-Id: Ia36b87e62e2f80a5fadaf5628729aedc80f38f3f	2013-11-21 15:04:35 -05:00
Jingning Han	fabc783695	Fix an overflow issue in SSE2 forward ADST The step that sums three input samples could potentially cause the intermediate result go beyond 16 bit limit, when operating as the second 1-D transform. This commit fixes the issue. Change-Id: Iaf512449ac2d25ddd8a806d760afab362c62a516	2013-11-13 15:15:59 -08:00
Yunqing Wang	d7289658fb	Remove TEXTREL from 32bit encoder This patch fixed the issue reported in "Issue 655: remove textrel's from 32-bit vp9 encoder". The set of vp9_subpel_variance functions that used x86inc.asm ABI didn't build correctly for 32bit PIC. The fix was carefully done under the situation that there was not enough registers. After the change, we got $ eu-findtextrel libvpx.so eu-findtextrel: no text relocations reported in 'libvpx.so' Change-Id: I1b176311dedaf48eaee0a1e777588043c97cea82	2013-11-07 13:39:40 -08:00
Dmitry Kovalev	600a3860a4	Making input pointer constant for all fdct/fht functions. Change-Id: I78f7012f967a777ddd39bae6671eb501df6bbfe8	2013-10-24 11:48:25 -07:00
Dmitry Kovalev	fd724f13b0	Renaming vp9_short_fdct4x4 and vp9_short_walsh4x4. For consistency with idct function names. Renames: vp9_short_fdct4x4 -> vp9_fdct4x4 vp9_short_walsh4x4 -> vp9_fwht4x4 Change-Id: Id15497cc1270acca626447d846f0ce9199770f58	2013-10-23 14:28:39 -07:00
Dmitry Kovalev	a018988ce8	Renaming vp9_short_fdct32x32 to vp9_fdct32x32. For consistency with idct function names. Change-Id: Ie77b7178e0894c57cd5cb9243c949eb9224ece18	2013-10-23 13:41:40 -07:00
Dmitry Kovalev	5bdd4d9ccf	Merge "Renaming vp9_short_fdct16x16 to vp9_fdct16x16."	2013-10-23 13:37:09 -07:00
Dmitry Kovalev	02feb63684	Renaming vp9_short_fdct16x16 to vp9_fdct16x16. For consistency with idct function names. Change-Id: I5ca355ba99fdba04f09254be95cf79808b534f71	2013-10-23 10:57:12 -07:00
Dmitry Kovalev	fa143dbc8e	Renaming vp9_short_fdct8x8 to vp9_fdct8x8. For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f	2013-10-23 10:52:33 -07:00
Dmitry Kovalev	9f09618bd4	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct4x4."	2013-10-22 13:05:24 -07:00
Dmitry Kovalev	a767d10fa5	Merge "Using stride (# of elements) instead of pitch (bytes) in fdct8x8."	2013-10-22 11:34:17 -07:00
Dmitry Kovalev	190c2b4591	Using stride (# of elements) instead of pitch (bytes) in fdct4x4. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I0ba3c52513a5fdd194f1e7e2901092671398985b	2013-10-21 15:27:35 -07:00
Dmitry Kovalev	e5fa44c869	Using stride (# of elements) instead of pitch (bytes) in fdct8x8. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Ibc944952a192e6c7b2b6a869ec2894c01da82ed1	2013-10-18 12:20:26 -07:00
Dmitry Kovalev	1aa7fd5aef	Using stride (# of elements) instead of pitch (bytes) in fdct16x16. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: I2d95fdcbba96aaa0ed24a80870cb38f53487a97d	2013-10-18 11:49:33 -07:00
Dmitry Kovalev	e05412fc23	Using stride (# of elements) instead of pitch (bytes) in fdct32x32. Just making fdct consistent with iht/idct/fht functions which all use stride (# of elements) as input argument. Change-Id: Id623c5113262655fa50f7c9d6cec9a91fcb20bb4	2013-10-17 13:02:28 -07:00
Dmitry Kovalev	a4585285ed	Removing unused 8x4 transform from the encoder. Change-Id: Icbcf68b5b685a56f255ebc3859c9692accdadf9e	2013-10-15 11:27:28 -07:00
Jingning Han	80f215198f	Merge "Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16"	2013-10-09 16:08:42 -07:00
Jim Bankoski	9603989c72	Merge "cpplint vp9_variance_sse2.c"	2013-10-07 15:44:50 -07:00
Jim Bankoski	f59cb3eacc	Merge "added nolint to function that doesn't seem easy to breakup"	2013-10-05 16:47:23 -07:00
Jim Bankoski	5b4f836148	cpplint issues resolved in vp9_variance_mmx.c Change-Id: Idbfabe427fbeab44210f13fec8b6f63f7a4eb0dd	2013-10-04 14:22:08 -07:00
Jim Bankoski	eb5b7ac27b	added nolint to function that doesn't seem easy to breakup Change-Id: I5489b116aea7c510ea5ebbed3c1445f321b05f3e	2013-10-04 14:17:47 -07:00
Jim Bankoski	25ecb1f0b3	cpplint vp9_variance_sse2.c Change-Id: Ifce8f5b57a1ea8952e8a67c5b92a127a061899fa	2013-10-04 14:15:06 -07:00
A.Mahfoodh	5215b83aea	Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16 Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two instructions. Then inlined them. quoting from intel MMX_App_Compute_16bit_Vector.pdf‎ "The PMADDWD instruction multiplies four pairs of 16-bit numbers and produces partial sums of the results and can do so once per clock (with a three-clock latency)." so I am assuming that there will be three clock overhead after the last _mm_madd_pi16 command. Even with the overhead the number of clocks in general should be smaller. I am not sure though becasue I could not find information about number of clocks required for instructions in k_cvtlo_epi16 and k_cvthi_epi16. I will run a test and compare the execution time. Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7	2013-10-02 20:02:03 -04:00
A.Mahfoodh	13c7715a75	Number of instructions in fdct4_1d_sse2 reduced by two. Mathematically the results are the same. Change-Id: I1c5126cd3ca64e8515ca6331e0989c6f7dd651a0	2013-09-23 17:23:27 -07:00
Jingning Han	09bc942b47	Fix overflow issue in 16x16 quantization SSSE3 The 16x16 transform unit test suggested that the peak coefficient value can reach 32639. This could cause potential overflow issue in the SSSE3 implmentation of 16x16 block quantization. This commit fixes this issue by replacing addition with saturated addition. Change-Id: I6d5bb7c5faad4a927be53292324bd2728690717e	2013-09-06 21:06:10 -07:00
Jingning Han	458c2833c0	Use saturated addition in SSSE3 of 32x32 quant The 32x32 forward transform can potentially reach peak coefficient value close to 32700, while the rounding factor can go upto 610. This could cause overflow issue in the SSSE3 implementation of 32x32 quantization process. This commit resolves this issue by replacing the addition operations with saturated addition operations in 32x32 block quantization. Change-Id: Id6b98996458e16c5b6241338ca113c332bef6e70	2013-09-05 12:49:12 -07:00
Jingning Han	3cf46fa591	Fix 32x32 forward transform SSE2 version This commit fixed the potential overflow issue in the SSE2 implementation of 32x32 forward DCT. It resolved the corrupted coded frames in the border of scenes. Change-Id: If87eef2d46209269f74ef27e7295b6707fbf56f9	2013-08-31 18:47:08 -07:00
Jingning Han	c86c5443eb	Merge "Fix overflow issue in SSSE3 32x32 quantization"	2013-08-29 16:49:04 -07:00
Jingning Han	abff678866	Fix overflow issue in SSSE3 32x32 quantization The 32x32 quantization process can potentially have the intermediate stacks over 16-bit range, thereby causing enc/dec mismatch. This commit fixes this overflow issue in the SSSE3 implementation, as well as the prototype, of 32x32 quantization. This fixes issue 607 from webm@googlecode. Change-Id: I85635e6ca236b90c3dcfc40d449215c7b9caa806	2013-08-29 11:00:54 -07:00
Yaowu Xu	9482c07953	fixed the reading too many bytes In subpel_avg_variance functions, code similar to the following punpkldq m2, [addr] actually reads 8 bytes. For functions that are supposed to work on buffers only have less 8 bytes a line, this caused valgrind error of reading uninitialized memory. Change-Id: I2a4c079dbdbc747829bd9e2ed85f0018ad2a3a34	2013-08-27 08:39:20 -07:00

1 2 3 4 5

233 Commits