generic-library/vpx

Author	SHA1	Message	Date
Yunqing Wang	687c56e802	Merge "SAD32xh and SAD64xh for AVX2"	2014-10-20 12:37:55 -07:00
levytamar82	7045aec00a	SAD32xh and SAD64xh for AVX2 All sad function that process above 32 consecutive elements are optimized for AVX2: vp9_sad64x64 vp9_sad64x32 vp9_sad32x64 vp9_sad32x32 vp9_sad32x16 vp9_sad64x64_avg vp9_sad64x32_avg vp9_sad32x64_avg vp9_sad32x32_avg vp9_sad32x16_avg The functions that appeared as a hotspot is vp9_sad32x32 and vp9_sad64x64 vp9_sad32x32 was optimized by 68% and vp9_sad64x64 was optimized by 90% both of them gave and overall ~2.3% user level gain Change-Id: Iccf86b375a2b54c5fbbe685902ead0c9a561b9fd	2014-10-19 13:59:10 -07:00
JackyChen	6356d21a47	vp9_denoiser_sse2.c: solve windows build error. Change-Id: Ib5df91c8580d5dbeb0b3554edc9c2ca906ba4c4d	2014-10-17 09:28:22 -07:00
James Zern	00f1cf40ed	Merge "vp9_denoiser_sse2.c: eliminate gcc warnings"	2014-10-17 03:26:06 -07:00
JackyChen	8514d03402	vp9_denoiser_sse2.c: eliminate gcc warnings Change-Id: I5f63f48e11e31ea9951223c5b18f42a2471e4560	2014-10-17 11:00:57 +02:00
Alex Converse	7497d2fb23	Add a 32-bit friendly sse2 quantizer. This is based on the 64-bit ssse3 quantizer. 1.1x speedup for screen content at speed 7. Change-Id: I57d15415ef97c49165954bbe3daaaf9318e37448	2014-10-14 11:37:41 -07:00
James Zern	7c6fec672f	vp9_avg_intrin_sse2: correct intrinsics include immintrin.h -> emmintrin.h fixes build where newer intrinsics are unavailable Change-Id: I79311b39bfa782fc2abeb45884ecb417050cb9f8	2014-10-10 10:05:47 +02:00
Jim Bankoski	0ce51d823f	experimental : partition using 1/8 x 1/8 image The concept: There's too much noise in source pixels for variance and at low bitrate the reconstructed looks nothing like the source so we have problems getting good partitionings with either. This skirts the issue by using a box blur scaled down version for variance calculations. To compare against source_var_ moved keyframe to be rd based like source_var. Change-Id: Ie3babdbfadae324b7b5a76bea192893af27f0624	2014-10-07 16:36:14 -07:00
JackyChen	80465dae88	Add SSE2 code and unit test for VP9 denoiser. This SSE2 is based on VP8 denoiser's SSE2 code. In VP8, there are only 16x16 blocks in denoiser, while in VP9, there are 13 different block sizes. By adding this SSE2 code, the improvement of encoder speed is around 20%(using C code vs using SSE2 code), vary for different clips. The unit test for VP9 denoiser is to confirm that the SSE2 code is bit-exact with the C code. The unit test covers all block size. Change-Id: Ic8d8ac26db4ea40a5f146b5678a065af07eaaa3d	2014-10-06 15:27:40 -07:00
Dmitry Kovalev	1f19ebbab6	Replacing vp9_get_mb_ss_sse2 asm implementation with intrinsics. Change-Id: Ib4f5dd733eb2939b108070a01e83da5d9990bac0	2014-09-06 00:10:25 -07:00
Dmitry Kovalev	48197f0a70	Adding sse2 variant for vp9_mse{8x8, 8x16, 16x8}. Change-Id: I6786d25ce4f32b8d8912f2d239a45ca15b310c4b	2014-09-03 19:02:14 -07:00
Dmitry Kovalev	ab73dba65f	Merge "Replacing asm 16x16 variance calculation with intrinsics."	2014-09-03 18:57:33 -07:00
Dmitry Kovalev	7f4c3b8d93	Merge "Cleaning up vp9_variance_avx2.c."	2014-09-03 13:21:38 -07:00
Dmitry Kovalev	070210e20b	Removing duplicated code. Change-Id: I7b5c776d5e6f5ca428b87fa9411ae4012a9538ba	2014-09-02 17:57:35 -07:00
Dmitry Kovalev	318fc0c34f	Removing MMX SAD calculation code. Removed functions: * vp9_sad_16x16_mmx * vp9_sad_8x16_mmx * vp9_sad_16x8_mmx * vp9_sad_8x8_mmx * vp9_sad_4x4_mmx Change-Id: Ic5174b93b64d65d846f0c11e72cab149e9472bc3	2014-09-02 14:41:36 -07:00
Dmitry Kovalev	6f6bd282c9	Replacing asm 16x16 variance calculation with intrinsics. New code is 20% faster for 64-bit and 15% faster for 32-bit. Compiled using clang. Change-Id: Icfea461238411001fd093561293dbfedfbf8d0bb	2014-09-02 13:54:34 -07:00
Dmitry Kovalev	5c937db029	Cleaning up vp9_variance_avx2.c. Change-Id: I75eb47dd21f87015efd673dbd2aa71f4386afdf5	2014-09-02 11:01:29 -07:00
Dmitry Kovalev	0b721db543	Replacing asm 8x8 variance calculation with intrinsics. New code is 10% faster for 64-bit and 25% faster for 32-bit. Compiled using clang. Change-Id: I8ba1544c30dd6f3ca479db806384317549650dfc	2014-08-29 17:28:31 -07:00
Dmitry Kovalev	12cd6f421d	Removing variance MMX code. Removed functions: * vp9_mse16x16_mmx * vp9_get_mb_ss_mmx * vp9_get4x4var_mmx * vp9_get8x8var_mmx * vp9_variance4x4_mmx * vp9_variance8x8_mmx * vp9_variance16x16_mmx * vp9_variance16x8_mmx * vp9_variance8x16_mmx They all have SSE2 equivalent. Change-Id: I3796f2477c4f59b35b4828f46a300c16e62a2615	2014-08-29 10:26:42 -07:00
Dmitry Kovalev	dcac083cf3	Implementing 4x4 variance calculation with SSE2. New SSE2 function is three times faster than MMX one. Change-Id: I4f387ce9f75b88379176ec7bdc62d86eb5f70fbe	2014-08-28 15:01:16 -07:00
Jingning Han	5b21708fd5	Fix def pairs in 32x32 2D-DCT sse2 Properly pair the def/undef order. Change-Id: I9736a6f8d2efc075b1d72dafc75b9350d055cf65	2014-08-20 09:40:30 -07:00
levytamar82	efdfdf5787	32 Align Load bug In the sub_pixel_avg_variance the parameter sec was also aligned load and changed to unaligned. Change-Id: I4d4966e0291059ea4d705baed1503dc58444fcb7	2014-08-14 14:07:28 -07:00
levytamar82	69a5f5ecf7	Fix bug 807 in the sub_pixel_variance function the dst is aligned to 16 bytes and not to 32 bytes - now load unaligned data Change-Id: I2e0b9745543697efc56fefa32857ea10117af135	2014-08-07 18:51:02 -07:00
levytamar82	af10457e02	Fix bug 806 in the function sad32x32x4d and sad64x64x4d the source is aligned to 16 bytes and not to 32 bytes - the load is now unaligned. Change-Id: I922fdba56d0936b5cf72e4503519f185645a168c	2014-08-07 14:13:30 -07:00
levytamar82	4ba92dc5ab	Fix bug 805 Remove all the redundant dct functions (dct4x4, dct8x8) in avx2 except dct32x32 those functions were copied originally from dct_sse2 Change-Id: I742576fbf5175f3ac09f2076976a9247b259323e	2014-07-28 15:46:01 -07:00
Jingning Han	9ad1b9fc67	Re-design quantization process for 32x32 transform block This commit enables a new quantization process for 32x32 2D-DCT transform coefficient blocks. It improves the compression performance of speed 5 by 1.4%. The overall compression gains of speed 5 due to the new quantization scheme is 4.7%. It also includes the SSSE3 implementation of the 32x32 quantization process. Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b	2014-07-08 16:55:28 -07:00
Jingning Han	00fc0e3ff5	Tune SSSE3 implementation of fast path quantization This commit further simplifies the SSSE3 implementation of the fast path quantization process. Change-Id: I5be3286ec0f1bd81d1cf5be3168fece6384fb9ca	2014-07-07 11:06:53 -07:00
Jingning Han	9ac2f66320	Re-design quantization process This commit re-designs the quantization process for transform coefficient blocks of size 4x4 to 16x16. It improves compression performance for speed 7 by 3.85%. The SSSE3 version for the new quantization process is included. The average runtime of the 8x8 block quantization is reduced from 285 cycles -> 255 cycles, i.e., over 10% faster. Change-Id: I61278aa02efc70599b962d3314671db5b0446a50	2014-07-01 17:00:07 -07:00
Jingning Han	d5ae43318e	Merge "Fast computation path for forward transform and quantization"	2014-06-12 11:59:52 -07:00
Jingning Han	ccba289f8d	Fast computation path for forward transform and quantization This commit enables a fast path computational flow for forward transformation. It checks the sse and variance of prediction residuals and decides if the quantized coefficients are all zero, dc only, or more. It then selects the corresponding coding path in the forward transformation and quantization stage. It is currently enabled in rtc coding mode. Will do it for rd coding mode next. In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up. Overall coding performance for rtc set is changed by -0.18%. Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1	2014-06-12 11:10:54 -07:00
Dmitry Kovalev	e6fadb5ba8	Merge "Cleaning up vp9_variance_mmx.c."	2014-06-10 17:27:12 -07:00
Jingning Han	540d910350	Fix potential overflow issue in SSSE3 forward 8x8 2D-DCT The SSSE3 implementation might find a potential overflow issue in its second 1-D transform, if all input residual pixels are close to 255. This commit fixes the issue and re-enables the unit test on the SSSE3 version. Change-Id: I0520478abdab7afd3ff2842516bec951111e9b3c	2014-06-03 14:21:47 -07:00
Yaowu Xu	d553cc10dc	Merge "Fixed a crash windows build"	2014-05-29 08:16:19 -07:00
Yaowu Xu	43414f3f7b	Fixed a crash windows build Change-Id: I58baa1da1f3bfc8a6da454399139fe6a7473ff10	2014-05-28 15:50:50 -07:00
Dmitry Kovalev	ac3d97f124	Cleaning up vp9_variance_mmx.c. Change-Id: I42d83f91e272c92daed604c233f74439fe6307c5	2014-05-28 12:03:55 -07:00
Dmitry Kovalev	a789bfec87	Cleaning up vp9_variance_sse2.c. Change-Id: I5ec336848f6489c31cf2b645026fa2025db07466	2014-05-27 13:53:19 -07:00
Dmitry Kovalev	72ab966d5e	Removing vp9_pragmas.h. Change-Id: I9120a87e27e73e496932d11716937e2fad246521	2014-05-22 13:46:31 -07:00
Deb Mukherjee	b59b324171	Merge "Renames x86_64 specific asm files"	2014-05-22 12:30:38 -07:00
Deb Mukherjee	e272273443	Renames x86_64 specific asm files Renames all x86_64 specific assembly files to consistently end in _x86_64.asm. This will be useful for build systems to handle these files differently. All new 64-bit specific assembly files should use the new naming convention. Change-Id: I36c89584967c82ffc4088b1b5044ac15d2bb7536	2014-05-21 13:55:56 -07:00
Jingning Han	d8b26caa71	Merge "Adjust the forward 16x16 DCT computation steps"	2014-05-21 09:16:04 -07:00
Deb Mukherjee	a185bc3350	Extends temporal filtering to work for 422 data This is needed for profiles 1 and 2. Change-Id: I5dd7644c2932d055ab89e050d4be7d4117cd1028	2014-05-20 15:19:40 -07:00
Jingning Han	7f547336b7	Adjust the forward 16x16 DCT computation steps This commit adjusts the forward 16x16 DCT computation steps to simplify the register level operations. It fixes the corresponding sse2 version accordingly. Change-Id: I72a9c25b8ca9442fc5e113f47cd701ae55aa7f08	2014-05-19 12:39:26 -07:00
Yunqing Wang	c661cf0dad	Merge "AVX2 To VP9 Block Error Optimization"	2014-05-15 11:29:29 -07:00
levytamar82	1fbab853c8	AVX2 To VP9 Block Error Optimization vp9_block_error_sse2 can only handle 16 bytes at a time but the function requires to handle a sequence of 32 bytes at a time so each 16 bytes is handled in a different register. With AVX2 optimization the 32 bytes can be handled in one register instead of two in the SSE2 The vp9_block_error was optimized by 85%. The user level was optimized by 1.2% Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd	2014-05-14 11:51:07 -07:00
Alex Converse	b5422fab46	Add an x86inc MMX fwht4x4. Change-Id: Ib0a73d4863478f9b8a00976379d25d2f6ebbb197	2014-05-08 12:01:27 -07:00
Dmitry Kovalev	68a600d82a	Merge "Moving pair_set_epi32 macro into vp9_dct32x32_sse2.c."	2014-05-07 13:34:05 -07:00
Paul Wilkins	33b1c457ed	Revert "Add an MMX fwht4x4" Includes changes that are not compatible with VS windows builds. Amongst other things stdint.h is not supported in VS. This reverts commit `89fbf3de50`. Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd	2014-05-07 12:53:27 +01:00
Alex Converse	75d05d5ed4	Merge "Add an MMX fwht4x4"	2014-05-06 11:12:27 -07:00
Alex Converse	89fbf3de50	Add an MMX fwht4x4 7% faster encoding a desktop lossless at RT speed 4. Change-Id: I41627f5b737752616b6512bb91a36ec45995bf64	2014-05-05 15:10:48 -07:00
Jingning Han	52ae97b6aa	SSSE3 implementation of full inverse 8x8 2D-DCT This commit enables SSSE3 version full inverse 8x8 2D-DCT and reconstruction. It makes the runtime of vp9_idct8x8_64_add down from 256 cycles (SSE2) to 246 cycles. Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3	2014-05-05 10:49:27 -07:00

1 2 3 4

186 Commits