generic-library/vpx

Author	SHA1	Message	Date
Yunqing Wang	75cd57503d	Refactor vp9_diamond_search_sad function Currently, vp9_diamond_search_sadx4() is only called when sse3 is enabled, which is improper since sse2 optimization of sdx4df functions are available. Changed to always use vp9_diamond_search_sadx4(). Change-Id: I4b95d6b7a3c6c645783c373f0ba8d645ece24717	2014-07-10 09:19:03 -07:00
Yunqing Wang	30117a576d	Refactor refining_search_sad code There are sse2 optimization of sdx4df functions. Instead of calling vp9_refining_search_sadx4 only when sse3 is enabled, call it always. Change-Id: I24f93818f7d4209d1425039e0eb099ff9ff08fe9	2014-07-09 16:50:11 -07:00
Jingning Han	9ad1b9fc67	Re-design quantization process for 32x32 transform block This commit enables a new quantization process for 32x32 2D-DCT transform coefficient blocks. It improves the compression performance of speed 5 by 1.4%. The overall compression gains of speed 5 due to the new quantization scheme is 4.7%. It also includes the SSSE3 implementation of the 32x32 quantization process. Change-Id: I0855b124fd6462418683f783f5bcb44255c9993b	2014-07-08 16:55:28 -07:00
Jingning Han	9ac2f66320	Re-design quantization process This commit re-designs the quantization process for transform coefficient blocks of size 4x4 to 16x16. It improves compression performance for speed 7 by 3.85%. The SSSE3 version for the new quantization process is included. The average runtime of the 8x8 block quantization is reduced from 285 cycles -> 255 cycles, i.e., over 10% faster. Change-Id: I61278aa02efc70599b962d3314671db5b0446a50	2014-07-01 17:00:07 -07:00
James Zern	88df435d6b	Merge "vp9_rtcd: correct avx2 references"	2014-06-16 17:39:13 -07:00
Jingning Han	d5ae43318e	Merge "Fast computation path for forward transform and quantization"	2014-06-12 11:59:52 -07:00
Jingning Han	ccba289f8d	Fast computation path for forward transform and quantization This commit enables a fast path computational flow for forward transformation. It checks the sse and variance of prediction residuals and decides if the quantized coefficients are all zero, dc only, or more. It then selects the corresponding coding path in the forward transformation and quantization stage. It is currently enabled in rtc coding mode. Will do it for rd coding mode next. In speed -6, the runtime for pedestrian_area 1080p at 1000 kbps goes down from 14234 ms to 13704 ms, i.e., about 4% speed-up. Overall coding performance for rtc set is changed by -0.18%. Change-Id: I0452da1786d59bc8bcbe0a35fdae9f623d1d44e1	2014-06-12 11:10:54 -07:00
James Zern	9f3a0dbb5e	vp9_rtcd: correct avx2 references s/"\$avx2_x86inc"/"avx2"/ avx2 code is all intrinsics and as a result doesn't rely on x86inc.asm Change-Id: I76ad39474d8a00658f3e43131830ef0f4f34772a	2014-06-10 16:26:36 -07:00
James Zern	520cb3f39f	vp9_sub_pixel_variance: disable avx2 variants tests failing under Win32/Win64 + variance_test: add missing avx2 functions (partially disabled) Change-Id: I6abc0657ea076379ab9ca65c12678b9ea199849d	2014-06-10 16:11:15 -07:00
James Zern	d3ff009d84	vp9_sad*x4d: disable avx2 variants tests failing under Win32/Win64 + sad_test: add missing avx2 functions (disabled) Change-Id: I8224fba2b270f6039ab1877d71e1e512f0081856	2014-06-10 16:10:12 -07:00
James Zern	dd9f502933	vp9_f(dct\|ht): disable avx2 variants tests failing under Win32/Win64 + dct16x16_test: add missing avx2 functions (partially disabled) exercises the forward transforms no idct/iht implementations, so the c-code is used Change-Id: I04f64a457fa0828a00f32b5c9fe4f55294f21f61	2014-06-09 18:48:11 -07:00
James Zern	5704578f5f	convolve: disable avx2 variants tests failing under Win32/Win64 Change-Id: I5d49d11911bcda3a832b14efe5500d22597bedcf	2014-06-09 18:42:03 -07:00
Jingning Han	0c4a4225ec	Merge "Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs"	2014-06-03 16:51:39 -07:00
Dmitry Kovalev	f7ff24cdd0	Reusing existing vp9_get{8x8, 16x16}var() instead of new ones. Change-Id: I87b7c657d8813d7fb383ab519d150c0ffb1dd377	2014-05-29 11:14:06 -07:00
Jingning Han	6d21cbd20b	Enable SSSE3 inverse 2D-DCT with 10 non-zero coeffs This commit enables SSSE3 implementation of the inverse 2D-DCT with only first 10 coefficients non-zero. It reduces the runtime of SSE2 version from 745 cycles to 538 cycles, i.e., 27% speed-up. Change-Id: I18ba4128859b09c704a6ee361d69a86c09fe8dfe	2014-05-28 10:53:33 -07:00
Yunqing Wang	1f2200080b	Revert "Making vp9_get_sse_sum_{8x8, 16x16} static." This reverts commit `e8bbb3d9db`. Change-Id: Ie368d36fd249d323d859d208609c711f04537bbc	2014-05-27 13:37:08 -07:00
Deb Mukherjee	444f93945b	Merge "Remove Wextra warnings from vp9_sad.c"	2014-05-27 11:54:05 -07:00
Jingning Han	48b0891370	Inverse 16x16 2D-DCT SSSE3 implementation This commit enables the SSSE3 implementation of full inverse 16x16 2D-DCT. The unit runtime goes down from 1642 cycles to 1519 cycles, about 7% speed-up. Change-Id: I14d2fdf9da1fb4ed1e5db7ce24f77a1bfc8ea90d	2014-05-23 15:09:35 -07:00
Deb Mukherjee	916550428d	Remove Wextra warnings from vp9_sad.c As a side-effect, the sad unit tests for VP8 and VP9 had to be separated. Change-Id: I068cc2391eed51e9b140ea6aba78338c5fec8d71	2014-05-22 22:21:16 -07:00
Deb Mukherjee	a185bc3350	Extends temporal filtering to work for 422 data This is needed for profiles 1 and 2. Change-Id: I5dd7644c2932d055ab89e050d4be7d4117cd1028	2014-05-20 15:19:40 -07:00
Jim Bankoski	ec82d2dfec	Merge "Revert "Remove Wextra warnings from vp9_sad.c""	2014-05-15 11:54:23 -07:00
Yunqing Wang	c661cf0dad	Merge "AVX2 To VP9 Block Error Optimization"	2014-05-15 11:29:29 -07:00
Jim Bankoski	a16794dd31	Revert "Remove Wextra warnings from vp9_sad.c" This reverts commit `7ab9a9587b` Nightly test http://build.webmproject.org/jenkins/view/libvpx-nightly-tests/job/libvpx%20unit%20tests%20(valgrind-2)/arch=x86_64-linux-gcc,filter=-VP8:Large./276/console Failed This patch did not address all the assembly issues some of the vp8 assembly counts on 5 arguments being passed in to this function: one example : vp8_sad8x16_wmt Please address or split this into vp9 and vp8 patches. Change-Id: I78afcc171649894f887bb8ee3c66de24aaddc7ca	2014-05-15 08:31:20 -07:00
levytamar82	1fbab853c8	AVX2 To VP9 Block Error Optimization vp9_block_error_sse2 can only handle 16 bytes at a time but the function requires to handle a sequence of 32 bytes at a time so each 16 bytes is handled in a different register. With AVX2 optimization the 32 bytes can be handled in one register instead of two in the SSE2 The vp9_block_error was optimized by 85%. The user level was optimized by 1.2% Change-Id: Ia8fffe60e61eff7432a5fbd538757894f6c319fd	2014-05-14 11:51:07 -07:00
Deb Mukherjee	7ab9a9587b	Remove Wextra warnings from vp9_sad.c As a side-effect, the max_sad check is removed from the C-implementation of VP8, for consistency with VP9, and to ensure that the SAD tests common to VP8/VP9 pass. That will make the VP8 C implementation of sad a little slower but given that is rarely used in practice, the impact will be minimal. Change-Id: I7f43089fdea047fbf1862e40c21e4715c30f07ca	2014-05-14 03:17:31 -07:00
Johann	ce23931a3f	Only build neon assembly for armv7 targets Allow selectively building just the intrinsics for armv8 Change-Id: I2f29b2e4508b8b8e5649c2906b3159ad1d4ec477	2014-05-12 08:52:02 -07:00
Alex Converse	ec8a3272fa	Merge "Add an x86inc MMX fwht4x4."	2014-05-09 13:48:49 -07:00
Jingning Han	9412785b02	Merge changes I3edd4b95,I4514f974,Ie7fa4386 * changes: Turn on unit tests for SSSE3 8x8 forward and inverse 2D-DCT Change eob threshold for partial inverse 8x8 2D-DCT to 12 SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zero	2014-05-09 09:58:39 -07:00
Alex Converse	b5422fab46	Add an x86inc MMX fwht4x4. Change-Id: Ib0a73d4863478f9b8a00976379d25d2f6ebbb197	2014-05-08 12:01:27 -07:00
Jingning Han	41a350a83d	Change eob threshold for partial inverse 8x8 2D-DCT to 12 The scanning order has the first 12 coefficients of the 8x8 2D-DCT sitting in the top left 4x4 block. Hence the partial inverse 8x8 2D-DCT allows to handle cases with eob below 12. The overall runtime of the inverse 8x8 2D-DCT unit is reduced from 166 cycles (using SSE2) to 150 cycles (using SSSE3). Change-Id: I4514f9748042809ac84df4c14382c00f313f1cd2	2014-05-08 09:48:58 -07:00
Jingning Han	9e7b09bc5d	SSSE3 8x8 inverse 2D-DCT with first 10 coeffs non-zero This commit enables ssse3 assembly implementation of the 8x8 inverse 2D-DCT with only first 10 coefficients non-zero. The average runtime for this unit goes down from 198 cycles to 129 cycles (34.8% faster). Change-Id: Ie7fa4386f6d3a2fe0d47a2eb26fc2a6bbc592ac7	2014-05-07 17:40:02 -07:00
Paul Wilkins	33b1c457ed	Revert "Add an MMX fwht4x4" Includes changes that are not compatible with VS windows builds. Amongst other things stdint.h is not supported in VS. This reverts commit `89fbf3de50`. Change-Id: Ifa86d7df250578d1ada9b539c9ff12ed0c523cdd	2014-05-07 12:53:27 +01:00
Alex Converse	75d05d5ed4	Merge "Add an MMX fwht4x4"	2014-05-06 11:12:27 -07:00
Jingning Han	d289deb04c	Merge "SSSE3 implementation of full inverse 8x8 2D-DCT"	2014-05-06 09:17:22 -07:00
Dmitry Kovalev	e8bbb3d9db	Making vp9_get_sse_sum_{8x8, 16x16} static. Change-Id: Ifb7937c977308c682986f0ce9645a0807d2aa46a	2014-05-05 19:12:38 -07:00
Alex Converse	89fbf3de50	Add an MMX fwht4x4 7% faster encoding a desktop lossless at RT speed 4. Change-Id: I41627f5b737752616b6512bb91a36ec45995bf64	2014-05-05 15:10:48 -07:00
Jingning Han	52ae97b6aa	SSSE3 implementation of full inverse 8x8 2D-DCT This commit enables SSSE3 version full inverse 8x8 2D-DCT and reconstruction. It makes the runtime of vp9_idct8x8_64_add down from 256 cycles (SSE2) to 246 cycles. Change-Id: I0600feac894d6a443a3c9d18daf34156d4e225c3	2014-05-05 10:49:27 -07:00
Jingning Han	39761eb5d6	Merge "Enable SSSE3 implementation of 8x8 forward 2D-DCT"	2014-04-30 13:41:36 -07:00
Dmitry Kovalev	d2bc8816a1	Merge "Adding search_site_config struct."	2014-04-29 16:59:47 -07:00
Jingning Han	1eaa3a76dc	Enable SSSE3 implementation of 8x8 forward 2D-DCT Assembly implementation of ssse3 8x8 forward 2D-DCT. The current version is turned on only for x86_64. The average unit runtime goes from 157 cycles down to 136 cycles, i.e., about 12.8% faster. This translates into about 1.5% speed-up for pedestrian_area 1080p at speed 2. Change-Id: I0f12435857e9425ed7ce12541344dfa16837f4f4	2014-04-29 15:49:18 -07:00
Dmitry Kovalev	aa464eca5e	Adding search_site_config struct. Change-Id: I2ad333553e673dbabcdc0f0366aea311e90849bf	2014-04-29 10:34:53 -07:00
Dmitry Kovalev	6e01079cc0	Removing unused vp9_variance_halfpixvar*() functions. Change-Id: I99695564a3aa9bc8c79ac0a551d257e2ff3ad3c3	2014-04-25 11:50:07 -07:00
Dmitry Kovalev	03e7deae4f	Removing unused vp9_sub_pixel_mse* functions. Change-Id: I8d906da3bd6de0d3042676846f61a8b2a3444508	2014-04-24 11:49:12 -07:00
Dmitry Kovalev	63fa722179	Removing unused cost arguments from mcomp functions. Change-Id: Id81a76d18be6b2de69f81bb563d74c3bb356d434	2014-04-11 10:24:36 -07:00
Yunqing Wang	4e66293fcb	Use source frame difference to make partition decision Calculate the difference variance between last source frame and current source frame. The variance is calculated at 16x16 block level. The variances are compared to several thresholds to decide final partition sizes. An adaptive strategy is implemented to decide using SOURCE_VAR_BASED_PARTITION or FIXED_PARTITION based on motions in the video. The switching test is done once every search_type_check_frequency frames. The selection of source_var_thresh needs to be investigated further later. RTC set Borg test showed 0.424% overall psnr gain, and 0.357% ssim gain. For clips with large enough static area, the encoding speedup is around 2% to 15%. Change-Id: Id7d268f1d8cbca7fb8026aa4a53b3c77459dc156	2014-04-08 17:03:02 -07:00
levytamar82	0fa8b668c1	AVX2 SAD Optimization: 2 functions were optimized for avx2 by using full 256 bit register In order to handle 32 elements in parallel instead of only 16 in parallel: 1. vp9_sad32x32x4d 2. vp9_sad64x64x4d The function level gain is 66% and the user level gain is ~1%. Change-Id: I4efbb3bc7d8bc03b64b6c98f5cd5c4a9dd3212cb	2014-03-21 13:53:32 -07:00
James Zern	805078a1bf	build: convert rtcd.sh to perl significantly speeds up file generation. the goal of this change is to convert rtcd.sh to perl as directly as possible to allow for simple comparison. future changes can make it more perl-like. --- Linux [CREATE] vpx_scale_rtcd.h real 0m0.485s -> 0m0.022s [CREATE] vp8_rtcd.h real 0m4.619s -> 0m0.060s [CREATE] vp9_rtcd.h real 0m10.102s -> 0m0.087s Windows [CREATE] vpx_scale_rtcd.h real 0m8.360s -> 0m0.080s [CREATE] vp8_rtcd.h real 1m8.083s -> 0m0.160s [CREATE] vp9_rtcd.h real 2m6.489s -> 0m0.233s Change-Id: Idfb71188206c91237d6a3c3a81dfe00d103f11ee	2014-03-03 14:47:11 -08:00

47 Commits