generic-library/vpx

Author	SHA1	Message	Date
Yunqing Wang	f9404f2406	Revert "Revert "SSSE3 convolution optimization"" This reverts commit `b645257121`. Change-Id: I60d1bf57ae8e9eb6127f42f2d5a780124ac51b45	2014-01-13 12:29:55 -08:00
Paul Wilkins	b645257121	Revert "SSSE3 convolution optimization" This reverts commit `511d218c60`. In current form intrinsics break borg build. Change-Id: Ied37936af841250ecff449802e69a3d3761c91b9	2014-01-10 13:38:26 +00:00
Jingning Han	a4c94a94cc	Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P2"	2014-01-09 18:17:25 -08:00
Jingning Han	faa2ba86cc	Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P1"	2014-01-09 18:17:12 -08:00
Jingning Han	af31b27aae	Optimze inv 16x16 DCT with 10 non-zero coeffs - P2 This commit further optimizes SSE2 operations in the second 1-D inverse 16x16 DCT, with (<10) non-zero coefficients. The average runtime of this module goes down from 779 cycles -> 725 cycles. Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f	2014-01-09 12:46:09 -08:00
Yunqing Wang	f3b9b97c0e	Merge "SSSE3 convolution optimization"	2014-01-09 12:39:47 -08:00
levytamar82	511d218c60	SSSE3 convolution optimization Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization done only for 64bit. Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969	2014-01-09 12:27:51 -07:00
Jingning Han	ba6ab46cdc	Optimze inv 16x16 DCT with 10 non-zero coeffs - P1 This commit is the first patch optimizing SSE2 implementation of inverse 16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row) transformation. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients. The average runtime of idct16x16_10 unit is reduced from 883 cycles -> 779 cycles (12% faster). For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes down from 310651 ms -> 305910 ms. The decoding speed goes up from 80.37 fps -> 80.87 fps. Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645	2014-01-08 15:36:45 -08:00
Jingning Han	3e0c62b53f	Tune IDCT8_1D macro function interface This commit adds input/output ports for IDCT8_1D macro function to provide more flexibility in variable use. It allows to skip several buffer swap operations. Change-Id: I21f3450509537322293043b3281bfd3949868677	2014-01-03 15:23:47 -08:00
Jingning Han	0b1a27135a	Reduce num of buffer swap calls in idct8_1d_sse2 This commit merges the initial buffer swap operations in idct8_1d_sse2 into the array transpose step, hence reducing number of instructions therein. Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479	2014-01-03 12:12:03 -08:00
Jingning Han	1bb11781e2	Rework idct8x8_10 SSE2 implementation This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, and hence reduces the instructions needed. The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles, estimated by averaging over 100000 runs. For pedestrian_area_1080p 300 frames coded at 4000kbps, the average decoding speed goes up from 79.3 fps to 79.7 fps. Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180	2014-01-03 12:04:09 -08:00
Yunqing Wang	b6a0ac11f0	Merge "Code clean up"	2013-12-20 08:46:11 -08:00
Yunqing Wang	09faf55916	Code clean up Removed unused filter coefficients. Change-Id: Ib395a51305e23ff41ab69c1808d56946d25961cd	2013-12-19 11:09:23 -08:00
Jim Bankoski	b720ba165f	rename loop filter functions This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b	2013-12-17 17:34:34 -08:00
Abo Talib Mahfoodh	e4419ab691	Improve idct16x16: _256_add_sse2(x1.107)&_10_add_sse2(x1.012) The performance gain of idct16x16_10_add_sse2 function is not noticeable. However since both functions use the IDCT16_1D, idct16x16_10_add_sse2 should be modified as well. Tested with: park_joy_420_720p50.y4m Change-Id: I02b957e36fcf997c677d15baf496533895271bff	2013-12-02 21:08:56 -05:00
Yunqing Wang	8f182a1cac	Merge "improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2"	2013-12-02 15:10:05 -08:00
Abo Talib Mahfoodh	f97d91ab67	improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2 vp9_idct32x32_34_add_sse2: speedup: 1.472 IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized based on the fact that Only upper-left 8x8 has non-zero values. vp9_idct32x32_1024_add_sse2: speedup: 1.032 Tested with: park_joy_420_720p50.y4m Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc	2013-11-26 12:28:26 -05:00
Yunqing Wang	ed36720b66	Do vertical loopfiltering in parallel This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7	2013-11-22 10:04:51 -08:00
Yunqing Wang	256cf7ee7d	Correct ssse3 8/16-pixel wide sub-pixel filter calculation Although no mismatch was indicated for 8/16 wide sub-pixel filters in issue 661, they had similar problems that could cause mismatch potentially. This patch fixed calculations in HORIZx8/16 and VERTx8/16. Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca	2013-11-20 12:52:56 -08:00
Yunqing Wang	0ef63f596d	Fix stack pointer in sub-pixel filters In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack pointer was modified while aligning the stack, and it needed to be pop out at the end. Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67	2013-11-20 09:42:44 -08:00
Yunqing Wang	3d50da5397	Fix decoder mismatch with ssse3 enabled This patch fixed issue 661: "Decoder produces mismatched outputs with ssse3 enabled and disabled." In sub-pixel filters, a pixel value was multiplied by a filter coefficient, and the results were added up. The order of adding up these multiplications had to be arranged carefully to prevent incorrect overflowing. Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2	2013-11-19 15:10:04 -08:00
Abo Talib Mahfoodh	613e2d2e90	Improve vp9_iht4x4_16_add_sse2 (x1.341) This rebase is a better implementation of the previous ones. Modifications are done to reduce the total clock cycle. Speedup: 1.341 Compiled with -O3 Tested with: park_joy_420_720p50.y4m Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d	2013-11-18 20:53:13 -05:00
Yunqing Wang	64f728caef	Do horizontal loopfiltering in parallel This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35	2013-11-15 16:18:43 -08:00
Yunqing Wang	e731b2ba2c	Merge "Improve vp9_idct4x4_1_add_sse2"	2013-11-08 12:00:36 -08:00
James Zern	2d980b803a	vp9 ssse3 d207_predictor_32x32: add missing GLOBAL() removes a textrel for sh_b23456789abcdefff Change-Id: I80cb9dfd8e49a0fe884c8ff76472275b3a00cb57	2013-11-01 20:33:22 -07:00
Tamar Levy	54f9205653	mb_lpf_horizontal_edge AVX2 optimization This CL contains two AVX2 optimized loop filter functions, mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16. Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b	2013-10-31 10:26:15 -06:00
Yunqing Wang	47665452f0	Merge "Add 32x32 idct function for eob<=34 case"	2013-10-25 09:34:46 -07:00
Yunqing Wang	f88315cb29	Add 32x32 idct function for eob<=34 case When only upper-left 8x8 area has non-zero dct coefficients, we could skip 1D IDCT for 9th to 32th rows to save operations. This function is called when eob <= 34. Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5	2013-10-24 16:13:21 -07:00
Dmitry Kovalev	fa143dbc8e	Renaming vp9_short_fdct8x8 to vp9_fdct8x8. For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f	2013-10-23 10:52:33 -07:00
Abo Talib Mahfoodh	908a992d7f	Improve vp9_idct4x4_1_add_sse2 Simple modification to reduce number of cycles in the function. Original function number of cycles: 973 Modified function number of cycles: 835 Improvment factor: 1.165 Tested with: park_joy_420_720p50.y4m Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd	2013-10-22 09:35:36 -04:00
Yunqing Wang	dd51042802	Fix d207 intra prediction SSSE3 functions This patch fixed a bug that caused 32bit PIC build mismatch. The stack pointer was modified after "GET_GOT". Loading left pointer from a hard-coded position gave wrong result. Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc	2013-10-18 17:00:18 -07:00
Jingning Han	bf187d1b2d	Merge "Fix a few indent format issues in buffer defs"	2013-10-15 16:23:50 -07:00
Jingning Han	0a66541619	Fix a few indent format issues in buffer defs Change-Id: Iac55891ac9e6f13718c9f822aa099b5ca491832a	2013-10-15 11:51:09 -07:00
Dmitry Kovalev	65f118d72f	Making input pointer of any inverse transform constant. Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940	2013-10-11 18:27:12 -07:00
Dmitry Kovalev	7ef573914d	Consistent names for inverse hybrid transforms (1 of 2). Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0	2013-10-11 13:31:32 -07:00
Dmitry Kovalev	9c8f3063b1	Merge "Removing vp9_idct4_1d_sse2 function."	2013-10-11 10:43:56 -07:00
Yunqing Wang	57b97b56f6	Code cleanup Minor code cleanup. Change-Id: I47c1f794842d4570bb39cfd23b80f54f5606bba6	2013-10-11 09:08:41 -07:00
Yunqing Wang	3a0b59e3fd	Merge "SSE2 8-tap sub-pixel filter optimization"	2013-10-11 08:44:56 -07:00
Dmitry Kovalev	ddf1b76205	Removing vp9_idct4_1d_sse2 function. We have two SSE2-optimized functions for idct4_1d: vp9_idct4_1d_sse2 <-- removing this one idct4_1d_sse2 vp9_idct4_1d_sse2 was used only by the following functions which already have SSE2 optimized variants: vp9_idct4x4_16_add_c -> vp9_idct4x4_16_add_see2 idct8_1d -> vp9_idct8x8_{16, 10, 1}_see2 vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2 Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb	2013-10-10 16:50:43 -07:00
Scott LaVarnway	83936e8cd5	d207 intra prediction ssse3 using bytes byte version of ronalds d207 ssse3 optimizations (commit: f891f84d3ba9345b0074e682f0fea09b8ddf4f1e) Change-Id: If15f71a589ea16f78ac86a501b0c5c6231dc9af1	2013-10-10 15:50:31 -07:00
Dmitry Kovalev	2be3b84aed	Merge "Giving consistent names to IDCT 32x32 functions."	2013-10-10 15:31:25 -07:00
Yunqing Wang	86528586a3	Merge "d153 intra prediction (32x32) ssse3 using bytes"	2013-10-10 15:16:45 -07:00
Yunqing Wang	3fb728c749	SSE2 8-tap sub-pixel filter optimization To ensure fast encoding/decoding on devices without ssse3 support, SSE2 optimization of sub-pixel filters was done. Test using 1080p clip showed the decoder speeds were ~70fps with ssse3 filters, ~60fps with sse2 filters, and ~15fps with c filters. Change-Id: Ie2088f87d83a889fba80a613e4d0e287aadd785c	2013-10-10 14:12:47 -07:00
Dmitry Kovalev	1e766b50e2	Giving consistent names to IDCT 32x32 functions. Renames: vp9_short_idct32x32_add -> vp9_idct32x32_1024_add vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add vp9_idct_add_32x32 -> vp9_idct32x32_add Change-Id: Id85306f5814bac6c47463a6b5901a93082510666	2013-10-10 11:27:39 -07:00
Dmitry Kovalev	b096c5a336	Giving consistent names to IDCT 16x16 functions. Renames: vp9_short_idct16x16_add -> vp9_idct16x16_256_add vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add vp9_idct_add_16x16 -> vp9_idct16x16_add Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3	2013-10-07 14:31:10 -07:00
Dmitry Kovalev	2ae93a776b	Merge "Giving consistent names to IDCT 8x8 functions."	2013-10-07 14:19:50 -07:00
Scott LaVarnway	a2a3b4a479	d153 intra prediction (32x32) ssse3 using bytes Change-Id: Ie2c0d84ff9f6294084d65f4380e1f30c09e681c9	2013-10-07 11:21:10 -04:00
Jim Bankoski	bf893e84bd	Merge changes I8a106dd6,Iec442603 * changes: d153 intra prediction (16x16) ssse3 using bytes d153 intra prediction ssse3 using bytes	2013-10-06 20:11:24 -07:00
Dmitry Kovalev	c6ad70d5f1	Giving consistent names to IDCT 8x8 functions. Renames: vp9_short_idct8x8_add -> vp9_idct8x8_64_add vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add vp9_idct_add_8x8 -> vp9_idct8x8_add Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1	2013-10-06 00:24:09 -07:00
Dmitry Kovalev	3a0602578e	Giving consistent names to IDCT/IWHT functions. The idea is to have the following names for each transform size: vp9_idct4x4_add vp9_idct4x4_1_add vp9_idct4x4_10_add vp9_idct4x4_16_add vp9_idct8x8_add vp9_idct8x8_1_add vp9_idct8x8_10_add vp9_idct8x8_64_add etc for 16x16, 32x32 The actual list of renames in this patch: vp9_idct_add_lossless -> vp9_iwht4x4_add vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add vp9_idct_add -> vp9_idct4x4_add vp9_short_idct4x4_add -> vp9_idct4x4_16_add vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1	2013-10-04 14:17:06 -07:00

1 2 3 4

182 Commits