generic-library/vpx

Author	SHA1	Message	Date
Jingning Han	a4c94a94cc	Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P2"	2014-01-09 18:17:25 -08:00
Jingning Han	faa2ba86cc	Merge "Optimze inv 16x16 DCT with 10 non-zero coeffs - P1"	2014-01-09 18:17:12 -08:00
Jingning Han	af31b27aae	Optimze inv 16x16 DCT with 10 non-zero coeffs - P2 This commit further optimizes SSE2 operations in the second 1-D inverse 16x16 DCT, with (<10) non-zero coefficients. The average runtime of this module goes down from 779 cycles -> 725 cycles. Change-Id: Iac31b123640d9b1e8f906e770702936b71f0ba7f	2014-01-09 12:46:09 -08:00
Yunqing Wang	f3b9b97c0e	Merge "SSSE3 convolution optimization"	2014-01-09 12:39:47 -08:00
levytamar82	511d218c60	SSSE3 convolution optimization Optimizing all SSSE3 assembly for convolution: 1. vp9_filter_block1d4_h8_sse2 2. vp9_filter_block1d8_h8_sse2 3. vp9_filter_block1d16_h8_sse2 4. vp9_filter_block1d4_v8_sse2 5. vp9_filter_block1d8_v8_sse2 6. vp9_filter_block1d16_v8_sse2 my optimization include: -processing 2x8 elements in one 128 bit register instead of processing 8 elements in one 128 bit register. -removing unecessary loads. This optimization gives between 2.4% user level gain for 480p input and 1.6% user level gain for 720p. This Optimization done only for 64bit. Change-Id: Icb586dc0c938b56699864fcee6c52fd43b36b969	2014-01-09 12:27:51 -07:00
Jingning Han	ba6ab46cdc	Optimze inv 16x16 DCT with 10 non-zero coeffs - P1 This commit is the first patch optimizing SSE2 implementation of inverse 16x16 DCT with <10 non-zero coefficients. It focused on the first 1-D (row) transformation. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, in a 2-D inverse 16x16 DCT with <10 coeffients. The average runtime of idct16x16_10 unit is reduced from 883 cycles -> 779 cycles (12% faster). For pedestrian_area_1080p 300 frames at 4000 kbps, the speed 2 runtime goes down from 310651 ms -> 305910 ms. The decoding speed goes up from 80.37 fps -> 80.87 fps. Change-Id: Ic6f3ac5a637a76c07ba73ddaafe318a699fea645	2014-01-08 15:36:45 -08:00
Jingning Han	3e0c62b53f	Tune IDCT8_1D macro function interface This commit adds input/output ports for IDCT8_1D macro function to provide more flexibility in variable use. It allows to skip several buffer swap operations. Change-Id: I21f3450509537322293043b3281bfd3949868677	2014-01-03 15:23:47 -08:00
Jingning Han	0b1a27135a	Reduce num of buffer swap calls in idct8_1d_sse2 This commit merges the initial buffer swap operations in idct8_1d_sse2 into the array transpose step, hence reducing number of instructions therein. Change-Id: I219f6f50813390d2ec3ee37eecf2a4a2b44ae479	2014-01-03 12:12:03 -08:00
Jingning Han	1bb11781e2	Rework idct8x8_10 SSE2 implementation This commit optimizes the SSE2 implmentation of idct8x8_10. It exploits the fact that only top-left 4x4 block contains non-zero coefficients, and hence reduces the instructions needed. The runtime of idct8x8_10_sse2 goes down from 216 to 198 CPU cycles, estimated by averaging over 100000 runs. For pedestrian_area_1080p 300 frames coded at 4000kbps, the average decoding speed goes up from 79.3 fps to 79.7 fps. Change-Id: I6d277bbaa3ec9e1562667906975bae06904cb180	2014-01-03 12:04:09 -08:00
Yunqing Wang	b6a0ac11f0	Merge "Code clean up"	2013-12-20 08:46:11 -08:00
Yunqing Wang	09faf55916	Code clean up Removed unused filter coefficients. Change-Id: Ib395a51305e23ff41ab69c1808d56946d25961cd	2013-12-19 11:09:23 -08:00
Jim Bankoski	b720ba165f	rename loop filter functions This renames all the loop filter functions so that they no longer refer to mb Change-Id: I8a58a8c7fd253d835cb619bde13913e896ece90b	2013-12-17 17:34:34 -08:00
Abo Talib Mahfoodh	e4419ab691	Improve idct16x16: _256_add_sse2(x1.107)&_10_add_sse2(x1.012) The performance gain of idct16x16_10_add_sse2 function is not noticeable. However since both functions use the IDCT16_1D, idct16x16_10_add_sse2 should be modified as well. Tested with: park_joy_420_720p50.y4m Change-Id: I02b957e36fcf997c677d15baf496533895271bff	2013-12-02 21:08:56 -05:00
Yunqing Wang	8f182a1cac	Merge "improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2"	2013-12-02 15:10:05 -08:00
Abo Talib Mahfoodh	f97d91ab67	improve vp9_idct32x32_34(x1.472)&1024(x1.032)_add_sse2 vp9_idct32x32_34_add_sse2: speedup: 1.472 IDCT32_1D_34 and MULTIPLICATION_AND_ADD_2 are optimized based on the fact that Only upper-left 8x8 has non-zero values. vp9_idct32x32_1024_add_sse2: speedup: 1.032 Tested with: park_joy_420_720p50.y4m Change-Id: I8670ce547552b48695049de298e2fc46ce28dfbc	2013-11-26 12:28:26 -05:00
Yunqing Wang	ed36720b66	Do vertical loopfiltering in parallel This patch followed "Add filter_selectively_vert_row2 to enable parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. For other optimizations (neon and dspr2), current 16-pixel functions were done by calling 8-pixel functions twice, and real 16-pixel functions could be added later. Decoder speedup: tulip clip: 2% speed gain; old_town_cross: 1.2% speed gain; bus: 2% speed gain. Change-Id: I4818a0c72f84b34f5fe678e496cf4a10238574b7	2013-11-22 10:04:51 -08:00
Yunqing Wang	256cf7ee7d	Correct ssse3 8/16-pixel wide sub-pixel filter calculation Although no mismatch was indicated for 8/16 wide sub-pixel filters in issue 661, they had similar problems that could cause mismatch potentially. This patch fixed calculations in HORIZx8/16 and VERTx8/16. Change-Id: I169961c9d40a20340995b7d22aafc89ccf30bfca	2013-11-20 12:52:56 -08:00
Yunqing Wang	0ef63f596d	Fix stack pointer in sub-pixel filters In commit "3d50da5397d20abc932d81453b26cde758293a40", the stack pointer was modified while aligning the stack, and it needed to be pop out at the end. Change-Id: I062971e195f1f2ab9d0ab5fb84dcf215a0fcaa67	2013-11-20 09:42:44 -08:00
Yunqing Wang	3d50da5397	Fix decoder mismatch with ssse3 enabled This patch fixed issue 661: "Decoder produces mismatched outputs with ssse3 enabled and disabled." In sub-pixel filters, a pixel value was multiplied by a filter coefficient, and the results were added up. The order of adding up these multiplications had to be arranged carefully to prevent incorrect overflowing. Change-Id: Id08af4200fea9e1b896fc40157b8651c2c7e80f2	2013-11-19 15:10:04 -08:00
Abo Talib Mahfoodh	613e2d2e90	Improve vp9_iht4x4_16_add_sse2 (x1.341) This rebase is a better implementation of the previous ones. Modifications are done to reduce the total clock cycle. Speedup: 1.341 Compiled with -O3 Tested with: park_joy_420_720p50.y4m Change-Id: I940eaf283f60597ca0d9d2e13d518878d55ff02d	2013-11-18 20:53:13 -05:00
Yunqing Wang	64f728caef	Do horizontal loopfiltering in parallel This patch followed "Rewrite filter_selectively_horiz for parallel loopfiltering" commit, and added x86 SSE2 optimization to do 16-pixel filtering in parallel. Also, corrected the declaration of aligned arrays. For 8-pixel-in-parallel case, improved the calculation of the masks and filters. Updated the threshold loading since the thresholds were already duplicated. Updated neon C functions to call neon loopfilters twice. Using tulip clip, tests showed it gave a ~1.5% decoder speed gain. Change-Id: Id02638626ac27a4b0e0b09d71792a24c0499bd35	2013-11-15 16:18:43 -08:00
Yunqing Wang	e731b2ba2c	Merge "Improve vp9_idct4x4_1_add_sse2"	2013-11-08 12:00:36 -08:00
James Zern	2d980b803a	vp9 ssse3 d207_predictor_32x32: add missing GLOBAL() removes a textrel for sh_b23456789abcdefff Change-Id: I80cb9dfd8e49a0fe884c8ff76472275b3a00cb57	2013-11-01 20:33:22 -07:00
Tamar Levy	54f9205653	mb_lpf_horizontal_edge AVX2 optimization This CL contains two AVX2 optimized loop filter functions, mb_lpf_horizontal_edge_w_avx2_8 and mb_lpf_horizontal_edge_w_avx2_16. Change-Id: I604e4fe6e99752b7800c2ea98721d97f7e0b931b	2013-10-31 10:26:15 -06:00
Yunqing Wang	47665452f0	Merge "Add 32x32 idct function for eob<=34 case"	2013-10-25 09:34:46 -07:00
Yunqing Wang	f88315cb29	Add 32x32 idct function for eob<=34 case When only upper-left 8x8 area has non-zero dct coefficients, we could skip 1D IDCT for 9th to 32th rows to save operations. This function is called when eob <= 34. Change-Id: I9684b75947bdde346cfe3720f08a953aa7a13fb5	2013-10-24 16:13:21 -07:00
Dmitry Kovalev	fa143dbc8e	Renaming vp9_short_fdct8x8 to vp9_fdct8x8. For consistency with idct function names. Change-Id: I7b6af2f92c66eff56f84ed29edc3a66af8dc421f	2013-10-23 10:52:33 -07:00
Abo Talib Mahfoodh	908a992d7f	Improve vp9_idct4x4_1_add_sse2 Simple modification to reduce number of cycles in the function. Original function number of cycles: 973 Modified function number of cycles: 835 Improvment factor: 1.165 Tested with: park_joy_420_720p50.y4m Change-Id: Ic5857272ea3aafe21d5ef9a69258d78c688f69bd	2013-10-22 09:35:36 -04:00
Yunqing Wang	dd51042802	Fix d207 intra prediction SSSE3 functions This patch fixed a bug that caused 32bit PIC build mismatch. The stack pointer was modified after "GET_GOT". Loading left pointer from a hard-coded position gave wrong result. Change-Id: Iea0aec6f917b12a6b3393ffc986bad74510248cc	2013-10-18 17:00:18 -07:00
Jingning Han	bf187d1b2d	Merge "Fix a few indent format issues in buffer defs"	2013-10-15 16:23:50 -07:00
Jingning Han	0a66541619	Fix a few indent format issues in buffer defs Change-Id: Iac55891ac9e6f13718c9f822aa099b5ca491832a	2013-10-15 11:51:09 -07:00
Dmitry Kovalev	65f118d72f	Making input pointer of any inverse transform constant. Also renaming dest_stride to stride in some places. Change-Id: I75f602b623a5a7071d4922b747c45fa0b7d7a940	2013-10-11 18:27:12 -07:00
Dmitry Kovalev	7ef573914d	Consistent names for inverse hybrid transforms (1 of 2). Renames: vp9_short_iht4x4_add -> vp9_iht4x4_16_add vp9_short_iht8x8_add -> vp9_iht8x8_64_add vp9_short_iht16x16_add_c -> vp9_iht16x16_256_add Change-Id: Ibca7a188fd062b196787ac5efc1ea545e7f166c0	2013-10-11 13:31:32 -07:00
Dmitry Kovalev	9c8f3063b1	Merge "Removing vp9_idct4_1d_sse2 function."	2013-10-11 10:43:56 -07:00
Yunqing Wang	57b97b56f6	Code cleanup Minor code cleanup. Change-Id: I47c1f794842d4570bb39cfd23b80f54f5606bba6	2013-10-11 09:08:41 -07:00
Yunqing Wang	3a0b59e3fd	Merge "SSE2 8-tap sub-pixel filter optimization"	2013-10-11 08:44:56 -07:00
Dmitry Kovalev	ddf1b76205	Removing vp9_idct4_1d_sse2 function. We have two SSE2-optimized functions for idct4_1d: vp9_idct4_1d_sse2 <-- removing this one idct4_1d_sse2 vp9_idct4_1d_sse2 was used only by the following functions which already have SSE2 optimized variants: vp9_idct4x4_16_add_c -> vp9_idct4x4_16_add_see2 idct8_1d -> vp9_idct8x8_{16, 10, 1}_see2 vp9_short_iht4x4_add_c -> vp9_short_iht4x4_add_see2 Change-Id: Ib0a7f6d1373dbaf7a4a41208cd9d0671fdf15edb	2013-10-10 16:50:43 -07:00
Scott LaVarnway	83936e8cd5	d207 intra prediction ssse3 using bytes byte version of ronalds d207 ssse3 optimizations (commit: f891f84d3ba9345b0074e682f0fea09b8ddf4f1e) Change-Id: If15f71a589ea16f78ac86a501b0c5c6231dc9af1	2013-10-10 15:50:31 -07:00
Dmitry Kovalev	2be3b84aed	Merge "Giving consistent names to IDCT 32x32 functions."	2013-10-10 15:31:25 -07:00
Yunqing Wang	86528586a3	Merge "d153 intra prediction (32x32) ssse3 using bytes"	2013-10-10 15:16:45 -07:00
Yunqing Wang	3fb728c749	SSE2 8-tap sub-pixel filter optimization To ensure fast encoding/decoding on devices without ssse3 support, SSE2 optimization of sub-pixel filters was done. Test using 1080p clip showed the decoder speeds were ~70fps with ssse3 filters, ~60fps with sse2 filters, and ~15fps with c filters. Change-Id: Ie2088f87d83a889fba80a613e4d0e287aadd785c	2013-10-10 14:12:47 -07:00
Dmitry Kovalev	1e766b50e2	Giving consistent names to IDCT 32x32 functions. Renames: vp9_short_idct32x32_add -> vp9_idct32x32_1024_add vp9_short_idct32x32_1_add -> vp9_idct32x32_1_add vp9_idct_add_32x32 -> vp9_idct32x32_add Change-Id: Id85306f5814bac6c47463a6b5901a93082510666	2013-10-10 11:27:39 -07:00
Dmitry Kovalev	b096c5a336	Giving consistent names to IDCT 16x16 functions. Renames: vp9_short_idct16x16_add -> vp9_idct16x16_256_add vp9_short_idct16x16_10_add -> vp9_idct16x16_10_add vp9_short_idct16x16_1_add -> vp9_idct16x16_1_add vp9_idct_add_16x16 -> vp9_idct16x16_add Change-Id: Ief8a3904de78deab0f4ede944c4d0339c228cfc3	2013-10-07 14:31:10 -07:00
Dmitry Kovalev	2ae93a776b	Merge "Giving consistent names to IDCT 8x8 functions."	2013-10-07 14:19:50 -07:00
Scott LaVarnway	a2a3b4a479	d153 intra prediction (32x32) ssse3 using bytes Change-Id: Ie2c0d84ff9f6294084d65f4380e1f30c09e681c9	2013-10-07 11:21:10 -04:00
Jim Bankoski	bf893e84bd	Merge changes I8a106dd6,Iec442603 * changes: d153 intra prediction (16x16) ssse3 using bytes d153 intra prediction ssse3 using bytes	2013-10-06 20:11:24 -07:00
Dmitry Kovalev	c6ad70d5f1	Giving consistent names to IDCT 8x8 functions. Renames: vp9_short_idct8x8_add -> vp9_idct8x8_64_add vp9_short_idct8x8_1_add -> vp9_idct8x8_1_add vp9_short_idct8x8_10_add -> vp9_idct8x8_10_add vp9_idct_add_8x8 -> vp9_idct8x8_add Change-Id: Ifb8d3a45b4c0397aa805b30463f3d14581bf72c1	2013-10-06 00:24:09 -07:00
Dmitry Kovalev	3a0602578e	Giving consistent names to IDCT/IWHT functions. The idea is to have the following names for each transform size: vp9_idct4x4_add vp9_idct4x4_1_add vp9_idct4x4_10_add vp9_idct4x4_16_add vp9_idct8x8_add vp9_idct8x8_1_add vp9_idct8x8_10_add vp9_idct8x8_64_add etc for 16x16, 32x32 The actual list of renames in this patch: vp9_idct_add_lossless -> vp9_iwht4x4_add vp9_short_iwalsh4x4_add -> vp9_iwht4x4_16_add vp9_short_iwalsh4x4_1_add -> vp9_iwht4x4_1_add vp9_idct_add -> vp9_idct4x4_add vp9_short_idct4x4_add -> vp9_idct4x4_16_add vp9_short_idct4x4_1_add -> vp9_idct4x4_1_add Change-Id: I6f43f7437c68dd30cdd05d72e213765578ed30b1	2013-10-04 14:17:06 -07:00
Yunqing Wang	134dfea878	Merge "Rewrite HORIZx4 and HORIZx8 in subpixel filter functions"	2013-10-03 12:17:47 -07:00
Yunqing Wang	ed22179a82	Rewrite HORIZx4 and HORIZx8 in subpixel filter functions In subpixel filters, prefetched source data, unrolled loops, and interleaved instructions. In HORIZx4, integrated the idea in Scott's CL (commit: `d22a504d11`), which was suggested by Erik/Tamar from Intel. Further tweaking was done to combine row 0, 2, and row 1, 3 in registers to do more 2-row-in-1 operations until the last add. Test showed a ~2% decoder speedup. Change-Id: Ib53d04ede8166c38c3dc744da8c6f737ce26a0e3	2013-10-03 09:04:02 -07:00
Scott LaVarnway	20a09d928a	d153 intra prediction (16x16) ssse3 using bytes Change-Id: I8a106dd61b0a2520fae792d87d6348e662649b2d	2013-10-02 16:34:05 -04:00
Dmitry Kovalev	3c4e9e341f	Adding SSE2 optimized vp9_short_idct32x32_1_add function. Change-Id: I4b1c6bb9ff615f5872b96ed07dbf0f5e18e63643	2013-10-01 18:34:36 -07:00
Yunqing Wang	03698aa6d8	Merge "Modify HORIZx16 macro in subpixel filter functions"	2013-10-01 14:18:10 -07:00
Yunqing Wang	df8e156432	Modify HORIZx16 macro in subpixel filter functions Interleaved the instructions, reduced register dependency, and prefetched the source data. This improved the decoder speed by 0.6% - 2%. Change-Id: I568067aa0c629b2e58219326899c82aedf7eccca	2013-10-01 12:49:25 -07:00
Scott LaVarnway	27b390e1a1	d153 intra prediction ssse3 using bytes byte version of ronalds d153 ssse3 optimizations for 4x4 and 8x8 (commit: fc91a2a112238a1aee568f3b840585de4e928fca) Change-Id: Iec4426032311483f615fd9e0dceba3ee85ddebd7	2013-10-01 09:05:20 -04:00
Jim Bankoski	152fd59964	fixed cpp lint issue in vp9_postproc_x86 Change-Id: I2b2af1dd9f5c29c05e28a4fd51fa58ccc4071477	2013-09-29 18:44:58 -07:00
Jim Bankoski	ec421b7810	nolintify intrinsic idct file Change-Id: Id2cc5c829399a2afdf7a8a82615a4e272c814986	2013-09-29 18:42:24 -07:00
Dmitry Kovalev	3fab2125ff	Renaming vp9_short_idct10_8x8_add to vp9_short_idct8x8_10_add. Making name consistent with vp9_short_idct8x8 and vp9_short_idct8x8_1. Change-Id: I99e0be040ec893f9571dcf090e18f98dc58339f5	2013-09-27 15:26:27 -07:00
Dmitry Kovalev	db60c02c9e	Merge "Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10."	2013-09-27 13:08:52 -07:00
Dmitry Kovalev	15a36a0a0d	Renaming vp9_short_idct10_16x16 to vp9_short_idct16x16_10. Making function name consistent with vp9_short_idct16x16 and vp9_short_idct16x16_1. Change-Id: I70e54be9e6b9a1dddab0de470686591e96d05517	2013-09-26 14:01:25 -07:00
Scott LaVarnway	208658490c	d63 intra prediction ssse3 using bytes byte version of ronalds d63 ssse3 optimizations (commit: c5a1c8cf3541cf3665fee981b36d22c9fbd4191e) Change-Id: Ifd3e6d454a2246085f23eabb38518a930321e807	2013-09-25 16:16:44 -04:00
Yunqing Wang	9d901217c6	Fix x86inc.asm to build PIC code correctly Current x86inc.asm didn't handle 32bit PIC build properly. TEXTRELs were seen in the library built. The PIC macros from libvpx's x86_abi_support.asm was used to fix this problem. The assembly code was modified to use the macros. Notes: We need this fix in for decoder building. Functions in encoder will be fixed later. Change-Id: Ifa548d37b1d0bc7d0528db75009cc18cd5eb1838	2013-09-18 13:45:46 -07:00
James Zern	2d58761993	Revert "Improved 8t filters" This is incompatible with most toolchains other than gcc. Revert "Deleted #include <inttypes.h>" This reverts commit `4d018be950`. This reverts commit `d22a504d11`. Change-Id: I1751dc6831f4395ee064e6748281418e967e1dcf	2013-09-13 15:13:06 -07:00
Paul Wilkins	4d018be950	Deleted #include <inttypes.h> This seems not to be needed and is not supported in the Windows build. Change-Id: Iaca3bbf8cca283aee6bc336cb31ba9dd4610322b	2013-09-12 13:43:07 +01:00
Scott LaVarnway	d22a504d11	Improved 8t filters Reformatted version of a patch submitted by Erik/Tamar from Intel. For the test clips used, the decoder performance improved by ~2%. Change-Id: Ifbc37ac6311bca9ff1cfefe3f2e9b7f13a4a511b	2013-09-11 13:56:32 -04:00
Scott LaVarnway	22dc946a7e	Improved mb_lpf_horizontal_edge_w_sse2_8 This patch is a reformatted version of optimizations done by engineers at Intel (Erik/Tamar) who have been providing performance feedback for VP9. For the test clips used (720p, 1080p), up to 1.2% performance improvement was seen. Change-Id: Ic1a7149098740079d5453b564da6fbfdd0b2f3d2	2013-08-29 08:30:17 -04:00
Jingning Han	9d67495f72	Optimize 32x32 2D inverse DCT for speed-up This commit exploits the sparsity of quantized coefficient matrix. It detects each 32x8 array and skip the corresponding inverse transformation if all entries are zero. For ped1080p at 8000 kbps, this on average reduces the runtime of 32x32 inverse 2D-DCT SSE2 function from 6256 cycles -> 5200 cycles. It makes the overall encoding process about 2% faster at speed 0. The speed-up is more pronounceable for the decoding process. Change-Id: If20056c3566bd117642a76f8884c83e8bc8efbcf	2013-07-31 17:13:31 -07:00
Jingning Han	a7c4de22e1	16x16 inverse 2D-DCT with DC only This commit provides special handle on 16x16 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero value. Change-Id: I7bf71be7fa13384fab453dc8742b5b50e77a277c	2013-07-29 14:45:53 -07:00
Ronald S. Bultje	6f3054b65d	Merge "d45 intra prediction SSSE3 optimizations."	2013-07-26 17:21:09 -07:00
Jingning Han	325e0aa650	Special handle on DC only inverse 8x8 2D-DCT This commit enables a special handle for the 8x8 inverse 2D-DCT, where only DC coefficient is quantized to be non-zero. For bus_cif at 2000 kbps, it provides about 1% speed-up at speed 0. Change-Id: I2523222359eec26b144cf8fd4c63a4ad63b1b011	2013-07-26 14:16:51 -07:00
Ronald S. Bultje	94b0c6791d	d45 intra prediction SSSE3 optimizations. Change-Id: Ie48035ff4f93c41f8a9b3023e6444fd10432d8fb	2013-07-26 13:30:02 -07:00
Jingning Han	384e37e32b	SSE2 inverse 4x4 2D-DCT with DC only Add SSE2 implementation to handle the special case of inverse 2D-DCT where only DC coefficient is non-zero. Change-Id: I2c6a59e21e5e77b8cf39a4af5eecf4d5ade32e2f	2013-07-24 23:19:56 -07:00
Jingning Han	d2de1ca37b	Merge vp9_dc_only_idct_add and vp9_short_idct4x4_1 They share the same functionality, so merging together. Change-Id: I98a0386fcee052cb854f9ff90c283c1b844bcb79	2013-07-24 16:51:15 -07:00
James Zern	98e132bde0	Merge changes I40454d26,I892e76d5,I865ab3f9,I4a4bec17,I61c4351e,I37eb3559,I1031c556,I8c8f1f42 * changes: delete vp9_loopfilter_sse2.asm vp9_loopfilter_intrin_sse2: cosmetics: fix indent delete x86/vp9_loopfilter_x86.h vp9_loopfilter_intrin_sse2: make some funcs static vp9_loopfilter_intrin_sse2: remove unused uv funcs vp9_loopfilter: remove uv function typedef filter_block_plane: reuse some constants vp9_loopfilter.c: make some functions static	2013-07-16 14:25:32 -07:00
James Zern	50015f6eba	delete vp9_loopfilter_sse2.asm sse2 functions are provided by vp9_loopfilter_intrin_sse2.c Change-Id: I40454d26034e3ef915eeaf889937fe7d1b519b9b	2013-07-16 13:09:16 -07:00
James Zern	8f4787a383	vp9_loopfilter_intrin_sse2: cosmetics: fix indent Change-Id: I892e76d5ad1443b2ea0d1a7839fe26afe9c68ffb	2013-07-16 13:09:16 -07:00
James Zern	af58254267	delete x86/vp9_loopfilter_x86.h also remove prototype_loopfilter{,_block} defines from vp9_loopfilter.h Change-Id: I865ab3f9436c7b1ca166f76630328abf01389405	2013-07-16 13:09:05 -07:00
Jingning Han	d05f66aa10	SSE2 16x16 inverse ADST/DCT hybrid transform This commit enables SSE2 implementation of 16x16 inverse ADST/DCT hybrid transform. The runtime goes from 5742 cycles -> 1821 cycles. This provides about 1% encoding speed-up at speed 0. Change-Id: I1678d0988bf30b9efd524877705bbb3645edb17b	2013-07-16 12:51:42 -07:00
James Zern	04606d7258	vp9_loopfilter_intrin_sse2: make some funcs static + drop 'vp9_' Change-Id: I4a4bec175316aab8f65c3a23bacc8362399a1357	2013-07-13 18:48:00 -07:00
James Zern	dc968d3d45	vp9_loopfilter_intrin_sse2: remove unused uv funcs vp9_mbloop_filter_horizontal_edge_sse2 / vp9_mbloop_filter_vertical_edge_uv_sse2 Change-Id: I61c4351ef0cce79fa4156a47ddace781f1566869	2013-07-13 18:44:32 -07:00
James Zern	bd6b79c44d	vp9_loopfilter: remove uv function typedef loop_filter_uvfunction is unused Change-Id: I37eb3559e9eb2808f1f29dfea429441c94c9df2a	2013-07-13 18:38:28 -07:00
Jingning Han	91365addf8	SSE2 8x8 inverse ADST/DCT transform This commit enables SSE2 implementation of 8x8 inverse ADST/DCT transform. The runtime goes from 1216 cycles -> 266 cycles. For bus_cif at 2000 kbps, the overall runtime reduces from 253707ms -> 248430ms, i.e., 2% speed-up at speed 0. Change-Id: Ib0372e17e9162d7b11a10d653b1c8be547c878fb	2013-07-12 21:03:16 -07:00
Jingning Han	dac5891a1a	Merge "SSE2 4x4 invserse ADST/DCT transform"	2013-07-11 14:17:23 -07:00
Johann	158c80cbb0	convolve8 optimizations for neon Independent horizontal and vertical implementations. Requires that blocks be built from 4x4 and [xy]_step_q4 == 16 6-10% improvement. CIF improved the least. Change-Id: I137f5ceae4440adc0960bf88e4453e55a618bcda	2013-07-11 11:08:19 -07:00
Jim Bankoski	5000cdf0ff	Merge "Wide loopfilter 16 pix at a time"	2013-07-11 06:44:02 -07:00
Jingning Han	49b6302044	SSE2 4x4 invserse ADST/DCT transform Enable SSE2 4x4 inverse ADST/DCT transform. The runtime goes from 292 cycles down to 89 cycles. Running bus_cif at 2000 kbps, the overall runtime of speed 0 goes from 301s to 295s (2% speed-up). Change-Id: I24098136e7fee7ab2fbf1c11755bdf2ca37f3628	2013-07-10 20:16:02 -07:00
Ronald S. Bultje	decead7336	Replace copy_memNxM functions with a generic copy/avg function. Change-Id: I3ce849452ed4f08527de9565a9914d5ee36170aa	2013-07-10 18:27:24 -07:00
John Koleszar	64f7a4d8cb	Wide loopfilter 16 pix at a time Where possible, do the 16 pixel wide filter while doing the horizontal filtering pass. The same approach can be taken for the mbloop_filter when that's implemented. Doing so on the vertical pass is a little more involved, but possible. Change-Id: I010cb505e623464247ae8f67fa25a0cdac091320	2013-07-10 16:32:44 -07:00
Ronald S. Bultje	3f210f10eb	Remove unused iwalsh4x4 MMX/SSE2 functions. Change-Id: I2d22577911a37ed7d8c7e08cac20764842267652	2013-07-10 14:52:47 -07:00
Ronald S. Bultje	48c53233fd	Remove unused 16x3/3x16 sad SSE2 functions. Change-Id: I30a597c0cc366e34c9a3e2afe32d70e044f95ca4	2013-07-10 14:52:47 -07:00
Ronald S. Bultje	e6f955251f	Merge "SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction."	2013-07-10 14:52:23 -07:00
Ronald S. Bultje	6a60249071	Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction."	2013-07-10 14:52:19 -07:00
Ronald S. Bultje	44b29a769c	Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction."	2013-07-10 10:24:16 -07:00
Ronald S. Bultje	89810bfd71	Merge "SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction."	2013-07-10 10:13:16 -07:00
Ronald S. Bultje	7fd643264a	SSSE3 assembly for 4x4/8x8/16x16/32x32 H intra prediction. Change-Id: Iad70966b986f65259329070e258f76ef0af816b4	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	8dade638a1	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 TM intra prediction. Change-Id: I3441c059214c2956e8261331bbf521525a617a86	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	75b33c68c7	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 V intra prediction. Change-Id: I55a6cfa2daba738cbc0c4a02f806893f7e556997	2013-07-10 09:28:03 -07:00
Ronald S. Bultje	92c5d3665d	SSE/SSE2 assembly for 4x4/8x8/16x16/32x32 DC intra prediction. Change-Id: Ibe1690afc5459f3b3beca401e7734fcd03da6dd0	2013-07-10 09:28:03 -07:00
Dmitry Kovalev	aeed28f143	Removing vp9_maskingmv.c and corresponding assembly file. Change-Id: I9842d02d61d78d17dc3449bae8ffbe60f4b3ecb3	2013-07-09 11:22:56 -07:00
Ronald S. Bultje	8350e7fe38	Make intra prediction pointers RTCD-based. This probably has a mildly negative impact on performance, but will (in future commits - or possibly merged with this one) allow SIMD implementations of individual intra prediction functions. We may perhaps want to consider having separate functions per txfm-size also (i.e. 4x4, 8x8, 16x16 and 32x32 intra prediction functions for each intra prediction mode), but I haven't played much with that yet. Change-Id: Ie739985eee0a3fcbb7aed29ee6910fdb653ea269	2013-07-08 17:25:51 -07:00

1 2 3 4 5

230 Commits