generic-library/vpx

Author	SHA1	Message	Date
Shiyou Yin	d080c92524	Merge "vpx_dsp:loongson optimize vpx_mseWxH_c(case 16x16,16X8,8X16,8X8) with mmi."	2017-08-24 00:55:11 +00:00
Johann	7c27872164	quantize avx: copy implementation to intrinsics Adds an early exit based on ptest. Slightly slower than ssse3 in the full case because of the extra check, but potentially faster if lots of rows can be skipped. Very close in speed to the assembly. Can run in 32 bit, unlike the assembly. Allows reworking the function prototype to use structs. Change-Id: If80e2b9ba059370a4cad3c973196e82a97b4330e	2017-08-23 09:19:16 -07:00
Shiyou Yin	59e065b6ed	vpx_dsp:loongson optimize vpx_mseWxH_c(case 16x16,16X8,8X16,8X8) with mmi. Change-Id: I2c782d18d9004414ba61b77238e0caf3e022d8f2	2017-08-23 15:14:15 +08:00
Shiyou Yin	bff5aa9827	Merge "vpx_dsp:loongson optimize vpx_subtract_block_c (case 4x4,8x8,16x16) with mmi."	2017-08-22 00:37:23 +00:00
Shiyou Yin	7d82e57f5b	vpx_dsp:loongson optimize vpx_subtract_block_c (case 4x4,8x8,16x16) with mmi. Change-Id: Ia120ad1064d0b6106d9685cf075bdab373eef19e	2017-08-18 09:06:49 +08:00
Linfeng Zhang	d72e20b123	Add vpx_highbd_idct32x32_{34, 135, 1024}_add_{sse2, sse4_1} BUG=webm:1412 Change-Id: I08b562b60fa85fbc2fec1c15c323a3444b44618f	2017-08-14 17:05:22 -07:00
Johann Koenig	0b393ae505	Merge "quantize: copy ssse3 optimizations to intrinsics"	2017-08-10 15:42:20 +00:00
Johann	d52cb59729	quantize: copy ssse3 optimizations to intrinsics Fairly minor differences from sse2. pabsw and psignw are the big gains. Also re-uses some values in eob calculation to avoid an extra pcmp. Fixes test failures in HBD and OS X builds. Allows using it in 32bit builds, where it is about 40% faster than sse2. Substantially faster than the assembly for skip_block. 10-20% faster the rest of the time. Change-Id: If783bb3567e561e47667e10133b9c84414a334e2	2017-08-08 12:22:14 -07:00
Linfeng Zhang	7f20c3ac44	Add vpx_highbd_idct16x16_{10, 38, 256}_add_sse4_1 BUG=webm:1412 Change-Id: I8877c986b4042f7b8e33f5674c86700675a0e4ca	2017-08-04 15:31:17 -07:00
Scott LaVarnway	c42517568d	vpx_dsp: merge avx2 variance files BUG=webm:1404 Change-Id: Ieb8f85c3811b05df78722cb41eeb1166966ceec4	2017-08-04 07:49:30 -07:00
Johann	2d6b5df657	neon: vpx_quantize_b With skip block or coeff < zbin it is about twice as fast as C. If most coeff values are > zbin it is about 10-15x as fast as C. BUG=webm:1426 Change-Id: I5d3c007b014a372d5ef0882b39bb48983b4131c7	2017-07-31 10:38:46 -07:00
Johann	87610ac45e	neon: consolidate horizontal adds Change-Id: Iaf9e88ff636ccf8f0ef310869c6827f3f205cca8	2017-07-10 15:29:13 -07:00
Alexandra Hájková	d8c277030c	ppc: Add vpx_idct4x4_16_add_vsx Change-Id: Id2673eece32027fb245919c7a5c81994a4a19fd8	2017-07-01 12:32:18 -07:00
Linfeng Zhang	1e3a93e72e	Merge changes I5d038b4f,I9d00d1dd,I0722841d,I1f640db7 * changes: Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1 sse2: Add transpose_32bit_4x4x2() and update transpose_32bit_4x4() Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h	2017-06-30 20:49:19 +00:00
Linfeng Zhang	c338f3635e	Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1 BUG=webm:1412 Change-Id: I5d038b4fa842ce2f6b9bd5c8c44c70647bda9591	2017-06-29 17:19:34 -07:00
Linfeng Zhang	0fa59a4baf	Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h Also clean highbd_inv_txfm_sse2.h BUG=webm:1412 Change-Id: I0722841d824ce602874019bd9779b10d49d10c0b	2017-06-29 17:17:43 -07:00
Linfeng Zhang	9ac78ae35f	Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h BUG=webm:1412 Change-Id: I1f640db71ad4c644b7521305a781f2218eb1ba9d	2017-06-29 17:13:28 -07:00
Johann	cf75ab6ccd	partial fdct neon: move 8x8_1 and enable hbd tests The function was originally written with HBD in mind. Enable it and configure the tests. BUG=webm:1424 Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1	2017-06-28 15:37:43 -07:00
Linfeng Zhang	8253a27904	Add vpx_highbd_idct4x4_16_add_sse4_1() BUG=webm:1412 Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a	2017-06-23 14:30:12 -07:00
Johann	e67660cf37	fdct32x32 neon implementation Almost 3x faster in constrained loop testing. Over 10x faster in HBD builds. BUG=webm:1424 Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9	2017-06-22 06:40:17 -07:00
Jerome Jiang	943f9ee25c	Merge "Merge skin detection code in vp8/9."	2017-06-08 16:36:00 +00:00
Johann Koenig	903375a48a	Merge "fdct16x16 neon optimization"	2017-06-08 15:19:36 +00:00
Jerome Jiang	658e854252	Merge skin detection code in vp8/9. BUG=webm:1438 Change-Id: Ie3dc034c7dbb498a0b088a767b1936ddeed4df14	2017-06-07 21:20:34 -07:00
Johann	eae7cf2368	fdct16x16 neon optimization Roughly 2x speedup. Since the only change for HBD is to store(), the improvement appears to hold there as well. BUG=webm:1424 Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19	2017-06-07 14:59:55 -07:00
Johann	f695b30ac2	comp_avg_pred neon: used by sub pixel avg variance BUG=webm:1423 Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804	2017-05-30 22:47:34 +00:00
Johann	105503b839	neon fdct: 4x4 implementation Approximately twice as fast as C implementation. BUG=webm:1424 Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f	2017-05-17 07:38:18 -07:00
Johann	1088b4f87c	move neon load/stores to a new file Move the tran_low_t helper functions to a new file. Additional load/store functions will be added here. Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3	2017-05-15 08:29:43 -07:00
James Zern	ac8f58f6ab	Merge changes I1b54a7a5,I3028bdad,I59788cd9 * changes: ppc: Add get_mb_ss_vsx ppc: Add get4x4sse_cs_vsx ppc: Add comp_avg_pred_vsx	2017-05-12 15:24:59 +00:00
Luca Barbato	a7f8bd451b	ppc: Add comp_avg_pred_vsx Change-Id: I59788cd98231e707239c2ad95ae54f67cfe24e10	2017-05-12 17:22:55 +02:00
Alexandra Hájková	cc7f0c0f3e	ppc: Add vpx_sad16x8/16/32_vsx Change-Id: I60619d28fffd9809f93b1af510a50e1aa02519a9	2017-05-10 19:57:30 +00:00
Linfeng Zhang	2231669a83	Split dsp/x86/inv_txfm_sse2.c Spin out highbd idct functions. BUG=webm:1412 Change-Id: I0cfe4117c00039b6778c59c022eee79ad089a2af	2017-05-03 15:43:02 -07:00
Luca Barbato	63860ba7b8	ppc: Add convolve_copy Change-Id: Ie26d6dbe090e711d84bac01ba7da270db983f405	2017-04-29 15:47:25 +02:00
Luca Barbato	7ad1faa6f8	ppc: vertical intrapred 16x16 and 32x32 Change-Id: Ic80f3c050cfbe7697e81a311b4edaaa597b85cab	2017-04-19 01:48:09 +02:00
James Zern	4ba20da8b1	Merge "Add AVX2 optimization to copy/avg functions"	2017-04-15 00:26:08 +00:00
Yi Luo	aa5a941992	Add AVX2 optimization to copy/avg functions Change-Id: Ibcef70e4fead74e2c2909330a7044a29381a8074	2017-04-14 16:50:10 -07:00
Johann	28a8622143	vpx_comp_avg_pred: sse2 optimization Provides over 15x speedup for width > 8. Due to smaller loads and shifting for width == 8 it gets about 8x speedup. For width == 4 it's only about 4x speedup because there is a lot of shuffling and shifting to get the data properly situated. BUG=webm:1390 Change-Id: Ice0b3dbbf007be3d9509786a61e7f35e94bdffa8	2017-04-13 08:44:52 -07:00
Linfeng Zhang	27530d484e	Add vpx_highbd_idct32x32_1024_add_neon() BUG=webm:1301 Change-Id: Ib90af0c1712e56b301d0e981dbe9a641e15e36ca	2017-03-17 00:27:46 -07:00
Linfeng Zhang	50b13f75b8	Add vpx_highbd_idct32x32_34_add_neon() BUG=webm:1301 Change-Id: I74dd16c6c64e7bb71aa991cedccddf0663ef5e06	2017-03-17 00:27:46 -07:00
James Zern	2882778310	Merge "Add vpx_highbd_idct32x32_135_add_neon()"	2017-03-17 07:26:52 +00:00
Linfeng Zhang	65e9fb65e8	Add vpx_highbd_idct32x32_135_add_neon() BUG=webm:1301 Change-Id: I58c2d65d385080711c3666d6d8f9d241dac7b21a	2017-03-16 22:37:55 -07:00
Rafael de Lucena Valle	405b94c661	Add Hadamard for Power8 Change-Id: I3b4b043c1402b4100653ace4869847e030861b18 Signed-off-by: Rafael de Lucena Valle <rafaeldelucena@gmail.com>	2017-03-15 23:46:18 -03:00
Yi Luo	f62dcc9c33	Replace idct32x32_1024_add_ssse3 assembly with intrinsics - Encoding/decoding test, BQTerrace_1920x1080_60.y4m, on i7-6700, no obvious user-level speed performance downgrade. - Passed unit tests. Change-Id: I20688e0dd3731021ec8fb4404734336f1a426bfc	2017-02-16 16:10:40 -08:00
Johann	2104454607	block error avx2: use tran_low_t Change-Id: Ic5f3a1f569d6f82afeaf4fcd7235374bb460db3c	2017-02-16 12:39:02 -08:00
Linfeng Zhang	016933ad48	Add vpx_highbd_idct{16x16,32x32}_1_add_neon() and update vpx_highbd_idct8x8_1_add_neon() BUG=webm:1301 Change-Id: I18d1a0cbe98ba822d5194c1b4e13a4c29c5c75f4	2017-02-13 10:25:22 -08:00
Johann	641fda79bb	highbd x86: consolidate tran_low_t conversions Create new helper files specifically for converting tran_low_t types. Change-Id: I7c4c458ef910f3b3d10a3cfbf9df4de7682fd905	2017-02-06 10:43:26 -08:00
Jingning Han	bb40844e32	Merge "Add SSSE3 intrinsic 8x8 inverse 2D-DCT"	2017-02-02 22:18:32 +00:00
Kaustubh Raste	5b10674b5c	Merge "Add mips msa sum_squares_2d_i16 function"	2017-02-02 08:09:21 +00:00
Jingning Han	8f95389742	Add SSSE3 intrinsic 8x8 inverse 2D-DCT The intrinsic version reduces the average cycles from 183 to 175. Change-Id: I7c1bcdb0a830266e93d8347aed38120fb3be0e03	2017-02-01 14:47:53 -08:00
Kaustubh Raste	750e753134	Add mips msa sum_squares_2d_i16 function average improvement ~4x-5x Change-Id: I8d91b71d0677009be52b412e4f52b40b98573a53	2017-01-31 12:22:43 +00:00
Johann	13234d3c43	Remove neon assembly for idct 16x16 and 8x8 Tested using test/partial_idct_test.cc:DISABLED_Speed Both gcc 4.9 and clang 3.8 from the r13 Android NDK offer improvements using the intrinsics: <function> <clang asm> <gcc asm> <clang intrin> <gcc intrin> idct16x16_256 1720ms 1703ms 1546ms 1554ms idct16x16_10 1320ms 1247ms 518ms 488ms idct16x16_1 107ms 108ms 64ms 68ms idct8x8_64 924ms 931ms 866ms 989ms idct8x8_12 826ms 824ms 519ms 514ms idct8x8_1 172ms 166ms 110ms 125ms idct8x8_64 isn't quite perfect (slight regression with gcc intrinsics) but as a counter example idct16x16_10 goes from ~1300ms to ~500ms On a sample clip, clang improved from 48.5 to 49fps and gcc stayed roughly stable. BUG=webm:1303 Change-Id: I9d4fd2b41b46ea6174a887b40a82c8e6e4769ed4	2017-01-19 12:27:31 -08:00

1 2 3 4

154 Commits