generic-library/vpx

Author	SHA1	Message	Date
Linfeng Zhang	a76b6b232c	Update load_input_data() in x86 Split to load_input_data4() and load_input_data8(). Use pack with signed saturation instruction for high bitdepth. Change-Id: Icda3e0129a6fdb4a51d1cafbdc652ae3a65f4e06	2017-06-26 13:38:33 -07:00
Linfeng Zhang	8253a27904	Add vpx_highbd_idct4x4_16_add_sse4_1() BUG=webm:1412 Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a	2017-06-23 14:30:12 -07:00
Linfeng Zhang	b8a4b5dd8d	Cosmetics, 8x8 idct SSE2 optimization Change-Id: Id21fa94fd323e36cd19a2d890bf4a0cafb7d964d	2017-06-23 14:30:12 -07:00
James Zern	88a302e743	Merge changes from topic 'missing-proto' * changes: onyxd_int.h: add missing prototypes onyxd.h: add vp8dx_references_buffer prototype vp[89],vpx_dsp: add missing includes vp8,encodeframe.h: correct prototypes vp8: add temporal_filter.h add picklpf.h add ethreading.h vp8,bitstream.h: add missing prototypes vp8: remove vp8_fast_quantize_b_mmx vp8,loopfilter_filters: make some functions static vp9_ratectrl: make adjust_gf_boost_lag_one_pass_vbr static vp9_encodeframe: make scale_part_thresh_sumdiff static vp9_alt_ref_aq: correct vp9_alt_ref_aq_create proto tiny_ssim: make some functions static	2017-06-23 05:44:24 +00:00
Johann Koenig	794a5ad713	Merge "fdct32x32 neon implementation"	2017-06-23 01:58:00 +00:00
Johann	e67660cf37	fdct32x32 neon implementation Almost 3x faster in constrained loop testing. Over 10x faster in HBD builds. BUG=webm:1424 Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9	2017-06-22 06:40:17 -07:00
James Zern	44418c659f	vp[89],vpx_dsp: add missing includes quiets -Wmissing-prototypes Change-Id: I841cfc019d592f2bc6b3fec5818051a31f4c53b5	2017-06-21 19:00:15 -07:00
Linfeng Zhang	466b667ff3	Clean vpx_idct16x16_256_add_sse2() Remove macro IDCT16 which is redundant with idct16_8col(). Change-Id: I783c5f4fda038a22d5ee5c2b22e8c2cdfb38432c	2017-06-21 13:47:15 -07:00
Linfeng Zhang	42522ce0b7	Update vpx_idct{8x8,16x16,32x32}_1_add_sse2() Change-Id: I365f8e53d9ccd028cef0f561d4de9e5916278609	2017-06-21 13:47:05 -07:00
Linfeng Zhang	2b43a1ee18	Clean 32x32 full idct sse2 and ssse3 code vpx_idct32x32_1024_add_ssse3() is actually a sse2 function and faster than vpx_idct32x32_1024_add_sse2(). Replace the slow one. All are code relocations, no new code. Change-Id: I5dac0e98cc411a4ce05660406921118986638d19	2017-06-21 13:46:49 -07:00
Linfeng Zhang	c7e4917e97	Clean 8x8 idct x86 optimization Create load_buffer_8x8() and write_buffer_8x8(). Change-Id: Ib26dd515d734a5402971c91de336ab481b213fdf	2017-06-15 14:30:00 -07:00
Linfeng Zhang	98967645a1	Remove vpx_idct8x8_64_add_ssse3() It's almost identical with vpx_idct8x8_64_add_sse2(), except little difference in instructions order. Change-Id: Ie60dabc35eaa6ebae7c755e6cff00a710aad284f	2017-06-15 14:09:33 -07:00
Linfeng Zhang	6da6a23291	Update high bitdepth load_input_data() in x86 BUG=webm:1412 Change-Id: Ibf9d120b80c7d3a7637e79e123cf2f0aae6dd78c	2017-06-13 16:53:53 -07:00
Linfeng Zhang	d6eeef9ee6	Clean array_transpose_{4X8,16x16,16x16_2) in x86 Change-Id: I341399ecbde37065375ea7e63511a26bfc285ea0	2017-06-13 16:50:44 -07:00
Linfeng Zhang	9c72e85e4c	Remove array_transpose_8x8() in x86 Duplicate of transpose_16bit_8x8() Change-Id: Iaa5dd63b5cccb044974a65af22c90e13418e311f	2017-06-13 16:50:44 -07:00
Linfeng Zhang	cbb991b6b8	Convert 8x8 idct x86 macros to inline functions Change-Id: Id59865fd6c453a24121ce7160048d67875fc67ce	2017-06-13 16:50:43 -07:00
Jerome Jiang	943f9ee25c	Merge "Merge skin detection code in vp8/9."	2017-06-08 16:36:00 +00:00
Johann Koenig	903375a48a	Merge "fdct16x16 neon optimization"	2017-06-08 15:19:36 +00:00
Jerome Jiang	658e854252	Merge skin detection code in vp8/9. BUG=webm:1438 Change-Id: Ie3dc034c7dbb498a0b088a767b1936ddeed4df14	2017-06-07 21:20:34 -07:00
Johann	eae7cf2368	fdct16x16 neon optimization Roughly 2x speedup. Since the only change for HBD is to store(), the improvement appears to hold there as well. BUG=webm:1424 Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19	2017-06-07 14:59:55 -07:00
James Zern	ff42e04f9c	Merge "ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64}"	2017-06-06 23:52:39 +00:00
James Zern	4753c23983	Merge "ppc: Add vpx_sad64/32/16x64/32/16_avg_vsx"	2017-06-06 02:19:41 +00:00
Johann Koenig	755b3daf90	Merge "comp_avg_pred neon: used by sub pixel avg variance"	2017-05-31 18:17:28 +00:00
Linfeng Zhang	30ea3ef283	Merge "Update vpx_highbd_idct4x4_16_add_sse2()"	2017-05-31 15:56:20 +00:00
Johann	f695b30ac2	comp_avg_pred neon: used by sub pixel avg variance BUG=webm:1423 Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804	2017-05-30 22:47:34 +00:00
Linfeng Zhang	45048dc9dc	Update vpx_highbd_idct4x4_16_add_sse2() BUG=webm:1412 Change-Id: I26e4b34ae9bc1ae80c24f56d740d737a95f1ab84	2017-05-30 09:25:30 -07:00
Johann Koenig	b9649d2407	Merge "comp_avg_pred: alignment"	2017-05-30 16:21:05 +00:00
Johann	ea8b4a450d	comp_avg_pred: alignment x86 requires 16 byte alignment for some vector loads/stores. arm does not have the same requirement. The asserts are still in avg_pred_sse2.c. This just removes them from the common code. Change-Id: Ic5175c607a94d2abf0b80d431c4e30c8a6f731b6	2017-05-30 07:46:43 -07:00
Johann	42ce25821d	remove DECLARE_ALIGNED from neon code Unlike x86 neon only requires type alignment when loading into vectors. Change-Id: I7bbbe4d51f78776e499ce137578d8c0effdbc02f	2017-05-26 10:41:57 -07:00
Johann	f3c97ed32e	subpel variance neon: reduce stack usage Unlike x86, arm does not impose additional alignment restrictions on vector loads. For incoming values to the first pass, it uses vld1_u32() which typically does impose a 4 byte alignment. However, as the first pass operates on user-supplied values we must prepare for unaligned values anyway (and have, see mem_neon.h). But for the local temporary values there is no stride and the load will use vld1_u8 which does not require 4 byte alignment. There are 3 temporary structures. In the C, one is uint16_t. The arm saturates between passes but still passes tests. If this becomes an issue new functions will be needed. Change-Id: I3c9d4701bfeb14b77c783d0164608e621bfecfb1	2017-05-24 13:28:13 -07:00
Johann	d204c4bf01	Use vdup instead of vmov Change-Id: Idb6248c1429b55176bb3e9f4e8365ea0ed2be62a	2017-05-24 11:38:15 -07:00
Johann Koenig	de1a9c77a7	Merge changes Iaab2b9a1,Idfb458d3 * changes: sub pel avg variance neon: 4x block sizes sub pel variance neon: 4x block sizes	2017-05-24 18:33:53 +00:00
Johann Koenig	b11a37f540	Merge changes I31fa6ef8,I228c6f29 * changes: sub pel avg variance neon: add neon optimizations sub pel variance neon: normalize variable names	2017-05-24 18:32:02 +00:00
Alexandra Hájková	8bf6eaf433	ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64} Change-Id: I547d0099e15591655eae954e3ce65fdf3b003123	2017-05-24 13:27:09 +00:00
Linfeng Zhang	6444958f62	Update inv_txfm_sse2.h and inv_txfm_sse2.c Extract shared code into inline functions. Change-Id: Iee1e5a4bc6396aeed0d301163095c9b21aa66b2f	2017-05-23 14:54:46 -07:00
Johann	f6fcd3410d	sub pel avg variance neon: 4x block sizes BUG=webm:1423 Change-Id: Iaab2b9a183fdb54aae5f717aba95d90dc36a9e3b	2017-05-22 14:40:05 -07:00
Johann	188d58eaa9	sub pel variance neon: 4x block sizes Add optimizations for blocks of width 4 BUG=webm:1423 Change-Id: Idfb458d36db3014d48fbfbe7f5462aa6eb249938	2017-05-22 14:40:01 -07:00
Johann	9b0d306a2f	sub pel avg variance neon: add neon optimizations These are missing an optimized version of vpx_comp_avg_pred BUG=webm:1423 Change-Id: I31fa6ef842e98f7ff3ea079ffed51ae33178e2ed	2017-05-22 13:58:43 -07:00
Johann	e0d294c3af	sub pel variance neon: normalize variable names match vpx_dsp/variance.c variable names Change-Id: I228c6f296c183af147b079b7c8bcdf97bd09cf3a	2017-05-22 13:58:43 -07:00
Linfeng Zhang	27beada6d0	Merge "Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2"	2017-05-22 20:58:18 +00:00
Johann	67ac68e399	variance neon: assert overflow conditions Change-Id: I12faca82d062eb33dc48dfeb39739b25112316cd	2017-05-22 11:25:06 -07:00
Linfeng Zhang	c167345ffb	Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2 BUG=webm:1412 Change-Id: Ia338a6057d36f9ed7eaa9cbd4dfbf0c3cbdc6468	2017-05-22 11:24:21 -07:00
Johann	d217c87139	neon variance: special case 4x The sub pixel variance uses a temp buffer which guarantees width == stride. Take advantage of this with the 4x and avoid the very costly lane loads. Change-Id: Ia0c97eb8c29dc8dfa6e51a29dff9b75b3c6726f1	2017-05-22 10:51:31 -07:00
Johann Koenig	e7cac13016	Merge changes Ib8dd96f7,Ie9854b77 * changes: neon variance: process 4x blocks use memcpy for unaligned neon stores	2017-05-22 17:48:33 +00:00
Johann Koenig	b5055002d7	Merge "neon 4 byte helper functions"	2017-05-19 17:11:30 +00:00
Johann Koenig	3c603eadb4	Merge "neon fdct: 4x4 implementation"	2017-05-19 17:08:58 +00:00
Johann	7b742da63e	neon variance: process 4x blocks Continue processing sets of 16 values. Plenty of improvement for 4x8 (doubles the speed) but only about 30% for 4x4. BUG=webm:1422 Change-Id: Ib8dd96f75d474f0348800271d11e58356b620905	2017-05-17 17:35:01 -07:00
Johann	2057d3ef75	use memcpy for unaligned neon stores Advise the compiler that the store is eventually going to a uint8_t buffer. This helps avoid getting alignment hints which would cause the memory access to fail. Originally added as a workaround for clang: https://bugs.llvm.org//show_bug.cgi?id=24421 Change-Id: Ie9854b777cfb2f4baaee66764f0e51dcb094d51e	2017-05-17 12:11:31 -07:00
Johann	105503b839	neon fdct: 4x4 implementation Approximately twice as fast as C implementation. BUG=webm:1424 Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f	2017-05-17 07:38:18 -07:00
Linfeng Zhang	18e8baa5c0	Add transpose_32bit_4x4() and rename transpose_4x4() for vpx_dsp/x86 Change-Id: Ib57377f6cf6573c04720d3cc5dea4285362b4220	2017-05-16 17:46:37 -07:00

... 4 5 6 7 8 ...

995 Commits