generic-library/vpx

Author	SHA1	Message	Date
Johann	8152b0904d	sad4d neon: 8x[4,8,16] BUG=webm:1425 Change-Id: I7de2500cca4b621f21478c4b0333c56d76dbc9a4	2017-07-12 13:25:03 +00:00
Johann	dd4347e9ec	sad4d neon: 4x4, 4x8 BUG=webm:1425 Change-Id: I5081b5ce131821d590c53ac1206a94f50cb8b468	2017-07-12 03:38:03 +00:00
Johann	66a96fd3de	avg_neon: fix 4x4, update 8x8 4x4 was failing with a bus error. Most likely due to clang alignment hints on 32bit loads. Change-Id: Ib191ce0e6239fc55d85f10e4dbe15876e5052edb	2017-07-10 15:29:34 -07:00
Johann	87610ac45e	neon: consolidate horizontal adds Change-Id: Iaf9e88ff636ccf8f0ef310869c6827f3f205cca8	2017-07-10 15:29:13 -07:00
Johann Koenig	4b78c6e6f7	Merge "remove vp9_full_sad_search"	2017-07-10 20:42:40 +00:00
Johann	109faffe9b	remove vp9_full_sad_search This code is unused in vp9. Only vp8 still contains references to vpx_sad_NxMx[3\|8] and only for sizes 16x16, 16x8, 8x16, 8x8 and 4x4. Remove the remaining sizes and all the highbitdepth versions. BUG=webm:1425 Change-Id: If6a253977c8e0c04599e25cbeb45f71a94f563e8	2017-07-10 11:20:35 -07:00
Johann Koenig	4e16f70703	Merge changes Id84d9780,Iaa6ea75b,I3362e0dd,I0020a49e,Ia42e4f36, ... * changes: sad neon: avg for 64x[32,64] sad neon: macroize 64xN definitions sad neon: avg for 32x[16,32,64] sad neon: macroize 32xN definitions sad neon: avg for 16x[8,16,32] sad neon: macroize 16xN definitions	2017-07-07 21:01:23 +00:00
Johann Koenig	6c375b9cd0	Merge "fdct neon: 32x32_rd"	2017-07-07 14:05:51 +00:00
Johann	e4e08556db	sad neon: avg for 64x[32,64] BUG=webm:1425 Change-Id: Id84d97807a6a0fbcc889c4dfe11929d54f85493d	2017-07-07 07:04:04 -07:00
Johann	6ae8f8dbe8	sad neon: macroize 64xN definitions Change-Id: Iaa6ea75b10e75784f31b1e08637eecf0dcb5cff9	2017-07-07 07:04:04 -07:00
Johann	67cffc1ef6	sad neon: avg for 32x[16,32,64] BUG=webm:1425 Change-Id: I3362e0dded3b46ca032caa7f44db42f324bc596d	2017-07-07 07:04:04 -07:00
Johann	b0d15713be	sad neon: macroize 32xN definitions Change-Id: I0020a49e77d27514375a03095d5821dc0aa7d128	2017-07-07 07:04:04 -07:00
Johann	527e0c9b1c	sad neon: avg for 16x[8,16,32] BUG=webm:1425 Change-Id: Ia42e4f36547c5fe12114fb58379e34bce82eb2f2	2017-07-07 07:04:04 -07:00
Johann	3c18acf452	sad neon: macroize 16xN definitions Change-Id: I5aea6ffbfa48eb1970afe3be54f0bba275d7fa58	2017-07-07 07:04:04 -07:00
Johann	d6423b3166	sad neon: macroize 8xN definitions Change-Id: I7b36a57e893c1795a37ba7994995bec7ff021409	2017-07-06 07:51:59 -07:00
Johann	63bdc574e5	sad neon: avg for 8x[4,8,16] BUG=webm:1425 Change-Id: If2ab51e3050e078b0011b174efe41fcb65a15f44	2017-07-06 07:43:09 -07:00
Johann	6bac3f80ee	sad neon: avg for 4x4 and 4x8 BUG=webm:1425 Change-Id: Ifc685a96cb34f7fd9243b4c674027480564b84fb	2017-07-06 07:12:47 -07:00
Johann	75b00592c7	fdct neon: 32x32_rd About 40% faster than the non-rd version. BUG=webm:1424 Change-Id: Ia99d14eb9532302eeaab8cd3e503395b0374b5a2	2017-07-06 06:30:50 -07:00
James Zern	a6531cbc54	Merge changes from topic 'missing-proto' * changes: fwd_txfm_msa.c: add missing vpx_dsp_rtcd.h vpx_convolve__msa.c: add missing vpx_dsp_rtcd.h loopfilter__msa.c: add missing vpx_dsp_rtcd.h	2017-07-05 20:00:25 +00:00
Johann Koenig	b6321025cd	Merge "partial fdct neon: maintain neon registers"	2017-07-05 19:12:38 +00:00
James Zern	fb135ff050	Merge changes I4ed1312f,Id2673eec * changes: ppc: Add vpx_idct8x8_64_add_vsx ppc: Add vpx_idct4x4_16_add_vsx	2017-07-02 02:38:39 +00:00
Alexandra Hájková	c757d6dde4	ppc: Add vpx_idct8x8_64_add_vsx Change-Id: I4ed1312f365509e0595dcc09890ecb050f6f2069	2017-07-01 12:55:47 -07:00
Alexandra Hájková	d8c277030c	ppc: Add vpx_idct4x4_16_add_vsx Change-Id: Id2673eece32027fb245919c7a5c81994a4a19fd8	2017-07-01 12:32:18 -07:00
James Zern	3dd993e4be	highbd_idct8x8_add_sse4: make << of neg. val a multiply left shifting a negative value is undefined; quiets a ubsan warning. this is applied to a constant, no change in the generated code. Change-Id: Ia17a7672d4832463decbc4afd6cd42974d02698e	2017-07-01 11:56:56 -07:00
Johann	3ae458f2f3	partial fdct neon: maintain neon registers Finish the calulations in neon registers. This avoids a potentially expensive move from neon to gp and allows at least clang to store directly to memory. BUG=webm:1424 Change-Id: Idef25eec95f7610947167818e9194bde8b00d282	2017-07-01 09:29:38 -07:00
James Zern	a876d04072	fwd_txfm_msa.c: add missing vpx_dsp_rtcd.h + only expose compatible functions in high-bitdepth build quiets -Wmissing-prototypes warnings Change-Id: I8ef7db08a34c5c54b5cde6e732c0d70f4287c89a	2017-06-30 18:53:30 -07:00
James Zern	8710c6d884	vpx_convolve_*_msa.c: add missing vpx_dsp_rtcd.h quiets -Wmissing-prototypes warnings Change-Id: I1ab5b8ae4a62f54e0f9eb3fc81371c9b99972c30	2017-06-30 18:50:56 -07:00
James Zern	329dabf57e	loopfilter_*_msa.c: add missing vpx_dsp_rtcd.h + make some functions static quiets -Wmissing-prototypes warnings Change-Id: I2130e06142e71a004a1eb30e173feba4f6fe68a0	2017-06-30 18:50:52 -07:00
James Zern	27e37e1a8a	fwd_txfm_msa.c: correct vpx_fdct8x8_1_msa prototype this makes the function compatible with high-bitdepth and fixes test failures since: `5ac88162b` partial fdct test Change-Id: Ib630694608237f0c515948942e05dbea259ba338	2017-06-30 18:50:47 -07:00
Linfeng Zhang	1e3a93e72e	Merge changes I5d038b4f,I9d00d1dd,I0722841d,I1f640db7 * changes: Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1 sse2: Add transpose_32bit_4x4x2() and update transpose_32bit_4x4() Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h	2017-06-30 20:49:19 +00:00
Johann Koenig	89d3dc043e	Merge changes Id5beb35d,I2945fe54,Ib0f3cfd6,I78a2eba8 * changes: partial fdct neon: add 32x32_1 partial fdct neon: add 16x16_1 partial fdct neon: add 4x4_1 partial fdct neon: move 8x8_1 and enable hbd tests	2017-06-30 01:00:07 +00:00
Linfeng Zhang	c338f3635e	Add vpx_highbd_idct8x8_{12, 64}_add_sse4_1 BUG=webm:1412 Change-Id: I5d038b4fa842ce2f6b9bd5c8c44c70647bda9591	2017-06-29 17:19:34 -07:00
Linfeng Zhang	ee5cb8d87f	sse2: Add transpose_32bit_4x4x2() and update transpose_32bit_4x4() BUG=webm:1412 Change-Id: I9d00d1ddbd724fd5f825fd974c4cf46a9bca6cb3	2017-06-29 17:18:01 -07:00
Linfeng Zhang	0fa59a4baf	Refactor highbd idct 4x4 sse4.1 code and add highbd_inv_txfm_sse4.h Also clean highbd_inv_txfm_sse2.h BUG=webm:1412 Change-Id: I0722841d824ce602874019bd9779b10d49d10c0b	2017-06-29 17:17:43 -07:00
Linfeng Zhang	9ac78ae35f	Refactor vpx_idct8x8_12_add_ssse3() and add inv_txfm_ssse3.h BUG=webm:1412 Change-Id: I1f640db71ad4c644b7521305a781f2218eb1ba9d	2017-06-29 17:13:28 -07:00
James Zern	bd77931421	dct_partial_test,fwd_txfm: change << to * left shift of a negative number is undefined in C; quiets a ubsan warning Change-Id: Ib1624ad5326ac8e0eead9348468ef7fe5d4df9a4	2017-06-29 14:42:03 -07:00
Johann	9fe510c12a	partial fdct neon: add 32x32_1 Always return an int32_t. Since it needs to be moved to a register for shifting, this doesn't really penalize the smaller transforms. The values could potentially be summed and shifted in place. BUG=webm:1424 Change-Id: Id5beb35d79c7574ebd99285fc4182788cf2bb972	2017-06-28 15:37:44 -07:00
Johann	f310ddc470	partial fdct neon: add 16x16_1 For the 8x8_1, the highbd output fit nicely in the existing function. 12 bit input will overflow this implementation of 16x16_1. BUG=webm:1424 Change-Id: I2945fe5478b18f996f1a5de80110fa30f3f4e7ec	2017-06-28 15:37:44 -07:00
Johann	4959dd3eb3	partial fdct neon: add 4x4_1 BUG=webm:1424 Change-Id: Ib0f3cfd6116fc1f5a99acb8bfd76e25b90177ffc	2017-06-28 15:37:44 -07:00
Johann	cf75ab6ccd	partial fdct neon: move 8x8_1 and enable hbd tests The function was originally written with HBD in mind. Enable it and configure the tests. BUG=webm:1424 Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1	2017-06-28 15:37:43 -07:00
Johann Koenig	81e25512c3	Merge changes Ib454762d,I966650df,Ie126553e,I068f06c6,Icb72a94e * changes: sad neon: rewrite 64x64 and add 64x32 sad neon: rewrite 32x32, add 32x16 and 32x64 sad neon: rewrite 16x8, 16x16, add 16x32 sad neon: rewrite 8x8 and 8x16 sad neon: rewrite 4x4 and add 4x8	2017-06-28 22:37:00 +00:00
Johann Koenig	35f8515c3f	Merge "partial fdct test"	2017-06-28 22:34:53 +00:00
Johann	5ac88162b9	partial fdct test Test the _1 variant of the fdct, which simply sums the block and applies a modifying shift based on the block size. BUG=webm:1424 Change-Id: Ic80d6008abba0c596b575fa0484d5b5855321468	2017-06-28 20:32:20 +00:00
Johann	ad011aaab8	sad neon: rewrite 64x64 and add 64x32 BUG=webm:1425 Change-Id: Ib454762d1c61b05a98324fe81ad58c9e09784717	2017-06-28 12:21:34 -07:00
Johann	77a648885c	sad neon: rewrite 32x32, add 32x16 and 32x64 BUG=webm:1425 Change-Id: I966650df7e3face93e1e771634d1cc5458a35f85	2017-06-28 12:20:27 -07:00
Johann	469643757f	sad neon: rewrite 16x8, 16x16, add 16x32 BUG=webm:1425 Change-Id: Ie126553e5fffcdfaf3d82a85b368ac10ce9ab082	2017-06-28 12:16:00 -07:00
Johann	e40e78be24	sad neon: rewrite 8x8 and 8x16 BUG=webm:1425 Change-Id: I068f06c67b841f09ea07c04ada0c2f1706102138	2017-06-28 12:15:57 -07:00
Johann	46d8660ce3	sad neon: rewrite 4x4 and add 4x8 The previous implementation loaded 8 values (discarding half) BUG=webm:1425 Change-Id: Icb72a94e2557a4ee2db7091266ab58fd92f72158	2017-06-28 11:14:59 -07:00
Linfeng Zhang	0bb31a46a4	Update vpx_idct8x8_12_add_ssse3() Change-Id: I0f38801c391db87ddae168602a786a062cd34b1d	2017-06-26 14:57:41 -07:00
Linfeng Zhang	a76b6b232c	Update load_input_data() in x86 Split to load_input_data4() and load_input_data8(). Use pack with signed saturation instruction for high bitdepth. Change-Id: Icda3e0129a6fdb4a51d1cafbdc652ae3a65f4e06	2017-06-26 13:38:33 -07:00
Linfeng Zhang	8253a27904	Add vpx_highbd_idct4x4_16_add_sse4_1() BUG=webm:1412 Change-Id: Ie33482409351a01be4e89466b0441834eb1e905a	2017-06-23 14:30:12 -07:00
Linfeng Zhang	b8a4b5dd8d	Cosmetics, 8x8 idct SSE2 optimization Change-Id: Id21fa94fd323e36cd19a2d890bf4a0cafb7d964d	2017-06-23 14:30:12 -07:00
James Zern	88a302e743	Merge changes from topic 'missing-proto' * changes: onyxd_int.h: add missing prototypes onyxd.h: add vp8dx_references_buffer prototype vp[89],vpx_dsp: add missing includes vp8,encodeframe.h: correct prototypes vp8: add temporal_filter.h add picklpf.h add ethreading.h vp8,bitstream.h: add missing prototypes vp8: remove vp8_fast_quantize_b_mmx vp8,loopfilter_filters: make some functions static vp9_ratectrl: make adjust_gf_boost_lag_one_pass_vbr static vp9_encodeframe: make scale_part_thresh_sumdiff static vp9_alt_ref_aq: correct vp9_alt_ref_aq_create proto tiny_ssim: make some functions static	2017-06-23 05:44:24 +00:00
Johann Koenig	794a5ad713	Merge "fdct32x32 neon implementation"	2017-06-23 01:58:00 +00:00
Johann	e67660cf37	fdct32x32 neon implementation Almost 3x faster in constrained loop testing. Over 10x faster in HBD builds. BUG=webm:1424 Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9	2017-06-22 06:40:17 -07:00
James Zern	44418c659f	vp[89],vpx_dsp: add missing includes quiets -Wmissing-prototypes Change-Id: I841cfc019d592f2bc6b3fec5818051a31f4c53b5	2017-06-21 19:00:15 -07:00
Linfeng Zhang	466b667ff3	Clean vpx_idct16x16_256_add_sse2() Remove macro IDCT16 which is redundant with idct16_8col(). Change-Id: I783c5f4fda038a22d5ee5c2b22e8c2cdfb38432c	2017-06-21 13:47:15 -07:00
Linfeng Zhang	42522ce0b7	Update vpx_idct{8x8,16x16,32x32}_1_add_sse2() Change-Id: I365f8e53d9ccd028cef0f561d4de9e5916278609	2017-06-21 13:47:05 -07:00
Linfeng Zhang	2b43a1ee18	Clean 32x32 full idct sse2 and ssse3 code vpx_idct32x32_1024_add_ssse3() is actually a sse2 function and faster than vpx_idct32x32_1024_add_sse2(). Replace the slow one. All are code relocations, no new code. Change-Id: I5dac0e98cc411a4ce05660406921118986638d19	2017-06-21 13:46:49 -07:00
Linfeng Zhang	c7e4917e97	Clean 8x8 idct x86 optimization Create load_buffer_8x8() and write_buffer_8x8(). Change-Id: Ib26dd515d734a5402971c91de336ab481b213fdf	2017-06-15 14:30:00 -07:00
Linfeng Zhang	98967645a1	Remove vpx_idct8x8_64_add_ssse3() It's almost identical with vpx_idct8x8_64_add_sse2(), except little difference in instructions order. Change-Id: Ie60dabc35eaa6ebae7c755e6cff00a710aad284f	2017-06-15 14:09:33 -07:00
Linfeng Zhang	6da6a23291	Update high bitdepth load_input_data() in x86 BUG=webm:1412 Change-Id: Ibf9d120b80c7d3a7637e79e123cf2f0aae6dd78c	2017-06-13 16:53:53 -07:00
Linfeng Zhang	d6eeef9ee6	Clean array_transpose_{4X8,16x16,16x16_2) in x86 Change-Id: I341399ecbde37065375ea7e63511a26bfc285ea0	2017-06-13 16:50:44 -07:00
Linfeng Zhang	9c72e85e4c	Remove array_transpose_8x8() in x86 Duplicate of transpose_16bit_8x8() Change-Id: Iaa5dd63b5cccb044974a65af22c90e13418e311f	2017-06-13 16:50:44 -07:00
Linfeng Zhang	cbb991b6b8	Convert 8x8 idct x86 macros to inline functions Change-Id: Id59865fd6c453a24121ce7160048d67875fc67ce	2017-06-13 16:50:43 -07:00
Jerome Jiang	943f9ee25c	Merge "Merge skin detection code in vp8/9."	2017-06-08 16:36:00 +00:00
Johann Koenig	903375a48a	Merge "fdct16x16 neon optimization"	2017-06-08 15:19:36 +00:00
Jerome Jiang	658e854252	Merge skin detection code in vp8/9. BUG=webm:1438 Change-Id: Ie3dc034c7dbb498a0b088a767b1936ddeed4df14	2017-06-07 21:20:34 -07:00
Johann	eae7cf2368	fdct16x16 neon optimization Roughly 2x speedup. Since the only change for HBD is to store(), the improvement appears to hold there as well. BUG=webm:1424 Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19	2017-06-07 14:59:55 -07:00
James Zern	ff42e04f9c	Merge "ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64}"	2017-06-06 23:52:39 +00:00
James Zern	4753c23983	Merge "ppc: Add vpx_sad64/32/16x64/32/16_avg_vsx"	2017-06-06 02:19:41 +00:00
Johann Koenig	755b3daf90	Merge "comp_avg_pred neon: used by sub pixel avg variance"	2017-05-31 18:17:28 +00:00
Linfeng Zhang	30ea3ef283	Merge "Update vpx_highbd_idct4x4_16_add_sse2()"	2017-05-31 15:56:20 +00:00
Johann	f695b30ac2	comp_avg_pred neon: used by sub pixel avg variance BUG=webm:1423 Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804	2017-05-30 22:47:34 +00:00
Linfeng Zhang	45048dc9dc	Update vpx_highbd_idct4x4_16_add_sse2() BUG=webm:1412 Change-Id: I26e4b34ae9bc1ae80c24f56d740d737a95f1ab84	2017-05-30 09:25:30 -07:00
Johann Koenig	b9649d2407	Merge "comp_avg_pred: alignment"	2017-05-30 16:21:05 +00:00
Johann	ea8b4a450d	comp_avg_pred: alignment x86 requires 16 byte alignment for some vector loads/stores. arm does not have the same requirement. The asserts are still in avg_pred_sse2.c. This just removes them from the common code. Change-Id: Ic5175c607a94d2abf0b80d431c4e30c8a6f731b6	2017-05-30 07:46:43 -07:00
Johann	42ce25821d	remove DECLARE_ALIGNED from neon code Unlike x86 neon only requires type alignment when loading into vectors. Change-Id: I7bbbe4d51f78776e499ce137578d8c0effdbc02f	2017-05-26 10:41:57 -07:00
Johann	f3c97ed32e	subpel variance neon: reduce stack usage Unlike x86, arm does not impose additional alignment restrictions on vector loads. For incoming values to the first pass, it uses vld1_u32() which typically does impose a 4 byte alignment. However, as the first pass operates on user-supplied values we must prepare for unaligned values anyway (and have, see mem_neon.h). But for the local temporary values there is no stride and the load will use vld1_u8 which does not require 4 byte alignment. There are 3 temporary structures. In the C, one is uint16_t. The arm saturates between passes but still passes tests. If this becomes an issue new functions will be needed. Change-Id: I3c9d4701bfeb14b77c783d0164608e621bfecfb1	2017-05-24 13:28:13 -07:00
Johann	d204c4bf01	Use vdup instead of vmov Change-Id: Idb6248c1429b55176bb3e9f4e8365ea0ed2be62a	2017-05-24 11:38:15 -07:00
Johann Koenig	de1a9c77a7	Merge changes Iaab2b9a1,Idfb458d3 * changes: sub pel avg variance neon: 4x block sizes sub pel variance neon: 4x block sizes	2017-05-24 18:33:53 +00:00
Johann Koenig	b11a37f540	Merge changes I31fa6ef8,I228c6f29 * changes: sub pel avg variance neon: add neon optimizations sub pel variance neon: normalize variable names	2017-05-24 18:32:02 +00:00
Alexandra Hájková	8bf6eaf433	ppc: Add vpx_sadnxmx4d_vsx for n,m = {8, 16, 32 ,64} Change-Id: I547d0099e15591655eae954e3ce65fdf3b003123	2017-05-24 13:27:09 +00:00
Linfeng Zhang	6444958f62	Update inv_txfm_sse2.h and inv_txfm_sse2.c Extract shared code into inline functions. Change-Id: Iee1e5a4bc6396aeed0d301163095c9b21aa66b2f	2017-05-23 14:54:46 -07:00
Johann	f6fcd3410d	sub pel avg variance neon: 4x block sizes BUG=webm:1423 Change-Id: Iaab2b9a183fdb54aae5f717aba95d90dc36a9e3b	2017-05-22 14:40:05 -07:00
Johann	188d58eaa9	sub pel variance neon: 4x block sizes Add optimizations for blocks of width 4 BUG=webm:1423 Change-Id: Idfb458d36db3014d48fbfbe7f5462aa6eb249938	2017-05-22 14:40:01 -07:00
Johann	9b0d306a2f	sub pel avg variance neon: add neon optimizations These are missing an optimized version of vpx_comp_avg_pred BUG=webm:1423 Change-Id: I31fa6ef842e98f7ff3ea079ffed51ae33178e2ed	2017-05-22 13:58:43 -07:00
Johann	e0d294c3af	sub pel variance neon: normalize variable names match vpx_dsp/variance.c variable names Change-Id: I228c6f296c183af147b079b7c8bcdf97bd09cf3a	2017-05-22 13:58:43 -07:00
Linfeng Zhang	27beada6d0	Merge "Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2"	2017-05-22 20:58:18 +00:00
Johann	67ac68e399	variance neon: assert overflow conditions Change-Id: I12faca82d062eb33dc48dfeb39739b25112316cd	2017-05-22 11:25:06 -07:00
Linfeng Zhang	c167345ffb	Add vpx_highbd_idct{4x4,8x8,16x16}_1_add_sse2 BUG=webm:1412 Change-Id: Ia338a6057d36f9ed7eaa9cbd4dfbf0c3cbdc6468	2017-05-22 11:24:21 -07:00
Johann	d217c87139	neon variance: special case 4x The sub pixel variance uses a temp buffer which guarantees width == stride. Take advantage of this with the 4x and avoid the very costly lane loads. Change-Id: Ia0c97eb8c29dc8dfa6e51a29dff9b75b3c6726f1	2017-05-22 10:51:31 -07:00
Johann Koenig	e7cac13016	Merge changes Ib8dd96f7,Ie9854b77 * changes: neon variance: process 4x blocks use memcpy for unaligned neon stores	2017-05-22 17:48:33 +00:00
Johann Koenig	b5055002d7	Merge "neon 4 byte helper functions"	2017-05-19 17:11:30 +00:00
Johann Koenig	3c603eadb4	Merge "neon fdct: 4x4 implementation"	2017-05-19 17:08:58 +00:00
Johann	7b742da63e	neon variance: process 4x blocks Continue processing sets of 16 values. Plenty of improvement for 4x8 (doubles the speed) but only about 30% for 4x4. BUG=webm:1422 Change-Id: Ib8dd96f75d474f0348800271d11e58356b620905	2017-05-17 17:35:01 -07:00
Johann	2057d3ef75	use memcpy for unaligned neon stores Advise the compiler that the store is eventually going to a uint8_t buffer. This helps avoid getting alignment hints which would cause the memory access to fail. Originally added as a workaround for clang: https://bugs.llvm.org//show_bug.cgi?id=24421 Change-Id: Ie9854b777cfb2f4baaee66764f0e51dcb094d51e	2017-05-17 12:11:31 -07:00
Johann	105503b839	neon fdct: 4x4 implementation Approximately twice as fast as C implementation. BUG=webm:1424 Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f	2017-05-17 07:38:18 -07:00
Linfeng Zhang	18e8baa5c0	Add transpose_32bit_4x4() and rename transpose_4x4() for vpx_dsp/x86 Change-Id: Ib57377f6cf6573c04720d3cc5dea4285362b4220	2017-05-16 17:46:37 -07:00
Johann Koenig	2300e16675	Revert "Add visibility="protected" attribute for global variables referenced in asm files." This reverts commit `0d88e15454`. Reason for revert: chromium builds are failing to locate vpx_rv during dlopen() dlopen failed: cannot locate symbol "vpx_rv" referenced by "libstandalonelibwebviewchromium.so" Original change's description: > Add visibility="protected" attribute for global variables referenced in asm files. > > During aosp builds with binutils-2.27, we're seeing linker error > messages of this form: > libvpx.a(subpixel_mmx.o): relocation R_386_GOTOFF against preemptible > symbol vp8_bilinear_filters_x86_8 cannot be used when making a shared > object > > subpixel_mmx.o is assembled from "vp8/common/x86/subpixel_mmx.asm". > Other messages refer to symbol references from deblock_sse2.o and > subpixel_sse2.o, also assembled from asm files. > > This change marks such symbols as having "protected" visibility. This > satisfies the linker as the symbols are not preemptible from outside > the shared library now, which I think is the original intent anyway. > > Change-Id: I2817f7a5f43041533d65ebf41aefd63f8581a452 > TBR=jzern@google.com,johannkoenig@google.com,rahulchaudhry@chromium.org,builds@webmproject.org Change-Id: I0c2ea375aa7ef5fda15b9d9e23e654bb315c941b	2017-05-16 15:54:33 -07:00
Johann	7498fe2e54	neon 4 byte helper functions When data is guaranteed to be aligned, use helper functions which assert that requirement. Change-Id: Ic4b188593aea0799d5bd8eda64f9858a1592a2a3	2017-05-15 13:42:31 -07:00
Johann	1088b4f87c	move neon load/stores to a new file Move the tran_low_t helper functions to a new file. Additional load/store functions will be added here. Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3	2017-05-15 08:29:43 -07:00
Alexandra Hájková	bcbc3929ae	ppc: Add vpx_sad64/32/16x64/32/16_avg_vsx Change-Id: Ic9639b1331d8c5cbc207c2a036891ff0137fc56f	2017-05-13 13:13:15 +00:00
Rahul Chaudhry	0d88e15454	Add visibility="protected" attribute for global variables referenced in asm files. During aosp builds with binutils-2.27, we're seeing linker error messages of this form: libvpx.a(subpixel_mmx.o): relocation R_386_GOTOFF against preemptible symbol vp8_bilinear_filters_x86_8 cannot be used when making a shared object subpixel_mmx.o is assembled from "vp8/common/x86/subpixel_mmx.asm". Other messages refer to symbol references from deblock_sse2.o and subpixel_sse2.o, also assembled from asm files. This change marks such symbols as having "protected" visibility. This satisfies the linker as the symbols are not preemptible from outside the shared library now, which I think is the original intent anyway. Change-Id: I2817f7a5f43041533d65ebf41aefd63f8581a452	2017-05-12 11:11:16 -07:00
James Zern	ac8f58f6ab	Merge changes I1b54a7a5,I3028bdad,I59788cd9 * changes: ppc: Add get_mb_ss_vsx ppc: Add get4x4sse_cs_vsx ppc: Add comp_avg_pred_vsx	2017-05-12 15:24:59 +00:00
Luca Barbato	143b21e362	ppc: Add get_mb_ss_vsx Change-Id: I1b54a7a5bb642e4b836d786ea1ae506eed025e3f	2017-05-12 17:23:00 +02:00
Luca Barbato	6d225eb5f9	ppc: Add get4x4sse_cs_vsx Change-Id: I3028bdadf653665d18e781d28e9625f62804b3d8	2017-05-12 17:23:00 +02:00
Luca Barbato	a7f8bd451b	ppc: Add comp_avg_pred_vsx Change-Id: I59788cd98231e707239c2ad95ae54f67cfe24e10	2017-05-12 17:22:55 +02:00
Alexandra Hájková	f48532e271	ppc: Add vpx_sad64x32/64_vsx Change-Id: I84e3705fa52f75cb91b2bab4abf5cc77585ee3e2	2017-05-12 16:10:16 +02:00
Alexandra Hájková	0b15bf1e54	ppc Add vpx_sad32x16/32/64_vsx Change-Id: I3c4f9d595275669580413a71b3c3c810e7ddcacd	2017-05-12 16:10:11 +02:00
James Zern	a12ea1d5e9	Merge "ppc: Add vpx_sad16x8/16/32_vsx"	2017-05-12 13:33:51 +00:00
Alexandra Hájková	cc7f0c0f3e	ppc: Add vpx_sad16x8/16/32_vsx Change-Id: I60619d28fffd9809f93b1af510a50e1aa02519a9	2017-05-10 19:57:30 +00:00
Linfeng Zhang	764b3b8090	Update specializations of idct functions Introduced append situation in Commit `0178d97` which could be confusing. Clean a little bit and add some comments. Change-Id: I69ad336f805aca7ce9d45515b8cd237423fadbb2	2017-05-10 12:51:18 -07:00
Johann Koenig	d713ec3c46	Merge changes I92eb4312,Ibb2afe4e * changes: subpel variance neon: add mixed sizes sub pixel variance neon: use generic variance	2017-05-10 18:19:52 +00:00
Linfeng Zhang	f532504864	Clean 32x32 idct C code Change-Id: I73b8104a9e7a70ffe827c1b7ff43618f24f5d7bd	2017-05-09 11:05:51 -07:00
Linfeng Zhang	ecd1eb2162	Update 4x4 idct sse2 functions It's a bit faster to call idct4_sse2() in vpx_idct4x4_16_add_sse2() Change-Id: I1513be7a895cd2fc190f4a8297c240b17de0f876	2017-05-08 16:16:52 -07:00
Johann	f7d1486f48	neon variance: process 16 values at a time Read in a Q register. Works on blocks of 16 and larger. Improvement of about 20% for 64x64. The smaller blocks are faster, but don't have quite the same level of improvement. 16x32 is only about 5% BUG=webm:1422 Change-Id: Ie11a877c7b839e66690a48117a46657b2ac82d4b	2017-05-08 18:48:55 +00:00
Johann Koenig	1814463864	Merge changes Id602909a,Ib0e85608 * changes: neon variance: process two rows of 8 at a time neon variance: add small missing sizes	2017-05-08 17:34:20 +00:00
Linfeng Zhang	2c3a2ad6f1	Merge changes I0cfe4117,I3581d80d,Ida62c941 * changes: Split dsp/x86/inv_txfm_sse2.c Update highbd idct functions arguments to use uint16_t dst Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct	2017-05-08 16:15:57 +00:00
Johann	2346a6da4a	subpel variance neon: add mixed sizes Add support for everything except block sizes of 4. Performance is better but numbers will improve again when the variance optimizations land. BUG=webm:1423 Change-Id: I92eb4312b20be423fa2fe6fdb18167a604ff4d80	2017-05-04 15:30:01 -07:00
Johann	19e1ec8359	sub pixel variance neon: use generic variance When a neon version is available it will be called. This allows decoupling the variance implementations and has no real downside. For most configurations, the call will be #define'd to the neon implementation. Change-Id: Ibb2afe4e156c5610e89488504d366b3e6d1ba712	2017-05-04 15:30:01 -07:00
Johann	462e29703c	fdct 8x8 neon: minor comment cleanup Simplify HBD/non distinction in test. Document why transpose_neon.h is not used Change-Id: I17659414206ddbb8c2f1ef0d9f4a17f1745d5a52	2017-05-04 15:14:23 -07:00
Johann	d6a7489dd5	neon variance: process two rows of 8 at a time When the width is equal to 8, process two rows at a time. This doubles the speed of 8x4 and improves 8x8 by about 20%. 8x16 was using this technique already, but still improved a little bit with the rewrite. Also use this for vpx_get8x8var_neon BUG=webm:1422 Change-Id: Id602909afcec683665536d11298b7387ac0a1207	2017-05-04 08:59:46 -07:00
Johann	cb9133c72f	neon variance: add small missing sizes Some of the mixed sizes were missing. They can be implemented trivially using the existing helper function. When comparing the previous 16x8 and 8x16 implementations, the helper function is about 10% faster than the 16x8 version. The 8x16 is very close, but the existing version appears to be faster. BUG=webm:1422 Change-Id: Ib0e856083c1893e1bd399373c5fbcd6271a7f004	2017-05-04 08:59:42 -07:00
Linfeng Zhang	2231669a83	Split dsp/x86/inv_txfm_sse2.c Spin out highbd idct functions. BUG=webm:1412 Change-Id: I0cfe4117c00039b6778c59c022eee79ad089a2af	2017-05-03 15:43:02 -07:00
Linfeng Zhang	d5de63d2be	Update highbd idct functions arguments to use uint16_t dst BUG=webm:1388 Change-Id: I3581d80d0389b99166e70987d38aba2db6c469d5	2017-05-03 13:59:16 -07:00
Linfeng Zhang	081b39f2b7	Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct BUG=webm:1388 Change-Id: Ida62c941f2b836d6c9e27b427a7d5008ab6dc112	2017-05-03 13:58:31 -07:00
Yi Luo	a3452996a1	High bit depth inter prediction horizontal/vertical filters AVX2 User level speed improvement on i7-6700, cpu-used=1, x86_64 Linux, bitrate, 1080p, 8Mbps, 4K, 16Mbps: - Decoder: 1080p: ~4% 4K: ~5% - Encoder: 1080p: ~1% 4K: ~3% Change-Id: I51b48f9c5de0d62487d5a11aa579c97bd03dd640	2017-05-03 12:18:01 -07:00
Linfeng Zhang	a10a5cb356	Merge changes I8bb660de,Ica51d780,I6037525d * changes: Clean specializes of idct functions Clean add_protos of highbd idct functions Clean add_protos of idct functions	2017-05-03 19:17:55 +00:00
Luca Barbato	e2ad89092d	ppc: Add convolve8_vsx and convolve8_avg_vsx Change-Id: Ia5293d948003a7fff5a7cbad6e83d8a72717c857	2017-05-02 20:27:47 -07:00
Luca Barbato	e6ca81ee67	ppc: Add convolve8_avg_vert_vsx Only the generic one again, speedups for 8x8 and larger blocks to come later. Change-Id: I90d481d3a602d1e277ead8f3934eca126b86b72d	2017-05-02 20:27:42 -07:00
Luca Barbato	a65f1771ad	ppc: Add convolve8_vert Only the generic one again, speedups for 8x8 and larger blocks to come later. Change-Id: Ia509d6225984b4930ec03928c9bcbf51486da99f	2017-05-02 20:27:33 -07:00
Luca Barbato	77772350f3	ppc: Add convolve8_horiz_avg The 8x8 and larger blocks cases can be sped up further. Change-Id: I54549b03ac6c7a4e3f485738b100c3cac7ac2e15	2017-05-02 20:27:28 -07:00
Luca Barbato	08edb85bd0	ppc: Add convolve8_horiz The 8x8 and larger blocks cases can be sped up further. Change-Id: I89b635d6b01c59f523f2d54b1284ed32916c5046	2017-05-02 20:27:16 -07:00
Linfeng Zhang	0178d974e5	Clean specializes of idct functions Change-Id: I8bb660de47b5f97263ec381dc428db96e9c9a4b2	2017-05-02 18:01:19 -07:00
Linfeng Zhang	4412996d59	Clean add_protos of highbd idct functions Change-Id: Ica51d780b92b316ce9112740c56cdf7670816371	2017-05-02 17:59:38 -07:00
Linfeng Zhang	a7a57d9756	Clean add_protos of idct functions Change-Id: I6037525d92ec172810edab720389eb1865ed3b1a	2017-05-02 17:58:40 -07:00
Luca Barbato	d51d3934f5	ppc: Add convolve_avg Change-Id: Ib203c444c708f42072e38301ee3db97b5b53d014	2017-04-29 15:47:25 +02:00
Luca Barbato	63860ba7b8	ppc: Add convolve_copy Change-Id: Ie26d6dbe090e711d84bac01ba7da270db983f405	2017-04-29 15:47:25 +02:00
Linfeng Zhang	51dc998f3a	Update highbd convolve functions arguments to use uint16_t src/dst BUG=webm:1388 Change-Id: I6912de2639895d817ce850da8ea9f6c8fe21da42	2017-04-25 14:22:19 -07:00
Luca Barbato	914b160fb5	ppc: h predictor 8x8 Slightly faster with the current compiler. Change-Id: Iae225fac08395eb430c97a2abec69c60f5cf5c47	2017-04-19 19:57:51 -07:00
Luca Barbato	0b9be93205	ppc: d63 predictor 8x8 10x faster. Change-Id: I7cedbf4df2ce7df5b6f1108b11815d088fdb9ba8	2017-04-19 19:57:51 -07:00
Luca Barbato	ee9325b0bd	ppc: tm predictor 4x4 Slightly faster. Change-Id: I0ca43f309b3d9b50435d69bd5be64b53a99bd191	2017-04-19 19:57:51 -07:00
Luca Barbato	2904eb5800	ppc: h predictor 4x4 2x faster. Change-Id: I0583dec353299c6797401b646099f18db4e0420d	2017-04-19 19:57:51 -07:00
Luca Barbato	58245d7050	ppc: dc predictor 8x8 Slightly faster, the other dc predictors cannot be faster since the computation speedup is overwhelmed by the time spent reading dst to write just the 8x8 part. Change-Id: I94a0b50500adf8b7b6bb919dbf5c7adf5b9fba66	2017-04-19 19:57:51 -07:00
Luca Barbato	6b4a65e8b1	ppc: d45 predictor 8x8 11x faster. Change-Id: I5b8f39213ee1f5260724fc254e3fb5c462435798	2017-04-19 19:57:51 -07:00
Luca Barbato	92e33c7b31	ppc: d63 predictor 32x32 About 10x faster. Change-Id: If7d0645f75c5d7deb9751edd0bf47e2f9068e9e7	2017-04-19 19:57:51 -07:00
Luca Barbato	a5469a00a8	ppc: d63 predictor 16x16 About 18x faster. Change-Id: Id043bf76c011e03e992085bb5e20f330d3e98cd4	2017-04-19 19:57:51 -07:00
Luca Barbato	cc868da526	ppc: d45 predictor 32x32 About 12x faster. Change-Id: I22c150256aefb4941861ab1f6c17d554fb694bed	2017-04-19 19:57:51 -07:00
Luca Barbato	7a7dc9e624	ppc: d45 predictor 16x16 About 16x faster. Change-Id: Ie5469fb32d5fd11bb6cb06318cea475d8a5b00b9	2017-04-19 19:57:51 -07:00
Luca Barbato	c08baa2900	ppc: dc predictor 32x32 10x and 5x faster. Change-Id: I7913c58c768334d818f541a5e219f1035791eeaf	2017-04-19 19:57:47 -07:00
Luca Barbato	22ca468c7c	ppc: dc top and left predictor 32x32 6x faster. Change-Id: I717995b4056e5579c68191d11b495372971fe1ae	2017-04-19 19:49:31 -07:00
Luca Barbato	ad9dea1f6d	ppc: dc top and left predictor 16x16 13x faster. Change-Id: I1771ac39fda599153f933cb3f0506c9f97a6cbe6	2017-04-19 19:49:31 -07:00
Luca Barbato	d68d37872c	ppc: dc_128 predictor 32x32 6x faster. Change-Id: I1da8f51b4262871cb98f0aa03ccda41b0ac2b08b	2017-04-19 19:49:31 -07:00
Luca Barbato	f9d20e6df2	ppc: dc_128 predictor 16x16 20x faster. Change-Id: I05f0deb2d38ae7966eae6b71fbc0aa51880e5709	2017-04-19 19:49:31 -07:00
Luca Barbato	0d9417de4a	ppc: tm predictor 32x32 About 8x faster. Change-Id: I9bad827ccbdf47ec95406e961c74ac2ff45f80cf	2017-04-19 19:49:26 -07:00
James Zern	a81f037f15	Merge changes I1f5a3752,I95123051,I3bb724e0,Ie81077fa,Ic80f3c05, ... * changes: ppc: tm predictor 16x16 ppc: tm predictor 8x8 ppc: horizontal predictor 32x32 ppc: horizontal predictor 16x16 ppc: vertical intrapred 16x16 and 32x32 configure: Workaround clang not enabling altivec on -mvsx configure: Match power64 as ppc64	2017-04-20 02:45:45 +00:00
Linfeng Zhang	bf8a49abbd	Clean CONVERT_TO_BYTEPTR/SHORTPTR in convolve Replace by CAST_TO_BYTEPTR/SHORTPTR. The rule is: if a short ptr is casted to a byte ptr, any offset operation on the byte ptr must be doubled. We do this by casting to short ptr first, adding offset, then casting back to byte ptr. BUG=webm:1388 Change-Id: I9e18a73ba45ddae58fc9dae470c0ff34951fe248	2017-04-19 12:13:49 -07:00
Luca Barbato	479443a570	ppc: tm predictor 16x16 About 10x faster. Change-Id: I1f5a3752d346459df3b45f92963208bf3e520f06	2017-04-19 01:48:10 +02:00
Luca Barbato	c8f5a55df4	ppc: tm predictor 8x8 About 5x faster. Change-Id: I951230517f49c0dca9ac9eac2efa8916a303b85a	2017-04-19 01:48:09 +02:00
Luca Barbato	7b0e12934e	ppc: horizontal predictor 32x32 About 5x faster. Change-Id: I3bb724e07baffd901aa2d0f65060ba48882cc9b8	2017-04-19 01:48:09 +02:00
Luca Barbato	a7a2d1653b	ppc: horizontal predictor 16x16 About 10x faster. Change-Id: Ie81077fa32ad214cdb46bdcb0be4e9e2c7df47c2	2017-04-19 01:48:09 +02:00
Luca Barbato	7ad1faa6f8	ppc: vertical intrapred 16x16 and 32x32 Change-Id: Ic80f3c050cfbe7697e81a311b4edaaa597b85cab	2017-04-19 01:48:09 +02:00
Johann	9fa24f03b5	re-enable vpx_comp_avg_pred_sse2 Buffers on 32 bit x86 builds only guaranteed 8 byte alignment. Fixed with "AvgPred test: use aligned buffers" and "sad avg: align intermediate buffer" Also re-enable asserts on the C version. BUG=webm:1390 Change-Id: I93081f1b0002a352bb0a3371ac35452417fa8514	2017-04-17 08:40:43 -07:00
Johann	069b772915	sad avg: align intermediate buffer comp_avg_pred has started declaring a requirement for aligned buffers. BUG=webm:1390 Change-Id: Idaf6667498ea343e8d49b32bc9d8b9d0aa43ef5c	2017-04-17 14:26:33 +00:00
James Zern	4ba20da8b1	Merge "Add AVX2 optimization to copy/avg functions"	2017-04-15 00:26:08 +00:00
Yi Luo	aa5a941992	Add AVX2 optimization to copy/avg functions Change-Id: Ibcef70e4fead74e2c2909330a7044a29381a8074	2017-04-14 16:50:10 -07:00
Johann	eaa7cdf05d	Disable vpx_comp_avg_pred_sse2 Failures on windows: unknown file: error: SEH exception with code 0xc0000005 thrown in the test body. Alignment check errors on linux: test_libvpx: ../libvpx/vpx_dsp/variance.c:230: void vpx_comp_avg_pred_c(uint8_t , const uint8_t , int, int, const uint8_t *, int): Assertion `((intptr_t)comp_pred & 0xf) == 0' failed. BUG=webm:1390 Change-Id: I5eed5381c0f1a8fe594a128eb415e77232f544ea	2017-04-14 08:43:06 -07:00
Johann	28a8622143	vpx_comp_avg_pred: sse2 optimization Provides over 15x speedup for width > 8. Due to smaller loads and shifting for width == 8 it gets about 8x speedup. For width == 4 it's only about 4x speedup because there is a lot of shuffling and shifting to get the data properly situated. BUG=webm:1390 Change-Id: Ice0b3dbbf007be3d9509786a61e7f35e94bdffa8	2017-04-13 08:44:52 -07:00
James Zern	04e9456567	Merge changes from topic 'Wshorten' * changes: configure: enable -Wshorten-64-to-32 for hbd vp9_encodeframe: resolve -Wshorten-64-to-32 in hbd Resolve -Wshorten-64-to-32 in highbd variance.	2017-04-07 07:32:14 +00:00
James Zern	47b9a09120	Resolve -Wshorten-64-to-32 in highbd variance. For 8-bit the subtrahend is small enough to fit into uint32_t. This is the same that was done for: `c0241664a` Resolve -Wshorten-64-to-32 in variance. For 10/12-bit apply: `63a37d16f` Prevent negative variance Change-Id: Iab35e3f3f269035e17c711bd6cc01272c3137e1d	2017-04-05 17:34:02 -07:00
Linfeng Zhang	6fc2e57c2c	Update 32x32 high bitdepth idct NEON optimization Preparation of CONVERT_TO_BYTEPTR/SHORTPTR clean up. BUG=webm:1388 Change-Id: I928d30a5698023bb90888d783cf81c51ec183760	2017-04-05 15:28:11 -07:00
James Zern	aefc1088a2	intrapred: sync highbd_d135_predictor w/d135_ previously: `05437805f` intrapred/d135: flatten border results before storing BUG=webm:1316 Change-Id: I3b8bd89117ad7f2f4560b57f7c148da781e86f85	2017-03-24 20:45:44 -07:00
James Zern	67cde46dd7	intrapred: specialize highbd 4x4 predictors d207/d63/d45/d117/d135/d153 ~9-45% better depending on the predictor on 32-bit ARM, similar range on x86-64 this matches the non-highbitdepth implementation BUG=webm:1316 Change-Id: Iddebdf7c58c6f31c47cae04da95c6e5318200e4c	2017-03-24 20:45:36 -07:00
James Zern	e05f4cf8f4	intrapred: rename d63f to d63e this is consistent with he/ve/d45e Change-Id: I75641ae5667430b0ecd370db86fff6e666cb577d	2017-03-24 20:41:39 -07:00
James Zern	d45617c702	remove CONFIG_MISC_FIXES this belonged to vp10 with the changes now migrated to av1. Change-Id: Ie30ead3e7b71f465bc14136e1b6f156ea978c43f	2017-03-24 20:41:39 -07:00
Kaustubh Raste	8ee9b855a0	Merge "Fix mips msa fwd xform mismatch"	2017-03-23 07:44:16 +00:00
James Zern	f16ea6a6eb	Merge "vp9_rdopt: correct size to vpx_sum_squares_2d_i16"	2017-03-23 00:53:22 +00:00
James Zern	e097bb1d39	Merge "idct_neon: prefix non-static functions w/'vpx_'"	2017-03-22 19:30:11 +00:00
James Zern	5661cd8ff4	vp9_rdopt: correct size to vpx_sum_squares_2d_i16 the current implementations expect pixel size, not the block type BUG=webm:1392 Change-Id: Ib91e9f30a1f56e13566b1fb76f089dae9bb50cdc	2017-03-22 12:04:33 -07:00
James Zern	f91c3bb3ab	idct_neon: prefix non-static functions w/'vpx_' Change-Id: I94fcdeae18468e6ef0cb7119b8142d982a048031	2017-03-22 11:49:23 -07:00
Kaustubh Raste	e45c1f55b4	Fix mips msa fwd xform mismatch Change-Id: I32a6df11463144aa1a562256ee7d57a41fd678d6	2017-03-22 14:01:03 +05:30
Yi Luo	cb9b277b2f	Merge "Make butterfly_self() signature consistent with butterfly()"	2017-03-21 22:32:20 +00:00
Yi Luo	266868a40b	Make butterfly_self() signature consistent with butterfly() - Refer to patch: `48fca113d` inv_txfm_ssse3,butterfly: fix win32 abi compatibility. - Change four butterfly() calls to butterfly_self(), to simplify the operations. Change-Id: Ib2a8cfe6cddcaf0a59e6e6270d8380055ea42ef3	2017-03-21 09:36:35 -07:00
James Zern	e0b4c4d1ae	Merge "Add vpx_highbd_idct32x32_1024_add_neon()"	2017-03-21 03:27:35 +00:00
James Zern	6d71d33d55	Merge "Add vpx_highbd_idct32x32_34_add_neon()"	2017-03-21 03:02:51 +00:00
James Zern	5da2e500d7	inv_txfm_sse2: clear conversion warning in hbd build tran_high -> tran_low in return from dct_const_round_shift() Change-Id: I2fe06c4b604823b1d1fe40a487017c3c2819a440	2017-03-17 01:16:38 -07:00
Linfeng Zhang	27530d484e	Add vpx_highbd_idct32x32_1024_add_neon() BUG=webm:1301 Change-Id: Ib90af0c1712e56b301d0e981dbe9a641e15e36ca	2017-03-17 00:27:46 -07:00
Linfeng Zhang	50b13f75b8	Add vpx_highbd_idct32x32_34_add_neon() BUG=webm:1301 Change-Id: I74dd16c6c64e7bb71aa991cedccddf0663ef5e06	2017-03-17 00:27:46 -07:00
James Zern	2882778310	Merge "Add vpx_highbd_idct32x32_135_add_neon()"	2017-03-17 07:26:52 +00:00
Linfeng Zhang	65e9fb65e8	Add vpx_highbd_idct32x32_135_add_neon() BUG=webm:1301 Change-Id: I58c2d65d385080711c3666d6d8f9d241dac7b21a	2017-03-16 22:37:55 -07:00
James Zern	68efc64b72	Merge "Clean vpx_idct32x32_1024_add_neon()"	2017-03-17 05:24:58 +00:00
Rafael de Lucena Valle	405b94c661	Add Hadamard for Power8 Change-Id: I3b4b043c1402b4100653ace4869847e030861b18 Signed-off-by: Rafael de Lucena Valle <rafaeldelucena@gmail.com>	2017-03-15 23:46:18 -03:00
Linfeng Zhang	e54231d613	Clean vpx_idct32x32_1024_add_neon() Change-Id: I05921e16d6a3e4e7e5b00a90624735050a186636	2017-03-15 11:24:31 -07:00
Yi Luo	8440cc4817	Merge "Improve idct32x32_1024_add SSSE3 intrinsics performance"	2017-03-15 02:32:52 +00:00
Linfeng Zhang	c756eb01c8	Fix overflow issue in 32x32 idct NEON intrinsics Similar issue as Change `bc1c18e`. The PartialIDctTest.ResultsMatch test on vpx_idct32x32_135_add_neon() in high bit-depth mode exposes 16-bit overflow in final stage of pass 2, when changing the test number from 1,000 to 1,000,000. Change to use saturating add/sub for vpx_idct32x32_34_add_neon(), vpx_idct32x32_135_add_neon and vpx_idct32x32_1024_add_neon() in high bit-depth mode. Change-Id: Iaec0e9aeab41a3fdb4e170d7e9b3ad1fda922f6f	2017-03-14 16:59:14 -07:00
Yi Luo	fedcf83f33	Improve idct32x32_1024_add SSSE3 intrinsics performance - Function level speed improves ~12%. Change-Id: I9b7dbddabf08c7d0f6b25264e6074d5ccbe39290	2017-03-14 14:04:08 -07:00
Linfeng Zhang	b0bfcc368c	Merge "Add vpx_highbd_idct32x32_135_add_c()"	2017-03-13 18:49:01 +00:00
James Zern	48fca113d1	inv_txfm_ssse3,butterfly: fix win32 abi compatibility only the first 3 parameters can be aligned to 16 as required by __m128i, make them all pointers for consistency. since: `07c48ccfe` Improve idct32x32_34_add SSSE3 intrinsics performance BUG=webm:1384 Change-Id: I0324f701e723a27cb470036a180693ba8829d01d	2017-03-10 19:57:17 -08:00
Yi Luo	327add990f	Improve idct32x32_135_add SSSE3 intrinsics performance - Split the inv txfm into three parts to avoid stack spillover. - Function level speed improves ~12%. - Use function and macro to remove some repeated code. Change-Id: I14f5f072334fd766808cb52bf648df792e7379ee	2017-03-09 16:17:54 -08:00
Linfeng Zhang	77311e0dff	Update vpx_idct32x32_1024_add_neon() Most are cosmetics changes. Speed has no change with clang 3.8, and about 5% faster with gcc 4.8.4 Tried the strategy used in 8x8 and 16x16 (which operations' orders are similar to the C code), though speed gets better with gcc, it's worse with clang. Tried to remove store_in_output(), but speed gets worse. Change-Id: I93c8d284e90836f98962bb23d63a454cd40f776e	2017-03-08 12:39:04 -08:00
Linfeng Zhang	48f5886605	Add vpx_highbd_idct32x32_135_add_c() When eob is less than or equal to 135 for high-bitdepth 32x32 idct, call this function. BUG=webm:1301 Change-Id: I8a5864f5c076e449c984e602946547a7b09c9fe6	2017-03-08 10:46:33 -08:00
Linfeng Zhang	c4e5c54d69	cosmetics,dsp/arm/: vpx_idct32x32_{34,135}_add_neon() No speed changes and disassembly is almost identical. Change-Id: Id07996237d2607ca6004da5906b7d288b8307e1f	2017-03-08 08:58:32 -08:00
Linfeng Zhang	3cf5c213f1	cosmetics,dsp/arm/: rename a variable Rename cospi_6_26_14_18N to cospi_6_26N_14_18N for consistency. Change-Id: I00498b43bb612b368219a489b3adaa41729bf31a	2017-03-08 08:55:41 -08:00
Yi Luo	07c48ccfe0	Improve idct32x32_34_add SSSE3 intrinsics performance - Split the transform into first half and second half. - Reschedule the instructions to avoid stack spillover. - Function level speed improves ~16%. Change-Id: I166889840d23aa8a273eca00f6fbdae8b4566f35	2017-03-01 11:14:48 -08:00
James Zern	47d6f16a04	get_prob(): rationalize int types promote the unsigned int calculation to uint64_t rather than int64_t for type consistency Change-Id: Ic34dee1dc707d9faf6a3ae250bfe39b60bef3438	2017-02-24 15:36:52 -08:00
Jerome Jiang	b1dcaf7f1e	Merge "Fix segmentation fault caused by denoiser working with spatial SVC."	2017-02-22 04:44:55 +00:00
Yi Luo	6036a0d24f	Following SSSE3 intrinsics functions also work for HBD - vpx_idct8x8_12_add_ssse3 vpx_idct8x8_64_add_ssse3 vpx_idct32x32_34_add_ssse3 vpx_idct32x32_135_add_ssse3 vpx_idct32x32_1024_add_ssse3 - turn on unit tests. Change-Id: I788b2b3b2074a6f3ab6a0e6f469c1327a123eff7	2017-02-21 12:37:53 -08:00
Jerome Jiang	0d1e5a21c4	Fix segmentation fault caused by denoiser working with spatial SVC. Re-enable the affected test. BUG=webm:1374 Change-Id: I98cd49403927123546d1d0056660b98c9cb8babb	2017-02-21 09:38:28 -08:00
Yi Luo	1f8e8e5bf1	Fix idct8x8 SSSE3 SingleExtremeCoeff unit tests - In SSSE3 optimization, 16-bit addition and subtraction would overflow when input coefficient is 16-bit signed extreme values. - Function-level speed becomes slower (unit ms): idct8x8_64: 284 -> 294 idct8x8_12: 145 -> 158. BUG=webm:1332 Change-Id: I1e4bf9d30a6d4112b8cac5823729565bf145e40b	2017-02-17 14:05:05 -08:00
James Zern	3e7025022e	Merge "Add vpx_highbd_idct16x16_10_add_neon()"	2017-02-17 20:29:37 +00:00
Yi Luo	f62dcc9c33	Replace idct32x32_1024_add_ssse3 assembly with intrinsics - Encoding/decoding test, BQTerrace_1920x1080_60.y4m, on i7-6700, no obvious user-level speed performance downgrade. - Passed unit tests. Change-Id: I20688e0dd3731021ec8fb4404734336f1a426bfc	2017-02-16 16:10:40 -08:00
Johann Koenig	a9b81da575	Merge "block error avx2: use tran_low_t"	2017-02-16 23:51:14 +00:00
Linfeng Zhang	0620081731	Add vpx_highbd_idct16x16_10_add_neon() BUG=webm:1301 Change-Id: If686c8144764c4162458f0bc4bb1bbf6555c48ab	2017-02-16 15:13:50 -08:00
James Zern	0f014c97e5	Merge "Fix mips vpx_post_proc_down_and_across_mb_row_msa function"	2017-02-16 23:02:10 +00:00
Johann Koenig	06a82af0de	Merge "correct bitdepth_conversion_sse2.h header guard"	2017-02-16 21:41:28 +00:00
Johann	6c2d732bf4	correct bitdepth_conversion_sse2.h header guard Change-Id: Ic4ffd861608e67fe59bcb3a86010ce3ef11a5519	2017-02-16 12:43:33 -08:00
Yi Luo	1cb44945fb	Merge "Add idct32x32_135_add SSSE3 intrinsics"	2017-02-16 20:43:29 +00:00
Johann	2104454607	block error avx2: use tran_low_t Change-Id: Ic5f3a1f569d6f82afeaf4fcd7235374bb460db3c	2017-02-16 12:39:02 -08:00
Yi Luo	72a43e2378	Add idct32x32_135_add SSSE3 intrinsics - Replace the corresponding assembly code. - No user level speed performance degrade. - Unit tests passed. Change-Id: Idd0c5a4bad4976f1617c34100cb46e75e3b961e5	2017-02-16 11:29:34 -08:00
Johann	4682130b60	quantize_fp highbd ssse3: use tran_low_t for coeff Change-Id: Iebade0efc0efbb0a80a0f3adbef4962e3a2f25e8	2017-02-16 07:40:56 -08:00
Johann	44600442dc	bitdepth conversion: really use num elements The previous implementation confused bit/bytes/elements. It was using '32' as the multiplier but that was mistakenly adopted because a 32x32 transform embedded the stride. Change-Id: Ieeb867a332416b9a40580b5e7c9b20088e9e691a	2017-02-16 15:02:48 +00:00
Kaustubh Raste	fddf66b741	Fix mips vpx_post_proc_down_and_across_mb_row_msa function Added fix to handle non-multiple of 16 cols case for size 16 Change-Id: If3a6d772d112077c5e0a9be9e612e1148f04338c	2017-02-16 13:17:00 +05:30
Johann Koenig	b63e88e506	Merge "Use 'packssdw' for loading tran_low_t values"	2017-02-16 02:41:00 +00:00
Linfeng Zhang	106c342659	cosmetics,dsp/inv_txfm.c: reorder functions Change-Id: Ie0f7689ebe230c68eadb22a32b14838c1a7543a6	2017-02-15 11:40:35 -08:00
Linfeng Zhang	81914ce68a	Add vpx_highbd_idct16x16_38_add_neon() BUG=webm:1301 Change-Id: Ic6cd8c1e63e1b7a997cbed221e20fff4c599e0fe	2017-02-15 09:12:02 -08:00
Linfeng Zhang	e07e74fb0f	Add vpx_highbd_idct16x16_38_add_c() When eob is less than or equal to 38 for high-bitdepth 16x16 idct, call this function. BUG=webm:1301 Change-Id: I09167f89d29c401f9c36710b0fd2d02644052060	2017-02-14 17:25:52 -08:00
Johann	327a02d77e	Use 'packssdw' for loading tran_low_t values This matches bitdepth_conversion_sse2.asm and produces substantially better assembly. The old way had lots of 'movzwl' and 'shl' and storing back to memory before loading into an xmm register. Change-Id: Ib33e35354dfd691a4f8b1e39f4dbcbb14cd5302b	2017-02-14 22:39:49 +00:00
Linfeng Zhang	429e652809	Replace 14 with DCT_CONST_BITS in idct NEON functions' shifts Change-Id: I2a39a3bb87516b04d273bc1c0f4a634e3fb6f0f6	2017-02-14 13:08:41 -08:00
clang-format	4b402746ca	apply clang-format Change-Id: I75e4a9e0b37bd4586f26c8d6c1fa27f3f6ff1bce	2017-02-14 12:45:52 -08:00
Yi Luo	c1a90dc160	Merge "Replace idct32x32_34_add_ssse3 assembly with intrinsics"	2017-02-14 20:13:27 +00:00
Yi Luo	bd86de1ac8	Replace idct32x32_34_add_ssse3 assembly with intrinsics - No user-level speed performance change. - Pass unit tests. Change-Id: Idfc598e00f354265e41f6b3219f4734216c115c6	2017-02-14 10:38:36 -08:00
Linfeng Zhang	de9ae32b93	Merge "Add vpx_highbd_idct16x16_256_add_neon()"	2017-02-14 01:15:34 +00:00
Linfeng Zhang	5ad4159ebb	Add vpx_highbd_idct16x16_256_add_neon() BUG=webm:1301 Change-Id: I6bb755552a39bdd26eef3f449601f6a9766c65ec	2017-02-13 15:50:33 -08:00
Johann	5ecde212a8	fdct8x8 highbd neon: use tran_low_t for output Change-Id: I100c4a1955d80bec4d28e82796b3e7f57e84d0ba	2017-02-13 22:16:14 +00:00
Linfeng Zhang	016933ad48	Add vpx_highbd_idct{16x16,32x32}_1_add_neon() and update vpx_highbd_idct8x8_1_add_neon() BUG=webm:1301 Change-Id: I18d1a0cbe98ba822d5194c1b4e13a4c29c5c75f4	2017-02-13 10:25:22 -08:00
James Zern	91f87e7513	Merge "Add vpx_idct16x16_38_add_neon()"	2017-02-11 03:42:36 +00:00
Linfeng Zhang	bc1c18e18c	Add vpx_idct16x16_38_add_neon() The RunQuantCheck() test on it exposes 16-bit overflow in stage 7 of pass 2. Change to use saturating add/sub for both vpx_idct16x16_38_add_neon() and vpx_idct16x16_256_add_neon() for high bitdepth. Change-Id: Ibf4c107a887553a52852cc582e28d38a5a5a2712	2017-02-08 12:15:22 -08:00
Yi Luo	ac04d11abc	Replace idct8x8_12_add_ssse3 assembly code with intrinsics - Performance achieves the same as assembly. - Unit tests pass. Change-Id: I6eacfbbd826b3946c724d78fbef7948af6406ccd	2017-02-08 10:07:45 -08:00
Linfeng Zhang	cf76ee2cb7	Add vpx_idct16x16_38_add_c() When eob is less than or equal to 38 for 16x16 idct, call this function. Change-Id: Ief6f3fb16a49ace3c92cebf4e220bf5bf52a6087	2017-02-07 09:40:51 -08:00
Linfeng Zhang	66695533a8	Merge "Update 16x16 8-bit idct NEON intrinsics"	2017-02-07 16:52:40 +00:00
Johann	641fda79bb	highbd x86: consolidate tran_low_t conversions Create new helper files specifically for converting tran_low_t types. Change-Id: I7c4c458ef910f3b3d10a3cfbf9df4de7682fd905	2017-02-06 10:43:26 -08:00
Jingning Han	bb40844e32	Merge "Add SSSE3 intrinsic 8x8 inverse 2D-DCT"	2017-02-02 22:18:32 +00:00
Kaustubh Raste	5b10674b5c	Merge "Add mips msa sum_squares_2d_i16 function"	2017-02-02 08:09:21 +00:00
Johann Koenig	726556dde9	Merge "Remove neon assembly for idct 16x16 and 8x8"	2017-02-02 03:25:31 +00:00
Johann Koenig	ce6318f254	Merge changes I43521ad3,I013659f6 * changes: satd highbd neon: use tran_low_t for coeff satd highbd sse2: use tran_low_t for coeff	2017-02-02 03:03:58 +00:00
Linfeng Zhang	e4985cf619	Update 16x16 8-bit idct NEON intrinsics Remove redundant memory accesses. Change-Id: I8049074bdba5f49eab7e735b2b377423a69cd4c8	2017-02-01 17:04:33 -08:00
Jingning Han	8f95389742	Add SSSE3 intrinsic 8x8 inverse 2D-DCT The intrinsic version reduces the average cycles from 183 to 175. Change-Id: I7c1bcdb0a830266e93d8347aed38120fb3be0e03	2017-02-01 14:47:53 -08:00
Johann Koenig	dc90501ba3	Merge changes I374dfc08,I7e15192e,Ica414007 * changes: hadamard highbd ssse3: use tran_low_t for coeff hadamard highbd neon: use tran_low_t for coeff hadamard highbd sse2: use tran_low_t for coeff	2017-02-01 21:56:36 +00:00
Johann Koenig	f60171bb4f	Merge "deblock: annotate postproc parameters"	2017-02-01 19:57:29 +00:00

... 3 4 5 6 7 ...

994 Commits