generic-library/vpx

Author	SHA1	Message	Date
Linfeng Zhang	7c0529728a	cosmetics: NEON scaling code Change-Id: Ib91054622c1f09c4ca523bc6837d7d8ab9f03618	2017-09-19 16:39:17 -07:00
Linfeng Zhang	f357335c38	Refactor convolve NEON code Rename a couple of hbd static functions. Move the position of NEON function convolve8_4(). Change-Id: Idfac00edf2e99cdd8e0a73b9f895402f60be6349	2017-09-19 16:28:36 -07:00
Linfeng Zhang	a9bbe53dbb	Add 4 to 1 scaling NEON optimization BUG=webm:1419 Change-Id: If82a93935d2453e61b7647aae70983db1740bec7	2017-09-11 10:17:28 -07:00
Linfeng Zhang	3ec20445b2	Refactor convolve8 NEON functions Change-Id: I4ac576875c91fee7cb150d298fae4a2c156d374c	2017-09-06 15:55:17 -07:00
Linfeng Zhang	d331e7a1c0	Remove get_filter_base() and get_filter_offset() in convolve so that the convolve functions are independent of table alignment. Change-Id: Ieab132a30d72c6e75bbe9473544fbe2cf51541ee	2017-09-05 15:22:36 -07:00
Johann Koenig	dfafd10ef5	Merge "quantize neon: round dqcoeff towards zero"	2017-08-23 19:20:53 +00:00
Johann	2a5aa98a35	quantize neon: round dqcoeff towards zero Add 1 if negative to get dqcoeff to round towards zero. 10-15% faster than converting to positive before shifting. Change-Id: I01a62fd0c9bca786b6885b318bd447bb9229903d	2017-08-23 08:05:50 -07:00
Johann	2c56bb97f2	quantize: ignore skip_block in arm Change-Id: Icfb70687476b2edb25d255793ba325b261d40584	2017-08-21 14:37:50 -07:00
Johann	93166c5e51	neon: vpx_quantize_b_32x32 With skip block the neon is about twice as fast as C. The neon has no shortcut for coeff < zbin so it always takes the same amount of time. Even if the C can take the shortcut, it is over twice as fast in neon. If it can't, that gap increases to over 10x. BUG=webm:1426 Change-Id: I400722146c1b5a5f6289f67d85fd642463d2bfc6	2017-08-08 14:05:18 -07:00
Johann	2d6b5df657	neon: vpx_quantize_b With skip block or coeff < zbin it is about twice as fast as C. If most coeff values are > zbin it is about 10-15x as fast as C. BUG=webm:1426 Change-Id: I5d3c007b014a372d5ef0882b39bb48983b4131c7	2017-07-31 10:38:46 -07:00
Johann	e381753926	sad4d neon: 64x[32,64] Rewrite 64x64. BUG=webm:1425 Change-Id: I336bf5a3aa4b783389c10b16a50f0f559346ecbf	2017-07-12 13:26:39 +00:00
Johann	e1bde306c8	sad4d neon: 32x[16,32,64] Rewrite 32x32. Use half the accumulator registers. BUG=webm:1425 Change-Id: Ibf5e61dc4ba15056102aef8495f4a02c668c5d13	2017-07-12 13:25:18 +00:00
Johann	807ce8fb1e	sad4d neon: 16x[8,16,32] Rewrite 16x16. Use half the accumulator registers. BUG=webm:1425 Change-Id: I44b48512b1e3629505d83c2645e800f53878ccc2	2017-07-12 13:25:11 +00:00
Johann	8152b0904d	sad4d neon: 8x[4,8,16] BUG=webm:1425 Change-Id: I7de2500cca4b621f21478c4b0333c56d76dbc9a4	2017-07-12 13:25:03 +00:00
Johann	dd4347e9ec	sad4d neon: 4x4, 4x8 BUG=webm:1425 Change-Id: I5081b5ce131821d590c53ac1206a94f50cb8b468	2017-07-12 03:38:03 +00:00
Johann	66a96fd3de	avg_neon: fix 4x4, update 8x8 4x4 was failing with a bus error. Most likely due to clang alignment hints on 32bit loads. Change-Id: Ib191ce0e6239fc55d85f10e4dbe15876e5052edb	2017-07-10 15:29:34 -07:00
Johann	87610ac45e	neon: consolidate horizontal adds Change-Id: Iaf9e88ff636ccf8f0ef310869c6827f3f205cca8	2017-07-10 15:29:13 -07:00
Johann Koenig	4e16f70703	Merge changes Id84d9780,Iaa6ea75b,I3362e0dd,I0020a49e,Ia42e4f36, ... * changes: sad neon: avg for 64x[32,64] sad neon: macroize 64xN definitions sad neon: avg for 32x[16,32,64] sad neon: macroize 32xN definitions sad neon: avg for 16x[8,16,32] sad neon: macroize 16xN definitions	2017-07-07 21:01:23 +00:00
Johann Koenig	6c375b9cd0	Merge "fdct neon: 32x32_rd"	2017-07-07 14:05:51 +00:00
Johann	e4e08556db	sad neon: avg for 64x[32,64] BUG=webm:1425 Change-Id: Id84d97807a6a0fbcc889c4dfe11929d54f85493d	2017-07-07 07:04:04 -07:00
Johann	6ae8f8dbe8	sad neon: macroize 64xN definitions Change-Id: Iaa6ea75b10e75784f31b1e08637eecf0dcb5cff9	2017-07-07 07:04:04 -07:00
Johann	67cffc1ef6	sad neon: avg for 32x[16,32,64] BUG=webm:1425 Change-Id: I3362e0dded3b46ca032caa7f44db42f324bc596d	2017-07-07 07:04:04 -07:00
Johann	b0d15713be	sad neon: macroize 32xN definitions Change-Id: I0020a49e77d27514375a03095d5821dc0aa7d128	2017-07-07 07:04:04 -07:00
Johann	527e0c9b1c	sad neon: avg for 16x[8,16,32] BUG=webm:1425 Change-Id: Ia42e4f36547c5fe12114fb58379e34bce82eb2f2	2017-07-07 07:04:04 -07:00
Johann	3c18acf452	sad neon: macroize 16xN definitions Change-Id: I5aea6ffbfa48eb1970afe3be54f0bba275d7fa58	2017-07-07 07:04:04 -07:00
Johann	d6423b3166	sad neon: macroize 8xN definitions Change-Id: I7b36a57e893c1795a37ba7994995bec7ff021409	2017-07-06 07:51:59 -07:00
Johann	63bdc574e5	sad neon: avg for 8x[4,8,16] BUG=webm:1425 Change-Id: If2ab51e3050e078b0011b174efe41fcb65a15f44	2017-07-06 07:43:09 -07:00
Johann	6bac3f80ee	sad neon: avg for 4x4 and 4x8 BUG=webm:1425 Change-Id: Ifc685a96cb34f7fd9243b4c674027480564b84fb	2017-07-06 07:12:47 -07:00
Johann	75b00592c7	fdct neon: 32x32_rd About 40% faster than the non-rd version. BUG=webm:1424 Change-Id: Ia99d14eb9532302eeaab8cd3e503395b0374b5a2	2017-07-06 06:30:50 -07:00
Johann	3ae458f2f3	partial fdct neon: maintain neon registers Finish the calulations in neon registers. This avoids a potentially expensive move from neon to gp and allows at least clang to store directly to memory. BUG=webm:1424 Change-Id: Idef25eec95f7610947167818e9194bde8b00d282	2017-07-01 09:29:38 -07:00
Johann	9fe510c12a	partial fdct neon: add 32x32_1 Always return an int32_t. Since it needs to be moved to a register for shifting, this doesn't really penalize the smaller transforms. The values could potentially be summed and shifted in place. BUG=webm:1424 Change-Id: Id5beb35d79c7574ebd99285fc4182788cf2bb972	2017-06-28 15:37:44 -07:00
Johann	f310ddc470	partial fdct neon: add 16x16_1 For the 8x8_1, the highbd output fit nicely in the existing function. 12 bit input will overflow this implementation of 16x16_1. BUG=webm:1424 Change-Id: I2945fe5478b18f996f1a5de80110fa30f3f4e7ec	2017-06-28 15:37:44 -07:00
Johann	4959dd3eb3	partial fdct neon: add 4x4_1 BUG=webm:1424 Change-Id: Ib0f3cfd6116fc1f5a99acb8bfd76e25b90177ffc	2017-06-28 15:37:44 -07:00
Johann	cf75ab6ccd	partial fdct neon: move 8x8_1 and enable hbd tests The function was originally written with HBD in mind. Enable it and configure the tests. BUG=webm:1424 Change-Id: I78a2eba8d4d9d59db98a344ba0840d4a60ebe9a1	2017-06-28 15:37:43 -07:00
Johann	ad011aaab8	sad neon: rewrite 64x64 and add 64x32 BUG=webm:1425 Change-Id: Ib454762d1c61b05a98324fe81ad58c9e09784717	2017-06-28 12:21:34 -07:00
Johann	77a648885c	sad neon: rewrite 32x32, add 32x16 and 32x64 BUG=webm:1425 Change-Id: I966650df7e3face93e1e771634d1cc5458a35f85	2017-06-28 12:20:27 -07:00
Johann	469643757f	sad neon: rewrite 16x8, 16x16, add 16x32 BUG=webm:1425 Change-Id: Ie126553e5fffcdfaf3d82a85b368ac10ce9ab082	2017-06-28 12:16:00 -07:00
Johann	e40e78be24	sad neon: rewrite 8x8 and 8x16 BUG=webm:1425 Change-Id: I068f06c67b841f09ea07c04ada0c2f1706102138	2017-06-28 12:15:57 -07:00
Johann	46d8660ce3	sad neon: rewrite 4x4 and add 4x8 The previous implementation loaded 8 values (discarding half) BUG=webm:1425 Change-Id: Icb72a94e2557a4ee2db7091266ab58fd92f72158	2017-06-28 11:14:59 -07:00
Johann	e67660cf37	fdct32x32 neon implementation Almost 3x faster in constrained loop testing. Over 10x faster in HBD builds. BUG=webm:1424 Change-Id: I2b7f8453e1d4ada63cde729d8115d684c4a71ff9	2017-06-22 06:40:17 -07:00
Johann Koenig	903375a48a	Merge "fdct16x16 neon optimization"	2017-06-08 15:19:36 +00:00
Johann	eae7cf2368	fdct16x16 neon optimization Roughly 2x speedup. Since the only change for HBD is to store(), the improvement appears to hold there as well. BUG=webm:1424 Change-Id: I15b813d50deb2e47b49a6b0705945de748e83c19	2017-06-07 14:59:55 -07:00
Johann Koenig	755b3daf90	Merge "comp_avg_pred neon: used by sub pixel avg variance"	2017-05-31 18:17:28 +00:00
Johann	f695b30ac2	comp_avg_pred neon: used by sub pixel avg variance BUG=webm:1423 Change-Id: I33de537f238f58f89b7a6c1c2d6e8110de4b8804	2017-05-30 22:47:34 +00:00
Johann	42ce25821d	remove DECLARE_ALIGNED from neon code Unlike x86 neon only requires type alignment when loading into vectors. Change-Id: I7bbbe4d51f78776e499ce137578d8c0effdbc02f	2017-05-26 10:41:57 -07:00
Johann	f3c97ed32e	subpel variance neon: reduce stack usage Unlike x86, arm does not impose additional alignment restrictions on vector loads. For incoming values to the first pass, it uses vld1_u32() which typically does impose a 4 byte alignment. However, as the first pass operates on user-supplied values we must prepare for unaligned values anyway (and have, see mem_neon.h). But for the local temporary values there is no stride and the load will use vld1_u8 which does not require 4 byte alignment. There are 3 temporary structures. In the C, one is uint16_t. The arm saturates between passes but still passes tests. If this becomes an issue new functions will be needed. Change-Id: I3c9d4701bfeb14b77c783d0164608e621bfecfb1	2017-05-24 13:28:13 -07:00
Johann	d204c4bf01	Use vdup instead of vmov Change-Id: Idb6248c1429b55176bb3e9f4e8365ea0ed2be62a	2017-05-24 11:38:15 -07:00
Johann	f6fcd3410d	sub pel avg variance neon: 4x block sizes BUG=webm:1423 Change-Id: Iaab2b9a183fdb54aae5f717aba95d90dc36a9e3b	2017-05-22 14:40:05 -07:00
Johann	188d58eaa9	sub pel variance neon: 4x block sizes Add optimizations for blocks of width 4 BUG=webm:1423 Change-Id: Idfb458d36db3014d48fbfbe7f5462aa6eb249938	2017-05-22 14:40:01 -07:00
Johann	9b0d306a2f	sub pel avg variance neon: add neon optimizations These are missing an optimized version of vpx_comp_avg_pred BUG=webm:1423 Change-Id: I31fa6ef842e98f7ff3ea079ffed51ae33178e2ed	2017-05-22 13:58:43 -07:00

1 2 3 4 5

245 Commits