generic-library/vpx

Author	SHA1	Message	Date
Johann	105503b839	neon fdct: 4x4 implementation Approximately twice as fast as C implementation. BUG=webm:1424 Change-Id: I3c0307fb08ddc23df42545cd089a78e2ed5c9d3f	2017-05-17 07:38:18 -07:00
Johann	7498fe2e54	neon 4 byte helper functions When data is guaranteed to be aligned, use helper functions which assert that requirement. Change-Id: Ic4b188593aea0799d5bd8eda64f9858a1592a2a3	2017-05-15 13:42:31 -07:00
Johann	1088b4f87c	move neon load/stores to a new file Move the tran_low_t helper functions to a new file. Additional load/store functions will be added here. Change-Id: I52bf652c344c585ea2f3e1230886be93f5caefc3	2017-05-15 08:29:43 -07:00
Johann Koenig	d713ec3c46	Merge changes I92eb4312,Ibb2afe4e * changes: subpel variance neon: add mixed sizes sub pixel variance neon: use generic variance	2017-05-10 18:19:52 +00:00
Johann	f7d1486f48	neon variance: process 16 values at a time Read in a Q register. Works on blocks of 16 and larger. Improvement of about 20% for 64x64. The smaller blocks are faster, but don't have quite the same level of improvement. 16x32 is only about 5% BUG=webm:1422 Change-Id: Ie11a877c7b839e66690a48117a46657b2ac82d4b	2017-05-08 18:48:55 +00:00
Johann Koenig	1814463864	Merge changes Id602909a,Ib0e85608 * changes: neon variance: process two rows of 8 at a time neon variance: add small missing sizes	2017-05-08 17:34:20 +00:00
Linfeng Zhang	2c3a2ad6f1	Merge changes I0cfe4117,I3581d80d,Ida62c941 * changes: Split dsp/x86/inv_txfm_sse2.c Update highbd idct functions arguments to use uint16_t dst Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct	2017-05-08 16:15:57 +00:00
Johann	2346a6da4a	subpel variance neon: add mixed sizes Add support for everything except block sizes of 4. Performance is better but numbers will improve again when the variance optimizations land. BUG=webm:1423 Change-Id: I92eb4312b20be423fa2fe6fdb18167a604ff4d80	2017-05-04 15:30:01 -07:00
Johann	19e1ec8359	sub pixel variance neon: use generic variance When a neon version is available it will be called. This allows decoupling the variance implementations and has no real downside. For most configurations, the call will be #define'd to the neon implementation. Change-Id: Ibb2afe4e156c5610e89488504d366b3e6d1ba712	2017-05-04 15:30:01 -07:00
Johann	462e29703c	fdct 8x8 neon: minor comment cleanup Simplify HBD/non distinction in test. Document why transpose_neon.h is not used Change-Id: I17659414206ddbb8c2f1ef0d9f4a17f1745d5a52	2017-05-04 15:14:23 -07:00
Johann	d6a7489dd5	neon variance: process two rows of 8 at a time When the width is equal to 8, process two rows at a time. This doubles the speed of 8x4 and improves 8x8 by about 20%. 8x16 was using this technique already, but still improved a little bit with the rewrite. Also use this for vpx_get8x8var_neon BUG=webm:1422 Change-Id: Id602909afcec683665536d11298b7387ac0a1207	2017-05-04 08:59:46 -07:00
Johann	cb9133c72f	neon variance: add small missing sizes Some of the mixed sizes were missing. They can be implemented trivially using the existing helper function. When comparing the previous 16x8 and 8x16 implementations, the helper function is about 10% faster than the 16x8 version. The 8x16 is very close, but the existing version appears to be faster. BUG=webm:1422 Change-Id: Ib0e856083c1893e1bd399373c5fbcd6271a7f004	2017-05-04 08:59:42 -07:00
Linfeng Zhang	d5de63d2be	Update highbd idct functions arguments to use uint16_t dst BUG=webm:1388 Change-Id: I3581d80d0389b99166e70987d38aba2db6c469d5	2017-05-03 13:59:16 -07:00
Linfeng Zhang	081b39f2b7	Clean CONVERT_TO_BYTEPTR/SHORTPTR in idct BUG=webm:1388 Change-Id: Ida62c941f2b836d6c9e27b427a7d5008ab6dc112	2017-05-03 13:58:31 -07:00
Linfeng Zhang	51dc998f3a	Update highbd convolve functions arguments to use uint16_t src/dst BUG=webm:1388 Change-Id: I6912de2639895d817ce850da8ea9f6c8fe21da42	2017-04-25 14:22:19 -07:00
Linfeng Zhang	bf8a49abbd	Clean CONVERT_TO_BYTEPTR/SHORTPTR in convolve Replace by CAST_TO_BYTEPTR/SHORTPTR. The rule is: if a short ptr is casted to a byte ptr, any offset operation on the byte ptr must be doubled. We do this by casting to short ptr first, adding offset, then casting back to byte ptr. BUG=webm:1388 Change-Id: I9e18a73ba45ddae58fc9dae470c0ff34951fe248	2017-04-19 12:13:49 -07:00
Linfeng Zhang	6fc2e57c2c	Update 32x32 high bitdepth idct NEON optimization Preparation of CONVERT_TO_BYTEPTR/SHORTPTR clean up. BUG=webm:1388 Change-Id: I928d30a5698023bb90888d783cf81c51ec183760	2017-04-05 15:28:11 -07:00
James Zern	f91c3bb3ab	idct_neon: prefix non-static functions w/'vpx_' Change-Id: I94fcdeae18468e6ef0cb7119b8142d982a048031	2017-03-22 11:49:23 -07:00
Linfeng Zhang	27530d484e	Add vpx_highbd_idct32x32_1024_add_neon() BUG=webm:1301 Change-Id: Ib90af0c1712e56b301d0e981dbe9a641e15e36ca	2017-03-17 00:27:46 -07:00
Linfeng Zhang	50b13f75b8	Add vpx_highbd_idct32x32_34_add_neon() BUG=webm:1301 Change-Id: I74dd16c6c64e7bb71aa991cedccddf0663ef5e06	2017-03-17 00:27:46 -07:00
Linfeng Zhang	65e9fb65e8	Add vpx_highbd_idct32x32_135_add_neon() BUG=webm:1301 Change-Id: I58c2d65d385080711c3666d6d8f9d241dac7b21a	2017-03-16 22:37:55 -07:00
Linfeng Zhang	e54231d613	Clean vpx_idct32x32_1024_add_neon() Change-Id: I05921e16d6a3e4e7e5b00a90624735050a186636	2017-03-15 11:24:31 -07:00
Linfeng Zhang	c756eb01c8	Fix overflow issue in 32x32 idct NEON intrinsics Similar issue as Change bc1c18e. The PartialIDctTest.ResultsMatch test on vpx_idct32x32_135_add_neon() in high bit-depth mode exposes 16-bit overflow in final stage of pass 2, when changing the test number from 1,000 to 1,000,000. Change to use saturating add/sub for vpx_idct32x32_34_add_neon(), vpx_idct32x32_135_add_neon and vpx_idct32x32_1024_add_neon() in high bit-depth mode. Change-Id: Iaec0e9aeab41a3fdb4e170d7e9b3ad1fda922f6f	2017-03-14 16:59:14 -07:00
Linfeng Zhang	77311e0dff	Update vpx_idct32x32_1024_add_neon() Most are cosmetics changes. Speed has no change with clang 3.8, and about 5% faster with gcc 4.8.4 Tried the strategy used in 8x8 and 16x16 (which operations' orders are similar to the C code), though speed gets better with gcc, it's worse with clang. Tried to remove store_in_output(), but speed gets worse. Change-Id: I93c8d284e90836f98962bb23d63a454cd40f776e	2017-03-08 12:39:04 -08:00
Linfeng Zhang	c4e5c54d69	cosmetics,dsp/arm/: vpx_idct32x32_{34,135}_add_neon() No speed changes and disassembly is almost identical. Change-Id: Id07996237d2607ca6004da5906b7d288b8307e1f	2017-03-08 08:58:32 -08:00
Linfeng Zhang	3cf5c213f1	cosmetics,dsp/arm/: rename a variable Rename cospi_6_26_14_18N to cospi_6_26N_14_18N for consistency. Change-Id: I00498b43bb612b368219a489b3adaa41729bf31a	2017-03-08 08:55:41 -08:00
Linfeng Zhang	0620081731	Add vpx_highbd_idct16x16_10_add_neon() BUG=webm:1301 Change-Id: If686c8144764c4162458f0bc4bb1bbf6555c48ab	2017-02-16 15:13:50 -08:00
Linfeng Zhang	81914ce68a	Add vpx_highbd_idct16x16_38_add_neon() BUG=webm:1301 Change-Id: Ic6cd8c1e63e1b7a997cbed221e20fff4c599e0fe	2017-02-15 09:12:02 -08:00
Linfeng Zhang	429e652809	Replace 14 with DCT_CONST_BITS in idct NEON functions' shifts Change-Id: I2a39a3bb87516b04d273bc1c0f4a634e3fb6f0f6	2017-02-14 13:08:41 -08:00
Linfeng Zhang	de9ae32b93	Merge "Add vpx_highbd_idct16x16_256_add_neon()"	2017-02-14 01:15:34 +00:00
Linfeng Zhang	5ad4159ebb	Add vpx_highbd_idct16x16_256_add_neon() BUG=webm:1301 Change-Id: I6bb755552a39bdd26eef3f449601f6a9766c65ec	2017-02-13 15:50:33 -08:00
Johann	5ecde212a8	fdct8x8 highbd neon: use tran_low_t for output Change-Id: I100c4a1955d80bec4d28e82796b3e7f57e84d0ba	2017-02-13 22:16:14 +00:00
Linfeng Zhang	016933ad48	Add vpx_highbd_idct{16x16,32x32}_1_add_neon() and update vpx_highbd_idct8x8_1_add_neon() BUG=webm:1301 Change-Id: I18d1a0cbe98ba822d5194c1b4e13a4c29c5c75f4	2017-02-13 10:25:22 -08:00
Linfeng Zhang	bc1c18e18c	Add vpx_idct16x16_38_add_neon() The RunQuantCheck() test on it exposes 16-bit overflow in stage 7 of pass 2. Change to use saturating add/sub for both vpx_idct16x16_38_add_neon() and vpx_idct16x16_256_add_neon() for high bitdepth. Change-Id: Ibf4c107a887553a52852cc582e28d38a5a5a2712	2017-02-08 12:15:22 -08:00
Linfeng Zhang	66695533a8	Merge "Update 16x16 8-bit idct NEON intrinsics"	2017-02-07 16:52:40 +00:00
Johann Koenig	726556dde9	Merge "Remove neon assembly for idct 16x16 and 8x8"	2017-02-02 03:25:31 +00:00
Johann Koenig	ce6318f254	Merge changes I43521ad3,I013659f6 * changes: satd highbd neon: use tran_low_t for coeff satd highbd sse2: use tran_low_t for coeff	2017-02-02 03:03:58 +00:00
Linfeng Zhang	e4985cf619	Update 16x16 8-bit idct NEON intrinsics Remove redundant memory accesses. Change-Id: I8049074bdba5f49eab7e735b2b377423a69cd4c8	2017-02-01 17:04:33 -08:00
Johann	f8d744d91a	satd highbd neon: use tran_low_t for coeff BUG=webm:1365 Change-Id: I43521ad32b6c96737a8ef2b8c327f901fd7eaf84	2017-02-01 11:55:47 -08:00
Johann	1eb8a718bf	hadamard highbd neon: use tran_low_t for coeff BUG=webm:1365 Change-Id: I7e15192ead3a3631755b386f102c979f06e26279	2017-02-01 11:50:46 -08:00
Johann	13234d3c43	Remove neon assembly for idct 16x16 and 8x8 Tested using test/partial_idct_test.cc:DISABLED_Speed Both gcc 4.9 and clang 3.8 from the r13 Android NDK offer improvements using the intrinsics: <function> <clang asm> <gcc asm> <clang intrin> <gcc intrin> idct16x16_256 1720ms 1703ms 1546ms 1554ms idct16x16_10 1320ms 1247ms 518ms 488ms idct16x16_1 107ms 108ms 64ms 68ms idct8x8_64 924ms 931ms 866ms 989ms idct8x8_12 826ms 824ms 519ms 514ms idct8x8_1 172ms 166ms 110ms 125ms idct8x8_64 isn't quite perfect (slight regression with gcc intrinsics) but as a counter example idct16x16_10 goes from ~1300ms to ~500ms On a sample clip, clang improved from 48.5 to 49fps and gcc stayed roughly stable. BUG=webm:1303 Change-Id: I9d4fd2b41b46ea6174a887b40a82c8e6e4769ed4	2017-01-19 12:27:31 -08:00
Johann	68d0f46ec0	arm idct16x16: remove extra config guards This file is guarded by HAVE_NEON_ASM in the .mk file now. Change-Id: I513a621c234aa90ad52e426c8ed494d8a7d4b74a	2017-01-11 10:17:14 -08:00
James Zern	9480da21e8	Merge "Refine 8-bit 16x16 idct NEON intrinsics"	2017-01-09 23:52:29 +00:00
Johann	c23970ec25	postproc: vpx_mbpost_proc_down_neon This was much more amenable to optimization than the across filter. Speedup of almost 2.5x BUG=webm:1320 Change-Id: I49acc0f9cb2e7642303df90132cbc938acade4c4	2017-01-09 10:21:56 -08:00
Johann Koenig	9af97fb630	Merge "postproc: vpx_mbpost_proc_across_ip_neon"	2017-01-09 18:17:26 +00:00
Linfeng Zhang	6abdd31555	Refine 8-bit 16x16 idct NEON intrinsics Speed test shows 25% gain on vpx_idct16x16_256_add_neon(), and vpx_idct16x16_10_add_neon() got trippled. Change-Id: If8518d9b6a3efab74031297b8d40cd83c4a49541	2017-01-06 17:52:07 -08:00
Johann	4dca923454	postproc: vpx_mbpost_proc_across_ip_neon The speedup is pretty poor. I would be concerned except the SSE2 is worse: Existing SSE2 improvement: 22% New neon improvement: 35% BUG=webm:1320 Change-Id: Ied598a261134aa6cbe69f96f58589d2bae17bf62	2017-01-06 16:39:17 -08:00
Linfeng Zhang	2d12a52ff0	Merge "Add high bitdepth 8x8 idct NEON intrinsics"	2017-01-06 16:47:23 +00:00
Linfeng Zhang	911bb980b1	Clean DC only idct NEON intrinsics BUG=webm:1301 Change-Id: Iffc83854218460b3f687f3774e71d45b552382a5	2016-12-28 13:51:44 -08:00
Linfeng Zhang	9b187954df	Add high bitdepth 8x8 idct NEON intrinsics BUG=webm:1301 Change-Id: I56e3bc3aab9214e2debac93796389a7194991084	2016-12-27 16:28:53 -08:00

1 2 3 4 5

238 Commits