openh264

Author	SHA1	Message	Date
Sindre Aamås	b43e58a366	[Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 8. Because pshufb does not cross 128-bit lanes, the overhead of address calculations and loads is relatively greater as compared with an SSSE3 implementation. Fall back to a generic approach for ratios > 8. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2, ~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:47 +02:00
Sindre Aamås	b1013095b1	[Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. The use of blendps makes this require SSE4.1. The pshufb path can be backported to SSSE3 and the generic path to SSE2 for a minor reduction in performance by replacing blendps and preceding instructions with an equivalent sequence. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2, ~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:39 +02:00
Sindre Aamås	1995e03d91	[Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. Note that the generic approach can be backported to SSE2. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2, ~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:31 +02:00
Guangwei Wang	64657d3cfd	add new c and assembly functions to optimize downsampler when downscale equal 1:3/1:4	2015-09-11 16:45:40 +08:00
Martin Storsjö	51efa57a3d	Convert tabs to spaces in vertically aligned code	2015-06-10 10:21:29 +03:00
Martin Storsjö	43767cddb6	Remove tabs from commented out code	2015-06-10 10:21:21 +03:00
Sijia Chen	4e89e71e8f	reformat cpp files for unit tests	2014-10-23 17:54:33 +08:00
ruil2	3ff145e839	rename namespace and funciton name to avoid conflicts with old library	2014-09-17 15:50:59 +08:00
Martin Storsjö	c3710c4130	Avoid warnings about comparison between signed and unsigned There's no need for these variables to be explicitly unsigned. This also matches the function above.	2014-08-28 11:13:38 +03:00
zhiliang wang	93af7bfc64	Add UT for Downsample functions.	2014-08-27 15:40:14 +08:00

10 Commits