openh264

Author	SHA1	Message	Date
Sindre Aamås	8a0af4a3f2	[Processing/x86] DyadicBilinearDownsample optimizations Average vertically before horizontally; horizontal averaging is more worksome. Doing the vertical averaging first reduces the number of horizontal averages by half. Use pmaddubsw and pavgw to do the horizontal averaging for a slight performance improvement. Minor tweaks. Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines. The non-temporal loads used in the SSE4 routines do nothing for cache- backed memory AFAIK. Adjust tests because averaging vertically first gives slightly different output. ~2.39x speedup for the widthx32 routine on Haswell when not memory-bound. ~2.20x speedup for the widthx16 routine on Haswell when not memory-bound. Note that the widthx16 routine can be unrolled for further speedup.	2016-06-02 13:44:28 +02:00
Sindre Aamås	563376df0c	[UT] Test downsampling routines with a wider variety of height ratios	2016-05-25 14:16:29 +02:00
Sindre Aamås	4fec6d581e	[UT] Test generic downsampling routines with a wider variety of width ratios Get coverage of all code paths for routines that branch to different paths for different scaling ratios.	2016-05-23 20:23:47 +02:00
Sindre Aamås	e490215990	[Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 8. Because pshufb does not cross 128-bit lanes, the overhead of address calculations and loads is relatively greater as compared with an SSSE3/SSE4.1 implementation. Fall back to a generic approach for ratios > 8. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2, ~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:47 +02:00
Sindre Aamås	b43e58a366	[Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 8. Because pshufb does not cross 128-bit lanes, the overhead of address calculations and loads is relatively greater as compared with an SSSE3 implementation. Fall back to a generic approach for ratios > 8. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2, ~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:47 +02:00
Sindre Aamås	b1013095b1	[Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. The use of blendps makes this require SSE4.1. The pshufb path can be backported to SSSE3 and the generic path to SSE2 for a minor reduction in performance by replacing blendps and preceding instructions with an equivalent sequence. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2, ~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:39 +02:00
Sindre Aamås	1995e03d91	[Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. Note that the generic approach can be backported to SSE2. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2, ~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:31 +02:00
Sindre Aamås	93db6511a8	[UT] Test VAA routines with a wider variety of resolutions Test even and odd multiples of 32 width because some AVX2 routines have conditional logic based on that.	2016-04-11 16:40:36 +02:00
Sindre Aamås	57fc3e9917	[Processing] Add AVX2 VAA routines Process 8 lines at a time rather than 16 lines at a time because this appears to give more reliable memory subsystem performance on Haswell. Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell. On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached, which appears to be related to processing 8 lines at a time as opposed to 16 lines at a time. The other routines are also faster as compared to the SSE2 routines in this case but to a lesser extent.	2016-04-11 16:09:56 +02:00
Guangwei Wang	64657d3cfd	add new c and assembly functions to optimize downsampler when downscale equal 1:3/1:4	2015-09-11 16:45:40 +08:00
Martin Storsjö	51efa57a3d	Convert tabs to spaces in vertically aligned code	2015-06-10 10:21:29 +03:00
Martin Storsjö	43767cddb6	Remove tabs from commented out code	2015-06-10 10:21:21 +03:00
Martin Storsjö	dd913ef878	Don't use tabs for indentation in multi-line macros The astyle configuration makes sure normal code is indented consistently with 2 spaces, but astyle doesn't seem to touch the indentation in these multi-line macros.	2015-05-13 22:06:54 +03:00
Martin Storsjö	acafbb442d	Add checks for cpu features in tests This allows running the tests on devices that don't have all the SIMD instruction sets.	2015-01-24 22:47:23 +02:00
Sijia Chen	4e89e71e8f	reformat cpp files for unit tests	2014-10-23 17:54:33 +08:00
ruil2	3ff145e839	rename namespace and funciton name to avoid conflicts with old library	2014-09-17 15:50:59 +08:00
zhuiling	0fe477625c	improve py, and change mk according to mk	2014-09-12 10:25:46 +08:00
Martin Storsjö	c3710c4130	Avoid warnings about comparison between signed and unsigned There's no need for these variables to be explicitly unsigned. This also matches the function above.	2014-08-28 11:13:38 +03:00
zhiliang wang	93af7bfc64	Add UT for Downsample functions.	2014-08-27 15:40:14 +08:00
zhiliang wang	0163eb520d	Add UT for VaaCalc Functions.	2014-08-27 13:53:18 +08:00
HFVideoMac	910c64ef22	add ARM64 Adaptative Quantization code and UT	2014-07-22 15:07:25 +08:00
Martin Storsjö	4f594deff9	Don't reset the random number generator within the unit tests This makes sure we don't accidentally return the same sequence of random numbers multiple times within one test (which would be very non-random). Every time srand(time()) is called, the pseudo random number generator is initialized to the same value (as long as time() returned the same value). By initializing the random number generator once and for all before starting to run the unit tests, we are sure we don't need to reinitialize it within all the tests and all the functions that use random numbers. This fixes occasional errors in MotionEstimateTest. MotionEstimateTest was designed to allow the test to occasionally not succeed - if it didn't succeed, it tried again, up to 100 times. However, since the YUVPixelDataGenerator function reset the random seed to time(), every attempt actually ran with the same random data (as long as all 100 attempts ran within 1 second) - thus if one attempt in MotionEstimateTest failed, all 100 of them would fail. If the utility functions don't touch the random seed, this is not an issue.	2014-07-01 10:20:45 +03:00
huili2	dc3fae4477	astyle all	2014-06-25 18:50:41 -07:00
Martin Storsjö	f99336d866	Don't compare a boolean to an integer This avoids a warning when building with MSVC.	2014-05-04 14:53:36 +03:00
lyao2	4248cc9c42	fix typedef re-define issue	2014-04-30 16:46:58 +08:00
lyao2	1c90837001	remove typedef.h	2014-04-29 15:50:08 +08:00
lyao2	34ad719cf2	Squashed commit of the following: commit f73d6cf0fcae5f401fc2817ab736af996113ca09 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Thu Apr 24 15:02:21 2014 +0800 remove comments commit 75416c2cf6c1ebb7aabf9e8c52d8c7163a8009b7 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Thu Apr 24 14:52:09 2014 +0800 for test commit 7dfb65ce514edcff892bfb3919921cadcce1d055 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Thu Apr 24 14:12:31 2014 +0800 for test commit eff771645e8c349dc4e454ab1751530b3cef18ed Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Thu Apr 24 10:51:34 2014 +0800 for test commit 9c42b9a7a04068e70be94529941f549b58e63780 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Wed Apr 23 17:46:59 2014 +0800 update cpu_flag commit cce3fccc0a4249b82ab2e0e92fe53579ef942799 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Wed Apr 23 17:26:56 2014 +0800 for test commit 3d292995b3c4437a2674a687cc4e8da1b5fb83f5 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Wed Apr 23 16:45:57 2014 +0800 remove space commit c608c2ba7cf010f1dcf8c0344f68536c48e181cb Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Wed Apr 23 16:42:43 2014 +0800 remove tabs commit 3b769342a06e25ad23a2c86f23a94d0d7ca1a4c8 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Wed Apr 23 16:33:55 2014 +0800 refine UT case commit 89b869f0c8f8c9bbd61e9de32caa77877aeae064 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Tue Apr 22 13:40:50 2014 +0800 Squashed commit of the following: commit abe55494134ef8342ffe9566df4e1b3265fe21b6 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Tue Apr 22 10:50:07 2014 +0800 set MV range commit 8c7f70c351e50d945c29118bed8b3781c22b7dbc Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Mon Apr 21 16:53:10 2014 +0800 refinement commit bf35f19a7dc88743aacf8e89e681e0ef3302d40a Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Fri Apr 18 17:24:31 2014 +0800 correct tabs commit 130b7f895d7020bfc571d910966891da93150242 Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Fri Apr 18 17:17:06 2014 +0800 correct format commit 0429703b0844363559dd2b3d44e45034232a9d8f Author: lyao2 <lyao2@LYAO2-WS01.cisco.com> Date: Fri Apr 18 15:12:44 2014 +0800 add scroll UT	2014-04-24 15:12:49 +08:00

27 Commits