Commit Graph

27 Commits

Author SHA1 Message Date
Sindre Aamås
8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
Sindre Aamås
563376df0c [UT] Test downsampling routines with a wider variety of height ratios 2016-05-25 14:16:29 +02:00
Sindre Aamås
4fec6d581e [UT] Test generic downsampling routines with a wider variety of width ratios
Get coverage of all code paths for routines that branch to different
paths for different scaling ratios.
2016-05-23 20:23:47 +02:00
Sindre Aamås
e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Sindre Aamås
93db6511a8 [UT] Test VAA routines with a wider variety of resolutions
Test even and odd multiples of 32 width because some AVX2 routines
have conditional logic based on that.
2016-04-11 16:40:36 +02:00
Sindre Aamås
57fc3e9917 [Processing] Add AVX2 VAA routines
Process 8 lines at a time rather than 16 lines at a time because
this appears to give more reliable memory subsystem performance on
Haswell.

Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell.
On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached,
which appears to be related to processing 8 lines at a time as opposed
to 16 lines at a time. The other routines are also faster as compared
to the SSE2 routines in this case but to a lesser extent.
2016-04-11 16:09:56 +02:00
Guangwei Wang
64657d3cfd add new c and assembly functions to optimize downsampler when downscale equal 1:3/1:4 2015-09-11 16:45:40 +08:00
Martin Storsjö
51efa57a3d Convert tabs to spaces in vertically aligned code 2015-06-10 10:21:29 +03:00
Martin Storsjö
43767cddb6 Remove tabs from commented out code 2015-06-10 10:21:21 +03:00
Martin Storsjö
dd913ef878 Don't use tabs for indentation in multi-line macros
The astyle configuration makes sure normal code is indented consistently
with 2 spaces, but astyle doesn't seem to touch the indentation in
these multi-line macros.
2015-05-13 22:06:54 +03:00
Martin Storsjö
acafbb442d Add checks for cpu features in tests
This allows running the tests on devices that don't have
all the SIMD instruction sets.
2015-01-24 22:47:23 +02:00
Sijia Chen
4e89e71e8f reformat cpp files for unit tests 2014-10-23 17:54:33 +08:00
ruil2
3ff145e839 rename namespace and funciton name to avoid conflicts with old library 2014-09-17 15:50:59 +08:00
zhuiling
0fe477625c improve py, and change mk according to mk 2014-09-12 10:25:46 +08:00
Martin Storsjö
c3710c4130 Avoid warnings about comparison between signed and unsigned
There's no need for these variables to be explicitly unsigned.
This also matches the function above.
2014-08-28 11:13:38 +03:00
zhiliang wang
93af7bfc64 Add UT for Downsample functions. 2014-08-27 15:40:14 +08:00
zhiliang wang
0163eb520d Add UT for VaaCalc Functions. 2014-08-27 13:53:18 +08:00
HFVideoMac
910c64ef22 add ARM64 Adaptative Quantization code and UT 2014-07-22 15:07:25 +08:00
Martin Storsjö
4f594deff9 Don't reset the random number generator within the unit tests
This makes sure we don't accidentally return the same sequence
of random numbers multiple times within one test (which would
be very non-random).

Every time srand(time()) is called, the pseudo random number
generator is initialized to the same value (as long as time()
returned the same value).

By initializing the random number generator once and for all
before starting to run the unit tests, we are sure we don't
need to reinitialize it within all the tests and all the
functions that use random numbers.

This fixes occasional errors in MotionEstimateTest.

MotionEstimateTest was designed to allow the test to occasionally
not succeed - if it didn't succeed, it tried again, up to 100 times.
However, since the YUVPixelDataGenerator function reset the random
seed to time(), every attempt actually ran with the same random
data (as long as all 100 attempts ran within 1 second) - thus if
one attempt in MotionEstimateTest failed, all 100 of them would
fail. If the utility functions don't touch the random seed,
this is not an issue.
2014-07-01 10:20:45 +03:00
huili2
dc3fae4477 astyle all 2014-06-25 18:50:41 -07:00
Martin Storsjö
f99336d866 Don't compare a boolean to an integer
This avoids a warning when building with MSVC.
2014-05-04 14:53:36 +03:00
lyao2
4248cc9c42 fix typedef re-define issue 2014-04-30 16:46:58 +08:00
lyao2
1c90837001 remove typedef.h 2014-04-29 15:50:08 +08:00
lyao2
34ad719cf2 Squashed commit of the following:
commit f73d6cf0fcae5f401fc2817ab736af996113ca09
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Thu Apr 24 15:02:21 2014 +0800

    remove comments

commit 75416c2cf6c1ebb7aabf9e8c52d8c7163a8009b7
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Thu Apr 24 14:52:09 2014 +0800

    for test

commit 7dfb65ce514edcff892bfb3919921cadcce1d055
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Thu Apr 24 14:12:31 2014 +0800

    for test

commit eff771645e8c349dc4e454ab1751530b3cef18ed
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Thu Apr 24 10:51:34 2014 +0800

    for test

commit 9c42b9a7a04068e70be94529941f549b58e63780
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Wed Apr 23 17:46:59 2014 +0800

    update cpu_flag

commit cce3fccc0a4249b82ab2e0e92fe53579ef942799
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Wed Apr 23 17:26:56 2014 +0800

    for test

commit 3d292995b3c4437a2674a687cc4e8da1b5fb83f5
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Wed Apr 23 16:45:57 2014 +0800

    remove space

commit c608c2ba7cf010f1dcf8c0344f68536c48e181cb
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Wed Apr 23 16:42:43 2014 +0800

    remove tabs

commit 3b769342a06e25ad23a2c86f23a94d0d7ca1a4c8
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Wed Apr 23 16:33:55 2014 +0800

    refine UT case

commit 89b869f0c8f8c9bbd61e9de32caa77877aeae064
Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
Date:   Tue Apr 22 13:40:50 2014 +0800

    Squashed commit of the following:

    commit abe55494134ef8342ffe9566df4e1b3265fe21b6
    Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
    Date:   Tue Apr 22 10:50:07 2014 +0800

        set MV range

    commit 8c7f70c351e50d945c29118bed8b3781c22b7dbc
    Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
    Date:   Mon Apr 21 16:53:10 2014 +0800

        refinement

    commit bf35f19a7dc88743aacf8e89e681e0ef3302d40a
    Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
    Date:   Fri Apr 18 17:24:31 2014 +0800

        correct tabs

    commit 130b7f895d7020bfc571d910966891da93150242
    Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
    Date:   Fri Apr 18 17:17:06 2014 +0800

        correct format

    commit 0429703b0844363559dd2b3d44e45034232a9d8f
    Author: lyao2 <lyao2@LYAO2-WS01.cisco.com>
    Date:   Fri Apr 18 15:12:44 2014 +0800

        add scroll UT
2014-04-24 15:12:49 +08:00