Commit Graph

10 Commits

Author SHA1 Message Date
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Guangwei Wang
64657d3cfd add new c and assembly functions to optimize downsampler when downscale equal 1:3/1:4 2015-09-11 16:45:40 +08:00
Martin Storsjö
51efa57a3d Convert tabs to spaces in vertically aligned code 2015-06-10 10:21:29 +03:00
Martin Storsjö
43767cddb6 Remove tabs from commented out code 2015-06-10 10:21:21 +03:00
Sijia Chen
4e89e71e8f reformat cpp files for unit tests 2014-10-23 17:54:33 +08:00
ruil2
3ff145e839 rename namespace and funciton name to avoid conflicts with old library 2014-09-17 15:50:59 +08:00
Martin Storsjö
c3710c4130 Avoid warnings about comparison between signed and unsigned
There's no need for these variables to be explicitly unsigned.
This also matches the function above.
2014-08-28 11:13:38 +03:00
zhiliang wang
93af7bfc64 Add UT for Downsample functions. 2014-08-27 15:40:14 +08:00