268 Commits

Author SHA1 Message Date
Karina
7c0ca2fc14 use average downsampling fistly then general downsampling when dst resolution > 1/4 source resolution and dst resolution <1/2 source resolution 2016-06-17 10:30:47 +08:00
Martin Storsjö
e945654f06 Use assert.h instead of cassert
This fixes building for android differently than in f5e483ce.

On android, <cassert> isn't available in the normal include path,
only when the STL headers are available.

We intentionally avoid using STL within the main libopenh264.so, to
simplify dependency chains for users of the library (which otherwise
could run into conflicts if the surrounding app would want to use
a different STL implementation).

The previous fix only provided headers, not actually linking
against STL, so at this point it's not a real issue yet, but it's
still a very slippery slope towards accidentally starting relying on
STL within the core library.

Instead explicitly avoid using STL within the core library, by not
even providing the include path.
2016-06-15 21:06:11 +03:00
HaiboZhu
2e6c9f7cd3 Merge pull request #2496 from saamas/processing-relax-downsample-buffer-size-requirement
[Processing] Relax downsample buffer size requirement
2016-06-15 10:31:53 +08:00
ruil2
4b6f037020 Merge pull request #2489 from saamas/processing-dyadic-bilinear-downsample-optimizations
[Processing] DyadicBilinearDownsample optimizations
2016-06-12 10:02:55 +08:00
Sindre Aamås
f183891c5b [Processing/x86] Use lddqu in case we still run on anything that benefits 2016-06-04 00:41:35 +02:00
Sindre Aamås
5a9c6db335 [Processing] Relax downsample buffer size requirement
AFAICT, it is sufficient that the sample buffer has space for half
the source width/height. With the current sample buffer size, this
enables its use for resolutions up to 3840x2176.
2016-06-03 15:14:09 +02:00
Sindre Aamås
68a5910f8f [Processing] Clear LSB before rounding up dyadic downsample width 2016-06-03 12:03:01 +02:00
Sindre Aamås
8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
Sindre Aamås
7cbb75eac6 [Processing] Pick dyadic downsample function based on stride
Assume that data can be written into the padding area following each
line. This enables the use of faster routines for more cases.

Align downsample buffer stride to a multiple of 32.

With this all strides used should be a multiple of 16, which means
that use of narrower downsample routines can be dropped altogether.
2016-06-02 13:44:28 +02:00
Sindre Aamås
770e48ac2b [Processing] Remove unused align macros
The WELS_ALIGN macro here aliases the WELS_ALIGN macro in macros.h
which is inconvenient. Just remove these unused macros.
2016-06-02 13:44:28 +02:00
Sindre Aamås
e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Sindre Aamås
cbaf087583 [Processing] Reduce duplication in downsampling wrappers 2016-05-23 13:19:17 +02:00
Karina
96b2a87030 add one new downsampling algorithms 2016-05-16 09:28:19 +08:00
ruil2
56618249d7 Merge pull request #2436 from saamas/processing-add-avx2-vaa-routines
[Processing] Add AVX2 VAA routines
2016-04-28 09:08:03 +08:00
Karina
1ecb9582df update arm assembly comments 2016-04-14 14:57:21 +08:00
Karina
d34e209266 fix 32-bit parameters issue on arm64 assembly function 2016-04-13 19:30:08 +08:00
Sindre Aamås
57fc3e9917 [Processing] Add AVX2 VAA routines
Process 8 lines at a time rather than 16 lines at a time because
this appears to give more reliable memory subsystem performance on
Haswell.

Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell.
On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached,
which appears to be related to processing 8 lines at a time as opposed
to 16 lines at a time. The other routines are also faster as compared
to the SSE2 routines in this case but to a lesser extent.
2016-04-11 16:09:56 +02:00
unknown
3873addc3d fix frame size constraints for width and height 2016-02-01 15:55:53 +08:00
Martin Storsjö
c31e4e23f2 Fix indentation to consistently use spaces instead of tabs
Also get rid of other stray tabs in scripts.
2015-09-15 08:41:19 +03:00
Martin Storsjö
77bd41ca7e Fix building down_sample_neon.S with gnu binutils 2015-09-14 21:38:26 +03:00
Guangwei Wang
64657d3cfd add new c and assembly functions to optimize downsampler when downscale equal 1:3/1:4 2015-09-11 16:45:40 +08:00
Martin Storsjö
78e0ec6130 Convert tabs to spaces before comments 2015-06-10 10:22:29 +03:00
Martin Storsjö
3052b7ac64 Remove tabs from vertically aligned function declarations and typedefs 2015-06-10 10:22:13 +03:00
Martin Storsjö
764793d74b Remove tabs in struct and class definitions 2015-06-10 10:22:01 +03:00
Martin Storsjö
ca51ee0f44 Remove tabs where a simple space is just enough 2015-06-10 10:21:52 +03:00
Martin Storsjö
51efa57a3d Convert tabs to spaces in vertically aligned code 2015-06-10 10:21:29 +03:00
Martin Storsjö
723044837a Convert tabs to spaces in defines 2015-06-10 10:21:25 +03:00
Martin Storsjö
43767cddb6 Remove tabs from commented out code 2015-06-10 10:21:21 +03:00
Martin Storsjö
c134aa753a Remove unnecessary/pointless/accidental tabs from the middle of lines of code 2015-06-03 15:39:30 +03:00
Martin Storsjö
b052a9580e Convert tabs to spaces in code that looks like tables
Also fix the alignment in some related cases, even though they
didn't use tabs originally.
2015-06-03 13:26:36 +03:00
Martin Storsjö
df994fa3f5 Convert tabs to spaces in enums and tables of defines 2015-05-15 11:20:11 +03:00
Martin Storsjö
b05468b5c1 Convert tabs to spaces in multiline comments 2015-05-15 10:50:49 +03:00
Martin Storsjö
0ca7ff49e2 Convert tabs to spaces in assignment of SIMD function pointers 2015-05-14 14:07:49 +03:00
Martin Storsjö
95ac72754e Convert tabs to spaces in .def files
The three def files in the project currently use tabs very inconsistently.
2015-05-14 13:58:44 +03:00
Martin Storsjö
d152c25485 Remove tabs from the copyright/license section in file headers 2015-05-14 13:58:40 +03:00
Martin Storsjö
7a80c21526 Reformat tables without tabs 2015-05-13 22:06:58 +03:00
Martin Storsjö
dd913ef878 Don't use tabs for indentation in multi-line macros
The astyle configuration makes sure normal code is indented consistently
with 2 spaces, but astyle doesn't seem to touch the indentation in
these multi-line macros.
2015-05-13 22:06:54 +03:00
Martin Storsjö
f324c354b1 Remove unnecessary double spaces and tabs in ifdef directives 2015-04-29 15:34:38 +03:00
Martin Storsjö
0995390c4a Remove apple specific versions of arm macros with arguments
The apple assembler for arm can handle the gnu binutils style
macros just fine these days, so there is no need to duplicate all
of these macros in two syntaxes, when the new one works fine in all cases.

We already require a new enough assembler to support the gnu binutils
style features since we use the .rept directive in a few places.
2015-03-27 11:11:45 +02:00
Martin Storsjö
d8202cf38f Remove apple specific versions of arm64 macros with arguments
The apple assembler for arm64 can handle the gnu binutils style
macros just fine, so there is no need to duplicate all of these
macros in two syntaxes, when the new one works fine in all cases.

We already require a new enough assembler to support the gnu binutils
style features since we use the .rept directive in a few places.
2015-03-27 11:11:23 +02:00
Martin Storsjö
0b0884874d Remove superfluous .text directives at the start of arm assembly files
This directive can be set by the common include header that is
included by all files anyway.
2015-03-27 10:46:34 +02:00
Martin Storsjö
b98e7c1f7d Rename a vcproj folder to camelcase, to match all other folders in the same project 2015-03-25 11:46:41 +02:00
Sijia Chen
431bcee310 1, update the max-nal-size setting in UT and param check since we are using a larger input check
2, fix potential overflow (will change bs but little impact on bs)
2015-02-06 13:24:20 +08:00
Martin Storsjö
a3063531c4 Remove accidental double semicolons 2015-02-02 09:20:35 +02:00
ruil2
5b5cc8434e add rc function 2015-01-15 11:14:05 +08:00
dong zhang
b18e905946 Check and Fix some issue#1535 2014-11-19 16:03:34 +08:00
Martin Storsjö
b17e9bb320 Make nasm commands in vcproj files consistent
Some commands had different spacing than others, and commands
for some files had accidentally missed a few parameters.
2014-10-22 10:14:22 +03:00