2691 Commits

Author SHA1 Message Date
HaiboZhu
60cbb77583 Merge pull request #2500 from ruil2/downsampling
use average downsampling fistly then general downsampling
2016-06-21 10:11:44 +08:00
Karina
7c0ca2fc14 use average downsampling fistly then general downsampling when dst resolution > 1/4 source resolution and dst resolution <1/2 source resolution 2016-06-17 10:30:47 +08:00
Sindre Aamås
0f7b8365b9 [Encoder] Avoid valgrind downsampling false positives
X86 SIMD downsampling routines may, for convenience, read slightly beyond
the input data and into the alignment padding area beyond each line. This
causes valgrind to warn about uninitialized values even if these values
only affect lanes of a SIMD vector that are effectively never used.

Avoid these false positives by zero-initializing the padding area beyond
each line of the source buffer used for downsampling.
2016-06-16 21:19:17 +02:00
Martin Storsjö
e945654f06 Use assert.h instead of cassert
This fixes building for android differently than in f5e483ce.

On android, <cassert> isn't available in the normal include path,
only when the STL headers are available.

We intentionally avoid using STL within the main libopenh264.so, to
simplify dependency chains for users of the library (which otherwise
could run into conflicts if the surrounding app would want to use
a different STL implementation).

The previous fix only provided headers, not actually linking
against STL, so at this point it's not a real issue yet, but it's
still a very slippery slope towards accidentally starting relying on
STL within the core library.

Instead explicitly avoid using STL within the core library, by not
even providing the include path.
2016-06-15 21:06:11 +03:00
HaiboZhu
2e6c9f7cd3 Merge pull request #2496 from saamas/processing-relax-downsample-buffer-size-requirement
[Processing] Relax downsample buffer size requirement
2016-06-15 10:31:53 +08:00
HaiboZhu
d35647ec3b Merge pull request #2491 from ruil2/nalsize
add nalsize checking UT and fix nalsize control when cabac on
2016-06-15 10:24:18 +08:00
HaiboZhu
151a7ff643 Merge pull request #2490 from sijchen/refactor_ref4
[Encoder] refactor: to avoid only use idx0 in syntax writing, for now it has no impact on bs
2016-06-15 10:23:38 +08:00
HaiboZhu
84a7669b63 Merge pull request #2464 from bumblebritches57/MVC
MVC aka Stereoscopic 3D support
2016-06-15 10:05:15 +08:00
ruil2
4b6f037020 Merge pull request #2489 from saamas/processing-dyadic-bilinear-downsample-optimizations
[Processing] DyadicBilinearDownsample optimizations
2016-06-12 10:02:55 +08:00
Karina
b5cef5d49c modify reserved nal header size and change source frame in NalSizeChecking UT 2016-06-08 10:12:27 +08:00
Karina
40f4fc05bb get each spatial layer qp 2016-06-06 17:13:22 +08:00
Karina
c1255451d7 use the correct frametype in statistics info 2016-06-06 17:06:56 +08:00
ruil2
106d13d26c Merge pull request #2492 from saamas/processing-x86-downsample-use-lddqu
[Processing/x86] Use lddqu in case we still run on anything that benefits
2016-06-06 12:46:55 +08:00
Sindre Aamås
f183891c5b [Processing/x86] Use lddqu in case we still run on anything that benefits 2016-06-04 00:41:35 +02:00
Sindre Aamås
5a9c6db335 [Processing] Relax downsample buffer size requirement
AFAICT, it is sufficient that the sample buffer has space for half
the source width/height. With the current sample buffer size, this
enables its use for resolutions up to 3840x2176.
2016-06-03 15:14:09 +02:00
Sindre Aamås
68a5910f8f [Processing] Clear LSB before rounding up dyadic downsample width 2016-06-03 12:03:01 +02:00
Karina
2171d84f1e add nalsize checking UT and fix nalsize control when cabac on 2016-06-03 17:36:14 +08:00
ruil2
3eba80765c Merge pull request #2487 from sijchen/refactor_ref31
[Encoder] Preprocess: refactor to improve code readability
2016-06-03 13:39:04 +08:00
Karina
4f41c3a5bf fix codingIdx update issue 2016-06-02 21:17:31 +08:00
Sindre Aamås
8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
Sindre Aamås
7cbb75eac6 [Processing] Pick dyadic downsample function based on stride
Assume that data can be written into the padding area following each
line. This enables the use of faster routines for more cases.

Align downsample buffer stride to a multiple of 32.

With this all strides used should be a multiple of 16, which means
that use of narrower downsample routines can be dropped altogether.
2016-06-02 13:44:28 +02:00
Sindre Aamås
770e48ac2b [Processing] Remove unused align macros
The WELS_ALIGN macro here aliases the WELS_ALIGN macro in macros.h
which is inconvenient. Just remove these unused macros.
2016-06-02 13:44:28 +02:00
sijchen@cisco.com
a7ae1efc3a add back the missing part after merging and formatting 2016-06-01 21:33:33 -07:00
sijchen@cisco.com
8bacc3d4d0 Preprocess: refactor to improve code readability 2016-06-01 21:26:24 -07:00
sijchen@cisco.com
8537a9274d fix a prob 2016-06-01 09:21:12 -07:00
sijchen@cisco.com
a9601cdc59 refactor to avoid only use idx0 in syntax writing, for now it has no impact on bs, may benefit future usage 2016-06-01 09:21:12 -07:00
Karina
268a0eb6f4 remove redundant initialization 2016-06-01 10:52:51 +08:00
HaiboZhu
515eeb41e4 Merge pull request #2481 from ruil2/maxbitrate1
fix iContinualSkipFrames calculation
2016-06-01 09:03:57 +08:00
HaiboZhu
7ccc377d55 Merge pull request #2480 from ruil2/fix
fix removing parameter setting wrongly
2016-06-01 09:03:49 +08:00
ruil2
2d3fc37a07 Merge pull request #2484 from sijchen/refactor_preprocess13
[Encoder] Refactor: add class for diff preprocess strategy
2016-06-01 08:31:02 +08:00
Karina
87e81a7a40 use the same name to avoid confusing. 2016-06-01 08:21:03 +08:00
sijchen@cisco.com
03863ae4c6 different preprocess actually used diff source picture management 2016-05-31 14:36:21 -07:00
sijchen@cisco.com
a1cae49732 add class for diff preprocess strategy 2016-05-31 13:48:45 -07:00
sijchen
c29da290b9 Merge pull request #2479 from ruil2/refine_rc1
get the correct did for savc case
2016-05-31 10:58:38 -07:00
Karina
dd021b6ca8 fix iContinualSkipFrames calculation 2016-05-31 21:01:11 +08:00
Karina
8effa45edd fix removing parameter setting 2016-05-31 20:46:13 +08:00
Karina
64ad70b0ea get the correct did for savc case 2016-05-31 17:35:20 +08:00
HaiboZhu
df77a5d587 Merge pull request #2478 from ruil2/refine_rc1
refine RC
2016-05-31 17:20:46 +08:00
Karina
4fc2b1f636 refine RC 2016-05-31 16:44:04 +08:00
Karina
7f2ba4dcb6 add savc setting in configure file and command line 2016-05-31 13:53:31 +08:00
Karina
e3c306608c fix dependency ID mapping issue 2016-05-30 15:03:39 +08:00
ruil2
39c2fb3d6b Merge pull request #2472 from saamas/processing-x86-general-bilinear-downsample-optimizations
[Processing/x86] GeneralBilinearDownsample optimizations
2016-05-27 15:17:31 +08:00
HaiboZhu
c17a58efdf Merge pull request #2473 from ruil2/update_interface
modify the interface that use a independent subseqID for each layer
2016-05-25 10:00:13 +08:00
Karina
2ef9613e55 avoid overflow 2016-05-24 13:25:05 +08:00
Sindre Aamås
e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Sindre Aamås
cbaf087583 [Processing] Reduce duplication in downsampling wrappers 2016-05-23 13:19:17 +02:00
ruil2
c96c8b05a8 Merge pull request #2468 from sijchen/refactor_pre
[Encoder] Refactor: create diff func for diff case to make logic clean
2016-05-23 13:21:40 +08:00