Commit Graph

4338 Commits

Author SHA1 Message Date
Karina
c1255451d7 use the correct frametype in statistics info 2016-06-06 17:06:56 +08:00
Karina
02218e2dbd modify configure file comments 2016-06-06 16:22:09 +08:00
ruil2
106d13d26c Merge pull request #2492 from saamas/processing-x86-downsample-use-lddqu
[Processing/x86] Use lddqu in case we still run on anything that benefits
2016-06-06 12:46:55 +08:00
Sindre Aamås
f183891c5b [Processing/x86] Use lddqu in case we still run on anything that benefits 2016-06-04 00:41:35 +02:00
Sindre Aamås
5a9c6db335 [Processing] Relax downsample buffer size requirement
AFAICT, it is sufficient that the sample buffer has space for half
the source width/height. With the current sample buffer size, this
enables its use for resolutions up to 3840x2176.
2016-06-03 15:14:09 +02:00
Sindre Aamås
68a5910f8f [Processing] Clear LSB before rounding up dyadic downsample width 2016-06-03 12:03:01 +02:00
Karina
2171d84f1e add nalsize checking UT and fix nalsize control when cabac on 2016-06-03 17:36:14 +08:00
ruil2
3eba80765c Merge pull request #2487 from sijchen/refactor_ref31
[Encoder] Preprocess: refactor to improve code readability
2016-06-03 13:39:04 +08:00
sijchen
1fa02f6b07 Merge pull request #2488 from ruil2/codingIdx1
fix codingIdx update issue
2016-06-02 10:00:56 -07:00
Karina
4f41c3a5bf fix codingIdx update issue 2016-06-02 21:17:31 +08:00
Sindre Aamås
8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
Sindre Aamås
7cbb75eac6 [Processing] Pick dyadic downsample function based on stride
Assume that data can be written into the padding area following each
line. This enables the use of faster routines for more cases.

Align downsample buffer stride to a multiple of 32.

With this all strides used should be a multiple of 16, which means
that use of narrower downsample routines can be dropped altogether.
2016-06-02 13:44:28 +02:00
Sindre Aamås
770e48ac2b [Processing] Remove unused align macros
The WELS_ALIGN macro here aliases the WELS_ALIGN macro in macros.h
which is inconvenient. Just remove these unused macros.
2016-06-02 13:44:28 +02:00
sijchen@cisco.com
a7ae1efc3a add back the missing part after merging and formatting 2016-06-01 21:33:33 -07:00
sijchen@cisco.com
8bacc3d4d0 Preprocess: refactor to improve code readability 2016-06-01 21:26:24 -07:00
sijchen
f6b6a0f6aa Merge pull request #2485 from ruil2/init
remove redundant initialization
2016-06-01 09:28:02 -07:00
sijchen@cisco.com
8537a9274d fix a prob 2016-06-01 09:21:12 -07:00
sijchen@cisco.com
a9601cdc59 refactor to avoid only use idx0 in syntax writing, for now it has no impact on bs, may benefit future usage 2016-06-01 09:21:12 -07:00
Karina
268a0eb6f4 remove redundant initialization 2016-06-01 10:52:51 +08:00
HaiboZhu
515eeb41e4 Merge pull request #2481 from ruil2/maxbitrate1
fix iContinualSkipFrames calculation
2016-06-01 09:03:57 +08:00
HaiboZhu
7ccc377d55 Merge pull request #2480 from ruil2/fix
fix removing parameter setting wrongly
2016-06-01 09:03:49 +08:00
ruil2
2d3fc37a07 Merge pull request #2484 from sijchen/refactor_preprocess13
[Encoder] Refactor: add class for diff preprocess strategy
2016-06-01 08:31:02 +08:00
Karina
87e81a7a40 use the same name to avoid confusing. 2016-06-01 08:21:03 +08:00
sijchen@cisco.com
03863ae4c6 different preprocess actually used diff source picture management 2016-05-31 14:36:21 -07:00
sijchen@cisco.com
a1cae49732 add class for diff preprocess strategy 2016-05-31 13:48:45 -07:00
sijchen
c29da290b9 Merge pull request #2479 from ruil2/refine_rc1
get the correct did for savc case
2016-05-31 10:58:38 -07:00
Karina
dd021b6ca8 fix iContinualSkipFrames calculation 2016-05-31 21:01:11 +08:00
Karina
8effa45edd fix removing parameter setting 2016-05-31 20:46:13 +08:00
Karina
64ad70b0ea get the correct did for savc case 2016-05-31 17:35:20 +08:00
HaiboZhu
df77a5d587 Merge pull request #2478 from ruil2/refine_rc1
refine RC
2016-05-31 17:20:46 +08:00
Karina
4fc2b1f636 refine RC 2016-05-31 16:44:04 +08:00
HaiboZhu
3f199f92a9 Merge pull request #2477 from ruil2/add_param_configure
add savc setting in configure file and command line
2016-05-31 16:33:40 +08:00
Karina
7f2ba4dcb6 add savc setting in configure file and command line 2016-05-31 13:53:31 +08:00
HaiboZhu
1d2b52e4cc Merge pull request #2476 from ruil2/did1
fix dependency ID mapping issue
2016-05-31 11:08:16 +08:00
Karina
e3c306608c fix dependency ID mapping issue 2016-05-30 15:03:39 +08:00
ruil2
39c2fb3d6b Merge pull request #2472 from saamas/processing-x86-general-bilinear-downsample-optimizations
[Processing/x86] GeneralBilinearDownsample optimizations
2016-05-27 15:17:31 +08:00
Sindre Aamås
563376df0c [UT] Test downsampling routines with a wider variety of height ratios 2016-05-25 14:16:29 +02:00
HaiboZhu
c17a58efdf Merge pull request #2473 from ruil2/update_interface
modify the interface that use a independent subseqID for each layer
2016-05-25 10:00:13 +08:00
HaiboZhu
780101fcfd Merge pull request #2474 from ruil2/overflow
avoid overflow
2016-05-25 09:59:36 +08:00
Karina
2ef9613e55 avoid overflow 2016-05-24 13:25:05 +08:00
Sindre Aamås
4fec6d581e [UT] Test generic downsampling routines with a wider variety of width ratios
Get coverage of all code paths for routines that branch to different
paths for different scaling ratios.
2016-05-23 20:23:47 +02:00
Sindre Aamås
e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Sindre Aamås
cbaf087583 [Processing] Reduce duplication in downsampling wrappers 2016-05-23 13:19:17 +02:00
ruil2
c96c8b05a8 Merge pull request #2468 from sijchen/refactor_pre
[Encoder] Refactor: create diff func for diff case to make logic clean
2016-05-23 13:21:40 +08:00
HaiboZhu
685b6144a5 Merge pull request #2469 from ruil2/fix_bitrate
add GetBsPostion for cabac and cavlc
2016-05-23 09:49:45 +08:00
Karina
9b2dd55324 add GetBsPostion for cabac and cavlc 2016-05-20 14:29:48 +08:00
sijchen
27e803f6f4 refactor to make logic clean 2016-05-19 09:42:39 -07:00