Commit Graph

711 Commits

Author SHA1 Message Date
ruil2
4b6f037020 Merge pull request #2489 from saamas/processing-dyadic-bilinear-downsample-optimizations
[Processing] DyadicBilinearDownsample optimizations
2016-06-12 10:02:55 +08:00
Sindre Aamås
fe4a47a979 [UT] Add comment on X86_ASM checksum ifdef 2016-06-08 21:53:30 +02:00
Karina
4f41c3a5bf fix codingIdx update issue 2016-06-02 21:17:31 +08:00
Sindre Aamås
8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
Karina
4fc2b1f636 refine RC 2016-05-31 16:44:04 +08:00
Karina
e3c306608c fix dependency ID mapping issue 2016-05-30 15:03:39 +08:00
Sindre Aamås
563376df0c [UT] Test downsampling routines with a wider variety of height ratios 2016-05-25 14:16:29 +02:00
Sindre Aamås
4fec6d581e [UT] Test generic downsampling routines with a wider variety of width ratios
Get coverage of all code paths for routines that branch to different
paths for different scaling ratios.
2016-05-23 20:23:47 +02:00
Sindre Aamås
e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Karina
9b2dd55324 add GetBsPostion for cabac and cavlc 2016-05-20 14:29:48 +08:00
sijchen
ffb85046b4 Refactoring: Wrap all the operations related to eSpsPpsIdStrategy to class, to improve code readability 2016-05-04 15:06:02 -07:00
HaiboZhu
c30cc41261 Merge pull request #2448 from saamas/encoder-getnonzerocount-sse42
[Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount
2016-05-04 09:49:47 +08:00
ruil2
e9dc97803d Merge pull request #2447 from saamas/encoder-cavlcparamcal-sse42
[Encoder] Add an SSE4.2 implementation of CavlcParamCal
2016-04-28 09:08:44 +08:00
ruil2
7d65687284 Merge pull request #2441 from saamas/encoder-add-avx2-4x4-quantization-routines
[Encoder] Add AVX2 4x4 quantization routines
2016-04-28 09:08:31 +08:00
Sindre Aamås
4645bd26aa [Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount
Avoid touching some cache lines by using popcnt instead of table
lookups.

Also gives a speedup of ~1.4x on Haswell as compared with SSE2.
2016-04-20 19:10:24 +02:00
Sindre Aamås
d906dda224 [UT] Improve GetNonZeroCount tests
Reduce duplication.
Test more combinations.
Always test boundary cases.
2016-04-20 19:10:24 +02:00
Sindre Aamås
3f31aff4dc [Encoder] Add an SSE4.2 implementation of CavlcParamCal
Use a combination of table lookups and pshufb to convert coefficients
to zero run/level format. Two 16-entry lookup tables are used for a
total of 192 bytes worth of tables. (The existing SSE2 version uses a
table of size 2048 bytes.)

Speedup is ~1.5x-3x as compared with the SSE2 version on Haswell (the
speedup is greater for input with many trailing zeros).

The use of popcnt makes it require SSE4.2. This can be replaced with
a small LUT and accumulation which would reduce the requirement to
SSSE3.
2016-04-20 18:37:08 +02:00
Sindre Aamås
502b16925e [UT] Add tests for CavlcParamCal_c and CavlcParamCal_sse2 2016-04-20 18:37:08 +02:00
Sindre Aamås
bb49e23719 [Encoder] Add AVX2 4x4 quantization routines
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2    (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2      (~1.49x speedup over SSE2)
WelsQuant4x4_avx2        (~1.42x speedup over SSE2)
2016-04-13 11:56:47 +02:00
Sindre Aamås
1e83bec860 [UT] Add some missing quantization tests 2016-04-13 11:56:44 +02:00
Sindre Aamås
abaf3a4104 [UT] Reduce duplication in quantization tests 2016-04-13 08:59:16 +02:00
Sindre Aamås
93db6511a8 [UT] Test VAA routines with a wider variety of resolutions
Test even and odd multiples of 32 width because some AVX2 routines
have conditional logic based on that.
2016-04-11 16:40:36 +02:00
Sindre Aamås
57fc3e9917 [Processing] Add AVX2 VAA routines
Process 8 lines at a time rather than 16 lines at a time because
this appears to give more reliable memory subsystem performance on
Haswell.

Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell.
On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached,
which appears to be related to processing 8 lines at a time as opposed
to 16 lines at a time. The other routines are also faster as compared
to the SSE2 routines in this case but to a lesser extent.
2016-04-11 16:09:56 +02:00
Martin Storsjö
81493590f8 Remove a stray empty line
This disappears when regenerating the makefiles.
2016-03-24 10:01:48 +02:00
sijchen
47d310539f Squashed commit of the following:
commit c8111942e07437034a74b33887c33b5ad78e476a
Author: Karina <ruil2@cisco.com>
Date:   Wed Mar 23 14:31:18 2016 +0800

    update SHA table

commit f36a25344c25a131581dcbcd2d103fc4b131012e
Author: Karina <ruil2@cisco.com>
Date:   Wed Mar 23 13:45:58 2016 +0800

    fix bitrate overflow issue when adaptive quality turns on
2016-03-23 10:23:33 -07:00
sijchen
33bb96f604 Merge pull request #2420 from sijchen/fix_sps
[Encoder] fix the lack of eSpsPpsIdStrategy==INCREASING_ID under simulcast avc on
2016-03-21 21:51:07 -07:00
zhilwang
d7570bfa52 Merge pull request #2401 from saamas/decoder-use-encoder-x86-idct-routines
[Decoder] Use encoder x86 IDCT routines
2016-03-18 08:50:33 +08:00
Sindre Aamås
b6c4a5447c [Decoder/x86] IDCT one block at a time with SSE2
At lower bitrates, it is overall faster to conditionally do one block
at a time with SSE2 on Haswell and likely other common architectures.
At higher bitrates, it is faster to use the wider routine that IDCTs
four blocks at a time. To avoid potential performance regressions
as compared to MMX, stick with single-block IDCTs with SSE2. There
is still a performance advantage as compared to MMX because the
single-block SSE2 routine is faster than the corresponding MMX
routine.

Stick with four blocks at a time with AVX2 for which that appears
to be consistently faster on Haswell.
2016-03-16 19:55:11 +01:00
huili2
a8d9576297 Merge pull request #2405 from HaiboZhu/Fix_UT_decoder_init_fail
Fix the decoder init failed case in UT
2016-03-16 16:28:14 +08:00
sijchen
c009183e97 fix the lack of eSpsPpsIdStrategy==INCREASING_ID under simulcast avc on 2016-03-14 11:28:44 -07:00
Haibo Zhu
46f42ec5f3 Fix the decoder init failed case in UT 2016-03-14 17:06:58 +08:00
Karina
f84f2315ab change downsampling logic that downsampling source is from the nearest layer instead of the highest layer 2016-03-14 09:55:36 +08:00
Sindre Aamås
98042f1600 [Decoder] Use encoder x86 IDCT routines
Move asm routines to common. Delete obsolete decoder routines.

Use wider routines where applicable.

~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.
2016-03-09 10:41:42 +01:00
Sindre Aamås
48a520915a [Encoder/x86] Add AVX2 SATD routines
WelsSampleSatd16x16_avx2 (~2.31x speedup over SSE4.1 on Haswell).
WelsSampleSatd16x8_avx2  (~2.19x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x16_avx2  (~1.68x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x8_avx2   (~1.53x speedup over SSE4.1 on Haswell).
2016-03-08 11:31:17 +01:00
sijchen
4db9c32976 remove sink in WelsThreadPool and hide the construtor to finish the singleTon 2016-03-02 17:08:09 -08:00
sijchen
d4f09d9048 put CWelsThreadPool to singleTon for future usage (including add sink for IWelsTask) 2016-02-29 11:40:25 -08:00
Gregory J. Wolfe
03890fe86f Added support for "video signal type present" information.
The "Video signal type present" information is written to the output
video file when it is created, and later is used by the decoder to
properly decode the compressed video data. The saved attributes
are:

- format type (PAL, NTSC, etc.)
- color primaries (BT709, SMPTE170M, etc.)
- transfer characteristics (BT709, SMPTE170M, etc.)
- color matrix ((BT709, SMPTE170M, etc.)

These modifications allow the client to specify these attributes
and, if specified, makes sure they are written to the output file.
2016-02-24 10:33:18 -05:00
Gregory J. Wolfe
c7fcba06c7 Added support for "video signal type present" information.
The "Video signal type present" information is written to the output
video file when it is created, and later is used by the decoder to
properly decode the compressed video data. The saved attributes
are:

- format type (PAL, NTSC, etc.)
- color primaries (BT709, SMPTE170M, etc.)
- transfer characteristics (BT709, SMPTE170M, etc.)
- color matrix ((BT709, SMPTE170M, etc.)

These modifications allow the client to specify these attributes
and, if specified, makes sure they are written to the output file.
2016-02-23 13:21:06 -05:00
sijchen
aaa25160ec Merge pull request #2353 from saamas/encoder-x86-dct-opt2
[Encoder] x86 DCT optimizations
2016-02-08 15:00:12 -08:00
sijchen
e5e7013b73 Merge pull request #2350 from sijchen/th00
[Common] Add sink to IWelsTask
2016-02-08 14:59:38 -08:00
Sindre Aamås
c8c74903f8 [Encoder] Add single-block AVX2 4x4 DCT/IDCT routines
We do four blocks at a time when possible, but need to handle
single blocks at a time for intra prediction.

~3.15x speedup over MMX for the DCT on Haswell.
~2.94x speedup over MMX for the IDCT on Haswell.

Returns diminish with increasing vector length because a larger
proportion of the time is spent on load/store/shuffling.
2016-02-02 17:22:49 +01:00
Sindre Aamås
f90960983c [Encoder] Add single-block SSE2 4x4 DCT/IDCT routines
We do four blocks at a time when possible, but need to handle
single blocks at a time for intra prediction.

~2.31x speedup over MMX for the DCT on Haswell.
~1.92x speedup over MMX for the IDCT on Haswell.
2016-02-02 17:22:48 +01:00
unknown
3873addc3d fix frame size constraints for width and height 2016-02-01 15:55:53 +08:00
HaiboZhu
1030820ec4 Merge pull request #2342 from sijchen/enh_ut_tem
[UT] correct and enhance the ut template and trace improvement
2016-02-01 09:08:05 +08:00
sijchen
47e3f4c45c correct and enhance the ut template 2016-01-19 17:16:39 -08:00
Sindre Aamås
cc8d541432 [UT] Utilize DCT function pointer typedefs 2016-01-19 22:00:24 +01:00
Sindre Aamås
a45c10cf91 [UT] Only run AVX2 tests if host supports AVX2 2016-01-19 14:27:46 +01:00