2702 Commits

Author SHA1 Message Date
Karina
7f2ba4dcb6 add savc setting in configure file and command line 2016-05-31 13:53:31 +08:00
Karina
e3c306608c fix dependency ID mapping issue 2016-05-30 15:03:39 +08:00
ruil2
39c2fb3d6b Merge pull request #2472 from saamas/processing-x86-general-bilinear-downsample-optimizations
[Processing/x86] GeneralBilinearDownsample optimizations
2016-05-27 15:17:31 +08:00
HaiboZhu
c17a58efdf Merge pull request #2473 from ruil2/update_interface
modify the interface that use a independent subseqID for each layer
2016-05-25 10:00:13 +08:00
Karina
2ef9613e55 avoid overflow 2016-05-24 13:25:05 +08:00
Sindre Aamås
e490215990 [Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3/SSE4.1 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2,
~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios
within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b43e58a366 [Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 8. Because
pshufb does not cross 128-bit lanes, the overhead of address
calculations and loads is relatively greater as compared with an
SSSE3 implementation.

Fall back to a generic approach for ratios > 8.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2,
~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios
within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound
on Haswell as compared with the current SSE2 implementation.
2016-05-23 20:23:47 +02:00
Sindre Aamås
b1013095b1 [Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4.

The use of blendps makes this require SSE4.1. The pshufb path can be
backported to SSSE3 and the generic path to SSE2 for a minor reduction
in performance by replacing blendps and preceding instructions with an
equivalent sequence.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2,
~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:39 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
Sindre Aamås
cbaf087583 [Processing] Reduce duplication in downsampling wrappers 2016-05-23 13:19:17 +02:00
ruil2
c96c8b05a8 Merge pull request #2468 from sijchen/refactor_pre
[Encoder] Refactor: create diff func for diff case to make logic clean
2016-05-23 13:21:40 +08:00
Karina
9b2dd55324 add GetBsPostion for cabac and cavlc 2016-05-20 14:29:48 +08:00
sijchen
27e803f6f4 refactor to make logic clean 2016-05-19 09:42:39 -07:00
Karina
ac37666cf1 modify the interface that use a independent subseqID for each layer 2016-05-19 17:17:17 +08:00
Karina
8a341070f2 fix overflow issue 2016-05-19 12:00:49 +08:00
sijchen
1ac02f3002 fix conflict with master 2016-05-18 10:57:39 -07:00
Karina
c298d66d48 fix temporal layer skip issue 2016-05-18 09:47:49 +08:00
Haibo Zhu
85f4beb9a8 Fix the wrong variable name which casue the build error 2016-05-17 13:46:04 +08:00
HaiboZhu
46220cfb3b Merge pull request #2461 from HaiboZhu/Bugfix_remove_undefined_behavior_warning
Remove the undefined behavior waring in parse_cabac
2016-05-17 10:51:18 +08:00
Haibo Zhu
86c1f0d2c6 Remove the undefined behavior waring in parse_cabac 2016-05-17 09:40:03 +08:00
ruil2
0ec686f7ec Merge pull request #2452 from sijchen/refactor_sps2
Refactoring: Wrap all the operations related to eSpsPpsIdStrategy to class
2016-05-17 09:19:14 +08:00
sijchen
1eb735299a Merge pull request #2458 from ruil2/downsampling2
add one new downsampling algorithms
2016-05-16 10:59:35 -07:00
sijchen
00747540fb move strategy related pointer to class 2016-05-16 10:55:13 -07:00
Karina
3b55d64902 fix crash when temporal layer is skipped, the frame should not be encoded 2016-05-16 14:43:13 +08:00
Karina
96b2a87030 add one new downsampling algorithms 2016-05-16 09:28:19 +08:00
sijchen
ffb85046b4 Refactoring: Wrap all the operations related to eSpsPpsIdStrategy to class, to improve code readability 2016-05-04 15:06:02 -07:00
HaiboZhu
c30cc41261 Merge pull request #2448 from saamas/encoder-getnonzerocount-sse42
[Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount
2016-05-04 09:49:47 +08:00
ruil2
e9dc97803d Merge pull request #2447 from saamas/encoder-cavlcparamcal-sse42
[Encoder] Add an SSE4.2 implementation of CavlcParamCal
2016-04-28 09:08:44 +08:00
ruil2
7d65687284 Merge pull request #2441 from saamas/encoder-add-avx2-4x4-quantization-routines
[Encoder] Add AVX2 4x4 quantization routines
2016-04-28 09:08:31 +08:00
ruil2
56618249d7 Merge pull request #2436 from saamas/processing-add-avx2-vaa-routines
[Processing] Add AVX2 VAA routines
2016-04-28 09:08:03 +08:00
Sindre Aamås
fb0b2b3f41 [Encoder/x86] Drop unneeded LOAD_4_PARA in CavlcParamCal_sse42 2016-04-24 22:59:35 +02:00
Sindre Aamås
d1c7713191 [Encoder/x86] Minor CavlcParamCal_sse42 tweak
Do more elaborate register allocation to avoid a few mov instructions.
2016-04-24 22:36:23 +02:00
Sindre Aamås
f56bdc3aa4 [Encoder/x86] Minor CavlcParamCal_sse42 tweak
Avoid loading single-use parameter.
2016-04-21 16:29:02 +02:00
Sindre Aamås
2eb8800712 [Encoder/x86] Remove a leftover mov instruction in CavlcParamCal_sse42 2016-04-21 15:53:33 +02:00
Sindre Aamås
4645bd26aa [Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount
Avoid touching some cache lines by using popcnt instead of table
lookups.

Also gives a speedup of ~1.4x on Haswell as compared with SSE2.
2016-04-20 19:10:24 +02:00
Sindre Aamås
3f31aff4dc [Encoder] Add an SSE4.2 implementation of CavlcParamCal
Use a combination of table lookups and pshufb to convert coefficients
to zero run/level format. Two 16-entry lookup tables are used for a
total of 192 bytes worth of tables. (The existing SSE2 version uses a
table of size 2048 bytes.)

Speedup is ~1.5x-3x as compared with the SSE2 version on Haswell (the
speedup is greater for input with many trailing zeros).

The use of popcnt makes it require SSE4.2. This can be replaced with
a small LUT and accumulation which would reduce the requirement to
SSSE3.
2016-04-20 18:37:08 +02:00
Sindre Aamås
502b16925e [UT] Add tests for CavlcParamCal_c and CavlcParamCal_sse2 2016-04-20 18:37:08 +02:00
HaiboZhu
98c6c6de11 Merge pull request #2446 from HaiboZhu/Reduce_log_size_for_parse_only_mode
Add the log reduce logic into parse only mode
2016-04-20 10:48:57 +08:00
HaiboZhu
3b68840d5f Merge pull request #2444 from GuangweiWang/fix-assembly-arm64
Fix assembly arm64
Code review at: https://rbcommons.com/s/OpenH264/r/1594/
2016-04-20 10:03:01 +08:00
Haibo Zhu
3ccecfbdbe Add the log reduce logic into parse only mode 2016-04-20 09:58:12 +08:00
Guangwei Wang
cc407b4b21 fix code style 2016-04-17 19:47:55 +08:00
Guangwei Wang
0b8cdcaff8 extension 32-bit parameters to 64-bit on arm64 assembly function 2016-04-17 19:41:57 +08:00
Karina
1ecb9582df update arm assembly comments 2016-04-14 14:57:21 +08:00
Karina
dd340b7fe7 modify neon comment 2016-04-14 14:49:11 +08:00
Karina
525dbe7093 add 32-bit parameter sign-extentions for block_add_aarch64_neon.S 2016-04-14 10:06:57 +08:00
Karina
d34e209266 fix 32-bit parameters issue on arm64 assembly function 2016-04-13 19:30:08 +08:00
Sindre Aamås
bb49e23719 [Encoder] Add AVX2 4x4 quantization routines
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2    (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2      (~1.49x speedup over SSE2)
WelsQuant4x4_avx2        (~1.42x speedup over SSE2)
2016-04-13 11:56:47 +02:00
Karina
7943764869 add missing sign extension for arm64 2016-04-12 16:27:58 +08:00
Sindre Aamås
57fc3e9917 [Processing] Add AVX2 VAA routines
Process 8 lines at a time rather than 16 lines at a time because
this appears to give more reliable memory subsystem performance on
Haswell.

Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell.
On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached,
which appears to be related to processing 8 lines at a time as opposed
to 16 lines at a time. The other routines are also faster as compared
to the SSE2 routines in this case but to a lesser extent.
2016-04-11 16:09:56 +02:00
Karina
7e14400b0b refine the workflow for encode one frame 2016-03-31 16:58:20 +08:00