Commit Graph

461 Commits

Author SHA1 Message Date
HaiboZhu
84a7669b63 Merge pull request #2464 from bumblebritches57/MVC
MVC aka Stereoscopic 3D support
2016-06-15 10:05:15 +08:00
Sindre Aamås
8a0af4a3f2 [Processing/x86] DyadicBilinearDownsample optimizations
Average vertically before horizontally; horizontal averaging is more
worksome. Doing the vertical averaging first reduces the number of
horizontal averages by half.

Use pmaddubsw and pavgw to do the horizontal averaging for a slight
performance improvement.

Minor tweaks.

Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines.
The non-temporal loads used in the SSE4 routines do nothing for cache-
backed memory AFAIK.

Adjust tests because averaging vertically first gives slightly different
output.

~2.39x speedup for the widthx32 routine on Haswell when not memory-bound.
~2.20x speedup for the widthx16 routine on Haswell when not memory-bound.

Note that the widthx16 routine can be unrolled for further speedup.
2016-06-02 13:44:28 +02:00
Sindre Aamås
1995e03d91 [Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.

Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.

The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.

Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
ruil2
7d65687284 Merge pull request #2441 from saamas/encoder-add-avx2-4x4-quantization-routines
[Encoder] Add AVX2 4x4 quantization routines
2016-04-28 09:08:31 +08:00
Karina
dd340b7fe7 modify neon comment 2016-04-14 14:49:11 +08:00
Karina
d34e209266 fix 32-bit parameters issue on arm64 assembly function 2016-04-13 19:30:08 +08:00
Sindre Aamås
bb49e23719 [Encoder] Add AVX2 4x4 quantization routines
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2    (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2      (~1.49x speedup over SSE2)
WelsQuant4x4_avx2        (~1.42x speedup over SSE2)
2016-04-13 11:56:47 +02:00
Karina
7943764869 add missing sign extension for arm64 2016-04-12 16:27:58 +08:00
Martin Storsjö
a4e71d6662 Add missing sign extension for x86_64 in mb_copy.asm
This fixes running the code built for x86_64 OS X with Xcode 7.3.
2016-03-24 10:20:42 +02:00
Sindre Aamås
b6c4a5447c [Decoder/x86] IDCT one block at a time with SSE2
At lower bitrates, it is overall faster to conditionally do one block
at a time with SSE2 on Haswell and likely other common architectures.
At higher bitrates, it is faster to use the wider routine that IDCTs
four blocks at a time. To avoid potential performance regressions
as compared to MMX, stick with single-block IDCTs with SSE2. There
is still a performance advantage as compared to MMX because the
single-block SSE2 routine is faster than the corresponding MMX
routine.

Stick with four blocks at a time with AVX2 for which that appears
to be consistently faster on Haswell.
2016-03-16 19:55:11 +01:00
Marcus Johnson
4d6b1c23fe MVC support 2 2016-03-16 01:32:56 -04:00
Marcus Johnson
69bae68698 Add support for MVC NALs to EWelsNalUnitType 2016-03-16 01:28:55 -04:00
Sindre Aamås
98042f1600 [Decoder] Use encoder x86 IDCT routines
Move asm routines to common. Delete obsolete decoder routines.

Use wider routines where applicable.

~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.
2016-03-09 10:41:42 +01:00
Sindre Aamås
48a520915a [Encoder/x86] Add AVX2 SATD routines
WelsSampleSatd16x16_avx2 (~2.31x speedup over SSE4.1 on Haswell).
WelsSampleSatd16x8_avx2  (~2.19x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x16_avx2  (~1.68x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x8_avx2   (~1.53x speedup over SSE4.1 on Haswell).
2016-03-08 11:31:17 +01:00
volvet
d4c68527b1 Merge pull request #2389 from saamas/common-x86-deblock-chroma-horizontal-ssse3-optimizations
[Common/x86] Deblock chroma horizontal ssse3 optimizations
2016-03-08 17:09:08 +08:00
sijchen
4db9c32976 remove sink in WelsThreadPool and hide the construtor to finish the singleTon 2016-03-02 17:08:09 -08:00
sijchen
d4f09d9048 put CWelsThreadPool to singleTon for future usage (including add sink for IWelsTask) 2016-02-29 11:40:25 -08:00
Sindre Aamås
a009153741 [Common/x86] DeblockChromaEq4H_ssse3 optimizations
Use packed 8-bit operations rather than unpack to 16-bit.

~5.80x speedup on Haswell (x86-64).
~1.69x speedup on Haswell (x86 32-bit).
2016-02-26 10:58:16 +01:00
Sindre Aamås
9909c306f1 [Common/x86] DeblockChromaLt4H_ssse3 optimizations
Use packed 8-bit operations rather than unpack to 16-bit.

~5.72x speedup on Haswell (x86-64).
~1.85x speedup on Haswell (x86 32-bit).
2016-02-26 10:58:16 +01:00
ruil2
2754129064 Merge pull request #2360 from saamas/common-x86-deblock-optimizations
[Common/x86] Deblocking optimizations
2016-02-19 09:52:39 +08:00
sijchen
e07ee9c096 use WELS_DELETE_OP for deleting 2016-02-17 10:07:33 -08:00
sijchen
74955c877f set pointers to null and call uninit 2016-02-17 10:07:33 -08:00
sijchen
cc675f9fd1 add error handling in memory allocation failed case 2016-02-17 10:07:33 -08:00
sijchen
71aa533038 move the printing of MEMORY_CHECK part to more reasonable 2016-02-15 10:12:34 -08:00
Sindre Aamås
e96a7b5c92 [Common/x86] DeblockChromaEq4V_ssse3 optimizations
Use packed 8-bit operations rather than unpack to 16-bit.

Avoid spills.

~2.07x speedup on Haswell (x86-64).
~2.12x speedup on Haswell (x86 32-bit).
2016-02-15 02:08:03 +01:00
Sindre Aamås
fc16010583 [Common/x86] DeblockChromaLt4V_ssse3 optimizations
Use packed 8-bit operations rather than unpack to 16-bit.

Avoid spills.

~2.68x speedup on Haswell (x86-64).
~2.38x speedup on Haswell (x86 32-bit).
2016-02-15 02:07:25 +01:00
Sindre Aamås
62fb37d096 [Common/x86] DeblockLumaEq4_ssse3 optimizations
Use packed 8-bit operations rather than unpack to 16-bit.

Minimize spills.

~2.31x speedup on Haswell (x86-64).
~2.40x speedup on Haswell (x86 32-bit).
2016-02-15 02:06:39 +01:00
Sindre Aamås
732e1c5f78 [Common/x86] DeblockLumaLt4_ssse3 optimizations
Use packed 8-bit operations rather than unpack to 16-bit.

Avoid spills.

~1.97x speedup on Haswell (x86-64).
~3.09x speedup on Haswell (x86 32-bit).
2016-02-15 02:06:18 +01:00
sijchen
e5e7013b73 Merge pull request #2350 from sijchen/th00
[Common] Add sink to IWelsTask
2016-02-08 14:59:38 -08:00
unknown
3873addc3d fix frame size constraints for width and height 2016-02-01 15:55:53 +08:00
Sindre Aamås
3088d96978 [Encoder] Add an AVX2 4x4 IDCT implementation
~2.03x faster on Haswell as compared to the SSE2 version.
2016-01-19 13:12:28 +01:00
sijchen
5eb18b101e change the output way of debug trace 2016-01-13 22:13:43 -08:00
sijchen
cce1c29844 add sink to IWelsTask (for further enhancements) 2016-01-13 16:24:54 -08:00
Sijia Chen
3e0ee69812 remove unneeded codes and add some logs 2015-11-02 23:15:29 -08:00
Sijia Chen
054a297ca7 adjust encoder tasks, add ut and enable new thread pool under some slice modes 2015-10-28 09:39:26 -07:00
HaiboZhu
e0cee02d77 Merge pull request #2177 from sijchen/thp21
[Encoder] add encoder tasks and task-management class
2015-10-23 13:21:42 +08:00
Martin Storsjö
80c8b7b1cc Add a missing include of stdlib.h
This is required for malloc in this header.

This fixes building for Windows Phone.
2015-10-20 08:59:41 +03:00
Sijia Chen
819f6f5d93 [Encoder] add encoder tasks and task-management class
https://rbcommons.com/s/OpenH264/r/1334/
2015-10-19 22:48:28 -07:00
Martin Storsjö
dac26cf923 Remove unused STL includes
This fixes building for Android, where libopenh264.so is intended
not to link to any particular STL implementation.
2015-10-19 11:21:29 +03:00
Sijia Chen
b29760ee31 remove unneeded parts 2015-10-15 11:31:34 -07:00
Sijia Chen
ade32f5c48 implementation for WelsSleep on WP8.0
https://rbcommons.com/s/OpenH264/r/1315/
2015-10-15 11:27:43 -07:00
Sijia Chen
a3f606e58a replacement of std::list for m_cBusyThreads
https://rbcommons.com/s/OpenH264/r/1320/
2015-10-15 11:17:29 -07:00
Sijia Chen
bc566f0923 put m_cIdleThreads to CWelsCircleQueue rather than std::map
https://rbcommons.com/s/OpenH264/r/1313/
2015-10-15 10:24:48 -07:00
Sijia Chen
eb00d5cb9e change std::list to internal implementation and add the new ut file for CWelsCircleQueue
https://rbcommons.com/s/OpenH264/r/1310/
2015-10-15 10:11:29 -07:00
Sijia Chen
757a596e97 add basic threadpool functions
https://rbcommons.com/s/OpenH264/r/1294/
2015-10-15 10:04:00 -07:00
Haibo Zhu
03d16bb4d1 Remove UBSAN warnings about negative left shift 2015-10-14 19:43:19 -07:00
HaiboZhu
3067d127aa Merge pull request #2153 from mstorsjo/fix-warnings
Fix warnings when building for iOS with xcode
2015-10-13 18:26:56 +08:00
Martin Storsjö
8363d43588 Fix warnings when building for iOS with xcode 2015-10-13 12:27:11 +03:00
Martin Storsjö
837599becc Revert an accidental change that broke MSVC compilation
This reverts an unrelated part of e7e3b4f37f.

Since the function still is declared as taking an int32_t parameter
in the header, changing the function implementation makes it end
up as a different function.
2015-10-13 12:15:01 +03:00
Haibo Zhu
e7e3b4f37f Init the string value and add protection for WelsStrcat() 2015-10-10 08:45:48 -07:00