Sindre Aamås
1995e03d91
[Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample
...
Keep track of relative pixel offsets and utilize pshufb to efficiently
extract relevant pixels for horizontal scaling ratios <= 4.
Fall back to a generic approach for ratios > 4. Note that the generic
approach can be backported to SSE2.
The implementation assumes that data beyond the end of each line,
before the next line begins, can be dirtied; which AFAICT is safe with
the current usage of these routines.
Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2,
~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios
> 4 when not memory-bound on Haswell as compared with the current SSE2
implementation.
2016-05-23 20:23:31 +02:00
ruil2
7d65687284
Merge pull request #2441 from saamas/encoder-add-avx2-4x4-quantization-routines
...
[Encoder] Add AVX2 4x4 quantization routines
2016-04-28 09:08:31 +08:00
Karina
dd340b7fe7
modify neon comment
2016-04-14 14:49:11 +08:00
Karina
d34e209266
fix 32-bit parameters issue on arm64 assembly function
2016-04-13 19:30:08 +08:00
Sindre Aamås
bb49e23719
[Encoder] Add AVX2 4x4 quantization routines
...
WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2)
WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2)
WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2)
WelsQuant4x4_avx2 (~1.42x speedup over SSE2)
2016-04-13 11:56:47 +02:00
Karina
7943764869
add missing sign extension for arm64
2016-04-12 16:27:58 +08:00
Martin Storsjö
a4e71d6662
Add missing sign extension for x86_64 in mb_copy.asm
...
This fixes running the code built for x86_64 OS X with Xcode 7.3.
2016-03-24 10:20:42 +02:00
Sindre Aamås
b6c4a5447c
[Decoder/x86] IDCT one block at a time with SSE2
...
At lower bitrates, it is overall faster to conditionally do one block
at a time with SSE2 on Haswell and likely other common architectures.
At higher bitrates, it is faster to use the wider routine that IDCTs
four blocks at a time. To avoid potential performance regressions
as compared to MMX, stick with single-block IDCTs with SSE2. There
is still a performance advantage as compared to MMX because the
single-block SSE2 routine is faster than the corresponding MMX
routine.
Stick with four blocks at a time with AVX2 for which that appears
to be consistently faster on Haswell.
2016-03-16 19:55:11 +01:00
Sindre Aamås
98042f1600
[Decoder] Use encoder x86 IDCT routines
...
Move asm routines to common. Delete obsolete decoder routines.
Use wider routines where applicable.
~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.
2016-03-09 10:41:42 +01:00
Sindre Aamås
48a520915a
[Encoder/x86] Add AVX2 SATD routines
...
WelsSampleSatd16x16_avx2 (~2.31x speedup over SSE4.1 on Haswell).
WelsSampleSatd16x8_avx2 (~2.19x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x16_avx2 (~1.68x speedup over SSE4.1 on Haswell).
WelsSampleSatd8x8_avx2 (~1.53x speedup over SSE4.1 on Haswell).
2016-03-08 11:31:17 +01:00
volvet
d4c68527b1
Merge pull request #2389 from saamas/common-x86-deblock-chroma-horizontal-ssse3-optimizations
...
[Common/x86] Deblock chroma horizontal ssse3 optimizations
2016-03-08 17:09:08 +08:00
sijchen
4db9c32976
remove sink in WelsThreadPool and hide the construtor to finish the singleTon
2016-03-02 17:08:09 -08:00
sijchen
d4f09d9048
put CWelsThreadPool to singleTon for future usage (including add sink for IWelsTask)
2016-02-29 11:40:25 -08:00
Sindre Aamås
a009153741
[Common/x86] DeblockChromaEq4H_ssse3 optimizations
...
Use packed 8-bit operations rather than unpack to 16-bit.
~5.80x speedup on Haswell (x86-64).
~1.69x speedup on Haswell (x86 32-bit).
2016-02-26 10:58:16 +01:00
Sindre Aamås
9909c306f1
[Common/x86] DeblockChromaLt4H_ssse3 optimizations
...
Use packed 8-bit operations rather than unpack to 16-bit.
~5.72x speedup on Haswell (x86-64).
~1.85x speedup on Haswell (x86 32-bit).
2016-02-26 10:58:16 +01:00
ruil2
2754129064
Merge pull request #2360 from saamas/common-x86-deblock-optimizations
...
[Common/x86] Deblocking optimizations
2016-02-19 09:52:39 +08:00
sijchen
e07ee9c096
use WELS_DELETE_OP for deleting
2016-02-17 10:07:33 -08:00
sijchen
74955c877f
set pointers to null and call uninit
2016-02-17 10:07:33 -08:00
sijchen
cc675f9fd1
add error handling in memory allocation failed case
2016-02-17 10:07:33 -08:00
sijchen
71aa533038
move the printing of MEMORY_CHECK part to more reasonable
2016-02-15 10:12:34 -08:00
Sindre Aamås
e96a7b5c92
[Common/x86] DeblockChromaEq4V_ssse3 optimizations
...
Use packed 8-bit operations rather than unpack to 16-bit.
Avoid spills.
~2.07x speedup on Haswell (x86-64).
~2.12x speedup on Haswell (x86 32-bit).
2016-02-15 02:08:03 +01:00
Sindre Aamås
fc16010583
[Common/x86] DeblockChromaLt4V_ssse3 optimizations
...
Use packed 8-bit operations rather than unpack to 16-bit.
Avoid spills.
~2.68x speedup on Haswell (x86-64).
~2.38x speedup on Haswell (x86 32-bit).
2016-02-15 02:07:25 +01:00
Sindre Aamås
62fb37d096
[Common/x86] DeblockLumaEq4_ssse3 optimizations
...
Use packed 8-bit operations rather than unpack to 16-bit.
Minimize spills.
~2.31x speedup on Haswell (x86-64).
~2.40x speedup on Haswell (x86 32-bit).
2016-02-15 02:06:39 +01:00
Sindre Aamås
732e1c5f78
[Common/x86] DeblockLumaLt4_ssse3 optimizations
...
Use packed 8-bit operations rather than unpack to 16-bit.
Avoid spills.
~1.97x speedup on Haswell (x86-64).
~3.09x speedup on Haswell (x86 32-bit).
2016-02-15 02:06:18 +01:00
sijchen
e5e7013b73
Merge pull request #2350 from sijchen/th00
...
[Common] Add sink to IWelsTask
2016-02-08 14:59:38 -08:00
unknown
3873addc3d
fix frame size constraints for width and height
2016-02-01 15:55:53 +08:00
Sindre Aamås
3088d96978
[Encoder] Add an AVX2 4x4 IDCT implementation
...
~2.03x faster on Haswell as compared to the SSE2 version.
2016-01-19 13:12:28 +01:00
sijchen
5eb18b101e
change the output way of debug trace
2016-01-13 22:13:43 -08:00
sijchen
cce1c29844
add sink to IWelsTask (for further enhancements)
2016-01-13 16:24:54 -08:00
Sijia Chen
3e0ee69812
remove unneeded codes and add some logs
2015-11-02 23:15:29 -08:00
Sijia Chen
054a297ca7
adjust encoder tasks, add ut and enable new thread pool under some slice modes
2015-10-28 09:39:26 -07:00
HaiboZhu
e0cee02d77
Merge pull request #2177 from sijchen/thp21
...
[Encoder] add encoder tasks and task-management class
2015-10-23 13:21:42 +08:00
Martin Storsjö
80c8b7b1cc
Add a missing include of stdlib.h
...
This is required for malloc in this header.
This fixes building for Windows Phone.
2015-10-20 08:59:41 +03:00
Sijia Chen
819f6f5d93
[Encoder] add encoder tasks and task-management class
...
https://rbcommons.com/s/OpenH264/r/1334/
2015-10-19 22:48:28 -07:00
Martin Storsjö
dac26cf923
Remove unused STL includes
...
This fixes building for Android, where libopenh264.so is intended
not to link to any particular STL implementation.
2015-10-19 11:21:29 +03:00
Sijia Chen
b29760ee31
remove unneeded parts
2015-10-15 11:31:34 -07:00
Sijia Chen
ade32f5c48
implementation for WelsSleep on WP8.0
...
https://rbcommons.com/s/OpenH264/r/1315/
2015-10-15 11:27:43 -07:00
Sijia Chen
a3f606e58a
replacement of std::list for m_cBusyThreads
...
https://rbcommons.com/s/OpenH264/r/1320/
2015-10-15 11:17:29 -07:00
Sijia Chen
bc566f0923
put m_cIdleThreads to CWelsCircleQueue rather than std::map
...
https://rbcommons.com/s/OpenH264/r/1313/
2015-10-15 10:24:48 -07:00
Sijia Chen
eb00d5cb9e
change std::list to internal implementation and add the new ut file for CWelsCircleQueue
...
https://rbcommons.com/s/OpenH264/r/1310/
2015-10-15 10:11:29 -07:00
Sijia Chen
757a596e97
add basic threadpool functions
...
https://rbcommons.com/s/OpenH264/r/1294/
2015-10-15 10:04:00 -07:00
Haibo Zhu
03d16bb4d1
Remove UBSAN warnings about negative left shift
2015-10-14 19:43:19 -07:00
HaiboZhu
3067d127aa
Merge pull request #2153 from mstorsjo/fix-warnings
...
Fix warnings when building for iOS with xcode
2015-10-13 18:26:56 +08:00
Martin Storsjö
8363d43588
Fix warnings when building for iOS with xcode
2015-10-13 12:27:11 +03:00
Martin Storsjö
837599becc
Revert an accidental change that broke MSVC compilation
...
This reverts an unrelated part of e7e3b4f37f0.
Since the function still is declared as taking an int32_t parameter
in the header, changing the function implementation makes it end
up as a different function.
2015-10-13 12:15:01 +03:00
Haibo Zhu
e7e3b4f37f
Init the string value and add protection for WelsStrcat()
2015-10-10 08:45:48 -07:00
HaiboZhu
d0d7ad57c2
Merge pull request #2116 from mstorsjo/remove-tab-indentation
...
Fix indentation to consistently use spaces instead of tabs
2015-09-16 09:12:07 +08:00
Martin Storsjö
c31e4e23f2
Fix indentation to consistently use spaces instead of tabs
...
Also get rid of other stray tabs in scripts.
2015-09-15 08:41:19 +03:00
fstd
4d063b84cc
Build successfully on OpenBSD (which lacks sysctlbyname(3))
2015-09-12 21:31:39 +02:00
Nathan Kidd
fdabca4cc9
Only use CPU_COUNT if available
...
Fixes build error on Linux hosts with GLIBC < 2.6.
Resolves issue #2089
2015-09-02 18:24:24 -04:00