openh264

Author	SHA1	Message	Date
Guangwei Wang	cfa96daeca	add option for enable/disable AVX2	2016-07-20 10:19:46 +08:00
Sindre Aamås	8a0af4a3f2	[Processing/x86] DyadicBilinearDownsample optimizations Average vertically before horizontally; horizontal averaging is more worksome. Doing the vertical averaging first reduces the number of horizontal averages by half. Use pmaddubsw and pavgw to do the horizontal averaging for a slight performance improvement. Minor tweaks. Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines. The non-temporal loads used in the SSE4 routines do nothing for cache- backed memory AFAIK. Adjust tests because averaging vertically first gives slightly different output. ~2.39x speedup for the widthx32 routine on Haswell when not memory-bound. ~2.20x speedup for the widthx16 routine on Haswell when not memory-bound. Note that the widthx16 routine can be unrolled for further speedup.	2016-06-02 13:44:28 +02:00
Sindre Aamås	1995e03d91	[Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. Note that the generic approach can be backported to SSE2. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2, ~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:31 +02:00
Sindre Aamås	bb49e23719	[Encoder] Add AVX2 4x4 quantization routines WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2) WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2) WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2) WelsQuant4x4_avx2 (~1.42x speedup over SSE2)	2016-04-13 11:56:47 +02:00
Martin Storsjö	a4e71d6662	Add missing sign extension for x86_64 in mb_copy.asm This fixes running the code built for x86_64 OS X with Xcode 7.3.	2016-03-24 10:20:42 +02:00
Sindre Aamås	b6c4a5447c	[Decoder/x86] IDCT one block at a time with SSE2 At lower bitrates, it is overall faster to conditionally do one block at a time with SSE2 on Haswell and likely other common architectures. At higher bitrates, it is faster to use the wider routine that IDCTs four blocks at a time. To avoid potential performance regressions as compared to MMX, stick with single-block IDCTs with SSE2. There is still a performance advantage as compared to MMX because the single-block SSE2 routine is faster than the corresponding MMX routine. Stick with four blocks at a time with AVX2 for which that appears to be consistently faster on Haswell.	2016-03-16 19:55:11 +01:00
Sindre Aamås	98042f1600	[Decoder] Use encoder x86 IDCT routines Move asm routines to common. Delete obsolete decoder routines. Use wider routines where applicable. ~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.	2016-03-09 10:41:42 +01:00
Sindre Aamås	48a520915a	[Encoder/x86] Add AVX2 SATD routines WelsSampleSatd16x16_avx2 (~2.31x speedup over SSE4.1 on Haswell). WelsSampleSatd16x8_avx2 (~2.19x speedup over SSE4.1 on Haswell). WelsSampleSatd8x16_avx2 (~1.68x speedup over SSE4.1 on Haswell). WelsSampleSatd8x8_avx2 (~1.53x speedup over SSE4.1 on Haswell).	2016-03-08 11:31:17 +01:00
Sindre Aamås	a009153741	[Common/x86] DeblockChromaEq4H_ssse3 optimizations Use packed 8-bit operations rather than unpack to 16-bit. ~5.80x speedup on Haswell (x86-64). ~1.69x speedup on Haswell (x86 32-bit).	2016-02-26 10:58:16 +01:00
Sindre Aamås	9909c306f1	[Common/x86] DeblockChromaLt4H_ssse3 optimizations Use packed 8-bit operations rather than unpack to 16-bit. ~5.72x speedup on Haswell (x86-64). ~1.85x speedup on Haswell (x86 32-bit).	2016-02-26 10:58:16 +01:00
Sindre Aamås	e96a7b5c92	[Common/x86] DeblockChromaEq4V_ssse3 optimizations Use packed 8-bit operations rather than unpack to 16-bit. Avoid spills. ~2.07x speedup on Haswell (x86-64). ~2.12x speedup on Haswell (x86 32-bit).	2016-02-15 02:08:03 +01:00
Sindre Aamås	fc16010583	[Common/x86] DeblockChromaLt4V_ssse3 optimizations Use packed 8-bit operations rather than unpack to 16-bit. Avoid spills. ~2.68x speedup on Haswell (x86-64). ~2.38x speedup on Haswell (x86 32-bit).	2016-02-15 02:07:25 +01:00
Sindre Aamås	62fb37d096	[Common/x86] DeblockLumaEq4_ssse3 optimizations Use packed 8-bit operations rather than unpack to 16-bit. Minimize spills. ~2.31x speedup on Haswell (x86-64). ~2.40x speedup on Haswell (x86 32-bit).	2016-02-15 02:06:39 +01:00
Sindre Aamås	732e1c5f78	[Common/x86] DeblockLumaLt4_ssse3 optimizations Use packed 8-bit operations rather than unpack to 16-bit. Avoid spills. ~1.97x speedup on Haswell (x86-64). ~3.09x speedup on Haswell (x86 32-bit).	2016-02-15 02:06:18 +01:00
Sindre Aamås	3088d96978	[Encoder] Add an AVX2 4x4 IDCT implementation ~2.03x faster on Haswell as compared to the SSE2 version.	2016-01-19 13:12:28 +01:00
Guangwei Wang	1f8ef8f0a3	Add new x86 assembly functions to support sub8x8 mode	2015-07-10 09:00:05 +08:00
Martin Storsjö	3243a78959	Move DEFAULT REL into the x86_64 cases This fixes warnings when building for x86_32 using yasm, which says the "DEFAULT REL" is ignored for non-64-bit targets.	2015-02-02 00:49:37 +02:00
zhiliang wang	01b74ea7c1	Add asm code for NoneZeroCount and refine related code	2015-01-04 16:39:17 +08:00
zhiliang wang	19c02bdfa8	Fix crash issue due to commit 8945348.	2014-09-26 11:16:13 +08:00
ruil2	ed341048de	refine common moudle for part of intra prediction function	2014-09-25 14:03:11 +08:00
Martin Storsjö	d5a45ec513	Mark the x86 assembly object files as not requiring an executable stack This avoids having to add extra linker flags in order to specify this. This is similar to how this already is handled for the arm assembly.	2014-07-25 00:56:39 +03:00
Martin Storsjö	68b4b09ae6	Don't load undefined bits into rcx before calling the cpuid instruction The pFeatureC pointer is an uint32_t pointer, therefore only load 32 bits into ecx. This avoids loading potentially uninitialized data into the upper half of the rcx register, fixing valgrind warnings in some build setups (depending on how the compiler chooses to layout the stack in the calling function).	2014-06-29 20:46:55 +03:00
Martin Storsjö	7bc3e944ad	Get rid of uneven spacing after WELS_EXTERN	2014-06-09 11:03:25 +03:00
Martin Storsjö	57f6bcc4b0	Convert all tabs to spaces in assembly sources, unify indentation Previously the assembly sources had mixed indentation consisting of both spaces and tabs, making it quite hard to read unless the right tab size was used in the editor. Tabs have been interpreted as 4 spaces in most cases, matching the surrounding code.	2014-06-01 01:35:43 +03:00
Martin Storsjö	faaf62afad	Get rid of double spaces in macro declarations	2014-06-01 01:13:01 +03:00
Martin Storsjö	ac03b8b503	Avoid unnecessary tabs in macro declarations	2014-06-01 01:13:01 +03:00
Martin Storsjö	b8eeda1740	Properly back up and restore XMM registers on win64 in WelsSampleSadFour4x4_sse2	2014-05-04 15:47:56 +03:00
Licai Guo	fe5b8d1a69	refine format	2014-05-04 14:51:05 +08:00
Licai Guo	485b2b5b43	Add IntraSad asm code. Enable intraSad ASM code Refine format Add X86_ASM pretect for intraSad ASM code UT remove duplicated code.	2014-05-04 12:12:38 +08:00
Licai Guo	e39de8d404	reoranize common to inc/src/x86/arm	2014-03-18 19:41:32 -07:00

30 Commits