openh264

Author	SHA1	Message	Date
Sindre Aamås	98042f1600	[Decoder] Use encoder x86 IDCT routines Move asm routines to common. Delete obsolete decoder routines. Use wider routines where applicable. ~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.	2016-03-09 10:41:42 +01:00
Sindre Aamås	c8c74903f8	[Encoder] Add single-block AVX2 4x4 DCT/IDCT routines We do four blocks at a time when possible, but need to handle single blocks at a time for intra prediction. ~3.15x speedup over MMX for the DCT on Haswell. ~2.94x speedup over MMX for the IDCT on Haswell. Returns diminish with increasing vector length because a larger proportion of the time is spent on load/store/shuffling.	2016-02-02 17:22:49 +01:00
Sindre Aamås	f90960983c	[Encoder] Add single-block SSE2 4x4 DCT/IDCT routines We do four blocks at a time when possible, but need to handle single blocks at a time for intra prediction. ~2.31x speedup over MMX for the DCT on Haswell. ~1.92x speedup over MMX for the IDCT on Haswell.	2016-02-02 17:22:48 +01:00
Sindre Aamås	7486de2844	[Encoder] AVX2 DCT tweaks Do some shuffling in load/store unpack/pack to save some work in horizontal DCTs. Use a few 128-bit broadcasts to compact data vectors a bit. ~1.04x speedup for the DCT case on Haswell. ~1.12x speedup for the IDCT case on Haswell.	2016-02-02 17:22:48 +01:00
Sindre Aamås	e22d731f26	[Encoder] yasm-compatible vinserti128 syntax in DCT asm	2016-01-19 21:48:23 +01:00
Sindre Aamås	144ff0fd51	[Encoder] SSE2 4x4 IDCT optimizations Use a combination of instruction types that distributes more evenly across execution ports on common architectures. Do the horizontal IDCT without transposing back and forth. Minor tweaks. ~1.14x faster on Haswell. Should be faster on other architectures as well.	2016-01-19 13:12:29 +01:00
Sindre Aamås	991e344d8c	[Encoder] SSE2 4x4 DCT optimizations Use a combination of instruction types that distributes more evenly across execution ports on common architectures. Do the horizontal DCT without transposing back and forth. Minor tweaks. ~1.54x faster on Haswell. Should be faster on other architectures as well.	2016-01-19 13:12:28 +01:00
Sindre Aamås	3088d96978	[Encoder] Add an AVX2 4x4 IDCT implementation ~2.03x faster on Haswell as compared to the SSE2 version.	2016-01-19 13:12:28 +01:00
Sindre Aamås	b267163f10	[Encoder] Add an AVX2 4x4 DCT implementation ~2.52x faster on Haswell as compared to the SSE2 version.	2016-01-19 13:12:28 +01:00
Martin Storsjö	57f6bcc4b0	Convert all tabs to spaces in assembly sources, unify indentation Previously the assembly sources had mixed indentation consisting of both spaces and tabs, making it quite hard to read unless the right tab size was used in the editor. Tabs have been interpreted as 4 spaces in most cases, matching the surrounding code.	2014-06-01 01:35:43 +03:00
Martin Storsjö	ac03b8b503	Avoid unnecessary tabs in macro declarations	2014-06-01 01:13:01 +03:00
Martin Storsjö	ed9c03408f	Rename the asm subdirectories to x86 This is consistent with having the arm assembly in a subdirectory called arm.	2014-03-18 23:09:45 +02:00

12 Commits