openh264

Author	SHA1	Message	Date
Guangwei Wang	7d00e8bc42	add option for enable/disable AVX2	2016-07-15 12:15:57 +08:00
HaiboZhu	c30cc41261	Merge pull request #2448 from saamas/encoder-getnonzerocount-sse42 [Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount	2016-05-04 09:49:47 +08:00
ruil2	e9dc97803d	Merge pull request #2447 from saamas/encoder-cavlcparamcal-sse42 [Encoder] Add an SSE4.2 implementation of CavlcParamCal	2016-04-28 09:08:44 +08:00
Sindre Aamås	fb0b2b3f41	[Encoder/x86] Drop unneeded LOAD_4_PARA in CavlcParamCal_sse42	2016-04-24 22:59:35 +02:00
Sindre Aamås	d1c7713191	[Encoder/x86] Minor CavlcParamCal_sse42 tweak Do more elaborate register allocation to avoid a few mov instructions.	2016-04-24 22:36:23 +02:00
Sindre Aamås	f56bdc3aa4	[Encoder/x86] Minor CavlcParamCal_sse42 tweak Avoid loading single-use parameter.	2016-04-21 16:29:02 +02:00
Sindre Aamås	2eb8800712	[Encoder/x86] Remove a leftover mov instruction in CavlcParamCal_sse42	2016-04-21 15:53:33 +02:00
Sindre Aamås	4645bd26aa	[Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount Avoid touching some cache lines by using popcnt instead of table lookups. Also gives a speedup of ~1.4x on Haswell as compared with SSE2.	2016-04-20 19:10:24 +02:00
Sindre Aamås	3f31aff4dc	[Encoder] Add an SSE4.2 implementation of CavlcParamCal Use a combination of table lookups and pshufb to convert coefficients to zero run/level format. Two 16-entry lookup tables are used for a total of 192 bytes worth of tables. (The existing SSE2 version uses a table of size 2048 bytes.) Speedup is ~1.5x-3x as compared with the SSE2 version on Haswell (the speedup is greater for input with many trailing zeros). The use of popcnt makes it require SSE4.2. This can be replaced with a small LUT and accumulation which would reduce the requirement to SSSE3.	2016-04-20 18:37:08 +02:00
Sindre Aamås	bb49e23719	[Encoder] Add AVX2 4x4 quantization routines WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2) WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2) WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2) WelsQuant4x4_avx2 (~1.42x speedup over SSE2)	2016-04-13 11:56:47 +02:00
Sindre Aamås	98042f1600	[Decoder] Use encoder x86 IDCT routines Move asm routines to common. Delete obsolete decoder routines. Use wider routines where applicable. ~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.	2016-03-09 10:41:42 +01:00
Sindre Aamås	c8c74903f8	[Encoder] Add single-block AVX2 4x4 DCT/IDCT routines We do four blocks at a time when possible, but need to handle single blocks at a time for intra prediction. ~3.15x speedup over MMX for the DCT on Haswell. ~2.94x speedup over MMX for the IDCT on Haswell. Returns diminish with increasing vector length because a larger proportion of the time is spent on load/store/shuffling.	2016-02-02 17:22:49 +01:00
Sindre Aamås	f90960983c	[Encoder] Add single-block SSE2 4x4 DCT/IDCT routines We do four blocks at a time when possible, but need to handle single blocks at a time for intra prediction. ~2.31x speedup over MMX for the DCT on Haswell. ~1.92x speedup over MMX for the IDCT on Haswell.	2016-02-02 17:22:48 +01:00
Sindre Aamås	7486de2844	[Encoder] AVX2 DCT tweaks Do some shuffling in load/store unpack/pack to save some work in horizontal DCTs. Use a few 128-bit broadcasts to compact data vectors a bit. ~1.04x speedup for the DCT case on Haswell. ~1.12x speedup for the IDCT case on Haswell.	2016-02-02 17:22:48 +01:00
Sindre Aamås	e22d731f26	[Encoder] yasm-compatible vinserti128 syntax in DCT asm	2016-01-19 21:48:23 +01:00
Sindre Aamås	144ff0fd51	[Encoder] SSE2 4x4 IDCT optimizations Use a combination of instruction types that distributes more evenly across execution ports on common architectures. Do the horizontal IDCT without transposing back and forth. Minor tweaks. ~1.14x faster on Haswell. Should be faster on other architectures as well.	2016-01-19 13:12:29 +01:00
Sindre Aamås	991e344d8c	[Encoder] SSE2 4x4 DCT optimizations Use a combination of instruction types that distributes more evenly across execution ports on common architectures. Do the horizontal DCT without transposing back and forth. Minor tweaks. ~1.54x faster on Haswell. Should be faster on other architectures as well.	2016-01-19 13:12:28 +01:00
Sindre Aamås	3088d96978	[Encoder] Add an AVX2 4x4 IDCT implementation ~2.03x faster on Haswell as compared to the SSE2 version.	2016-01-19 13:12:28 +01:00
Sindre Aamås	b267163f10	[Encoder] Add an AVX2 4x4 DCT implementation ~2.52x faster on Haswell as compared to the SSE2 version.	2016-01-19 13:12:28 +01:00
Martin Storsjö	a00e2e7229	Convert tabs to spaces in sample_sc.asm This makes them consistent with the rest of the assembly source files. Prior to f2314151e8, all the assembly files had consistent indentation, but after that, this file had been made different.	2015-04-27 14:07:04 +03:00
ruil2	ed341048de	refine common moudle for part of intra prediction function	2014-09-25 14:03:11 +08:00
zhiliang wang	ef88889404	refine format and add UT cases	2014-08-15 09:22:37 +08:00
zhiliang wang	76863f977a	Refine asm code format	2014-08-15 08:46:55 +08:00
zhiliang wang	b35f5797de	Add x86 32/64bit asm code for Scc_hash	2014-08-14 18:41:52 +08:00
zhiliang wang	f2314151e8	Add x86 32/64bit asm code for SumOfBlocks.	2014-08-13 11:18:39 +08:00
Martin Storsjö	d1a00d8173	Remove mismatched chars at the end of a line marker None of the other markers close by have similar chars.	2014-06-09 11:11:25 +03:00
Martin Storsjö	57f6bcc4b0	Convert all tabs to spaces in assembly sources, unify indentation Previously the assembly sources had mixed indentation consisting of both spaces and tabs, making it quite hard to read unless the right tab size was used in the editor. Tabs have been interpreted as 4 spaces in most cases, matching the surrounding code.	2014-06-01 01:35:43 +03:00
Martin Storsjö	faaf62afad	Get rid of double spaces in macro declarations	2014-06-01 01:13:01 +03:00
Martin Storsjö	ac03b8b503	Avoid unnecessary tabs in macro declarations	2014-06-01 01:13:01 +03:00
Licai Guo	485b2b5b43	Add IntraSad asm code. Enable intraSad ASM code Refine format Add X86_ASM pretect for intraSad ASM code UT remove duplicated code.	2014-05-04 12:12:38 +08:00
Licai Guo	5c60e8f868	Add ASM related functions for ME cross search Add asm level functions Add asm code for ME Modify format Add unit test for asm code. Modify function name and format. Remove unuse comment Modify targets file Add Macro protect for SSE41 funtion test Modify according to review request.	2014-04-08 11:24:45 +08:00
Martin Storsjö	ed9c03408f	Rename the asm subdirectories to x86 This is consistent with having the arm assembly in a subdirectory called arm.	2014-03-18 23:09:45 +02:00

32 Commits