openh264

Author	SHA1	Message	Date
ruil2	4b6f037020	Merge pull request #2489 from saamas/processing-dyadic-bilinear-downsample-optimizations [Processing] DyadicBilinearDownsample optimizations	2016-06-12 10:02:55 +08:00
Sindre Aamås	fe4a47a979	[UT] Add comment on X86_ASM checksum ifdef	2016-06-08 21:53:30 +02:00
Karina	4f41c3a5bf	fix codingIdx update issue	2016-06-02 21:17:31 +08:00
Sindre Aamås	8a0af4a3f2	[Processing/x86] DyadicBilinearDownsample optimizations Average vertically before horizontally; horizontal averaging is more worksome. Doing the vertical averaging first reduces the number of horizontal averages by half. Use pmaddubsw and pavgw to do the horizontal averaging for a slight performance improvement. Minor tweaks. Improve the SSSE3 dyadic downsample routines and drop the SSE4 routines. The non-temporal loads used in the SSE4 routines do nothing for cache- backed memory AFAIK. Adjust tests because averaging vertically first gives slightly different output. ~2.39x speedup for the widthx32 routine on Haswell when not memory-bound. ~2.20x speedup for the widthx16 routine on Haswell when not memory-bound. Note that the widthx16 routine can be unrolled for further speedup.	2016-06-02 13:44:28 +02:00
Karina	4fc2b1f636	refine RC	2016-05-31 16:44:04 +08:00
Karina	e3c306608c	fix dependency ID mapping issue	2016-05-30 15:03:39 +08:00
Sindre Aamås	563376df0c	[UT] Test downsampling routines with a wider variety of height ratios	2016-05-25 14:16:29 +02:00
Sindre Aamås	4fec6d581e	[UT] Test generic downsampling routines with a wider variety of width ratios Get coverage of all code paths for routines that branch to different paths for different scaling ratios.	2016-05-23 20:23:47 +02:00
Sindre Aamås	e490215990	[Processing/x86] Add an AVX2 implementation of GeneralBilinearAccurateDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 8. Because pshufb does not cross 128-bit lanes, the overhead of address calculations and loads is relatively greater as compared with an SSSE3/SSE4.1 implementation. Fall back to a generic approach for ratios > 8. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~8.52x/~6.89x (32-bit/64-bit) for horizontal ratios <= 2, ~7.81x/~6.13x for ratios within (2, 4], ~5.81x/~4.52x for ratios within (4, 8], and ~5.06x/~4.09x for ratios > 8 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:47 +02:00
Sindre Aamås	b43e58a366	[Processing/x86] Add an AVX2 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 8. Because pshufb does not cross 128-bit lanes, the overhead of address calculations and loads is relatively greater as compared with an SSSE3 implementation. Fall back to a generic approach for ratios > 8. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~10.42x/~5.23x (32-bit/64-bit) for horizontal ratios <= 2, ~9.49x/~4.64x for ratios within (2, 4], ~6.43x/~3.18x for ratios within (4, 8], and ~5.42x/~2.50x for ratios > 8 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:47 +02:00
Sindre Aamås	b1013095b1	[Processing/x86] Add an SSE4.1 implementation of GeneralBilinearAccurateDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. The use of blendps makes this require SSE4.1. The pshufb path can be backported to SSSE3 and the generic path to SSE2 for a minor reduction in performance by replacing blendps and preceding instructions with an equivalent sequence. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~5.32x/~4.25x (32-bit/64-bit) for horizontal ratios <= 2, ~5.06x/~3.97x for ratios within (2, 4], and ~3.93x/~3.13x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:39 +02:00
Sindre Aamås	1995e03d91	[Processing/x86] Add an SSSE3 implementation of GeneralBilinearFastDownsample Keep track of relative pixel offsets and utilize pshufb to efficiently extract relevant pixels for horizontal scaling ratios <= 4. Fall back to a generic approach for ratios > 4. Note that the generic approach can be backported to SSE2. The implementation assumes that data beyond the end of each line, before the next line begins, can be dirtied; which AFAICT is safe with the current usage of these routines. Speedup is ~6.67x/~3.26x (32-bit/64-bit) for horizontal ratios <= 2, ~6.24x/~3.00x for ratios within (2, 4], and ~4.89x/~2.17x for ratios > 4 when not memory-bound on Haswell as compared with the current SSE2 implementation.	2016-05-23 20:23:31 +02:00
Karina	9b2dd55324	add GetBsPostion for cabac and cavlc	2016-05-20 14:29:48 +08:00
sijchen	ffb85046b4	Refactoring: Wrap all the operations related to eSpsPpsIdStrategy to class, to improve code readability	2016-05-04 15:06:02 -07:00
HaiboZhu	c30cc41261	Merge pull request #2448 from saamas/encoder-getnonzerocount-sse42 [Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount	2016-05-04 09:49:47 +08:00
ruil2	e9dc97803d	Merge pull request #2447 from saamas/encoder-cavlcparamcal-sse42 [Encoder] Add an SSE4.2 implementation of CavlcParamCal	2016-04-28 09:08:44 +08:00
ruil2	7d65687284	Merge pull request #2441 from saamas/encoder-add-avx2-4x4-quantization-routines [Encoder] Add AVX2 4x4 quantization routines	2016-04-28 09:08:31 +08:00
Sindre Aamås	4645bd26aa	[Encoder] Add an SSE4.2 implementation of WelsGetNonZeroCount Avoid touching some cache lines by using popcnt instead of table lookups. Also gives a speedup of ~1.4x on Haswell as compared with SSE2.	2016-04-20 19:10:24 +02:00
Sindre Aamås	d906dda224	[UT] Improve GetNonZeroCount tests Reduce duplication. Test more combinations. Always test boundary cases.	2016-04-20 19:10:24 +02:00
Sindre Aamås	3f31aff4dc	[Encoder] Add an SSE4.2 implementation of CavlcParamCal Use a combination of table lookups and pshufb to convert coefficients to zero run/level format. Two 16-entry lookup tables are used for a total of 192 bytes worth of tables. (The existing SSE2 version uses a table of size 2048 bytes.) Speedup is ~1.5x-3x as compared with the SSE2 version on Haswell (the speedup is greater for input with many trailing zeros). The use of popcnt makes it require SSE4.2. This can be replaced with a small LUT and accumulation which would reduce the requirement to SSSE3.	2016-04-20 18:37:08 +02:00
Sindre Aamås	502b16925e	[UT] Add tests for CavlcParamCal_c and CavlcParamCal_sse2	2016-04-20 18:37:08 +02:00
Sindre Aamås	bb49e23719	[Encoder] Add AVX2 4x4 quantization routines WelsQuantFour4x4Max_avx2 (~2.06x speedup over SSE2) WelsQuantFour4x4_avx2 (~2.32x speedup over SSE2) WelsQuant4x4Dc_avx2 (~1.49x speedup over SSE2) WelsQuant4x4_avx2 (~1.42x speedup over SSE2)	2016-04-13 11:56:47 +02:00
Sindre Aamås	1e83bec860	[UT] Add some missing quantization tests	2016-04-13 11:56:44 +02:00
Sindre Aamås	abaf3a4104	[UT] Reduce duplication in quantization tests	2016-04-13 08:59:16 +02:00
Sindre Aamås	93db6511a8	[UT] Test VAA routines with a wider variety of resolutions Test even and odd multiples of 32 width because some AVX2 routines have conditional logic based on that.	2016-04-11 16:40:36 +02:00
Sindre Aamås	57fc3e9917	[Processing] Add AVX2 VAA routines Process 8 lines at a time rather than 16 lines at a time because this appears to give more reliable memory subsystem performance on Haswell. Speedup is > 2x as compared to SSE2 when not memory-bound on Haswell. On my Haswell MBP, VAACalcSadSsdBgd is about ~3x faster when uncached, which appears to be related to processing 8 lines at a time as opposed to 16 lines at a time. The other routines are also faster as compared to the SSE2 routines in this case but to a lesser extent.	2016-04-11 16:09:56 +02:00
Martin Storsjö	81493590f8	Remove a stray empty line This disappears when regenerating the makefiles.	2016-03-24 10:01:48 +02:00
sijchen	47d310539f	Squashed commit of the following: commit c8111942e07437034a74b33887c33b5ad78e476a Author: Karina <ruil2@cisco.com> Date: Wed Mar 23 14:31:18 2016 +0800 update SHA table commit f36a25344c25a131581dcbcd2d103fc4b131012e Author: Karina <ruil2@cisco.com> Date: Wed Mar 23 13:45:58 2016 +0800 fix bitrate overflow issue when adaptive quality turns on	2016-03-23 10:23:33 -07:00
sijchen	33bb96f604	Merge pull request #2420 from sijchen/fix_sps [Encoder] fix the lack of eSpsPpsIdStrategy==INCREASING_ID under simulcast avc on	2016-03-21 21:51:07 -07:00
zhilwang	d7570bfa52	Merge pull request #2401 from saamas/decoder-use-encoder-x86-idct-routines [Decoder] Use encoder x86 IDCT routines	2016-03-18 08:50:33 +08:00
Sindre Aamås	b6c4a5447c	[Decoder/x86] IDCT one block at a time with SSE2 At lower bitrates, it is overall faster to conditionally do one block at a time with SSE2 on Haswell and likely other common architectures. At higher bitrates, it is faster to use the wider routine that IDCTs four blocks at a time. To avoid potential performance regressions as compared to MMX, stick with single-block IDCTs with SSE2. There is still a performance advantage as compared to MMX because the single-block SSE2 routine is faster than the corresponding MMX routine. Stick with four blocks at a time with AVX2 for which that appears to be consistently faster on Haswell.	2016-03-16 19:55:11 +01:00
huili2	a8d9576297	Merge pull request #2405 from HaiboZhu/Fix_UT_decoder_init_fail Fix the decoder init failed case in UT	2016-03-16 16:28:14 +08:00
sijchen	c009183e97	fix the lack of eSpsPpsIdStrategy==INCREASING_ID under simulcast avc on	2016-03-14 11:28:44 -07:00
Haibo Zhu	46f42ec5f3	Fix the decoder init failed case in UT	2016-03-14 17:06:58 +08:00
Karina	f84f2315ab	change downsampling logic that downsampling source is from the nearest layer instead of the highest layer	2016-03-14 09:55:36 +08:00
Sindre Aamås	98042f1600	[Decoder] Use encoder x86 IDCT routines Move asm routines to common. Delete obsolete decoder routines. Use wider routines where applicable. ~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.	2016-03-09 10:41:42 +01:00
Sindre Aamås	48a520915a	[Encoder/x86] Add AVX2 SATD routines WelsSampleSatd16x16_avx2 (~2.31x speedup over SSE4.1 on Haswell). WelsSampleSatd16x8_avx2 (~2.19x speedup over SSE4.1 on Haswell). WelsSampleSatd8x16_avx2 (~1.68x speedup over SSE4.1 on Haswell). WelsSampleSatd8x8_avx2 (~1.53x speedup over SSE4.1 on Haswell).	2016-03-08 11:31:17 +01:00
sijchen	4db9c32976	remove sink in WelsThreadPool and hide the construtor to finish the singleTon	2016-03-02 17:08:09 -08:00
sijchen	d4f09d9048	put CWelsThreadPool to singleTon for future usage (including add sink for IWelsTask)	2016-02-29 11:40:25 -08:00
Gregory J. Wolfe	03890fe86f	Added support for "video signal type present" information. The "Video signal type present" information is written to the output video file when it is created, and later is used by the decoder to properly decode the compressed video data. The saved attributes are: - format type (PAL, NTSC, etc.) - color primaries (BT709, SMPTE170M, etc.) - transfer characteristics (BT709, SMPTE170M, etc.) - color matrix ((BT709, SMPTE170M, etc.) These modifications allow the client to specify these attributes and, if specified, makes sure they are written to the output file.	2016-02-24 10:33:18 -05:00
Gregory J. Wolfe	c7fcba06c7	Added support for "video signal type present" information. The "Video signal type present" information is written to the output video file when it is created, and later is used by the decoder to properly decode the compressed video data. The saved attributes are: - format type (PAL, NTSC, etc.) - color primaries (BT709, SMPTE170M, etc.) - transfer characteristics (BT709, SMPTE170M, etc.) - color matrix ((BT709, SMPTE170M, etc.) These modifications allow the client to specify these attributes and, if specified, makes sure they are written to the output file.	2016-02-23 13:21:06 -05:00
sijchen	aaa25160ec	Merge pull request #2353 from saamas/encoder-x86-dct-opt2 [Encoder] x86 DCT optimizations	2016-02-08 15:00:12 -08:00
sijchen	e5e7013b73	Merge pull request #2350 from sijchen/th00 [Common] Add sink to IWelsTask	2016-02-08 14:59:38 -08:00
Sindre Aamås	c8c74903f8	[Encoder] Add single-block AVX2 4x4 DCT/IDCT routines We do four blocks at a time when possible, but need to handle single blocks at a time for intra prediction. ~3.15x speedup over MMX for the DCT on Haswell. ~2.94x speedup over MMX for the IDCT on Haswell. Returns diminish with increasing vector length because a larger proportion of the time is spent on load/store/shuffling.	2016-02-02 17:22:49 +01:00
Sindre Aamås	f90960983c	[Encoder] Add single-block SSE2 4x4 DCT/IDCT routines We do four blocks at a time when possible, but need to handle single blocks at a time for intra prediction. ~2.31x speedup over MMX for the DCT on Haswell. ~1.92x speedup over MMX for the IDCT on Haswell.	2016-02-02 17:22:48 +01:00
unknown	3873addc3d	fix frame size constraints for width and height	2016-02-01 15:55:53 +08:00
HaiboZhu	1030820ec4	Merge pull request #2342 from sijchen/enh_ut_tem [UT] correct and enhance the ut template and trace improvement	2016-02-01 09:08:05 +08:00
sijchen	47e3f4c45c	correct and enhance the ut template	2016-01-19 17:16:39 -08:00
Sindre Aamås	cc8d541432	[UT] Utilize DCT function pointer typedefs	2016-01-19 22:00:24 +01:00
Sindre Aamås	a45c10cf91	[UT] Only run AVX2 tests if host supports AVX2	2016-01-19 14:27:46 +01:00

1 2 3 4 5 ...

711 Commits