ffmpeg

Author	SHA1	Message	Date
Christophe Gisquet	f4b0d12f5b	x86: sbrdsp: Implement SSE neg_odd_64 Timing on Arrandale: C SSE Win32: 57 44 Win64: 47 38 Unrolling and not storing mask both save some cycles. Signed-off-by: Diego Biurrun <diego@biurrun.de>	2013-04-05 22:47:04 +02:00
Diego Biurrun	c9f933b5b6	Add av_cold attributes to arch-specific init functions	2013-02-05 17:01:05 +01:00
Christophe Gisquet	4f50646697	x86: sbrdsp: Implement SSE qmf_post_shuffle 255 to 174 cycles on Arrandale / Win64. Unrolling yields no gain. Signed-off-by: Diego Biurrun <diego@biurrun.de>	2013-01-06 13:57:01 +01:00
Christophe Gisquet	44a0036d10	x86: sbrdsp: Implement SSE sum64x5 698 to 174 cycles on Arrandale. Unrolling is a 6 cycles gain. Signed-off-by: Diego Biurrun <diego@biurrun.de>	2013-01-06 13:57:01 +01:00
Christophe Gisquet	2aef3d66c9	SBR DSP x86: implement SSE sbr_hf_gen Start and end index are multiple of 2, therefore guaranteeing aligned access. Also, this allows to generate 4 floats per loop, keeping the alignment all along. Timing: - 32 bits: 326c -> 172c - 64 bits: 323c -> 156c Signed-off-by: Diego Biurrun <diego@biurrun.de>	2012-12-07 11:04:26 +01:00
Diego Biurrun	e0c6cce447	x86: Replace checks for CPU extensions and flags by convenience macros This separates code relying on inline from that relying on external assembly and fixes instances where the coalesced check was incorrect.	2012-09-08 18:18:34 +02:00
Christophe GISQUET	2784d18791	SBR DSP x86: implement SSE sbr_hf_g_filt Unrolling the main loop to process, instead of 4 elements: - 8: minor gain of 2 cycles (not worth the extra object size) - 2: loss of 8 cycles. Assigning STEP to a register is a loss. Output address (Y) is almost always unaligned. Timings: - C (32/64 bits): 117/109 cycles - SSE: 57 cycles Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2012-02-23 15:50:09 -08:00
Christophe GISQUET	34454c761f	SBR DSP x86: implement SSE sbr_sum_square_sse The 32bits targets have been compiled with -mfpmath=sse for proper reference. sbr_sum_square C /32bits: 82c (unrolled)/102c C /64bits: 69c (unrolled)/82c SSE/32bits: 42c SSE/64bits: 31c Use of SSE4.1 dpps to perform the final sum is slower. Not unrolling to perform 8 operations in a loop yields 10 more cycles. Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>	2012-02-23 15:50:06 -08:00

8 Commits