The non-intra-pcm branch in hl_decode_mb (simple, 8bpp) goes from 700
to 672 cycles, and the complete loop of decode_mb_cabac and hl_decode_mb
(in the decode_slice loop) goes from 1759 to 1733 cycles on the clip
tested (cathedral), i.e. almost 30 cycles per mb faster.
Signed-off-by: Martin Storsjö <martin@martin.st>
This way, they can be shared between mpeg4qpel and h264qpel without
requiring either one to be compiled unconditionally.
Signed-off-by: Martin Storsjö <martin@martin.st>
From 312 to 89/68 (sse/sse2) cycles on Arrandale and Win64.
Sandybridge: 68/47 cycles.
Having a loop counter is a 7 cycle gain.
Unrolling is another 7 cycle gain.
Working in reverse scan is another 6 cycles.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Timing on Arrandale:
C SSE
Win32: 57 44
Win64: 47 38
Unrolling and not storing mask both save some cycles.
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Timing on Arrandale:
C SSE
Win32: 57 44
Win64: 47 38
Unrolling and not storing mask both save some cycles.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'e5c2794a7162e485eefd3133af5b98fd31386aeb':
x86: consistently use unaligned movs in the unaligned bswap
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This way, the special IDCT permutations are no longer needed. Bfin code
is disabled until someone updates it. This is similar to how H264 does
it, and removes the dsputil dependency imposed by the scantable code.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This way, they can be shared between mpeg4qpel and h264qpel without
requiring either one to be compiled unconditionally.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'e8c52271c45ec27d783e74238dcfad0c2008731c':
Revert "Move H264/QPEL specific asm from dsputil.asm to h264_qpel_*.asm."
Conflicts:
libavcodec/x86/dsputil.asm
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This reverts commit f90ff772e7e35b4923c2de429d1fab9f2569b568.
The code should be put back in h264_qpel_8bit.asm, but unfortunately
it is unconditionally used from dsputil_mmx.c since 71155d7.
* commit '845cfc92f908791714b8c4c8a49c91b8c64b685e':
x86: dsputil: Drop aliasing of ff_put_pixels8_mmx to ff_put_pixels8_mmxext
Conflicts:
libavcodec/x86/dsputil_mmx.c
Note, the commit message is wrong, there are no mmxext instructions as
claimed in the function. The change should do no harm though
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '096cc11ec102701a18951b4f0437d609081ca1dd':
x86: vc1dsp: Move ff_avg_vc1_mspel_mc00_mmxext out of dsputil_mmx.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
The external assembly function uses mmxext instructions and should not be
masqueraded as an mmx-only function. Instead, use the mmx-only inline
assembly function.
This fixes crashes in chromium on win64 on machines with AVX
(crashes that apparently aren't triggered by fate).
Signed-off-by: Martin Storsjö <martin@martin.st>
The non-intra-pcm branch in hl_decode_mb (simple, 8bpp) goes from 700
to 672 cycles, and the complete loop of decode_mb_cabac and hl_decode_mb
(in the decode_slice loop) goes from 1759 to 1733 cycles on the clip
tested (cathedral), i.e. almost 30 cycles per mb faster.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>