It has already been demonstrated that the de Bruijn method has benefits
over the current implementation: commit 971d12b7f9.
That commit implemented it for long long, this extends it to the int version.
Tested with FATE.
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
This reduces the memory & cache need from 256 to 64 bytes
the code also seems faster with this change
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
This uses Stein's binary GCD algorithm:
https://en.wikipedia.org/wiki/Binary_GCD_algorithm
to get a roughly 4x speedup over Euclidean GCD on standard architectures
with a compiler intrinsic for ctzll, and a roughly 2x speedup otherwise.
At the moment, the compiler intrinsic is used on GCC and Clang due to
its easy availability.
Quick note regarding overflow: yes, subtractions on int64_t can, but the
llabs takes care of that. The llabs is also guaranteed to be safe, with
no annoying INT64_MIN business since INT64_MIN being a power of 2, is
shifted down before being sent to llabs.
The binary GCD needs ff_ctzll, an extension of ff_ctz for long long (int64_t). On
GCC, this is provided by a built-in. On Microsoft, there is a
BitScanForward64 analog of BitScanForward that should work; but I can't confirm.
Apparently it is not available on 32 bit builds; so this may or may not
work correctly. On Intel, per the documentation there is only an
intrinsic for _bit_scan_forward and people have posted on forums
regarding _bit_scan_forward64, but often their documentation is
woeful. Again, I don't have it, so I can't test.
As such, to be safe, for now only the GCC/Clang intrinsic is added, the rest
use a compiled version based on the De-Bruijn method of Leiserson et al:
http://supertech.csail.mit.edu/papers/debruijn.pdf.
Tested with FATE, sample benchmark (x86-64, GCC 5.2.0, Haswell)
with a START_TIMER and STOP_TIMER in libavutil/rationsl.c, followed by a
make fate.
aac-am00_88.err:
builtin:
714 decicycles in av_gcd, 4095 runs, 1 skips
de-bruijn:
1440 decicycles in av_gcd, 4096 runs, 0 skips
previous:
2889 decicycles in av_gcd, 4096 runs, 0 skips
Signed-off-by: Ganesh Ajjanagadde <gajjanagadde@gmail.com>
Signed-off-by: Michael Niedermayer <michael@niedermayer.cc>
Intel compiler also defines __GNUC__, so the Intel specific intrinsics were not
really being used.
Reviewed-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: James Almer <jamrial@gmail.com>
* commit '5ff998a233d759d0de83ea6f95c383d03d25d88e':
flacenc: use uint64_t for bit counts
flacenc: remove wasted trailing 0 bits
lavu: add av_ctz() for trailing zero bit count
flacenc: use a separate buffer for byte-swapping for MD5 checksum on big-endian
fate: aac: Place LATM tests and general AAC tests in different groups
build: The A64 muxer depends on rawenc.o for ff_raw_write_packet()
Conflicts:
doc/APIchanges
libavutil/version.h
tests/fate/aac.mak
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '2d09b36c0379fcda8f984bc8ad8816c8326fd7bd':
doc/platform: Add info on shared builds with MSVC
doc/platform: Move a caveat down to the notes section
ARM: reinstate optimised intmath.h
ffv1: update to ffv1 version 3
Conflicts:
doc/platform.texi
libavcodec/ffv1.c
libavcodec/ffv1.h
libavcodec/ffv1dec.c
libavcodec/ffv1enc.c
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'd15c21e5fa3961f10026da1a3080a3aa3cf4cec9':
avutil: Add a copy of ff_sqrt_tab back into avutil to restore ABI compatibility
avutil: make some tables visible again
avutil: remove inline av_log2 from public API
celp_math: rename ff_log2 to ff_log2_q15
Conflicts:
libavutil/libavutil.v
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This removes inline av_log2 and av_log2_16bit from the public API,
instead exporting them as regular functions. In-tree code still
gets the inline and otherwise optimised variants.
Signed-off-by: Mans Rullgard <mans@mansr.com>
* commit '9734b8ba56d05e970c353dfd5baafa43fdb08024':
Move avutil tables only used in libavcodec to libavcodec.
Conflicts:
libavcodec/mathtables.c
libavutil/intmath.h
Merged-by: Michael Niedermayer <michaelni@gmx.at>
GCC 4.3 and later do the right thing with the plain C code. Earlier
versions in 32-bit mode generate one extra instruction, needlessly
zeroing what would be the high half of the shifted value. At least
two gcc configurations miscompile the inline asm in some situations.
In 64-bit mode, all gcc versions generate imul r64, r64 followed by
shr. On Intel i7 and later, this imul is faster 32-bit mul. On
older Intel and all AMD, it is slightly slower. On Atom it is much
slower.
Considering where the FASTDIV macro is used, any overall negative
performance impact of this change should be negligible. If anyone
cares, they should file a bug against gcc and get the instruction
selection fixed.
Signed-off-by: Mans Rullgard <mans@mansr.com>
* qatar/master:
build: x86: Only compile mpegvideo optimizations when necessary
configure: Drop fastdiv option
build: Make the E-AC-3 encoder select the AC-3 encoder
fate: flac: Only run tests requiring samples when samples are available
Conflicts:
configure
Merged-by: Michael Niedermayer <michaelni@gmx.at>
There is no point in having the user disable any fastdiv macros.
Besides the condition implementation was broken and only disabled
the C implementation, but no platform specific assembly versions.
* qatar/master: (22 commits)
aacdec: Fix PS in ADTS.
avconv: Consistently use PIX_FMT_NONE.
dsputil: use cpuflags in x86 emu_edge_core
dsputil: use movups instead of movdqu in ff_emu_edge_core_sse()
wma: initialize prev_block_len_bits, next_block_len_bits, and block_len_bits.
mov: Remove some redundant and obsolete comments.
Add libavutil/mathematics.h #includes for INFINITY
doxy: structure libavformat groups
doxy: introduce an empty structure in libavcodec
doxy: provide a start page and document libavutil
doxy: cleanup pixfmt.h
regtest: split video encode/decode tests into individual targets
ARM: add explicit .arch and .fpu directives to asm.S
pthread: do not touch has_b_frames
avconv: cleanup the transcoding loop in output_packet().
avconv: split subtitle transcoding out of output_packet().
avconv: split video transcoding out of output_packet().
avconv: split audio transcoding out of output_packet().
avconv: reindent.
avconv: move streamcopy-only code out of decoding loop.
...
Conflicts:
avconv.c
libavcodec/aaccoder.c
libavcodec/pthread.c
libavcodec/version.h
libavutil/audioconvert.h
libavutil/avutil.h
libavutil/mem.h
tests/ref/vsynth1/dv
tests/ref/vsynth1/mpeg2thread
tests/ref/vsynth2/dv
tests/ref/vsynth2/mpeg2thread
Merged-by: Michael Niedermayer <michaelni@gmx.at>
This is a bit hackish. I will try to think of something nicer, but
this will do for now.
Originally committed as revision 22366 to svn://svn.ffmpeg.org/ffmpeg/trunk