Intrinsics only used on aarch64 since the existing ARMv7 NEON asm
is slightly faster (Cortex-A9, gcc-4.8, micro-benchmarks and full
decoding time).
Signed-off-by: James Yu <james.yu@linaro.org>
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
Such files can be created using the --bff x264 option.
Sample-Id: h264_direct_temporal_mvs_bff.mkv
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
The previous implementation targeted DTS Coherent Acoustics, which only
requires nbits == 4 (fft16()). This case was (and still is) linked directly
rather than being indirected through ff_fft_calc_vfp(), but now the full
range from radix-4 up to radix-65536 is available. This benefits other codecs
such as AAC and AC3.
The implementaion is based upon the C version, with each routine larger than
radix-16 calling a hierarchy of smaller FFT functions, then performing a
post-processing pass. This pass benefits a lot from loop unrolling to
counter the long pipelines in the VFP. A relaxed calling standard also
reduces the overhead of the call hierarchy, and avoiding the excessive
inlining performed by GCC probably helps with I-cache utilisation too.
I benchmarked the result by measuring the number of gperftools samples that
hit anywhere in the AAC decoder (starting from aac_decode_frame()) or
specifically in the FFT routines (fft4() to fft512() and pass()) for the
same sample AAC stream:
Before After
Mean StdDev Mean StdDev Confidence Change
Audio decode 2245.5 53.1 1599.6 43.8 100.0% +40.4%
FFT routines 940.6 22.0 348.1 20.8 100.0% +170.2%
Signed-off-by: Martin Storsjö <martin@martin.st>
The previous implementation targeted DTS Coherent Acoustics, which only
requires mdct_bits == 6. This relatively small size lent itself to
unrolling the loops a small number of times, and encoding offsets
calculated at assembly time within the load/store instructions of each
iteration.
In the more general case (codecs such as AAC and AC3) much larger arrays
are used - mdct_bits == [8, 9, 11]. The old method does not scale for
these cases, so more integer registers are used with non-unrolled versions
of the loops (and with some stack spillage). The postrotation filter loop
is still unrolled by a factor of 2 to permit the double-buffering of some
VFP registers to facilitate overlap of neighbouring iterations.
I benchmarked the result by measuring the number of gperftools samples
that hit anywhere in the AAC decoder (starting from aac_decode_frame())
or specifically in ff_imdct_half_c / ff_imdct_half_vfp, for the same
example AAC stream:
Before After
Mean StdDev Mean StdDev Confidence Change
aac_decode_frame 2368.1 35.8 2117.2 35.3 100.0% +11.8%
ff_imdct_half_* 457.5 22.4 251.2 16.2 100.0% +82.1%
Signed-off-by: Martin Storsjö <martin@martin.st>
For implicit signaling cases (as possible for Spectral Band Replication
and Parametric Stereo Tools), the decoder must decode the first frame to
correctly identify the stream configuration (as called from
avformat_find_stream_info). The mechanism for this is built-in and only
requires adding CODEC_CAP_CHANNEL_CONF to the libfdk-aacdec AVCodec
struct.
Signed-off-by: Omer Osman <omer.osman@iis.fraunhofer.de>
Signed-off-by: Martin Storsjö <martin@martin.st>
The default error concealment method if none is set via
aacDecoder_SetParam(AAC_CONCEAL_METHOD) is set in
CConcealment_InitCommonData within the fdk-aac library
and is set to Energy Interpolation. This method requires one frame
delay to the output. To reduce the default decoder output delay and
avoid missing the last frame in file based decoding, use Noise
Substitution as the default concealment method.
Signed-off-by: Omer Osman <omer.osman@iis.fraunhofer.de>
Signed-off-by: Martin Storsjö <martin@martin.st>
Add ability to handle multiple palettes and objects simultaneously.
Each simultaneous object is given its own AVSubtitleRect.
Note that there can be up to 64 currently valid objects, but only
2 at any one time can be "presented".
Signed-off-by: Anton Khirnov <anton@khirnov.net>
With optimzations disabled compilers have trouble doing dead code
elimination on 'if (foo && 0)' expressions, while 'if (0 && foo)'
still works, so use the latter to avoid problems.
Bug-Id: 707
As indicated in the function documentation, the header MUST be
checked prior to calling it because no consistency check is done
there.
CC:libav-stable@libav.org
Fixes an invalid read past the end of avpriv_mpa_freq_tab.
Fixes divide-by-zero due to sample_rate being set to 0.
Bug-Id: 705
CC:libav-stable@libav.org
Intel i263 codec has special 8-byte dummy frames that should not be decoded,
so do not even attempt to decode them and skip them instead.
Signed-off-by: Kostya Shishkov <kostya.shishkov@gmail.com>
Since error resilience uses AVFrame pointers instead of references it
has to copy NULL pointers too. After a codec flush the last/next frame
pointers in MpegEncContext are NULL and the old pointers remaining in
ERContext are invalid. Fixes a crash in vlc for android thumbnailer.
Reported and debugged by Adrien Maglo <magsoft@videolan.org>.
We know that the called function (ff_chroma_inter_body_mmxext)
doesn't touch the redzone, and thus will be kept intact - thus,
this doesn't fix any bug per se.
However, valgrind's memcheck tool intentionally assumes that the
redzone is clobbered on every function call and function return
(see a long comment in valgrind/memcheck/mc_main.c). This avoids
false positives in that tool, at the cost of an extra stack pointer
adjustment.
The other alternative would be a valgrind suppression for this issue,
but that's an extra burden for everybody that wants to run libavcodec
within valgrind.
Signed-off-by: Martin Storsjö <martin@martin.st>
The actual predictor value, set by the trellis code, never
was written back into the variable that was written into
the block header. This was accidentally removed in b304244b.
This significantly improves the audio quality of the trellis
case, which was plain broken since b304244b.
Encoding IMA QT with trellis still actually gives a slightly
worse quality than without trellis, since the trellis encoder
doesn't use the exact same way of rounding as in
adpcm_ima_qt_compress_sample and adpcm_ima_qt_expand_nibble.
CC: libav-stable@libav.org
Signed-off-by: Martin Storsjö <martin@martin.st>
This was broken in 095be4fb - samples+ch (for the previous
non-planar case) equals &samples_p[ch][0]. The confusion
probably stemmed from the IMA WAV case where it originally
was &samples[avctx->channels + ch], which was correctly
changed into &samples_p[ch][1].
CC: libav-stable@libav.org
Signed-off-by: Martin Storsjö <martin@martin.st>
This reduces the number of different licenses used within libav,
and is preferrable since it has less ambiguous wordings than
the BSD license with respect to the duties of the user of the code.
Fraunhofer have now indicated that they're allowed to contribute
code under this license as well.
Signed-off-by: Martin Storsjö <martin@martin.st>
Move the GNU as check before the arch specific asm checks since the .dn
check requires gas compatible assembler.
Disable the VC-1 motion compensation NEON asm which is the only part
using that directive. The integrated assembler in the upcoming clang 3.5
does not support .dn/.qn without plans to change that. Too much effort
to implement it while it is rarely used.
http://llvm.org/bugs/show_bug.cgi?id=18199.
Blackfin is a painful platform to work with, no test machines are available
and the range of multimedia applications is dubious. Thus it only represents
a maintenance burden.
Some encoders (e.g. flac) need to send side data when there is no more
data to be output. This enables them to output a packet with no data in
it, only side data.
This avoids all the ABI troubles associated with avpriv_.
Since this function is very small and does not depend on any tables,
making it inline should have no adverse effects.
Also include zero in the table, eliminating a special case in the
decoder.
Signed-off-by: Niels Möller <nisse@southpole.se>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Add AV_PKT_DATA_DISPLAYMATRIX and AV_FRAME_DATA_DISPLAYMATRIX as stream and
frame side data (respectively) to describe a display transformation matrix
for linear transformation operations on the decoded video.
Add functions to easily extract a rotation angle from a matrix and
conversely to setup a matrix for a given rotation angle.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Side data count is incremented by by calling av_packet_new_side_data()
in the following loop, setting it explicitly results in the resulting
value being twice what it should be.
CC: libav-stable@libav.org
Right now, the caller has to manually manage some allocated
AVCodecContext fields, like extradata or subtitle_header. This is
fragile and prone to leaks, especially if we want to add more such
fields in the future.
The only reason for this behaviour is so that the AVStream codec context
can be reused for decoding. Such reuse is discouraged anyway, so this
commit is the first step to deprecating it.
Initial implementation by Andrew D'Addesio <modchipv12@gmail.com> during
GSoC 2012.
Completion by Anton Khirnov <anton@khirnov.net>, sponsored by the
Mozilla Corporation.
Further contributions by:
Christophe Gisquet <christophe.gisquet@gmail.com>
Janne Grunau <janne-libav@jannau.net>
Luca Barbato <lu_zero@gentoo.org>
It leverages the new hwaccel 1.2 features:
- get_buffer2 is never called
- the internal context is automatically initialized/deinitialized
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
This should make it possible for Fraunhofer to contribute to these
wrappers - they didn't want to contribute to code under LGPL2.1 with
the "or any later version" clause (which allowed using the code
under the LGPL3 license).
Signed-off-by: Martin Storsjö <martin@martin.st>
7.1(wide) and 7.1(wide-side) channel layouts are supported in
fdk-aac since the 0.1.3 release.
The earlier versions of fdk-aac didn't include any library
version defines in the public headers, thus checking for
the AACENCODER_LIB_VL0 define is enough to know that we're
building against a new enough version of fdk-aac.
This change includes contributions by Tim Walker,
Michael Niedermayer and Timothy Gu.
Signed-off-by: Martin Storsjö <martin@martin.st>
Old Intel GPUs expect the reference frame index to the actual surface,
instead of the index into RefFrameList as specified by the spec.
This workaround should be set when using one of the "ClearVideo" decoder
devices.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
The latest H.264 DXVA specification states that the index in this
structure should refer to a valid entry in the RefFrameList of the picture
parameter structure, and not to the actual surface index.
Fixes H.264 DXVA2 decoding on recent Intel GPUs (tested on Sandy and Ivy)
Signed-off-by: Anton Khirnov <anton@khirnov.net>
This caused mpv (and possibly others) to fallback to software decoding after
seeking a VC1 stream.
Bug-Id: 667
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
The picture slot can be recycled by select_input_picture and
only current_picture is populated with the valid pts.
Unbreak timestamps when in cbr mode.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Fixes artifacts
Fixes use of freed memory
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
Also set the RGBA pixel format correctly as the native endian format,
which is what it returns.
This fixes the tests on big endian.
Signed-off-by: Martin Storsjö <martin@martin.st>
avctx->coded_{height,width} will always equal h->{height,width} since
init_dimensions() does that explicitly, Size changes are detected by
changes in mb_{height,width} earlier and propagated through the
needs_reinit variable.
This array is written using AV_WN32A, assuming alignment.
This hopefully fixes the failing vp7 fate test on sparc.
Signed-off-by: Martin Storsjö <martin@martin.st>
No initialization is needed in dca_decode_frame, because the next
thing it does is calling dca_parse_frame_header, which takes care of
the needed initialization.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Paul B Mahol <onemda@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Justin Ruggles <justin.ruggles@gmail.com>
CC:libav-stable@libav.org
The most interesting parts are initialization in ff_MPV_common_init() and
uninitialization in ff_MPV_common_end().
ff_mpeg_unref_picture and ff_thread_release_buffer have additional NULL
checks for Picture.f, because these functions can be called on
uninitialized or partially initialized Pictures.
NULL pointer checks are added to ff_thread_release_buffer() stub function.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
The old implementation is unusable due to changes in the Xvid API.
Further fixes by Michael Niedermayer <michaelni@gmx.at>.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
Sandy Bridge Win64:
180 cycles in ff_synth_filter_inner_sse2
150 cycles in ff_synth_filter_inner_avx
Also switch some instructions to a three operand format to avoid
assembly errors with Yasm 1.1.0 or older.
Signed-off-by: James Almer <jamrial@gmail.com>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Further performance improvements and security fixes by
Vittorio Giovara, Luca Barbato and Diego Biurrun.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
Signed-off-by: Diego Biurrun <diego@biurrun.de>
Additional fixes and enhancements by Vittorio Giovara, Gonzalo Garramuno,
Nicolas George, Paul B Mahol and Michael Niedermayer.
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
The assumption of (MPEG) Picture and H264Picture layout matching might
not hold true in the future.
Signed-off-by: Hendrik Leppkes <h.leppkes@gmail.com>
This allows proper muxing and seeking in things like MPEG-TS, by
placing headers by random access points.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
This was only used in hevc muxing code so far.
This makes the return values match what get_se_golomb returns for
the same bitstream reader instances.
The logic for producing a signed golomb code out of an unsigned one
was based on the corresponding code in get_se_golomb, which operated
directly on the bitstream reader buffer - not on the equivalent
return value from get_ue_golomb.
CC: libav-stable@libav.org
Signed-off-by: Martin Storsjö <martin@martin.st>
Profiling results for overall decode and the output_data function in
particular are as follows:
Before After
Mean StdDev Mean StdDev Confidence Change
6:2 total 339.6 15.1 329.3 16.0 95.8% +3.1% (insignificant)
6:2 function 24.6 6.0 9.9 3.1 100.0% +148.5%
8:2 total 324.5 15.5 323.6 14.3 15.2% +0.3% (insignificant)
8:2 function 20.4 3.9 9.9 3.4 100.0% +104.7%
6:6 total 572.8 20.6 539.9 24.2 100.0% +6.1%
6:6 function 54.5 5.6 16.0 3.8 100.0% +240.9%
8:8 total 741.5 21.2 702.5 18.5 100.0% +5.6%
8:8 function 63.9 7.6 18.4 4.8 100.0% +247.3%
The assembly version has also been tested with a fuzz tester to ensure that
any combinations of inputs not exercised by my available test streams still
generate mathematically identical results to the C version.
Signed-off-by: Martin Storsjö <martin@martin.st>
Profiling on a Raspberry Pi revealed the best performance to correspond
with VLC_BITS = 5. Results for overall audio decode and the get_vlc2 function
in particular are as follows:
Before After
Mean StdDev Mean StdDev Confidence Change
6:2 total 348.8 20.1 339.6 15.1 88.8% +2.7% (insignificant)
6:2 function 38.1 8.1 26.4 4.1 100.0% +44.5%
8:2 total 339.1 15.4 324.5 15.5 99.4% +4.5%
8:2 function 33.8 7.0 27.3 5.6 99.7% +23.6%
6:6 total 604.6 20.8 572.8 20.6 100.0% +5.6%
6:6 function 95.8 8.4 68.9 8.2 100.0% +39.1%
8:8 total 766.4 17.6 741.5 21.2 100.0% +3.4%
8:8 function 106.0 11.4 86.1 9.9 100.0% +23.1%
Signed-off-by: Martin Storsjö <martin@martin.st>
Profiling results for overall audio decode and the rematrix_channels function
in particular are as follows:
Before After
Mean StdDev Mean StdDev Confidence Change
6:2 total 370.8 17.0 348.8 20.1 99.9% +6.3%
6:2 function 46.4 8.4 45.8 6.6 18.0% +1.2% (insignificant)
8:2 total 343.2 19.0 339.1 15.4 54.7% +1.2% (insignificant)
8:2 function 38.9 3.9 40.2 6.9 52.4% -3.2% (insignificant)
6:6 total 658.4 15.7 604.6 20.8 100.0% +8.9%
6:6 function 109.0 8.7 59.5 5.4 100.0% +83.3%
8:8 total 896.2 24.5 766.4 17.6 100.0% +16.9%
8:8 function 223.4 12.8 93.8 5.0 100.0% +138.3%
The assembly version has also been tested with a fuzz tester to ensure that
any combinations of inputs not exercised by my available test streams still
generate mathematically identical results to the C version.
Signed-off-by: Martin Storsjö <martin@martin.st>
Profiling results for overall audio decode and the mlp_filter_channel(_arm)
function in particular are as follows:
Before After
Mean StdDev Mean StdDev Confidence Change
6:2 total 380.4 22.0 370.8 17.0 87.4% +2.6% (insignificant)
6:2 function 60.7 7.2 36.6 8.1 100.0% +65.8%
8:2 total 357.0 17.5 343.2 19.0 97.8% +4.0% (insignificant)
8:2 function 60.3 8.8 37.3 3.8 100.0% +61.8%
6:6 total 717.2 23.2 658.4 15.7 100.0% +8.9%
6:6 function 140.4 12.9 81.5 9.2 100.0% +72.4%
8:8 total 981.9 16.2 896.2 24.5 100.0% +9.6%
8:8 function 193.4 15.0 103.3 11.5 100.0% +87.2%
Experiments with adding preload instructions to this function yielded no
useful benefit, so these have not been included.
The assembly version has also been tested with a fuzz tester to ensure that
any combinations of inputs not exercised by my available test streams still
generate mathematically identical results to the C version.
Signed-off-by: Martin Storsjö <martin@martin.st>
Matroska, MP4, and other containers require it.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
There is no point in populating NuvContext with another DSPContext.
Also split static and dynamic initialization bits to avoid running the
static initialization parts over and over.
This gets rid of aliasing completely unrelated structs to Picture.
Fixes the remaining compilation warnings in the vdpau code.
Signed-off-by: Anton Khirnov <anton@khirnov.net>
The code passed H264Picture* and Picture*, and assumed the
hwaccel_picture_private field was in the same place in both
structs. Somehow this happened to work in Libav, but broke in
FFmpeg (and probably subtly breaks in Libav too).
Signed-off-by: Anton Khirnov <anton@khirnov.net>
This is done to disentangle ER from mpegvideo. In order to use a
classic Picture, callers can use ff_mpeg_set_erpic() or use a custom function
to set the fields. Please note that buffers need to be allocated before
calling ff_er_frame_end().
The function is assigned to a function pointer that does not have the
restrict keyword for that parameter.
This fixes compilation for MSVC builds that don't recognize "restrict",
broken since ed9625eb62.
The function is supposed to confirm that the compiler provided enough
alignment, but in practice it is only run in certain code paths and
insufficient alignment problems are restricted to legacy compilers.
Based on the aarch64 asm. CPU cycle counts on cortex-a9 compared to
gcc 4.8.2:
before: 475 decicycles in get_cabac_noinline, 67106035 runs, 2829 skips
after: 393 decicycles in get_cabac_noinline, 67106474 runs, 2390 skips
Overall speedup is above 2%. Code generated by clang 3.4 is slower on
the same hardware and the relative change is a little larger.
Based on the x86 branchless get_cabac asm. get_cabac_noinline() gets
approximately 20% faster (no cycle counts available) compared to clang
from Xcode 5.1 beta5. More than 6% faster overall. A part of the overall
speedup might be explained by additional inlining of get_cabac().
The overread avoidance fix in cbddee1cca
broke the computation for the last row since it prevented the safe
reading from the height+1-th row.
CC: libav-stable@libav.org
Some content requires an higher number of slices in order to
render properly.
Rise the number to 1024 and warn if ever it exceeds.
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
The SVQ3 decoder reuses large parts of the H.264 decoder so it
makes no sense to enable the former but not the latter.
Also drop unnecessary h263.o object from SVQ3 decoder object list.
Avoid a division by 0 in ff_mpeg4_set_one_direct_mv.
Sample-Id: 00000168-google
Reported-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
Signed-off-by: Vittorio Giovara <vittorio.giovara@gmail.com>
avcodec_flush_buffers() must release all internally held references
according to its documentation, for which all the threads need to be
flushed.
CC:libav-stable@libav.org
Bug-Id: vlc/9665
These codecs compile all of the MJPEG code anyway, so there is little
point in not enabling the MJPEG decoder directly. This also simplifies
the dependency declarations for the MJPEG codec family.
This codec compiles all of the SP5X code anyway, so there is little
point in not enabling the decoder directly. This also simplifies the
dependency declaration for the AMV decoder.
AAC LOAS can have new audio config objects in the stream itself.
Make sure the decoder reconfigures itself when the first one arrives
midstream.
Bug-Id: 644
Signed-off-by: Luca Barbato <lu_zero@gentoo.org>
The vector dequantization has a test in a loop preventing effective SIMD
implementation. By moving it out of the loop, this loop can be DSPized.
Therefore, modify the current DSP implementation. In particular, the
DSP implementation no longer has to handle null loop sizes.
The decode_hf implementations have following timings:
For x86 Arrandale:
C SSE SSE2 SSE4
win32: 260 162 119 104
win64: 242 N/A 89 72
The arm NEON optimizations follow in a later patch as external asm. The
now unused check for the y modifier in arm inline asm is removed from
configure.
Based on a patch from Christophe Gisquet.
Unrolling of the m == 0 case avoids a possible use of the uninitilized
value sum when s->predictor_history is not set. I failed to find a
sample for it. It also reduced the cycle count from 220 to 150 on
sandy bridge, x86_64 linux, gcc 4.8.2 compared to his patch.
Timings for Arrandale:
C SSE
win32: 2108 334
win64: 1152 322
Factorizing the inner loop with a call/jmp is a >15 cycles cost, even with
the jmp destination being aligned.
Unrolling for ARCH_X86_64 is a 20 cycles gain.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
The scaling factor is constant so it is faster to scale the
FIR coefficients in the tables during compilation.
Signed-off-by: Janne Grunau <janne-libav@jannau.net>
Framerate is now a sane rational instead of an integer, and
inputDepth is changed to what it actually is.
Signed-off-by: Derek Buitenhuis <derek.buitenhuis@gmail.com>