hevc seems to be the only place where the C implementation
of the av_clip function is explicitly selected, precluding
platform-specific optimizations
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Signed-off-by: Anton Khirnov <anton@khirnov.net>
Original x86 intrinsics code and initial yasm port by Pierre-Edouard Lepere.
Refactoring and optimizations by James Almer.
Benchmarks of BQTerrace_1920x1080_60_qp22.bin with an Intel Core i5-4200U
Width 32
158583 decicycles in edge, sao_edge_filter_8 runs, 0 skips
5205 decicycles in ff_hevc_sao_edge_filter_32_8_ssse3, 32767 runs, 1 skips
2942 decicycles in ff_hevc_sao_edge_filter_32_8_avx2, 32767 runs, 1 skips
Width 64
705639 decicycles in sao_edge_filter_8, 262144 runs, 0 skips
19224 decicycles in ff_hevc_sao_edge_filter_64_8_ssse3, 262111 runs, 33 skips
10433 decicycles in ff_hevc_sao_edge_filter_64_8_avx2, 262115 runs, 29 skips
Signed-off-by: James Almer <jamrial@gmail.com>
As with sao_band_filter, pass instead the two variables from the struct needed in the function.
This simplifies writing asm optimized versions.
Reviewed-by: Mickaël Raulet <mraulet@insa-rennes.fr>
Signed-off-by: James Almer <jamrial@gmail.com>
Use edge emu buffers
And enable the code unconditionally
Speed difference without USE_SAO_SMALL_BUFFER and with the new code:
Decicycles: 26772->26220 (BO32), 83803->80942 (BO64)
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
cherry picked from commit 5d9f79edef2c11b915bdac3a025b59a32082f409
SAO edge filter uses pre-SAO pixel data on the left and top of the ctb, so
this data must be kept available. This was done previously by having 2
copies of the frame, one before and one after SAO.
This commit reduces the storage to just that, instead of the previous whole
frame.
Commit message taken from patch by Christophe Gisquet <christophe.gisquet@gmail.com>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
For band filter, source and destination are aligned (except for 16x16 ctbs),
and otherwise, they are most often aligned. Overall, the total width is also
too small for amortizing memcpy.
Timings (using an intrinsic version of edge filters):
B/32 B/64 E/32 E/64
Before: 32045 93952 38925 126896
After: 26772 83803 33942 117182
Pass instead the two variables from the struct needed in the function.
This simplifies writing asm optimized versions of the function
Signed-off-by: James Almer <jamrial@gmail.com>
* commit 'a7a17e3f1915ce69b787dc58c5d8dba0910fc0a4':
hevc_filter: move some conditions out of loops
Conflicts:
libavcodec/hevc_filter.c
This is possibly less readable than the variant used before.
Thus please take a look and if people agree its worse, dont
hesitate to revert.
See: 83976e40e89655162e5394cf8915d9b6d89702d9
Merged-by: Michael Niedermayer <michaelni@gmx.at>
1) each of the loops run within a single CTB, so the relevant reference
list is constant
2) when that CTB is, or lies on the same slice as, the current one, we
can use a simple access instead of a relatively expensive call to
ff_hevc_get_ref_list()
The x86 asm expects int32_t so use that type.
Reviewed-by: Mickaël Raulet <mraulet@insa-rennes.fr>
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
This should help cache locality. On win64:
Before: 1397x cycles, 16216 bytes
After: 1369x cycles, 16040 bytes
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit '73bb8f61d48dbf7237df2e9cacd037f12b84b00a':
hevcdsp: remove an unneeded variable in the loop filter
Conflicts:
libavcodec/hevc_filter.c
See: d7e162d46b4a0fc03ca5161cdcac840152f048cb
Merged-by: Michael Niedermayer <michaelni@gmx.at>
beta0 and beta1 will always be the same within a CU
Signed-off-by: Mickaël Raulet <mraulet@insa-rennes.fr>
cherry picked from commit 4a23d824741a289c7d2d2f2871d1e2621b63fa1b
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
There's a lag of one CTB line for SAO behind deblocking filter, except for
last line. However, once SAO has been completed on a line, all its pixels,
i.e. up to y+ctb_size are filtered and ready to be used as reference.
Without SAO, when deblocking filter finishes a CTB line, only the bottom
bottom 4 pixels may be filtered when next CTB is process by the deblocing.
The await_progess for hevc then checks whether the bottom pixels of a PU
requires access beyond that point, so the reporting should effectively
report up to the the above limits.
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
- adding one extra pixel all around the frame
- do not copy when SAO is not applied
5% improvement
cherry picked from commit 10fc29fc19a12c4d8168fbe1a954b76386db12d0
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
- support for 4:2:2 and 4:4:4 up to 12 bits
- add a new profile for range extension
(cherry picked from commit d3c067fa65bbc871758d28aa07f54123430ca346)
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>
* commit 'ff486c0f7f6b2ace3f0238660bc06cc35b389676':
hevc: Do not right shift a negative value in get_pcm
Merged-by: Michael Niedermayer <michaelni@gmx.at>
* commit '50c988aa6d6c6f0ceb8f922bcea34800b56b85d9':
hevc: Drop unnecessary shifts in deblocking_filter_CTB
Merged-by: Michael Niedermayer <michaelni@gmx.at>
Fixes use of uninitialized memory
Fixes: 93728afd9aa074ba14a09bfd93a632fd-asan_static-oob_124a17d_1445_cov_1021181966_DBLK_D_VIXS_1.bit
Found-by: Mateusz "j00ru" Jurczyk and Gynvael Coldwind
Signed-off-by: Michael Niedermayer <michaelni@gmx.at>