- Use CRC32 instruction to calculate:
a. short data no more than 16 bytes;
b. long data when folding until 16 bytes.
- Add fastpath for short data to make the procedure more efficient.
Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
Using highest-level instruction set may not reveal the best
performance on certain platform. E.g. using AVX impl for ec
updating instead of AVX2 impl can be faster on Hygon 1/2/3
platform.
This commit identifies Hygon platform and use a special
dispatch case for ec_encode_data_update to choose certain
instruction set impl.
Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
To generate the side-by-side pattern of two 128-bit xgfts within a
YMM reg, loading them with VBROADCASTI128 from mem directly can be
faster than loading then swapping them with VMOVDQU + VPERM2i128.
Remove some out-of-date macro as well.
Signed-off-by: Maodi Ma <mamaodi@hygon.cn>
Add multi-thread benchmark in raid_funcs_perf application,
using --coremask parameter in command line.
This parameter allows multiple threads to be spawned in multiple
cores defined in the parameter, so multi-core throughput
can be calculated.
Example:
./raid/raid_funcs_perf -t pq_gen -s 32K --coremask 0x0f0
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Randomizing the data when initializing the buffers takes a long time
when buffers are large (or cold cache test is used).
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Number of destination buffers are already included in the calculation
of the buffer size for cold cache test.
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
reduce one slli instructions and remove the dependence between vle8.v and ld instructions
gf5 and gf7 are not modified, +5 and +7 are not used in actual scenarios.
Signed-off-by: Shuo Lv <lv.shuo@sanechips.com.cn>
On AArch64 systems with SVE support, 128-bit SVE implementations can
perform significantly worse than equivalent NEON code due to the
different optimization strategies used in each implementation. The NEON
version is unrolled 4 times, providing excellent performance at the
fixed 128-bit width. The SVE version can achieve similar or better
performance through its variable-width operations on systems with
256-bit or 512-bit SVE, but on 128-bit SVE systems, the NEON unrolled
implementation is faster due to reduced overhead.
This change adds runtime detection of SVE vector length and falls back
to the optimized NEON implementation when SVE is operating at 128-bit
width, ensuring optimal performance across all AArch64 configurations.
This implementation checks the vector length with an intrinsic if the
compiler supports it (which works on Apple as well) and falls back to
using prctl otherwise.
This optimization ensures that systems benefit from:
- 4x unrolled NEON code on 128-bit SVE systems
- Variable-width SVE optimizations on wider SVE implementations
- Maintained compatibility across different AArch64 configurations
Performance improvement on systems with 128-bit SVE:
- Encode: 7509.80 MB/s → 8995.59 MB/s (+19.8% improvement)
- Decode: 9383.67 MB/s → 12272.38 MB/s (+30.8% improvement)
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>
We only ever load 32 bits into it, and we only ever want to compare against
32 bits. There was no need to declare it as 64 bits.
Furthermore, there were cases where a 64 bit comparison around
isal_out_overflow_1 led us to erroneously set the block state to
ISAL_BLOCK_INPUT_DONE when it should have been left at ISAL_BLOCK_NEW_HDR.
Fixes#316
Signed-off-by: Tim Burke <tim.burke@gmail.com>
Somewhere between Command Line Tools for Xcode 16.2 and 16.3, clang
started complaining like
<instantiation>:91:26: error: unexpected token in argument list
movk x7, br_low_b2, lsl 32
^
crc/aarch64/crc32_ieee_norm_pmull.S:34:1: note: while in macro instantiation
crc32_norm_func crc32_ieee_norm_pmull
It seems to do with some change to macro expansion; work around it by
replacing .equ directives with #defines.
Fixes#352
Signed-off-by: Tim Burke <tim.burke@gmail.com>
There is a possibility that zstate.msg = NULL, which is set
in inflateInit2() function. In that case, we should not
compare against another string.
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
This is experimental library is a drop-in replacement for zlib that
utilizes ISA-L for improved compression/decompression performance.
Signed-off-by: Karpenko, Veronika <veronika.karpenko@intel.com>
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
The ISA-L EC code has been written using RVV vector instructions and the minimum multiplication table,
resulting in a performance improvement of over 10 times compared to the existing implementation.
Signed-off-by: Shuo Lv <lv.shuo@sanechips.com.cn>
Added new RAID performance application which consolidates the
existing XOR and P+Q gen performance applications.
This application accepts buffer sizes to benchmark,
as a single value, list or range, and the RAID function
to test and the number of sources.
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
To benchmark a cold cache scenario, the option `--cold`
has been added as a parameter of the CRC benchmark application,
where the addresses of the input buffers are randomize
within a 1GB preallocated memory buffer.
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Added new CRC performance application which consolidates the
existing CRC performance applications (CRC16, CRC32 and CRC64).
This application accepts buffer sizes to benchmark,
as a single value, list or range, and the CRC function
to test (or all of them).
Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>