Ternlog has additional benefit in by16 crc main loop in both reflected
and non-reflected polynomial crcs. Some arch see 4-7% improvement.
Revisited on suggestion by Nicola Torracca.
Change-Id: I806266a7080168cf33409634983e254a291a0795
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
- It should be fine to enable pmull always on Apple Silicon
- macOS 12+ is required for PMULL instruction.
- Changed the conditional macro to __APPLE__
- Rewritten dispatcher using sysctlbyname
- Use __USER_LABEL_PREFIX__
- Use __TEXT,__const as readonly section
- use ASM_DEF_RODATA macro
- fix func decl
Change-Id: I800593f21085d8187b480c8bb3ab2bd70c4a6974
Signed-off-by: Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>
Prior to this change, a missing loop bounds check in the aarch64
version of gf_vect_mul would cause the routine to return 1 (error)
in the normal case.
This change introduces a check and branch to "return_pass" (success), and
also adds checks of the return code of gf_vect_mul to the supplied unit
test; it was previously ignored.
Change-Id: I9f7fe0014189b24f9600e0473ee02b5316c2da91
Signed-off-by: Surendar Chandra <vsurench@amazon.com>
Large cold perf tests were allocating more then allowed stack size.
Change-Id: I2c54f36ac6b42b359078dae7fffa5ce0b6d4890a
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
The cold versions of tests depended on a fixed size of last level
cache that is too low on some arch and too high for the total
available memory on others.
Change-Id: Iee98403f9ace02e01b810c296a5fe44b933bfb17
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
The relic slver is no longer used for individual versioning
on functions and is confusing tools looking for data in text
sections. This removes all instances instead of fixing since
its usefulness is waining. Fixes#221
Change-Id: Ife0b9f105950a90337c58e8a41ac2cffc0f67d99
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
The cflag to link with dynamic msvcrt /MD is not necessary and causes
warnings when static linking. Fixes#219
Change-Id: I0085d468afc4acbe323b0783cbbc6760b4c70704
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
In the adler32_neon function, during the last iteration of the
loop through "accum32_neon", we would load data after the end of the
buffer (in the ld1 instruction, the "start" register points to the end
of the buffer).
If this memory is unmapped, this would cause a segfault. If the memory
is mapped, the checksum would be correct because that value would
only be used in the next iteration, but this happens during the last
iteration.
To fix this, we can simply do the load before incrementing "start". And
while we're at it, we can load directly into d0_v/d1_v, saving a couple
of mov's.
Finally, the ld1 done during the function initialization can be removed
as the values aren't used for anything.
Change-Id: I4a0f2811adc523852ebe774da0a6fb1f5419192f
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
The memory block size calculated by t10dif is generally 512 bytes in
sectors. prefetching can effectively reduce cache misses.Use ldp instead
of ldr to reduce the number of instructions, pmull+pmull2 can resuce
register access. The perf test result shows that the performance is
improved by 5x ~ 14x after optimization.
Change-Id: Ibd3f08036b6a45443ffc15f808fd3b467294c283
Signed-off-by: Chunsong Feng <fengchunsong@huawei.com>
While the external headers define the API, we could really use this
overview to get users started and point them to examples.
Change-Id: Iba419e61d0d7723e1029a3b6e7259facfeb39522
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
1. Revert "x86: Generate .note.gnu.property section for ELF output"
This reverts commit 8074e3fe1b, which is
a hack to work around the old nasm which doesn't support
section .note.gnu.property note alloc noexec align=8
This hack doesn't work for downstream, like:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=2040091
2. If Intel CET is enabled, require nasm with note section support to
add
section .note.gnu.property note alloc noexec align=N
to assembly codes.
Verified with
$ CC="gcc -Wl,-z,cet-report=error -fcf-protection" CXX="g++ -Wl,-z,cet-report=error -fcf-protection" .../configure x86_64-linux
$ make -j8
on Tiger Lake.
Change-Id: I6d66fe6fd054420d7fde35b1508ca9f09defdeca
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
The goal of this patch is to make isa-l testsuite pass on s390 with
minimal changes to the library. The one and only reason isa-l does not
work on s390 at the moment is that s390 is big-endian, and isa-l
assumes little-endian at a lot of places.
There are two flavors of this: loading/storing integers from/to
memory, and overlapping structs. Loads/stores are already helpfully
wrapped by unaligned.h header, so replace the functions there with
endianness-aware variants. Solve struct member overlap by reversing
their order on big-endian.
Also, fix a couple of usages of uninitialized memory in the testsuite
(found with MemorySanitizer).
Fixes s390x part of #188.
Change-Id: Iaf14a113bd266900192cc8b44212f8a47a8c7753
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
This patch adds Arm (aarch64) SVE [1] variable-length vector assembly support
into ISA-L erasure code library. "Arm designed the Scalable Vector Extension
(SVE) as a next-generation SIMD extension to AArch64. SVE allows flexible
vector length implementations with a range of possible values in CPU
implementations. The vector length can vary from a minimum of 128 bits up to
a maximum of 2048 bits, at 128-bit increments. The SVE design guarantees
that the same application can run on different implementations that support
SVE, without the need to recompile the code. " [3]
Test method:
- This patch was tested on Fujitsu's A64FX [2], and it passed all erasure
code related test cases, including "make checks" , "make test", and
"make perf".
- To ensure code testing coverage, parameters in files (erasure_code/
erasure_code_test.c , erasure_code_update_test.c and gf_vect_mad_test.c)
are modified to cover all _vect versions of _mad_sve() / _dot_prod_sve()
rutines.
Performance improvements over NEON:
In general, SVE benchmarks (bandwidth in MB/s) are 40% ~ 100% higher than NEON
when running _cold style (data uncached and pulled from memory) perfs. This
includes routines of dot_prod, mad, and mul.
Optimization points:
This patch was tuned for the best performance on A64FX. Tuning points being
touched in this patch include:
1) Data prefetch into L2 cache before loading. See _sve.S files.
2) Instruction sequence orchestration. Such as interleaving every two
'ld1b/st1b' instructions with other instructions. See _sve.S files.
3) To improve dest vectors parallelism, in highlevel, running
gf_4vect_dot_prod_sve twice is better than running gf_8vect_dot_prod_sve()
once, and it's also better than running _7vect + _vect, _6vect + _2vect,
and _5vect + _3vect. The similar idea is applied to improve 11 ~ 9 dest
vectors dot product computing as well. The related change can be found
in ec_encode_data_sve() of file:
erasure_code/aarch64/ec_aarch64_highlevel_func.c
Notes:
1) About vector length: A64FX has a vector register length of 512bit. However,
this patchset was written with variable length assembly so it work
automatically on aarch64 machines with any types of SVE vector length,
such as SVE-128, SVE-256, etc..
2) About optimization: Due to differences in microarchitecture and
cache/memory design, to achieve optimum performance on SVE capable CPUs
other than A64FX, it is considered necessary to do microarchitecture-level
tunings on these CPUs.
[1] Introduction to SVE - Arm Developer.
https://developer.arm.com/documentation/102476/latest/
[2] FUJITSU Processor A64FX.
https://www.fujitsu.com/global/products/computing/servers/supercomputer/a64fx/
[3] Introducing SVE.
https://developer.arm.com/documentation/102476/0001/Introducing-SVE
Change-Id: If49eb8a956154d799dcda0ba4c9c6d979f5064a9
Signed-off-by: Guodong Xu <guodong.xu@linaro.org>
Github actions checkout changed to pull only a single generated merge
commit instead of the actual PR commit id. This breaks check_format
test for signoff. Pulling history of 2 will include the actual commit
ID.
Change-Id: I7d83871159d24faaf2f8e6086f12173e14cbcf3c
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
New mem_zero_detect function will fail on avx only machines.
Change-Id: I3bca49bff886f9c130c89e8c74b31110e9bac76b
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
micro-optimizations: vpcmpeqb+vpmaskmov is faster than vptest according
to uops.info; make usually untaken branches target forward.
reduce numbers of data dependant branches and code size.
Change-Id: Ie70b4bc99685368e5131f23344348bfaf7c27d3e
Signed-off-by: Nicola Torracca <shark@bitchx.it>
The variable D= can be used to quickly add defines. This sets a null
default so it can only be overridden by the make command line.
fixes#184
Change-Id: I84615174547f36208d6d577c1e30b6fac83139b3
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Instead of using a constant as default zlib header, create the header on the fly. Both zlib
header bytes depend on the wbits and compression level used.
Make sure that ISA-L compression level 0 is advertised as the fastest compression in
both the gzip header (setting xfl flag to 0x04) and the zlib header (as 0, fastest, other levels are 1, fast).
Change-Id: I1f30e4397a0f5fcf6df593c40178e7d6f6c05328
Signed-off-by: Ruben Vorderman <r.h.p.vorderman@lumc.nl>
The file types.h has long been misnamed and overlaps with
functionality in the test helper routines.
Change-Id: I774047d3a0074198b67a6b4e909f1e2ce1938195
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Timing functions are made os-independent with test.h include.
Change-Id: Iab7d6325254d5c32263504efc756dbbe51d77153
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Windows def file was missing an exported ec support function.
Also added path in nmake file to build extra examples.
Change-Id: I59ac1599dcb8cdb45077347c74b57aeca4751c35
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
The osx brew and older linux targets are failing the update.
This removes the older linux builds and change the osx to
take the latest brew that comes with the image instead of
doing a brew update on every build.
Change-Id: Ib1543296a733875c9eff798326b0d45854153923
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Both gcc and clang are showing a warning on this despite the buffer
always being set before use.
Change-Id: I0e8f6b9e3451efe69e49814abc883d49b04f2666
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
The raid functions xor_gen, pq_gen and check functions
must have at least two sources. Fixes#175
Change-Id: I2e4509e037c2b1dc88f3f7449d80f4c763e1e124
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Make changed the interpretation of escaped # in a quote causing
warnings in the test for pthreads.
Change-Id: Ice94116713aea3c3e9725b38232e03f53d6633cc
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Here is the bug report on ceph. https://tracker.ceph.com/issues/48681
Change-Id: Ie1c60a71f28c1a169c8899a621be9bb455f5e244
Signed-off-by: luo rixin <luorixin@huawei.com>
Author of this patch is Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>
Re-organized by Jerry Yu <jerry.h.yu@arm.com>
Clang version must be later than 9.x according to https://reviews.llvm.org/D61719
Change-Id: I7516cca17ef4556b828fb6ecfa755e6451052359
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
Clang has deprecated the option -fsanitize-coverage=trace-pc-guard
for use with fuzzing.
Change-Id: I7fe5da0f57ab44110208d098858b786450a0a5e7
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Clang with sanitizer on was catching on cast of static header.
Switching to uload64 macro for better general solution.
Change-Id: I495d440407bb1773841e2f7cdc48bd95fc1a2df4
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>