Arrange the two xor instructions according to the specified
paradigm, then the two xor instructions can be fused to execute
which can save one issue slot and one execution latency.
Change-Id: Ic64bcfe569b2468e4dc9c13d073d367cc81fd937
Signed-off-by: liuqinfei <lucas.liuqinfei@huawei.com>
- It should be fine to enable pmull always on Apple Silicon
- macOS 12+ is required for PMULL instruction.
- Changed the conditional macro to __APPLE__
- Rewritten dispatcher using sysctlbyname
- Use __USER_LABEL_PREFIX__
- Use __TEXT,__const as readonly section
- use ASM_DEF_RODATA macro
- fix func decl
Change-Id: I800593f21085d8187b480c8bb3ab2bd70c4a6974
Signed-off-by: Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>
The memory block size calculated by t10dif is generally 512 bytes in
sectors. prefetching can effectively reduce cache misses.Use ldp instead
of ldr to reduce the number of instructions, pmull+pmull2 can resuce
register access. The perf test result shows that the performance is
improved by 5x ~ 14x after optimization.
Change-Id: Ibd3f08036b6a45443ffc15f808fd3b467294c283
Signed-off-by: Chunsong Feng <fengchunsong@huawei.com>
To reduce the cache missing events, the mix layout is changed
to PMULL+CRC. It also relaxes the final delay caused by data
dependency.
As results, the cold perf was improved about 20% and warm perf
was improved about 4%.
Change-Id: I7756f846edcb4f1665b4643a5a0e02283938cfdf
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
Hardware folding algorithm depend on CRC32 and PMULL instruction.
And it should match both flags .
Change-Id: I361068402db1fe6d7c0bd8d2c7048f1d94880233
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
+ Utilise `pmull2` instruction in main loops of arm64 crc functions and
avoid the need for `dup` to align multiplicands.
+ Use just 1 ASIMD register to hold both 64b p4 constants,
appropriately aligned.
+ Interleave quadword `ldr` with `pmull{2}` to avoid unnecessary stalls
on existing LITTLE uarch (which can only issue these instructions every
other cycle).
+ Similarly interleave scalar instructions with ASIMD instructions to
increase likelihood of instruction level parallelism on a variety of
uarch.
+ Cut down on needless instructions in non-critical sections to help
performance for small buffers.
+ Extract common instruction sequences into inner macros and moved
them into shared header - crc_common_pmull.h
+ Use the same human readable register aliases and register allocation
in all 4 implementations, never refer to registers without using human
readable alias.
+ Use #defines rather than .req to allow use of same names across
several implementations
+ Reduce tail case size from 1024B to 64B
+ Phrased the `eor` instructions in the main loop to more clearly show
that we can rewrite pairs of `eor` instructions with a single `eor3`
instruction in the presence of Armv8.2-SHA (should probably be an option
in multibinary in future).
Change-Id: I3688193ea4ad88b53cf47e5bd9a7fd5c2b4401e1
Signed-off-by: Samuel Lee <samuel.lee@microsoft.com>
This issue occurs when dynamic compilation is used
and gcc's -fsanitize memory detection option is turned on.
[Log] relocation truncated to fit: R_AARCH64_LD_PREL_LO19 against `.rodata'
Change-Id: Ic2f82264610552f347e043f82ac5ebafc93748e2
Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>
Some CPUs report "illegal instruction" error for the crc test because
they do not support the relevant optional feature . This can be fixed by
introducing CPU feature detection for AArch64 .
The difference with the x86 implementation is the dispatcher . It is based
on the glibc function `getauxval(AT_HWCAP)` and `getauxval(AT_HWCAP2)` , not
registers or instructions .
On a heterogeneous system (big.LITTLE) , it is dangerous to detect CPU
features using identification registers . And while it is possible to use
architectural feature registers from userspace on recent kernels, this
won't necessarily work with older platforms . Thus we use the HW_CAPs
exported from the kernel (and visible in getauxval) as the solution.
- According to kernel suggestion , getauxval should be used for this purpose .
- [CPU Feature detection](https://github.com/torvalds/linux/blob/master/Documentation/arm64/cpu-feature-registers.rst)
- According to AAPCS result/paramter registers should be saved/restore for function call
- [AAPCS](http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf)
- [GLibc](https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/aarch64/dl-trampoline.S)
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
Change-Id: Ic9abe0d2268ac95537e1abf10acc642fc58a5054
Some arm64 machines don't support pmull instructions, so set these
crc interface to base functions. For long-term solution, will
provide better multi-binary support with cpu features detection.
Change-Id: I02791a2a50283dc8df2f9ba124eb309912b5b4b7
Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>