isa-l

mirror of https://github.com/intel/isa-l.git synced 2024-12-12 17:33:50 +01:00

Author	SHA1	Message	Date
liuqinfei	4815174a68	crc: optimize by supporting arm xor fusion feature Arrange the two xor instructions according to the specified paradigm, then the two xor instructions can be fused to execute which can save one issue slot and one execution latency. Change-Id: Ic64bcfe569b2468e4dc9c13d073d367cc81fd937 Signed-off-by: liuqinfei <lucas.liuqinfei@huawei.com>	2023-08-18 07:53:59 +00:00
Pablo de Lara	2bbce31943	crc: add CRC64 rocksoft implementation - Added reference implementation - Added base implementation - Added functional and performance tests Change-Id: I60c5097bd5fb89ee7a50910e71d449d50d155d0a Signed-off-by: Pablo de Lara <pablo.de.lara.guarch@intel.com> Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>	2023-05-08 12:37:44 +00:00
Taiju Yamada	1187583a97	Fixes for aarch64 mac - It should be fine to enable pmull always on Apple Silicon - macOS 12+ is required for PMULL instruction. - Changed the conditional macro to __APPLE__ - Rewritten dispatcher using sysctlbyname - Use __USER_LABEL_PREFIX__ - Use __TEXT,__const as readonly section - use ASM_DEF_RODATA macro - fix func decl Change-Id: I800593f21085d8187b480c8bb3ab2bd70c4a6974 Signed-off-by: Taiju Yamada <tyamada@bi.a.u-tokyo.ac.jp>	2022-10-28 08:27:26 -07:00
Chunsong Feng	e297ecae7a	crc16: Accelerate T10DIF performance with prefetch and pmull2 The memory block size calculated by t10dif is generally 512 bytes in sectors. prefetching can effectively reduce cache misses.Use ldp instead of ldr to reduce the number of instructions, pmull+pmull2 can resuce register access. The perf test result shows that the performance is improved by 5x ~ 14x after optimization. Change-Id: Ibd3f08036b6a45443ffc15f808fd3b467294c283 Signed-off-by: Chunsong Feng <fengchunsong@huawei.com>	2022-03-31 09:58:04 -07:00
Jerry Yu	1c71f9c0ae	crc32: tweak performance of crc32/crc32c Tweak performances with prefetch instructions. Below is the test results: - Neoverse N1: ~30% - Cortex-A72: ~3% - Cortex-A57: ~90% - Others: 50% - 5x Change-Id: I3ab292a953043dbaea98af3c66778f57da3a1331 Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>	2020-07-09 17:37:00 +08:00
Zhiyuan Zhu	031450f697	crc32: Implement default mix mode optimization Change-Id: Ib3bf04215cca491db522ec33905fe48df173cc2f Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>	2020-05-09 08:10:34 +00:00
Jerry Yu	6c4d3dbf6c	crc32:NeoverseN1: Change CRC32/PMULL order to PMULL first To reduce the cache missing events, the mix layout is changed to PMULL+CRC. It also relaxes the final delay caused by data dependency. As results, the cold perf was improved about 20% and warm perf was improved about 4%. Change-Id: I7756f846edcb4f1665b4643a5a0e02283938cfdf Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>	2020-04-16 20:38:41 +08:00
Jerry Yu	92fc8733fa	crc32: Fix prototype mismatch bug Change-Id: I7c8a2348441f32a43ff386122612405e418d9947 Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>	2020-04-10 00:46:41 +00:00
Jerry Yu	9bcd6768fd	crc32:Adjust hardware folding algorithm flags Hardware folding algorithm depend on CRC32 and PMULL instruction. And it should match both flags . Change-Id: I361068402db1fe6d7c0bd8d2c7048f1d94880233 Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>	2020-04-08 13:50:15 +08:00
Jerry Yu	0033f42189	crc32:Optimize crc32/c for cortex-a72 Change-Id: Ib1658fd4b87b31d8ea6c93f697b50d9b409c186e Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>	2020-04-08 13:49:38 +08:00
Jerry Yu	a2fc2c000d	crc32:Add optimization implementation for Neoverse N1 This patch is base on reference(1) algorithm with some changes. - Redefine the block number to two. - That's due to only two pipe-line can be used in CRC32 calculate. - Redefine the block size: - The block size of CRC is 1536B and PMULL is 512B - Interleave CRC and PMULL instructions. The optimization parameters are calculated base on reference(2) References: - https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fast-crc-computation-generic-polynomials-pclmulqdq-paper.pdf - https://developer.arm.com/docs/swog309707/a Change-Id: I1c9e593d59b521f56e4b3c807b396c083c181636 Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>	2020-03-30 09:20:29 -07:00
Samuel Lee	4785428d2f	crc: arm64 implementation tweaks + Utilise `pmull2` instruction in main loops of arm64 crc functions and avoid the need for `dup` to align multiplicands. + Use just 1 ASIMD register to hold both 64b p4 constants, appropriately aligned. + Interleave quadword `ldr` with `pmull{2}` to avoid unnecessary stalls on existing LITTLE uarch (which can only issue these instructions every other cycle). + Similarly interleave scalar instructions with ASIMD instructions to increase likelihood of instruction level parallelism on a variety of uarch. + Cut down on needless instructions in non-critical sections to help performance for small buffers. + Extract common instruction sequences into inner macros and moved them into shared header - crc_common_pmull.h + Use the same human readable register aliases and register allocation in all 4 implementations, never refer to registers without using human readable alias. + Use #defines rather than .req to allow use of same names across several implementations + Reduce tail case size from 1024B to 64B + Phrased the `eor` instructions in the main loop to more clearly show that we can rewrite pairs of `eor` instructions with a single `eor3` instruction in the presence of Armv8.2-SHA (should probably be an option in multibinary in future). Change-Id: I3688193ea4ad88b53cf47e5bd9a7fd5c2b4401e1 Signed-off-by: Samuel Lee <samuel.lee@microsoft.com>	2019-11-13 10:58:19 -07:00
Zhiyuan Zhu	f3993f5c0b	crc: Fix dynamic relocation link failure on Arm This issue occurs when dynamic compilation is used and gcc's -fsanitize memory detection option is turned on. [Log] relocation truncated to fit: R_AARCH64_LD_PREL_LO19 against `.rodata' Change-Id: Ic2f82264610552f347e043f82ac5ebafc93748e2 Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>	2019-10-11 15:37:29 -07:00
Jerry Yu	183385f02f	multibinary: Add run-time cpu feature detect for aarch64 Some CPUs report "illegal instruction" error for the crc test because they do not support the relevant optional feature . This can be fixed by introducing CPU feature detection for AArch64 . The difference with the x86 implementation is the dispatcher . It is based on the glibc function `getauxval(AT_HWCAP)` and `getauxval(AT_HWCAP2)` , not registers or instructions . On a heterogeneous system (big.LITTLE) , it is dangerous to detect CPU features using identification registers . And while it is possible to use architectural feature registers from userspace on recent kernels, this won't necessarily work with older platforms . Thus we use the HW_CAPs exported from the kernel (and visible in getauxval) as the solution. - According to kernel suggestion , getauxval should be used for this purpose . - [CPU Feature detection](https://github.com/torvalds/linux/blob/master/Documentation/arm64/cpu-feature-registers.rst) - According to AAPCS result/paramter registers should be saved/restore for function call - [AAPCS](http://infocenter.arm.com/help/topic/com.arm.doc.ihi0055b/IHI0055B_aapcs64.pdf) - [GLibc](https://sourceware.org/git/gitweb.cgi?p=glibc.git;a=blob;f=sysdeps/aarch64/dl-trampoline.S) Signed-off-by: Jerry Yu <jerry.h.yu@arm.com> Change-Id: Ic9abe0d2268ac95537e1abf10acc642fc58a5054	2019-08-26 17:58:42 +08:00
Zhiyuan Zhu	c80610a2bb	crc: push the aarch64 crc optimization back to base functions Some arm64 machines don't support pmull instructions, so set these crc interface to base functions. For long-term solution, will provide better multi-binary support with cpu features detection. Change-Id: I02791a2a50283dc8df2f9ba124eb309912b5b4b7 Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>	2019-07-16 07:18:54 +00:00
Zhiyuan Zhu	a46da529d9	crc: optimize crc with arm64 assembly Change-Id: I49166ee06b3ad24babb90aeb0b834d8aacfc2d03 Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>	2019-06-21 17:02:16 +08:00

16 Commits