To reduce the cache missing events, the mix layout is changed
to PMULL+CRC. It also relaxes the final delay caused by data
dependency.
As results, the cold perf was improved about 20% and warm perf
was improved about 4%.
Change-Id: I7756f846edcb4f1665b4643a5a0e02283938cfdf
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
Hardware folding algorithm depend on CRC32 and PMULL instruction.
And it should match both flags .
Change-Id: I361068402db1fe6d7c0bd8d2c7048f1d94880233
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
The nmake default is changed for a modern nasm. Older nasm and yasm versions
will still work with windows but the nmake options must be changed appropriately
for max AS_FEATURE_LEVEL to match. Also now generates debug symbol pdb files.
Change-Id: I94a2dd7ecf541c6564ccbd4a184c33995d7b31ad
Signed-off-by: Poornima Kumar <poornima.kumar@intel.com>
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
This patch provides microarchitecture information
and make microarchitecture optimization possible. It
will trap into kernel due to mrs instruction. So it
should be called only in dispatcher, that will be
called only once in program lifecycle. And HWCAP must
be match,That will make sure there are no illegal
instruction errors.
Change-Id: I393ec742010bf3f10ce335482c0350aa4202c788
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
Change improper stack push in windows prolog. Error was not reachable without
windows nasm support and so went undetected.
Change-Id: I8b715195d1c8efd173843c043d42fc610ddebd17
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Previously windows build could only use yasm because some procedural items such
as proc_start were not supported by nasm. This adds a few macros and fixes so
nasm can be used to build on windows.
Change-Id: Ia05dc3ff482f33b0f915bb1be3c7df5e4a753b3a
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Push of registers overlapped xmm push. Error was not reachable without windows
nasm support and so went undetected.
Change-Id: I0ffd66f6d32ac37ea03fe9b11924968aa50f8fa7
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
For builds under windows this could emit a non-vec mov that's not optional for
AVX versions.
Change-Id: I31e6ea3b62d48c5a13f6e83f8d684f0b5551087b
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
The gf_{2-6}vect_dot_prod tests were kept in other_tests since the 5,6vect
functions were not strictly called by the higher level ec_encode_data() and
needed independent testing. As this has now changed the extra tests can be
removed as redundant.
Change-Id: I8a95e31487b150a2a8f929c5586785524d951fde
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Vec versions mix much better with other avx code.
Change-Id: I2544c75d09231ee70f16c384b1e57062976199d9
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
1) Implement the ErasureCode function in Altivec Intrinsics
2) Coding style update
Change-Id: I2c81d035f4083e9b011dbf3b741f628813b68606
Thanks-to: Daniel Axtens <dja@axtens.net>
Signed-off-by: Hong Bo Peng <penghb@cn.ibm.com>
Clang errors when generating dependencies due to a stray semicolon following a
function definition.
Change-Id: Iefb4aca988b643bb62a69bbbaf197aca20a2d085
Signed-off-by: Zhang Jinde <zjd5536@163.com>
Fixes invalid logic that attempted to eliminate unnecessary copy of input to the
history buffer in cases where it is not required. Correction should improve
performance and not change functionality.
Change-Id: Ife24dcc9d920ce220b1a394031e971321737a171
Signed-off-by: Zhang Jinde <zjd5536@163.com>
This patch addresses some cppcheck issues.
And some minor changes to maintain code consistency.
- Cleanup cppcheck issues.
[log][igzip/igzip_perf.c] (error) Shifting signed 32-bit value by 31 bits is undefined behaviour
[log][igzip/igzip_hist_perf.c:132]: (error) Memory leak: outbuf
- Some minor changes to maintain code consistency.
igzip/igzip_build_hash_table_perf.c
igzip/igzip_hist_perf.c
igzip/igzip_semi_dyn_file_perf.c
- delete unused variable
outbuf and outbuf_size from igzip/igzip_hist_perf.c
Change-Id: Icbbd8f70de689931c8a844d89e457af8d97c6793
Signed-off-by: Zhiyuan Zhu <zhiyuan.zhu@arm.com>
Added AVX512 optimized functions to calculate the
GF(2^8) vector dot product with 5 and 6 outputs
at a time. Also added GF(2^8) vector multiply
AVX512 optimized functions with 5 and 6 accumulate.
Change-Id: I6d2c080f4f4f8e4823ad9a9be2c65c3b5b3bb1f8
Signed-off-by: John Kariuki <John.K.Kariuki@intel.com>
+ Utilise `pmull2` instruction in main loops of arm64 crc functions and
avoid the need for `dup` to align multiplicands.
+ Use just 1 ASIMD register to hold both 64b p4 constants,
appropriately aligned.
+ Interleave quadword `ldr` with `pmull{2}` to avoid unnecessary stalls
on existing LITTLE uarch (which can only issue these instructions every
other cycle).
+ Similarly interleave scalar instructions with ASIMD instructions to
increase likelihood of instruction level parallelism on a variety of
uarch.
+ Cut down on needless instructions in non-critical sections to help
performance for small buffers.
+ Extract common instruction sequences into inner macros and moved
them into shared header - crc_common_pmull.h
+ Use the same human readable register aliases and register allocation
in all 4 implementations, never refer to registers without using human
readable alias.
+ Use #defines rather than .req to allow use of same names across
several implementations
+ Reduce tail case size from 1024B to 64B
+ Phrased the `eor` instructions in the main loop to more clearly show
that we can rewrite pairs of `eor` instructions with a single `eor3`
instruction in the presence of Armv8.2-SHA (should probably be an option
in multibinary in future).
Change-Id: I3688193ea4ad88b53cf47e5bd9a7fd5c2b4401e1
Signed-off-by: Samuel Lee <samuel.lee@microsoft.com>
Removed the redundant parts that apply to all arch.
Change-Id: I2015c436cc8ea09913a8d0d4ce2cf1f112d71dde
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
Replace hardcode gcc with $(CC). as_filter
will work correct in cross compile
Change-Id: I484d5074abdfc80ed5cd14fdd1358274f306bcfd
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
The third parameter must be 32bit register . Those assmebly
put 64bit register here , it is wrong .
Change-Id: Iebe17516b555a6a9b94ea7baa4778ad4b9dd0878
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
with -Wstric-prototype option , GCC report the
warning .
Change-Id: Ic2d1adb566ad21deec65c66552e2863254e1376a
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
- Disable clang test for travis and drone.io
- Add document about compiler requirement
Change-Id: I81f8dc31088d40f315dd4ec062bed5df8ab7b633
Signed-off-by: Jerry Yu <jerry.h.yu@arm.com>
Bug in Homebrew auto-update causes post-update install to use the old
environment.
Change-Id: I03e20d899f558f71579dfd4be3f96903b77f1998
Signed-off-by: Greg Tucker <greg.b.tucker@intel.com>
1.Replace below erasure code interfaces to arm neon interface by mbin_interface function.
ec_encode_data
gf_vect_mul
gf_vect_dot_prod
gf_vect_mad
ec_encode_data_update
2.Utilise arm neon instrution to accelerate GF(2^8) set compute by 128bit registor.
Change-Id: Ib0ecbfbd1837d2b1f823d26815c896724d2d22e4
Signed-off-by: Zhou Xiong <zhouxiong13@huawei.com>
Enable new arm64 architecture in TravisCI, add tests for
following compilers:
gcc: v5.4.0
clang: v3.8.0
Change-Id: Id0b2f2231fabcbeff7061f85050db99df12c9a67
Signed-off-by: Jun He <jun.he@arm.com>