5b349fc22e
The strategy for large block sizes is LDRD and STRD with offset addressing, where the main loop copies 64 bytes in every iteration, (i.e., 8 calls to LDRD and STRD pairs), interleaving load and stores (i.e., the pairs of LDRD and STRD of the same data are consecutive instructions), and the writeback of an updated address is a separate instruction, which allows us to write back the accumulated update once per iteration. This strategy is implemented in memcpy.S. In some configurations, a plain version of memcpy (included from memcpy-stub.c) is used instead of the optimized one. Validation: * Correctness: checked memcpy using a test harness for block sizes ranging between 1 to 128, and source and destination buffers alignment ranging in { 0,1,2,3,4,8,12 } bytes each. * Performance: benchmarking on Cortex-A15 FPGA indicates that this strategy is better for A15 than the strategy used by glibc and even slightly better than using NEON. Benchmarking on Cortex-A9 bare metal and Linux shows that the proposed strategy is reasonable: not as fast as the version of memcpy from glibc (which is the best open source strategy for A9), but comparable with csl and bionic. * Integration with GCC: no regression for arm-none-eabi --with-cpu cortex-a15 and cortex-a9. Change-Id: Ied56354d8992c62ae3e02d582a2bd55585d814b9 Signed-off-by: Vassilis Laganakos <vasileios.laganakos@arm.com> |
||
---|---|---|
libc | ||
libdl | ||
libm | ||
libstdc++ | ||
libthread_db | ||
linker | ||
tests | ||
.gitignore | ||
Android.mk | ||
CleanSpec.mk |