This is actually a slightly newer upstream version than the one I
originally pulled. Hopefully now it's in upstream-freebsd it will
be easier to track upstream, though I still need to sit down and
write the necessary scripts at some point.
Bug: 5110679
Change-Id: I87e563f0f95aa8e68b45578e2a8f448bbf827a33
The defines HAVE_32_BYTE_CACHE_LINES and ARCH_ARM_USE_NON_NEON_MEMCPY
are not used by any code. The previous memcpy code that used these
has been split into different architecture versions to avoid the need
for them.
Bug: 8005082
(cherry picked from commit 6e1a5cf31b)
Change-Id: I69654d47db1458136782b5504290f620e924ee75
Don't pull in unnecessary header files. AFAIK, I've fixed all
the code which didn't include the correct header files.
Change-Id: If0b7bba74e77cb24a0cf9ce8968aa07400855e58
The attached patch provides a new implementation of strcmp for ARM,
using LDRD instead of LDR whenever possible.
For older architectures that do not support LDRD, this implementation
uses the same algorithm as before.
Testing and benchmarking:
* Validation: successfully passes a test that compares different strings
of length 1-128 and offsets 0-8 from a word boundary. Checked on
qemu/A15/A9, ARM/Thumb mode, Big/Little Endian.
* Integration with gcc: no regression on qemu for arm-none-eabi --with-cpu
a15/a9 --with-mode arm/thumb.
Change-Id: I9e230e1b99dbdc9119b69ee858a89038c516a4ea
Signed-off-by: Vassilis Laganakos <vasileios.laganakos@arm.com>
The strategy for large block sizes is LDRD and STRD with offset addressing,
where the main loop copies 64 bytes in every iteration, (i.e., 8 calls to
LDRD and STRD pairs), interleaving load and stores (i.e., the pairs of LDRD
and STRD of the same data are consecutive instructions), and the writeback
of an updated address is a separate instruction, which allows us to write
back the accumulated update once per iteration.
This strategy is implemented in memcpy.S. In some configurations, a plain
version of memcpy (included from memcpy-stub.c) is used instead of the
optimized one.
Validation:
* Correctness: checked memcpy using a test harness for block sizes
ranging between 1 to 128, and source and destination buffers alignment
ranging in { 0,1,2,3,4,8,12 } bytes each.
* Performance: benchmarking on Cortex-A15 FPGA indicates that this strategy
is better for A15 than the strategy used by glibc and even slightly better
than using NEON. Benchmarking on Cortex-A9 bare metal and Linux shows
that the proposed strategy is reasonable: not as fast as the version of
memcpy from glibc (which is the best open source strategy for A9), but
comparable with csl and bionic.
* Integration with GCC: no regression for arm-none-eabi --with-cpu
cortex-a15 and cortex-a9.
Change-Id: Ied56354d8992c62ae3e02d582a2bd55585d814b9
Signed-off-by: Vassilis Laganakos <vasileios.laganakos@arm.com>
Move arch specific code for arm, mips, x86 into separate
makefiles.
In addition, add different arm cpu versions of memcpy/memset.
Bug: 8005082
(cherry picked from commit acdde8c1cf)
Change-Id: I0108d432af9f6283ae99adfc92a3399e5ab3e31d