memcpy.a15.S/strcmp.a15.S files were submitted by ARM for use as the basis
for the memcpy/strcmp implementations in cortex-a15.
memset.S was moved in to the generic directory.
NOTE: memcpy.a9.S was submitted by Linaro to be the basis for the memcpy
for cortex-a9/cortex-a15 but has not been incorporated yet.
Bug: 10971279
Merge from internal master.
(cherry-picked from 48fc3e8b9f)
Change-Id: I8f9297578990d517f004e4e8840e2b2cbd5a47d8
The check for __ARM_FEATURE_DSP being defined is pointless since it
is always defined.
Bug: 10971279
Merge from internal master.
(cherry-picked from d2642fa70c)
Change-Id: If23ab3271f4da0c38cd531ffdc9a7e5eed6ec5dc
Much of the per-architecture duplication can be removed, so let's do so
before we add the 64-bit architectures.
Change-Id: Ieb796503c8e5353ea38c3bab768bb9a690c9a767
I accidentally did a signed comparison of the size_t values passed in
for three of the _chk functions. Changing them to unsigned compares.
Add three new tests to verify this failure is fixed.
Bug: 10691831
Merge from internal master.
(cherry-picked from 883ef2499c)
Change-Id: Id9a96b549435f5d9b61dc132cf1082e0e30889f5
The backtrace when a fortify check failed was not correct. This change
adds all of the necessary directives to get a correct backtrace.
Fix the strcmp directives and change all labels to local labels.
Testing:
- Verify that the runtime can decode the stack for __memcpy_chk, __memset_chk,
__strcpy_chk, __strcat_chk fortify failures.
- Verify that gdb can decode the stack properly when hitting a fortify check.
- Verify that the runtime can decode the stack for a seg fault for all of the
_chk functions and for memcpy/memset.
- Verify that gdb can decode the stack for a seg fault for all of the _chk
functions and for memcpy/memset.
- Verify that the runtime can decode the stack for a seg fault for strcmp.
- Verify that gdb can decode the stack for a seg fault in strcmp.
Bug: 10342460
Bug: 10345269
Merge from internal master.
(cherry-picked from 05332f2ce7)
Change-Id: Ibc919b117cfe72b9ae97e35bd48185477177c5ca
The libcorkscrew stack unwinder does not understand cfi directives,
so add .save directives so that it can function properly.
Also add the directives in to strcmp.S and fix a missing set of
directives in cortex-a9/memcpy_base.S.
Bug: 10345269
Merge from internal master.
(cherry-picked from 5f7ccea3ff)
Change-Id: If48a216203216a643807f5d61906015984987189
I accidentally did a signed comparison of the size_t values passed in
for three of the _chk functions. Changing them to unsigned compares.
Add three new tests to verify this failure is fixed.
Bug: 10691831
Change-Id: Ia831071f7dffd5972a748d888dd506c7cc7ddba3
The backtrace when a fortify check failed was not correct. This change
adds all of the necessary directives to get a correct backtrace.
Fix the strcmp directives and change all labels to local labels.
Testing:
- Verify that the runtime can decode the stack for __memcpy_chk, __memset_chk,
__strcpy_chk, __strcat_chk fortify failures.
- Verify that gdb can decode the stack properly when hitting a fortify check.
- Verify that the runtime can decode the stack for a seg fault for all of the
_chk functions and for memcpy/memset.
- Verify that gdb can decode the stack for a seg fault for all of the _chk
functions and for memcpy/memset.
- Verify that the runtime can decode the stack for a seg fault for strcmp.
- Verify that gdb can decode the stack for a seg fault in strcmp.
Bug: 10342460
Bug: 10345269
Change-Id: I1dedadfee207dce4a285e17a21e8952bbc63786a
The libcorkscrew stack unwinder does not understand cfi directives,
so add .save directives so that it can function properly.
Also add the directives in to strcmp.S and fix a missing set of
directives in cortex-a9/memcpy_base.S.
Bug: 10345269
Change-Id: I043f493e0bb6c45bd3f4906fbe1d9f628815b015
This change pulls the memcpy code out into a new file so that the
__strcpy_chk and __strcat_chk can use it with an include.
The new versions of the two chk functions uses assembly versions
of strlen and memcpy to implement this check. This allows near
parity with the assembly versions of strcpy/strcat. It also means that
as memcpy implementations get faster, so do the chk functions.
Other included changes:
- Change all of the assembly labels to local labels. The other labels
confuse gdb and mess up backtracing.
- Add .cfi_startproc and .cfi_endproc directives so that gdb is not
confused when falling through from one function to another.
- Change all functions to use cfi directives since they are more powerful.
- Move the memcpy_chk fail code outside of the memcpy function definition
so that backtraces work properly.
- Preserve lr before the calls to __fortify_chk_fail so that the backtrace
actually works.
Testing:
- Ran the bionic unit tests. Verified all error messages in logs are set
correctly.
- Ran libc_test, replacing strcpy with __strcpy_chk and replacing
strcat with __strcat_chk.
- Ran the debugger on nexus10, nexus4, and old nexus7. Verified that the
backtrace is correct for all fortify check failures. Also verify that
when falling through from __memcpy_chk to memcpy that the backtrace is
still correct. Also verified the same for __memset_chk and bzero.
Verified the two different paths in the cortex-a9 memset routine that
save variables to the stack still show the backtrace properly.
Bug: 9293744
(cherry-picked from 2be91915dc)
Change-Id: Ia407b74d3287d0b6af0139a90b6eb3bfaebf2155
This change creates assembler versions of __memcpy_chk/__memset_chk
that is implemented in the memcpy/memset assembler code. This change
avoids an extra call to memcpy/memset, instead allowing a simple fall
through to occur from the chk code into the body of the real
implementation.
Testing:
- Ran the libc_test on __memcpy_chk/__memset_chk on all nexus devices.
- Wrote a small test executable that has three calls to __memcpy_chk and
three calls to __memset_chk. First call dest_len is length + 1. Second
call dest_len is length. Third call dest_len is length - 1.
Verified that the first two calls pass, and the third fails. Examined
the logcat output on all nexus devices to verify that the fortify
error message was sent properly.
- I benchmarked the new __memcpy_chk and __memset_chk on all systems. For
__memcpy_chk and large copies, the savings is relatively small (about 1%).
For small copies, the savings is large on cortex-a15/krait devices
(between 5% to 30%).
For cortex-a9 and small copies, the speed up is present, but relatively
small (about 3% to 5%).
For __memset_chk and large copies, the savings is also small (about 1%).
However, all processors show larger speed-ups on small copies (about 30% to
100%).
Bug: 9293744
Merge from internal master.
(cherry-picked from 7c860db074)
Change-Id: I916ad305e4001269460ca6ebd38aaa0be8ac7f52
This change pulls the memcpy code out into a new file so that the
__strcpy_chk and __strcat_chk can use it with an include.
The new versions of the two chk functions uses assembly versions
of strlen and memcpy to implement this check. This allows near
parity with the assembly versions of strcpy/strcat. It also means that
as memcpy implementations get faster, so do the chk functions.
Other included changes:
- Change all of the assembly labels to local labels. The other labels
confuse gdb and mess up backtracing.
- Add .cfi_startproc and .cfi_endproc directives so that gdb is not
confused when falling through from one function to another.
- Change all functions to use cfi directives since they are more powerful.
- Move the memcpy_chk fail code outside of the memcpy function definition
so that backtraces work properly.
- Preserve lr before the calls to __fortify_chk_fail so that the backtrace
actually works.
Testing:
- Ran the bionic unit tests. Verified all error messages in logs are set
correctly.
- Ran libc_test, replacing strcpy with __strcpy_chk and replacing
strcat with __strcat_chk.
- Ran the debugger on nexus10, nexus4, and old nexus7. Verified that the
backtrace is correct for all fortify check failures. Also verify that
when falling through from __memcpy_chk to memcpy that the backtrace is
still correct. Also verified the same for __memset_chk and bzero.
Verified the two different paths in the cortex-a9 memset routine that
save variables to the stack still show the backtrace properly.
Bug: 9293744
Change-Id: Id5aec8c3cb14101d91bd125eaf3770c9c8aa3f57
(cherry picked from commit 2be91915dc)
Create one version of strcat/strcpy/strlen for cortex-a15/krait and another
version for cortex-a9.
Tested with the libc_test strcat/strcpy/strlen tests.
Including new tests that verify that the src for strcat/strcpy do not
overread across page boundaries.
NOTE: The handling of unaligned strcpy (same code in strcat) could probably
be optimized further such that the src is read 64 bits at a time instead of
the partial reads occurring now.
strlen improves slightly since it was recently optimized.
Performance improvements for strcpy and strcat (using an empty dest string):
cortex-a9
- Small copies vary from about 5% to 20% as the size gets above 10 bytes.
- Copies >= 1024, about a 60% improvement.
- Unaligned copies, from about 40% improvement.
cortex-a15
- Most small copies exhibit a 100% improvement, a few copies only
improve by 20%.
- Copies >= 1024, about 150% improvement.
- Unaligned copies, about 100% improvement.
krait
- Most small copies vary widely, but on average 20% improvement, then
the performance gets better, hitting about a 100% improvement when
copies 64 bytes of data.
- Copies >= 1024, about 100% improvement.
- When coping MBs of data, about 50% improvement.
- Unaligned copies, about 90% improvement.
As strcat destination strings get larger in size:
cortex-a9
- about 40% improvement for small dst strings (>= 32).
- about 250% improvement for dst strings >= 1024.
cortex-a15
- about 200% improvement for small dst strings (>=32).
- about 250% improvement for dst strings >= 1024.
krait
- about 25% improvement for small dst strings (>=32).
- about 100% improvement for dst strings >=1024.
Merge from internal master.
(cherry-picked from d119b7b6f4)
Change-Id: I296463b251ef9fab004ee4dded2793feca5b547a
This change creates assembler versions of __memcpy_chk/__memset_chk
that is implemented in the memcpy/memset assembler code. This change
avoids an extra call to memcpy/memset, instead allowing a simple fall
through to occur from the chk code into the body of the real
implementation.
Testing:
- Ran the libc_test on __memcpy_chk/__memset_chk on all nexus devices.
- Wrote a small test executable that has three calls to __memcpy_chk and
three calls to __memset_chk. First call dest_len is length + 1. Second
call dest_len is length. Third call dest_len is length - 1.
Verified that the first two calls pass, and the third fails. Examined
the logcat output on all nexus devices to verify that the fortify
error message was sent properly.
- I benchmarked the new __memcpy_chk and __memset_chk on all systems. For
__memcpy_chk and large copies, the savings is relatively small (about 1%).
For small copies, the savings is large on cortex-a15/krait devices
(between 5% to 30%).
For cortex-a9 and small copies, the speed up is present, but relatively
small (about 3% to 5%).
For __memset_chk and large copies, the savings is also small (about 1%).
However, all processors show larger speed-ups on small copies (about 30% to
100%).
Bug: 9293744
Change-Id: I8926d59fe2673e36e8a27629e02a7b7059ebbc98
Create one version of strcat/strcpy/strlen for cortex-a15/krait and another
version for cortex-a9.
Tested with the libc_test strcat/strcpy/strlen tests.
Including new tests that verify that the src for strcat/strcpy do not
overread across page boundaries.
NOTE: The handling of unaligned strcpy (same code in strcat) could probably
be optimized further such that the src is read 64 bits at a time instead of
the partial reads occurring now.
strlen improves slightly since it was recently optimized.
Performance improvements for strcpy and strcat (using an empty dest string):
cortex-a9
- Small copies vary from about 5% to 20% as the size gets above 10 bytes.
- Copies >= 1024, about a 60% improvement.
- Unaligned copies, from about 40% improvement.
cortex-a15
- Most small copies exhibit a 100% improvement, a few copies only
improve by 20%.
- Copies >= 1024, about 150% improvement.
- Unaligned copies, about 100% improvement.
krait
- Most small copies vary widely, but on average 20% improvement, then
the performance gets better, hitting about a 100% improvement when
copies 64 bytes of data.
- Copies >= 1024, about 100% improvement.
- When coping MBs of data, about 50% improvement.
- Unaligned copies, about 90% improvement.
As strcat destination strings get larger in size:
cortex-a9
- about 40% improvement for small dst strings (>= 32).
- about 250% improvement for dst strings >= 1024.
cortex-a15
- about 200% improvement for small dst strings (>=32).
- about 250% improvement for dst strings >= 1024.
krait
- about 25% improvement for small dst strings (>=32).
- about 100% improvement for dst strings >=1024.
Change-Id: Ifd091ebdbce70fe35a7c5d8f71d5914255f3af35
This is needed when passing -mcpu=cortex-a9 or higher on a modern
toolchain for prebuilt library compatibility
Change-Id: I73eb2393377914ae26216a8c2828ad973d1c1225
Tested using a static version of the strlen libc_test program
on a nexus7 that uses the generic code.
Merge from internal master.
(cherry-picked from d8d10a8994)
Change-Id: I88f7dc01dc5b5c3ac2d5580d92153bc1bc36c564
This optimized version is primarily targeted at cortex-a15.
Tested on all nexus devices using the system/extras/libc_test strlen test.
Tested alignments from 1 to 32 that are powers of 2.
Tested that strlen does not cross page boundaries at all alignments.
Speed improvements listed below:
cortex-a15
- Sizes >= 32 bytes, ~75% improvement.
- Sizes >= 1024 bytes, ~250% improvement.
cortex-a9
- Sizes >= 32 bytes, ~75% improvement.
- Sizes >= 1024 bytes, ~85% improvement.
krait
- Sizes >= 32 bytes, ~95% improvement.
- Sizes >= 1024 bytes, ~160% improvement.
Merge from internal master.
(cherry-picked from 2fc0717977)
Change-Id: I1ceceb4e745fd68e9d946f96d1d42e0cdaff6ccf
We cleaned up the auto-generated ones a while back to not touch
the stack unnecessarily if they have <= 4 arguments. This patch
cleans up some hand-crafted ones.
Also improve comments in clone.S.
Change-Id: I8850bf98f2b26829385315304472a760e6880ed8
Tested using a static version of the strlen libc_test program
on a nexus7 that uses the generic code.
Change-Id: If04d15dcb6c0b18f27f2fefadca5510ed49016c5
This memcpy code uses NEON/VFP to achieve very good performance
on ARMv7-A processors. It is specifically tuned for A15 but should
provide good performance on A9 also. It is equivalent to the code
in cortex-strings rev 116.
This patch is a follow up the existing gerrit change:
I7f6f77995f3ca903ad9c66d14261441667a2a935
This version includes a tweak for performance on misaligned
buffers and splits the header comment into license and
documentation sections.
Change-Id: Ibd2e23c8d8e01357ba0247be1d05192de3ceba69
Signed-off-by: Will Newton <will.newton@linaro.org>
This memcpy code uses NEON/VFP to achieve very good performance
on ARMv7-A processors. It is specifically tuned for A15 but should
provide good performance on A9 also. It is equivalent to the code
in cortex-strings rev 116.
This patch is a follow up the existing gerrit change:
I7f6f77995f3ca903ad9c66d14261441667a2a935
But this version includes a tweak for performance on misaligned
buffers.
Change-Id: I285abac0068f8ae29a1cbf7862ea8590aadaf0a7
Signed-off-by: Will Newton <will.newton@linaro.org>
Streamline the memcpy a bit removing some unnecessary instructions.
The biggest speed improvement comes from changing the size of
the preload. On krait, the sweet spot for the preload in the main
loop is twice the L1 cache line size.
In most cases, these small tweaks yield > 1000MB/s speed ups. As
the size of the memcpy approaches about 1MB, the speed improvement
disappears.
Change-Id: Ief79694d65324e2db41bee4707dae19b8c24be62
This uses the new code original submitted as memcpy.a15.S as
the base. However, the old code handled unaligned src/dst better
so that was spliced in. I optimized the original unaligned code by
removing a few unnecessary instructions. I optimized the a15 code by
rewriting the pre and post code. I also modified the main loop to add
a pld so that larger copies would not stall waiting for memory.
Test cases for the new memcpy:
- Copy all sized values from 0 to 1024 bytes, using whatever alignment
is returned by malloc.
For each alignment case described below, the test copied from 0 to 128
bytes.
- Src and dst pointers are both aligned to the same value, starting
at one going through every power of two up to and including 128.
- Src aligned to double word boundary, dst aligned to word boundary.
- Src aligned to word boundary, dst aligned to double word boundary.
- Src aligned to 16 bit boundary, dst aligned to word boundary.
- Src aligned to word boundary, dst aligned to 16 byte boundary.
- Src aligned to word boundary, dst aligned to 1 byte from a word
boundary.
- Src aligned to word boundary, dst aligned to 2 bytes from a word
boundary.
- Src aligned to word boundary, dst aligned to 3 bytes from a word
boundary.
- Src aligned to 1 byte from a word boundary, dst aligned to a word
boundary.
- Src aligned to 2 bytes from a word boundary, dst aligned to a word
boundary.
- Src aligned to 3 bytes from a word boundary, dst aligned to a word
boundary.
Cases to verify the unaligned source code properly aligns to a 16 bit
boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
4 + 128 bit boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
8 + 128 bit boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
12 + 128 bit boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
16 + 128 bit boundary.
In all cases, a two byte fencepost was placed at the end of the
destination to verify that only the requested number of bytes were copied.
Bug: 8005082
Merge from internal master.
(cherry-picked from commit 21ede92d79)
Change-Id: Ief70c9e6dc8c6473ae245b6570b2c266fed9618c
This lets us move all the ARM syscall stubs over to the kernel <asm/unistd.h>.
Our generated <sys/linux-syscalls.h> is now unused, but I'll remove that in a
later change.
Change-Id: Ie5ff2cc4abce1938576af7cbaef615a79c7f310d
For some reason, socketcalls.c was only being compiled for ARM, where
it makes no sense. For x86 we generate stubs for the socket functions
that use __NR_socketcall directly.
Change-Id: I84181e6183fae2314ae3ed862276eba82ad21e8e
<sys/linux-syscalls.h> only contains constants for the syscalls
we're generating stubs for. We want all the syscalls available
on the architecture in question.
Keep using <sys/linux-syscalls.h> on ARM for now because the
__NR_ARM_set_tls and __NR_ARM_cacheflush values aren't in <asm/unistd.h>.
Change-Id: I66683950d87d9b18d6107d0acc0ed238a4496f44
This uses the new strcmp.a15.S code as the basis for new versions
of strcmp.S.
The cortex-a15 code is the performance optimized version of strcmp.a15.S
taken with only the addition of a few pld instructions.
The cortex-a9 code is the same as the cortex-a15 code except that the
unaligned strcmp code was taken from the original strcmp.S.
The krait code is the same as the cortex-a15 code except that one path
in the unaligned strcmp code was taken from the original strcmp.S code
(the 2 byte overlap case).
The generic code is the original unmodified strmp.S from the bionic
subdirectory.
All three new versions underwent these test cases:
Strings the same, all same size:
- Both pointers double word aligned.
- One pointer double word aligned, one pointer word aligned.
- Both pointers word aligned.
- One pointer double word aligned, one pointer 1 off a word alignment.
- One pointer double word aligned, one pointer 2 off a word alignment.
- One pointer double word aligned, one pointer 3 off a word alignment.
- One pointer word aligned, one pointer 1 off a word alignment.
- One pointer word aligned, one pointer 2 off a word alignment.
- One pointer word aligned, one pointer 3 off a word alignment.
For all cases where it made sense, the two pointers were also tested
swapped.
Different strings, all same size:
- Single difference at double word boundary.
- Single difference at word boudary.
- Single difference at 1 off a word alignment.
- Single difference at 2 off a word alignment.
- Single difference at 3 off a word alignment.
Different sized strings, strings the same until the end:
- Shorter string ends on a double word boundary.
- Shorter string ends on word boundary.
- Shorter string ends at 1 off a word boundary.
- Shorter string ends at 2 off a word boundary.
- Shorter string ends at 3 off a word boundary.
For all different cases, run them through the same pointer alignment
cases when the strings are the same size.
For all cases the two pointers were also tested swapped.
Bug: 8005082
Merge from internal master.
(cherry-picked from commit a9a5870d16)
Change-Id: I4c2b98f8a50804fb98ab67f75e9d660f1315a144
We only need one logging API, and I prefer the one that does no
allocation and is thus safe to use in any context.
Also use O_CLOEXEC when opening the /dev/log files.
Move everything logging-related into one header file.
Change-Id: Ic1e3ea8e9b910dc29df351bff6c0aa4db26fbb58
Move arch specific code for arm, mips, x86 into separate
makefiles.
In addition, add different arm cpu versions of memcpy/memset.
Bug: 8005082
Merge from internal master (acdde8c1cf).
Change-Id: I04f3d0715104fab618e1abf7cf8f7eec9bec79df
The attached patch provides a new implementation of strcmp for ARM,
using LDRD instead of LDR whenever possible.
For older architectures that do not support LDRD, this implementation
uses the same algorithm as before.
Testing and benchmarking:
* Validation: successfully passes a test that compares different strings
of length 1-128 and offsets 0-8 from a word boundary. Checked on
qemu/A15/A9, ARM/Thumb mode, Big/Little Endian.
* Integration with gcc: no regression on qemu for arm-none-eabi --with-cpu
a15/a9 --with-mode arm/thumb.
Change-Id: I9e230e1b99dbdc9119b69ee858a89038c516a4ea
Signed-off-by: Vassilis Laganakos <vasileios.laganakos@arm.com>
The strategy for large block sizes is LDRD and STRD with offset addressing,
where the main loop copies 64 bytes in every iteration, (i.e., 8 calls to
LDRD and STRD pairs), interleaving load and stores (i.e., the pairs of LDRD
and STRD of the same data are consecutive instructions), and the writeback
of an updated address is a separate instruction, which allows us to write
back the accumulated update once per iteration.
This strategy is implemented in memcpy.S. In some configurations, a plain
version of memcpy (included from memcpy-stub.c) is used instead of the
optimized one.
Validation:
* Correctness: checked memcpy using a test harness for block sizes
ranging between 1 to 128, and source and destination buffers alignment
ranging in { 0,1,2,3,4,8,12 } bytes each.
* Performance: benchmarking on Cortex-A15 FPGA indicates that this strategy
is better for A15 than the strategy used by glibc and even slightly better
than using NEON. Benchmarking on Cortex-A9 bare metal and Linux shows
that the proposed strategy is reasonable: not as fast as the version of
memcpy from glibc (which is the best open source strategy for A9), but
comparable with csl and bionic.
* Integration with GCC: no regression for arm-none-eabi --with-cpu
cortex-a15 and cortex-a9.
Change-Id: Ied56354d8992c62ae3e02d582a2bd55585d814b9
Signed-off-by: Vassilis Laganakos <vasileios.laganakos@arm.com>
Fix the pthread_setname_np test to take into account that emulator kernels are
so old that they don't support setting the name of other threads.
The CLONE_DETACHED thread is obsolete since 2.5 kernels.
Rename kernel_id to tid.
Fix the signature of __pthread_clone.
Clean up the clone and pthread_setname_np implementations slightly.
Change-Id: I16c2ff8845b67530544bbda9aa6618058603066d
If r0 == 0, we're the child. If r0 > 0, we're the parent.
Otherwise set errno.
The __bionic_clone code I copy & pasted was wrong. This patch
fixes both.
Bug: 3461078
Change-Id: Ibb7d6cc7e54e666841f2f0dc59a141a0b31982e4
MIPS and x86 appear to have been correct already.
(Also fix unit tests that ASSERT_EQ with errno so that the
arguments are in the retarded junit order.)
Bug: 3461078
Change-Id: I2418ea98927b56e15b4ba9cfec97f5e7094c6291