Commit Graph

9 Commits

Author SHA1 Message Date
Christopher Ferris
2be91915dc Create optimized __strcpy_chk/__strcat_chk.
This change pulls the memcpy code out into a new file so that the
__strcpy_chk and __strcat_chk can use it with an include.

The new versions of the two chk functions uses assembly versions
of strlen and memcpy to implement this check. This allows near
parity with the assembly versions of strcpy/strcat. It also means that
as memcpy implementations get faster, so do the chk functions.

Other included changes:
- Change all of the assembly labels to local labels. The other labels
  confuse gdb and mess up backtracing.
- Add .cfi_startproc and .cfi_endproc directives so that gdb is not
  confused when falling through from one function to another.
- Change all functions to use cfi directives since they are more powerful.
- Move the memcpy_chk fail code outside of the memcpy function definition
  so that backtraces work properly.
- Preserve lr before the calls to __fortify_chk_fail so that the backtrace
  actually works.

Testing:

- Ran the bionic unit tests. Verified all error messages in logs are set
  correctly.
- Ran libc_test, replacing strcpy with __strcpy_chk and replacing
  strcat with __strcat_chk.
- Ran the debugger on nexus10, nexus4, and old nexus7. Verified that the
  backtrace is correct for all fortify check failures. Also verify that
  when falling through from __memcpy_chk to memcpy that the backtrace is
  still correct. Also verified the same for __memset_chk and bzero.
  Verified the two different paths in the cortex-a9 memset routine that
  save variables to the stack still show the backtrace properly.

Bug: 9293744
Change-Id: Id5aec8c3cb14101d91bd125eaf3770c9c8aa3f57
2013-08-13 18:05:33 -07:00
Christopher Ferris
7c860db074 Optimize __memset_chk, __memcpy_chk.
This change creates assembler versions of __memcpy_chk/__memset_chk
that is implemented in the memcpy/memset assembler code. This change
avoids an extra call to memcpy/memset, instead allowing a simple fall
through to occur from the chk code into the body of the real
implementation.

Testing:

- Ran the libc_test on __memcpy_chk/__memset_chk on all nexus devices.
- Wrote a small test executable that has three calls to __memcpy_chk and
  three calls to __memset_chk. First call dest_len is length + 1. Second
  call dest_len is length. Third call dest_len is length - 1.
  Verified that the first two calls pass, and the third fails. Examined
  the logcat output on all nexus devices to verify that the fortify
  error message was sent properly.
- I benchmarked the new __memcpy_chk and __memset_chk on all systems. For
  __memcpy_chk and large copies, the savings is relatively small (about 1%).
  For small copies, the savings is large on cortex-a15/krait devices
  (between 5% to 30%).
  For cortex-a9 and small copies, the speed up is present, but relatively
  small (about 3% to 5%).
  For __memset_chk and large copies, the savings is also small (about 1%).
  However, all processors show larger speed-ups on small copies (about 30% to
  100%).

Bug: 9293744

Change-Id: I8926d59fe2673e36e8a27629e02a7b7059ebbc98
2013-08-06 15:38:29 -07:00
Christopher Ferris
d119b7b6f4 Optimize strcat/strcpy, small tweaks to strlen.
Create one version of strcat/strcpy/strlen for cortex-a15/krait and another
version for cortex-a9.

Tested with the libc_test strcat/strcpy/strlen tests.
Including new tests that verify that the src for strcat/strcpy do not
overread across page boundaries.

NOTE: The handling of unaligned strcpy (same code in strcat) could probably
be optimized further such that the src is read 64 bits at a time instead of
the partial reads occurring now.

strlen improves slightly since it was recently optimized.

Performance improvements for strcpy and strcat (using an empty dest string):

cortex-a9
- Small copies vary from about 5% to 20% as the size gets above 10 bytes.
- Copies >= 1024, about a 60% improvement.
- Unaligned copies, from about 40% improvement.

cortex-a15
- Most small copies exhibit a 100% improvement, a few copies only
  improve by 20%.
- Copies >= 1024, about 150% improvement.
- Unaligned copies, about 100% improvement.

krait
- Most small copies vary widely, but on average 20% improvement, then
  the performance gets better, hitting about a 100% improvement when
  copies 64 bytes of data.
- Copies >= 1024, about 100% improvement.
- When coping MBs of data, about 50% improvement.
- Unaligned copies, about 90% improvement.

As strcat destination strings get larger in size:

cortex-a9
- about 40% improvement for small dst strings (>= 32).
- about 250% improvement for dst strings >= 1024.

cortex-a15
- about 200% improvement for small dst strings (>=32).
- about 250% improvement for dst strings >= 1024.

krait
- about 25% improvement for small dst strings (>=32).
- about 100% improvement for dst strings >=1024.

Change-Id: Ifd091ebdbce70fe35a7c5d8f71d5914255f3af35
2013-08-02 10:31:51 -07:00
Christopher Ferris
2fc0717977 Add new optimized strlen for arm.
This optimized version is primarily targeted at cortex-a15.

Tested on all nexus devices using the system/extras/libc_test strlen test.
Tested alignments from 1 to 32 that are powers of 2.
Tested that strlen does not cross page boundaries at all alignments.

Speed improvements listed below:

cortex-a15
- Sizes >= 32 bytes, ~75% improvement.
- Sizes >= 1024 bytes, ~250% improvement.

cortex-a9
- Sizes >= 32 bytes, ~75% improvement.
- Sizes >= 1024 bytes, ~85% improvement.

krait
- Sizes >= 32 bytes, ~95% improvement.
- Sizes >= 1024 bytes, ~160% improvement.

Change-Id: I361b1a36ed89ab991f2a8f0abbf0d7416d39c8f5
2013-07-15 12:37:51 -07:00
Christopher Ferris
7ffad9c120 Rewrite memset for cortexa15 to use strd.
Change-Id: Iac3af55f7813bd2b40a41bd19403f2b4dca5224b
2013-04-08 18:48:46 -07:00
Christopher Ferris
6ffaa931c3 Add missing branch in memcpy.S dst aligned case.
Change-Id: I0651e46dc5d9b49a997c0c6d099b75097eedb504
2013-04-02 09:19:00 -07:00
Christopher Ferris
21ede92d79 Update to latest cortexa15 memcpy code.
This uses the new code original submitted as memcpy.a15.S as
the base. However, the old code handled unaligned src/dst better
so that was spliced in. I optimized the original unaligned code by
removing a few unnecessary instructions. I optimized the a15 code by
rewriting the pre and post code. I also modified the main loop to add
a pld so that larger copies would not stall waiting for memory.

Test cases for the new memcpy:

- Copy all sized values from 0 to 1024 bytes, using whatever alignment
  is returned by malloc.
For each alignment case described below, the test copied from 0 to 128
bytes.
- Src and dst pointers are both aligned to the same value, starting
  at one going through every power of two up to and including 128.
- Src aligned to double word boundary, dst aligned to word boundary.
- Src aligned to word boundary, dst aligned to double word boundary.
- Src aligned to 16 bit boundary, dst aligned to word boundary.
- Src aligned to word boundary, dst aligned to 16 byte boundary.
- Src aligned to word boundary, dst aligned to 1 byte from a word
  boundary.
- Src aligned to word boundary, dst aligned to 2 bytes from a word
  boundary.
- Src aligned to word boundary, dst aligned to 3 bytes from a word
  boundary.
- Src aligned to 1 byte from a word boundary, dst aligned to a word
  boundary.
- Src aligned to 2 bytes from a word boundary, dst aligned to a word
  boundary.
- Src aligned to 3 bytes from a word boundary, dst aligned to a word
  boundary.

Cases to verify the unaligned source code properly aligns to a 16 bit
boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
  4 + 128 bit boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
  8 + 128 bit boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
  12 + 128 bit boundary.
- Src aligned to 1 byte from a 128 bit boundary, dst aligned to
  16 + 128 bit boundary.

In all cases, a two byte fencepost was placed at the end of the
destination to verify that only the requested number of bytes were copied.

Bug: 8005082
Change-Id: I700b2fab81941959d301ab1934c18fbd8ee3eee4
2013-03-30 14:32:49 -07:00
Christopher Ferris
a9a5870d16 Create arch specific versions of strcmp.
This uses the new strcmp.a15.S code as the basis for new versions
of strcmp.S.

The cortex-a15 code is the performance optimized version of strcmp.a15.S
taken with only the addition of a few pld instructions.
The cortex-a9 code is the same as the cortex-a15 code except that the
unaligned strcmp code was taken from the original strcmp.S.
The krait code is the same as the cortex-a15 code except that one path
in the unaligned strcmp code was taken from the original strcmp.S code
(the 2 byte overlap case).
The generic code is the original unmodified strmp.S from the bionic
subdirectory.

All three new versions underwent these test cases:

Strings the same, all same size:
- Both pointers double word aligned.
- One pointer double word aligned, one pointer word aligned.
- Both pointers word aligned.
- One pointer double word aligned, one pointer 1 off a word alignment.
- One pointer double word aligned, one pointer 2 off a word alignment.
- One pointer double word aligned, one pointer 3 off a word alignment.
- One pointer word aligned, one pointer 1 off a word alignment.
- One pointer word aligned, one pointer 2 off a word alignment.
- One pointer word aligned, one pointer 3 off a word alignment.
For all cases where it made sense, the two pointers were also tested
swapped.

Different strings, all same size:
- Single difference at double word boundary.
- Single difference at word boudary.
- Single difference at 1 off a word alignment.
- Single difference at 2 off a word alignment.
- Single difference at 3 off a word alignment.

Different sized strings, strings the same until the end:
- Shorter string ends on a double word boundary.
- Shorter string ends on word boundary.
- Shorter string ends at 1 off a word boundary.
- Shorter string ends at 2 off a word boundary.
- Shorter string ends at 3 off a word boundary.

For all different cases, run them through the same pointer alignment
cases when the strings are the same size.
For all cases the two pointers were also tested swapped.

Bug: 8005082
Change-Id: I5f3dc02b48afba2cb9c13332ab45c828ff171a1c
2013-03-12 10:25:11 -07:00
Christopher Ferris
acdde8c1cf Break bionic implementations into arch versions.
Move arch specific code for arm, mips, x86 into separate
makefiles.
In addition, add different arm cpu versions of memcpy/memset.

Bug: 8005082
Change-Id: I04f3d0715104fab618e1abf7cf8f7eec9bec79df
2013-02-26 16:04:34 -08:00