Add an optimized memset that is ~20% faster for cortex-a7 and
cortex-a53.
Add a 32 bit optimized cortex-a53 memcpy that is about ~20% faster
on cached data.
Fix the cortex-a15 __str{cat,cpy}_chk.S, memcpy_base.S to remove
the phony functions, since they aren't needed any more. Then add
a direct include of these for cortex-a53.
Verified the new functions by stepping through all of the major
paths and verifying the backtrace is still correct.
Bug: 22696180
Change-Id: Iec92a3f82d51243cca76c9aff9f35d920ff865ae