use vld1.8 {d0[]}, [r0] rather than ldrb+vdup; mildly faster Change-Id: I5c24d49a90c2855c94395184774b289da8e9d5a7