Ported from arm NEON and added vector_dmul_scalar. Functions between 1.5 and 5 times faster than the C implementations using Apple's clang-503.0.19 on A7.