Oops.
SSE2 version 4%-35% faster than MMX depending on the width. AVX2 version 1%-13% faster than SSE2 depending on the width.
Heavily based upon ff_add_bytes by Christophe Gisquet. Reviewed-by: James Almer <jamrial@gmail.com> Signed-off-by: Timothy Gu <timothygu99@gmail.com>