Signed-off-by: Ronald S. Bultje <rsbultje@gmail.com>
The intermediate buffer is always aligned.
Speed: from 3.9x to 9.6x speed improvement over C, and some small (up to 15%) speed improvements over existing MMX code (particularly for bigger filters).