use of assembly language routines: rename the assembly language function
to the private_* variant unconditionally and perform tests from a small
C wrapper.
Modest improvement coefficients mean that code already had some
parallelism and there was not very much room for improvement. Special
thanks to Ted Krovetz for benchmarking the code with such patience.