Move asm routines to common. Delete obsolete decoder routines.
Use wider routines where applicable.
~1.07x overall faster decode on a quick 720p30 4Mbps test on Haswell.
valgrind thinks xmm2 is uninitialized - in fact it is, but
its value here doesn't really matter. Instead set it to a known value
before using it in SUMW_HORIZON.
Previously the assembly sources had mixed indentation consisting
of both spaces and tabs, making it quite hard to read unless
the right tab size was used in the editor.
Tabs have been interpreted as 4 spaces in most cases, matching
the surrounding code.