Assume that data can be written into the padding area following each
line. This enables the use of faster routines for more cases.
Align downsample buffer stride to a multiple of 32.
With this all strides used should be a multiple of 16, which means
that use of narrower downsample routines can be dropped altogether.