
Assume that data can be written into the padding area following each line. This enables the use of faster routines for more cases. Align downsample buffer stride to a multiple of 32. With this all strides used should be a multiple of 16, which means that use of narrower downsample routines can be dropped altogether.