df69c751a7
The code was expanding to Q registers so that vqrshn could be used, for vector quad round shift and narrow. If 4 values are added together, there is a shift by 2. If 8 values, a shift by 3. Since this accounts for any possibility of overflow, we can skip the narrowing shift. This allows keeping the values in D registers and casting the 16 bit value to 8 bits. Change-Id: I8d9cfa07176271f492c116ffa6a7b351af0b8751