d202138621
This change included: 1. Aligned reads in vp9_mbloop_filter_vertical_edge function. Since we actually read 16 bytes, we can align the reads to read starting at (s - 8) instead of (s - 5). 2. Combined u, v loop filters. 3. Added 8x16 transpose. This gave 2% decoder performance gain (tulip clip). Change-Id: Ib14c2f1645c4a3436df17fe2f24789506bf0bb58