- Performance achieves the same as assembly. - Unit tests pass. Change-Id: I6eacfbbd826b3946c724d78fbef7948af6406ccd
The intrinsic version reduces the average cycles from 183 to 175. Change-Id: I7c1bcdb0a830266e93d8347aed38120fb3be0e03