On AArch64 systems with SVE support, 128-bit SVE implementations can
perform significantly worse than equivalent NEON code due to the
different optimization strategies used in each implementation. The NEON
version is unrolled 4 times, providing excellent performance at the
fixed 128-bit width. The SVE version can achieve similar or better
performance through its variable-width operations on systems with
256-bit or 512-bit SVE, but on 128-bit SVE systems, the NEON unrolled
implementation is faster due to reduced overhead.
This change adds runtime detection of SVE vector length and falls back
to the optimized NEON implementation when SVE is operating at 128-bit
width, ensuring optimal performance across all AArch64 configurations.
This implementation checks the vector length with an intrinsic if the
compiler supports it (which works on Apple as well) and falls back to
using prctl otherwise.
This optimization ensures that systems benefit from:
- 4x unrolled NEON code on 128-bit SVE systems
- Variable-width SVE optimizations on wider SVE implementations
- Maintained compatibility across different AArch64 configurations
Performance improvement on systems with 128-bit SVE:
- Encode: 7509.80 MB/s → 8995.59 MB/s (+19.8% improvement)
- Decode: 9383.67 MB/s → 12272.38 MB/s (+30.8% improvement)
Signed-off-by: Jonathan Swinney <jswinney@amazon.com>