added launch bounds attributes for all CUDA kernels (cherry picked from commit d22516872c)
d22516872c