added launch bounds attributes for all CUDA kernels (cherry picked from commit d22516872cab0fa7c9b661f85e1eb1d36b2ff7cb)