In CUDA 6.0 there was a bug in NPP LUT implementation (invalid results when
src == 255). In CUDA 6.5 the bug was fixed.
Replaced NPP LUT call with own implementation (ported from master branch)
to be independant from CUDA Toolkit version.
(cherry picked from commit eaaa2d27d5ab334c74c2d10550a6097f437fb297)