When accessing global memory by DWORD4, memory bandwidth
can be fully utilized on Intel platform. This patch will
make more image format(e.g. 8UC4) be processed in DWORD4
by work-item. After applying this patch, 3 subcase of
./opencv_perf_core --gtest_filter=OCL_RepeatFixture_Repeat.Repeat/*
can be speedup on HD4000 graphics card with Beignet:
OCL_RepeatFixture_Repeat.Repeat/2, 64% improvement.
OCL_RepeatFixture_Repeat.Repeat/6, 50% improvement.
OCL_RepeatFixture_Repeat.Repeat/8, 56% improvement.
Signed-off-by: Chuanbo Weng <chuanbo.weng@intel.com>