I propose forEach method for cv::Mat and cv::Mat_.
This is solution for the overhead of MatIterator_<_Tp>.
I runs a test that micro opecode runs all over the pixel of cv::Mat_<cv::Point3_<uint8_t>>.
And this implementation 40% faster than the simple pointer, 80% faster than iterator.
With OpenMP, 70% faster than simple pointer, 95% faster than iterator (Core i7 920).
Above all, code is more readable.
My test code is here.
https://gist.github.com/kazuki-ma/8285876
Thanks.