4cf9990d4e
Conflicts: doc/tutorials/definitions/noContent.rst doc/tutorials/gpu/gpu-basics-similarity/gpu-basics-similarity.rst doc/tutorials/introduction/android_binary_package/dev_with_OCV_on_Android.rst doc/tutorials/introduction/how_to_write_a_tutorial/how_to_write_a_tutorial.rst modules/core/include/opencv2/core/core.hpp modules/core/include/opencv2/core/internal.hpp modules/core/include/opencv2/core/version.hpp modules/gpu/CMakeLists.txt modules/highgui/perf/perf_output.cpp modules/highgui/test/test_video_io.cpp modules/ocl/include/opencv2/ocl/ocl.hpp modules/ocl/perf/main.cpp modules/ocl/src/hog.cpp modules/ocl/src/initialization.cpp modules/ocl/src/moments.cpp modules/ocl/src/opencl/moments.cl modules/ocl/test/main.cpp modules/ocl/test/test_moments.cpp modules/python/test/test.py modules/ts/include/opencv2/ts/ts_perf.hpp modules/ts/src/precomp.hpp modules/ts/src/ts_perf.cpp
235 lines
12 KiB
ReStructuredText
235 lines
12 KiB
ReStructuredText
.. _gpuBasicsSimilarity:
|
||
|
||
Similarity check (PNSR and SSIM) on the GPU
|
||
*******************************************
|
||
|
||
Goal
|
||
====
|
||
|
||
In the :ref:`videoInputPSNRMSSIM` tutorial I already presented the PSNR and SSIM methods for
|
||
checking the similarity between the two images. And as you could see there performing these takes
|
||
quite some time, especially in the case of the SSIM. However, if the performance numbers of an
|
||
OpenCV implementation for the CPU do not satisfy you and you happen to have an NVidia CUDA GPU
|
||
device in your system all is not lost. You may try to port or write your algorithm for the video
|
||
card.
|
||
|
||
This tutorial will give a good grasp on how to approach coding by using the GPU module of OpenCV. As
|
||
a prerequisite you should already know how to handle the core, highgui and imgproc modules. So, our
|
||
goals are:
|
||
|
||
.. container:: enumeratevisibleitemswithsquare
|
||
|
||
+ What's different compared to the CPU?
|
||
+ Create the GPU code for the PSNR and SSIM
|
||
+ Optimize the code for maximal performance
|
||
|
||
The source code
|
||
===============
|
||
|
||
You may also find the source code and these video file in the
|
||
:file:`samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity` folder of the
|
||
OpenCV source library or :download:`download it from here
|
||
<../../../../samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp>`. The
|
||
full source code is quite long (due to the controlling of the application via the command line
|
||
arguments and performance measurement). Therefore, to avoid cluttering up these sections with those
|
||
you'll find here only the functions itself.
|
||
|
||
The PSNR returns a float number, that if the two inputs are similar between 30 and 50 (higher is
|
||
better).
|
||
|
||
.. literalinclude:: ../../../../samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp
|
||
:language: cpp
|
||
:linenos:
|
||
:tab-width: 4
|
||
:lines: 165-210, 18-23, 210-235
|
||
|
||
The SSIM returns the MSSIM of the images. This is too a float number between zero and one (higher is
|
||
better), however we have one for each channel. Therefore, we return a *Scalar* OpenCV data
|
||
structure:
|
||
|
||
.. literalinclude:: ../../../../samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp
|
||
:language: cpp
|
||
:linenos:
|
||
:tab-width: 4
|
||
:lines: 235-355, 26-42, 357-
|
||
|
||
How to do it? - The GPU
|
||
=======================
|
||
|
||
Now as you can see we have three types of functions for each operation. One for the CPU and two for
|
||
the GPU. The reason I made two for the GPU is too illustrate that often simple porting your CPU to
|
||
GPU will actually make it slower. If you want some performance gain you will need to remember a few
|
||
rules, whose I'm going to detail later on.
|
||
|
||
The development of the GPU module was made so that it resembles as much as possible its CPU
|
||
counterpart. This is to make porting easy. The first thing you need to do before writing any code is
|
||
to link the GPU module to your project, and include the header file for the module. All the
|
||
functions and data structures of the GPU are in a *gpu* sub namespace of the *cv* namespace. You may
|
||
add this to the default one via the *use namespace* keyword, or mark it everywhere explicitly via
|
||
the cv:: to avoid confusion. I'll do the later.
|
||
|
||
.. code-block:: cpp
|
||
|
||
#include <opencv2/gpu.hpp> // GPU structures and methods
|
||
|
||
GPU stands for **g**\ raphics **p**\ rocessing **u**\ nit. It was originally build to render
|
||
graphical scenes. These scenes somehow build on a lot of data. Nevertheless, these aren't all
|
||
dependent one from another in a sequential way and as it is possible a parallel processing of them.
|
||
Due to this a GPU will contain multiple smaller processing units. These aren't the state of the art
|
||
processors and on a one on one test with a CPU it will fall behind. However, its strength lies in
|
||
its numbers. In the last years there has been an increasing trend to harvest these massive parallel
|
||
powers of the GPU in non-graphical scene rendering too. This gave birth to the general-purpose
|
||
computation on graphics processing units (GPGPU).
|
||
|
||
The GPU has its own memory. When you read data from the hard drive with OpenCV into a *Mat* object
|
||
that takes place in your systems memory. The CPU works somehow directly on this (via its cache),
|
||
however the GPU cannot. He has too transferred the information he will use for calculations from the
|
||
system memory to its own. This is done via an upload process and takes time. In the end the result
|
||
will have to be downloaded back to your system memory for your CPU to see it and use it. Porting
|
||
small functions to GPU is not recommended as the upload/download time will be larger than the amount
|
||
you gain by a parallel execution.
|
||
|
||
Mat objects are stored only in the system memory (or the CPU cache). For getting an OpenCV matrix
|
||
to the GPU you'll need to use its GPU counterpart :gpudatastructure:`GpuMat <gpu-gpumat>`. It works
|
||
similar to the Mat with a 2D only limitation and no reference returning for its functions (cannot
|
||
mix GPU references with CPU ones). To upload a Mat object to the GPU you need to call the upload
|
||
function after creating an instance of the class. To download you may use simple assignment to a
|
||
Mat object or use the download function.
|
||
|
||
.. code-block:: cpp
|
||
|
||
Mat I1; // Main memory item - read image into with imread for example
|
||
gpu::GpuMat gI; // GPU matrix - for now empty
|
||
gI1.upload(I1); // Upload a data from the system memory to the GPU memory
|
||
|
||
I1 = gI1; // Download, gI1.download(I1) will work too
|
||
|
||
Once you have your data up in the GPU memory you may call GPU enabled functions of OpenCV. Most of
|
||
the functions keep the same name just as on the CPU, with the difference that they only accept
|
||
*GpuMat* inputs. A full list of these you will find in the documentation: `online here
|
||
<http://docs.opencv.org/modules/gpu/doc/gpu.html>`_ or the OpenCV reference manual that comes with
|
||
the source code.
|
||
|
||
Another thing to keep in mind is that not for all channel numbers you can make efficient algorithms
|
||
on the GPU. Generally, I found that the input images for the GPU images need to be either one or
|
||
four channel ones and one of the char or float type for the item sizes. No double support on the
|
||
GPU, sorry. Passing other types of objects for some functions will result in an exception thrown,
|
||
and an error message on the error output. The documentation details in most of the places the types
|
||
accepted for the inputs. If you have three channel images as an input you can do two things: either
|
||
adds a new channel (and use char elements) or split up the image and call the function for each
|
||
image. The first one isn't really recommended as you waste memory.
|
||
|
||
For some functions, where the position of the elements (neighbor items) doesn't matter quick
|
||
solution is to just reshape it into a single channel image. This is the case for the PSNR
|
||
implementation where for the *absdiff* method the value of the neighbors is not important. However,
|
||
for the *GaussianBlur* this isn't an option and such need to use the split method for the SSIM. With
|
||
this knowledge you can already make a GPU viable code (like mine GPU one) and run it. You'll be
|
||
surprised to see that it might turn out slower than your CPU implementation.
|
||
|
||
Optimization
|
||
============
|
||
|
||
The reason for this is that you're throwing out on the window the price for memory allocation and
|
||
data transfer. And on the GPU this is damn high. Another possibility for optimization is to
|
||
introduce asynchronous OpenCV GPU calls too with the help of the
|
||
:gpudatastructure:`gpu::Stream<gpu-stream>`.
|
||
|
||
1. Memory allocation on the GPU is considerable. Therefore, if it’s possible allocate new memory as
|
||
few times as possible. If you create a function what you intend to call multiple times it is a
|
||
good idea to allocate any local parameters for the function only once, during the first call.
|
||
To do this you create a data structure containing all the local variables you will use. For
|
||
instance in case of the PSNR these are:
|
||
|
||
.. code-block:: cpp
|
||
|
||
struct BufferPSNR // Optimized GPU versions
|
||
{ // Data allocations are very expensive on GPU. Use a buffer to solve: allocate once reuse later.
|
||
gpu::GpuMat gI1, gI2, gs, t1,t2;
|
||
|
||
gpu::GpuMat buf;
|
||
};
|
||
|
||
Then create an instance of this in the main program:
|
||
|
||
.. code-block:: cpp
|
||
|
||
BufferPSNR bufferPSNR;
|
||
|
||
And finally pass this to the function each time you call it:
|
||
|
||
.. code-block:: cpp
|
||
|
||
double getPSNR_GPU_optimized(const Mat& I1, const Mat& I2, BufferPSNR& b)
|
||
|
||
Now you access these local parameters as: *b.gI1*, *b.buf* and so on. The GpuMat will only
|
||
reallocate itself on a new call if the new matrix size is different from the previous one.
|
||
|
||
#. Avoid unnecessary function data transfers. Any small data transfer will be significant one once
|
||
you go to the GPU. Therefore, if possible make all calculations in-place (in other words do not
|
||
create new memory objects - for reasons explained at the previous point). For example, although
|
||
expressing arithmetical operations may be easier to express in one line formulas, it will be
|
||
slower. In case of the SSIM at one point I need to calculate:
|
||
|
||
.. code-block:: cpp
|
||
|
||
b.t1 = 2 * b.mu1_mu2 + C1;
|
||
|
||
Although the upper call will succeed observe that there is a hidden data transfer present. Before
|
||
it makes the addition it needs to store somewhere the multiplication. Therefore, it will create a
|
||
local matrix in the background, add to that the *C1* value and finally assign that to *t1*. To
|
||
avoid this we use the gpu functions, instead of the arithmetic operators:
|
||
|
||
.. code-block:: cpp
|
||
|
||
gpu::multiply(b.mu1_mu2, 2, b.t1); //b.t1 = 2 * b.mu1_mu2 + C1;
|
||
gpu::add(b.t1, C1, b.t1);
|
||
|
||
#. Use asynchronous calls (the :gpudatastructure:`gpu::Stream <gpu-stream>`). By default whenever
|
||
you call a gpu function it will wait for the call to finish and return with the result
|
||
afterwards. However, it is possible to make asynchronous calls, meaning it will call for the
|
||
operation execution, make the costly data allocations for the algorithm and return back right
|
||
away. Now you can call another function if you wish to do so. For the MSSIM this is a small
|
||
optimization point. In our default implementation we split up the image into channels and call
|
||
then for each channel the gpu functions. A small degree of parallelization is possible with the
|
||
stream. By using a stream we can make the data allocation, upload operations while the GPU is
|
||
already executing a given method. For example we need to upload two images. We queue these one
|
||
after another and call already the function that processes it. The functions will wait for the
|
||
upload to finish, however while that happens makes the output buffer allocations for the function
|
||
to be executed next.
|
||
|
||
.. code-block:: cpp
|
||
|
||
gpu::Stream stream;
|
||
|
||
stream.enqueueConvert(b.gI1, b.t1, CV_32F); // Upload
|
||
|
||
gpu::split(b.t1, b.vI1, stream); // Methods (pass the stream as final parameter).
|
||
gpu::multiply(b.vI1[i], b.vI1[i], b.I1_2, stream); // I1^2
|
||
|
||
Result and conclusion
|
||
=====================
|
||
|
||
On an Intel P8700 laptop CPU paired with a low end NVidia GT220M here are the performance numbers:
|
||
|
||
.. code-block:: cpp
|
||
|
||
Time of PSNR CPU (averaged for 10 runs): 41.4122 milliseconds. With result of: 19.2506
|
||
Time of PSNR GPU (averaged for 10 runs): 158.977 milliseconds. With result of: 19.2506
|
||
Initial call GPU optimized: 31.3418 milliseconds. With result of: 19.2506
|
||
Time of PSNR GPU OPTIMIZED ( / 10 runs): 24.8171 milliseconds. With result of: 19.2506
|
||
|
||
Time of MSSIM CPU (averaged for 10 runs): 484.343 milliseconds. With result of B0.890964 G0.903845 R0.936934
|
||
Time of MSSIM GPU (averaged for 10 runs): 745.105 milliseconds. With result of B0.89922 G0.909051 R0.968223
|
||
Time of MSSIM GPU Initial Call 357.746 milliseconds. With result of B0.890964 G0.903845 R0.936934
|
||
Time of MSSIM GPU OPTIMIZED ( / 10 runs): 203.091 milliseconds. With result of B0.890964 G0.903845 R0.936934
|
||
|
||
In both cases we managed a performance increase of almost 100% compared to the CPU implementation.
|
||
It may be just the improvement needed for your application to work. You may observe a runtime
|
||
instance of this on the `YouTube here <https://www.youtube.com/watch?v=3_ESXmFlnvY>`_.
|
||
|
||
.. raw:: html
|
||
|
||
<div align="center">
|
||
<iframe title="Similarity check (PNSR and SSIM) on the GPU" width="560" height="349" src="http://www.youtube.com/embed/3_ESXmFlnvY?rel=0&loop=1" frameborder="0" allowfullscreen align="middle"></iframe>
|
||
</div>
|