 4cf9990d4e
			
		
	
	4cf9990d4e
	
	
	
		
			
			Conflicts: doc/tutorials/definitions/noContent.rst doc/tutorials/gpu/gpu-basics-similarity/gpu-basics-similarity.rst doc/tutorials/introduction/android_binary_package/dev_with_OCV_on_Android.rst doc/tutorials/introduction/how_to_write_a_tutorial/how_to_write_a_tutorial.rst modules/core/include/opencv2/core/core.hpp modules/core/include/opencv2/core/internal.hpp modules/core/include/opencv2/core/version.hpp modules/gpu/CMakeLists.txt modules/highgui/perf/perf_output.cpp modules/highgui/test/test_video_io.cpp modules/ocl/include/opencv2/ocl/ocl.hpp modules/ocl/perf/main.cpp modules/ocl/src/hog.cpp modules/ocl/src/initialization.cpp modules/ocl/src/moments.cpp modules/ocl/src/opencl/moments.cl modules/ocl/test/main.cpp modules/ocl/test/test_moments.cpp modules/python/test/test.py modules/ts/include/opencv2/ts/ts_perf.hpp modules/ts/src/precomp.hpp modules/ts/src/ts_perf.cpp
		
			
				
	
	
		
			235 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			235 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. _gpuBasicsSimilarity:
 | ||
| 
 | ||
| Similarity check (PNSR and SSIM) on the GPU
 | ||
| *******************************************
 | ||
| 
 | ||
| Goal
 | ||
| ====
 | ||
| 
 | ||
| In the :ref:`videoInputPSNRMSSIM` tutorial I already presented the PSNR and SSIM methods for
 | ||
| checking the similarity between the two images. And as you could see there performing these takes
 | ||
| quite some time, especially in the case of the SSIM. However, if the performance numbers of an
 | ||
| OpenCV implementation for the CPU do not satisfy you and you happen to have an NVidia CUDA GPU
 | ||
| device in your system all is not lost. You may try to port or write your algorithm for the video
 | ||
| card.
 | ||
| 
 | ||
| This tutorial will give a good grasp on how to approach coding by using the GPU module of OpenCV. As
 | ||
| a prerequisite you should already know how to handle the core, highgui and imgproc modules. So, our
 | ||
| goals are:
 | ||
| 
 | ||
| .. container:: enumeratevisibleitemswithsquare
 | ||
| 
 | ||
|    + What's different compared to the CPU?
 | ||
|    + Create the GPU code for the PSNR and SSIM
 | ||
|    + Optimize the code for maximal performance
 | ||
| 
 | ||
| The source code
 | ||
| ===============
 | ||
| 
 | ||
| You may also find the source code and these video file in the
 | ||
| :file:`samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity` folder of the
 | ||
| OpenCV source library or :download:`download it from here
 | ||
| <../../../../samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp>`. The
 | ||
| full source code is quite long (due to the controlling of the application via the command line
 | ||
| arguments and performance measurement). Therefore, to avoid cluttering up these sections with those
 | ||
| you'll find here only the functions itself.
 | ||
| 
 | ||
| The PSNR returns a float number, that if the two inputs are similar between 30 and 50 (higher is
 | ||
| better).
 | ||
| 
 | ||
| .. literalinclude:: ../../../../samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp
 | ||
|    :language: cpp
 | ||
|    :linenos:
 | ||
|    :tab-width: 4
 | ||
|    :lines: 165-210, 18-23, 210-235
 | ||
| 
 | ||
| The SSIM returns the MSSIM of the images. This is too a float number between zero and one (higher is
 | ||
| better), however we have one for each channel. Therefore, we return a *Scalar* OpenCV data
 | ||
| structure:
 | ||
| 
 | ||
| .. literalinclude:: ../../../../samples/cpp/tutorial_code/gpu/gpu-basics-similarity/gpu-basics-similarity.cpp
 | ||
|    :language: cpp
 | ||
|    :linenos:
 | ||
|    :tab-width: 4
 | ||
|    :lines: 235-355, 26-42, 357-
 | ||
| 
 | ||
| How to do it? - The GPU
 | ||
| =======================
 | ||
| 
 | ||
| Now as you can see we have three types of functions for each operation. One for the CPU and two for
 | ||
| the GPU. The reason I made two for the GPU is too illustrate that often simple porting your CPU to
 | ||
| GPU will actually make it slower. If you want some performance gain you will need to remember a few
 | ||
| rules, whose I'm going to detail later on.
 | ||
| 
 | ||
| The development of the GPU module was made so that it resembles as much as possible its CPU
 | ||
| counterpart. This is to make porting easy. The first thing you need to do before writing any code is
 | ||
| to link the GPU module to your project, and include the header file for the module. All the
 | ||
| functions and data structures of the GPU are in a *gpu* sub namespace of the *cv* namespace. You may
 | ||
| add this to the default one via the *use namespace* keyword, or mark it everywhere explicitly via
 | ||
| the cv:: to avoid confusion. I'll do the later.
 | ||
| 
 | ||
| .. code-block:: cpp
 | ||
| 
 | ||
|    #include <opencv2/gpu.hpp>        // GPU structures and methods
 | ||
| 
 | ||
| GPU stands for **g**\ raphics **p**\ rocessing **u**\ nit. It was originally build to render
 | ||
| graphical scenes. These scenes somehow build on a lot of data. Nevertheless, these aren't all
 | ||
| dependent one from another in a sequential way and as it is possible a parallel processing of them.
 | ||
| Due to this a GPU will contain multiple smaller processing units. These aren't the state of the art
 | ||
| processors and on a one on one test with a CPU it will fall behind. However, its strength lies in
 | ||
| its numbers. In the last years there has been an increasing trend to harvest these massive parallel
 | ||
| powers of the GPU in non-graphical scene rendering too. This gave birth to the general-purpose
 | ||
| computation on graphics processing units (GPGPU).
 | ||
| 
 | ||
| The GPU has its own memory. When you read data from the hard drive with OpenCV into a *Mat* object
 | ||
| that takes place in your systems memory. The CPU works somehow directly on this (via its cache),
 | ||
| however the GPU cannot. He has too transferred the information he will use for calculations from the
 | ||
| system memory to its own. This is done via an upload process and takes time. In the end the result
 | ||
| will have to be downloaded back to your system memory for your CPU to see it and use it. Porting
 | ||
| small functions to GPU is not recommended as the upload/download time will be larger than the amount
 | ||
| you gain by a parallel execution.
 | ||
| 
 | ||
| Mat objects are stored only in the system memory (or the CPU cache). For getting an OpenCV matrix
 | ||
| to the GPU you'll need to use its GPU counterpart :gpudatastructure:`GpuMat <gpu-gpumat>`. It works
 | ||
| similar to the Mat with a 2D only limitation and no reference returning for its functions (cannot
 | ||
| mix GPU references with CPU ones). To upload a Mat object to the GPU you need to call the upload
 | ||
| function after creating an instance of the class. To download you may use simple assignment to a
 | ||
| Mat object or use the download function.
 | ||
| 
 | ||
| .. code-block:: cpp
 | ||
| 
 | ||
|    Mat I1;         // Main memory item - read image into with imread for example
 | ||
|    gpu::GpuMat gI; // GPU matrix - for now empty
 | ||
|    gI1.upload(I1); // Upload a data from the system memory to the GPU memory
 | ||
| 
 | ||
|    I1 = gI1;       // Download, gI1.download(I1) will work too
 | ||
| 
 | ||
| Once you have your data up in the GPU memory you may call GPU enabled functions of OpenCV. Most of
 | ||
| the functions keep the same name just as on the CPU, with the difference that they only accept
 | ||
| *GpuMat* inputs. A full list of these you will find in the documentation: `online here
 | ||
| <http://docs.opencv.org/modules/gpu/doc/gpu.html>`_ or the OpenCV reference manual that comes with
 | ||
| the source code.
 | ||
| 
 | ||
| Another thing to keep in mind is that not for all channel numbers you can make efficient algorithms
 | ||
| on the GPU. Generally, I found that the input images for the GPU images need to be either one or
 | ||
| four channel ones and one of the char or float type for the item sizes. No double support on the
 | ||
| GPU, sorry. Passing other types of objects for some functions will result in an exception thrown,
 | ||
| and an error message on the error output. The documentation details in most of the places the types
 | ||
| accepted for the inputs. If you have three channel images as an input you can do two things: either
 | ||
| adds a new channel (and use char elements) or split up the image and call the function for each
 | ||
| image. The first one isn't really recommended as you waste memory.
 | ||
| 
 | ||
| For some functions, where the position of the elements (neighbor items) doesn't matter quick
 | ||
| solution is to just reshape it into a single channel image. This is the case for the PSNR
 | ||
| implementation where for the *absdiff* method the value of the neighbors is not important. However,
 | ||
| for the *GaussianBlur* this isn't an option and such need to use the split method for the SSIM. With
 | ||
| this knowledge you can already make a GPU viable code (like mine GPU one) and run it. You'll be
 | ||
| surprised to see that it might turn out slower than your CPU implementation.
 | ||
| 
 | ||
| Optimization
 | ||
| ============
 | ||
| 
 | ||
| The reason for this is that you're throwing out on the window the price for memory allocation and
 | ||
| data transfer. And on the GPU this is damn high. Another possibility for optimization is to
 | ||
| introduce asynchronous OpenCV GPU calls too with the help of the
 | ||
| :gpudatastructure:`gpu::Stream<gpu-stream>`.
 | ||
| 
 | ||
| 1. Memory allocation on the GPU is considerable. Therefore, if it’s possible allocate new memory as
 | ||
|    few times as possible. If you create a function what you intend to call multiple times it is a
 | ||
|    good idea to allocate any local parameters for the function only once, during the first call.
 | ||
|    To do this you create a data structure containing all the local variables you will use. For
 | ||
|    instance in case of the PSNR these are:
 | ||
| 
 | ||
|    .. code-block:: cpp
 | ||
| 
 | ||
|       struct BufferPSNR                                     // Optimized GPU versions
 | ||
|         {   // Data allocations are very expensive on GPU. Use a buffer to solve: allocate once reuse later.
 | ||
|         gpu::GpuMat gI1, gI2, gs, t1,t2;
 | ||
| 
 | ||
|         gpu::GpuMat buf;
 | ||
|       };
 | ||
| 
 | ||
|    Then create an instance of this in the main program:
 | ||
| 
 | ||
|    .. code-block:: cpp
 | ||
| 
 | ||
|       BufferPSNR bufferPSNR;
 | ||
| 
 | ||
|    And finally pass this to the function each time you call it:
 | ||
| 
 | ||
|    .. code-block:: cpp
 | ||
| 
 | ||
|       double getPSNR_GPU_optimized(const Mat& I1, const Mat& I2, BufferPSNR& b)
 | ||
| 
 | ||
|    Now you access these local parameters as: *b.gI1*, *b.buf* and so on. The GpuMat will only
 | ||
|    reallocate itself on a new call if the new matrix size is different from the previous one.
 | ||
| 
 | ||
| #. Avoid unnecessary function data transfers. Any small data transfer will be significant one once
 | ||
|    you go to the GPU. Therefore, if possible make all calculations in-place (in other words do not
 | ||
|    create new memory objects - for reasons explained at the previous point). For example, although
 | ||
|    expressing arithmetical operations may be easier to express in one line formulas, it will be
 | ||
|    slower. In case of the SSIM at one point I need to calculate:
 | ||
| 
 | ||
|    .. code-block:: cpp
 | ||
| 
 | ||
|       b.t1 = 2 * b.mu1_mu2 + C1;
 | ||
| 
 | ||
|    Although the upper call will succeed observe that there is a hidden data transfer present. Before
 | ||
|    it makes the addition it needs to store somewhere the multiplication. Therefore, it will create a
 | ||
|    local matrix in the background, add to that the *C1* value and finally assign that to *t1*. To
 | ||
|    avoid this we use the gpu functions, instead of the arithmetic operators:
 | ||
| 
 | ||
|    .. code-block:: cpp
 | ||
| 
 | ||
|       gpu::multiply(b.mu1_mu2, 2, b.t1); //b.t1 = 2 * b.mu1_mu2 + C1;
 | ||
|       gpu::add(b.t1, C1, b.t1);
 | ||
| 
 | ||
| #. Use asynchronous calls (the :gpudatastructure:`gpu::Stream <gpu-stream>`). By default whenever
 | ||
|    you call a gpu function it will wait for the call to finish and return with the result
 | ||
|    afterwards. However, it is possible to make asynchronous calls, meaning it will call for the
 | ||
|    operation execution, make the costly data allocations for the algorithm and return back right
 | ||
|    away. Now you can call another function if you wish to do so. For the MSSIM this is a small
 | ||
|    optimization point. In our default implementation we split up the image into channels and call
 | ||
|    then for each channel the gpu functions. A small degree of parallelization is possible with the
 | ||
|    stream. By using a stream we can make the data allocation, upload operations while the GPU is
 | ||
|    already executing a given method. For example we need to upload two images. We queue these one
 | ||
|    after another and call already the function that processes it. The functions will wait for the
 | ||
|    upload to finish, however while that happens makes the output buffer allocations for the function
 | ||
|    to be executed next.
 | ||
| 
 | ||
|    .. code-block:: cpp
 | ||
| 
 | ||
|       gpu::Stream stream;
 | ||
| 
 | ||
|       stream.enqueueConvert(b.gI1, b.t1, CV_32F);    // Upload
 | ||
| 
 | ||
|       gpu::split(b.t1, b.vI1, stream);              // Methods (pass the stream as final parameter).
 | ||
|       gpu::multiply(b.vI1[i], b.vI1[i], b.I1_2, stream);        // I1^2
 | ||
| 
 | ||
| Result and conclusion
 | ||
| =====================
 | ||
| 
 | ||
| On an Intel P8700 laptop CPU paired with a low end NVidia GT220M here are the performance numbers:
 | ||
| 
 | ||
| .. code-block:: cpp
 | ||
| 
 | ||
|    Time of PSNR CPU (averaged for 10 runs): 41.4122 milliseconds. With result of: 19.2506
 | ||
|    Time of PSNR GPU (averaged for 10 runs): 158.977 milliseconds. With result of: 19.2506
 | ||
|    Initial call GPU optimized:              31.3418 milliseconds. With result of: 19.2506
 | ||
|    Time of PSNR GPU OPTIMIZED ( / 10 runs): 24.8171 milliseconds. With result of: 19.2506
 | ||
| 
 | ||
|    Time of MSSIM CPU (averaged for 10 runs): 484.343 milliseconds. With result of B0.890964 G0.903845 R0.936934
 | ||
|    Time of MSSIM GPU (averaged for 10 runs): 745.105 milliseconds. With result of B0.89922 G0.909051 R0.968223
 | ||
|    Time of MSSIM GPU Initial Call            357.746 milliseconds. With result of B0.890964 G0.903845 R0.936934
 | ||
|    Time of MSSIM GPU OPTIMIZED ( / 10 runs): 203.091 milliseconds. With result of B0.890964 G0.903845 R0.936934
 | ||
| 
 | ||
| In both cases we managed a performance increase of almost 100% compared to the CPU implementation.
 | ||
| It may be just the improvement needed for your application to work. You may observe a runtime
 | ||
| instance of this on the `YouTube here <https://www.youtube.com/watch?v=3_ESXmFlnvY>`_.
 | ||
| 
 | ||
| .. raw:: html
 | ||
| 
 | ||
|   <div align="center">
 | ||
|   <iframe title="Similarity check (PNSR and SSIM) on the GPU" width="560" height="349" src="http://www.youtube.com/embed/3_ESXmFlnvY?rel=0&loop=1" frameborder="0" allowfullscreen align="middle"></iframe>
 | ||
|   </div>
 |