intro fix in gpu module

This commit is contained in:
Anatoly Baksheev 2011-02-02 15:50:01 +00:00
parent 8d36926271
commit 82441a4b56
3 changed files with 83559 additions and 83753 deletions

View File

@ -3,7 +3,7 @@
\cvclass{gpu::DevMem2D\_}\label{cppfunc.gpu.DevMem2D}
This is a simple lightweight class that encapsulate pitched memory on GPU. It is untented to pass to nvcc-compiled code, i.e. CUDA kernels. Its members can be called both from host and from device code.
This is a simple lightweight class that encapsulate pitched memory on GPU. It is intended to pass to nvcc-compiled code, i.e. CUDA kernels. So it is used internally by OpenCV and by users writes own device code. Its members can be called both from host and from device code.
\begin{lstlisting}
template <typename T> struct DevMem2D_
@ -33,7 +33,7 @@ template <typename T> struct DevMem2D_
\cvclass{gpu::PtrStep\_}\label{cppfunc.gpu.PtrStep}
This is structure is similar toDevMem2D\_ but contains only pointer and row step. Width and height fields are excluded due to performance reasons.
This is structure is similar to DevMem2D\_ but contains only pointer and row step. Width and height fields are excluded due to performance reasons. The structure is for internal use or for users who write own device code.
\begin{lstlisting}
template<typename T> struct PtrStep_
@ -55,7 +55,7 @@ template<typename T> struct PtrStep_
\end{lstlisting}
\cvclass{gpu::PtrElemStrp\_}\
This is structure is similar to DevMem2D\_ but contains only pointer and row step in elements. Width and height fields are excluded due to performance reasons. This class is can only be constructed if sizeof(T) is a multiple of 256.
This is structure is similar to DevMem2D\_ but contains only pointer and row step in elements. Width and height fields are excluded due to performance reasons. This class is can only be constructed if sizeof(T) is a multiple of 256. The structure is for internal use or for users who write own device code.
\begin{lstlisting}
template<typename T> struct PtrElemStep_ : public PtrStep_<T>

View File

@ -2,52 +2,58 @@
\subsection{General information}
The OpenCV GPU module is a set of classes and functions to utilize GPU computational capabilities. It is implemented using NVidia CUDA Runtime API, so only that vendor GPUs are supported. It includes utility functions, low level vision primitives as well as high level algorithms. I.e. the module is being developed as power infrastructure for fast vision algorithms building on GPU with some high level state of the art functionality.
The OpenCV GPU module is a set of classes and functions to utilize GPU computational capabilities. It is implemented using NVidia CUDA Runtime API, so only the NVidia GPUs are supported. It includes utility functions, low-level vision primitives as well as high-level algorithms. The utility functions and low-level primitives provide a powerful infrastructure for developing fast vision algorithms taking advantage of GPU. Whereas the high-level functionality includes some state-of-the-art algorithms (such as stereo correspondence, face and people detectors etc.), ready to be used by the application developers.
The GPU module is designed as host level API, i.e. if a user has precompiled OpenCV GPU binaries, it is not necessary for him to have Cuda Toolkit installed and have deal with code to execute on GPU. Additional advantage of this is that with the binaries users can use any compiler for any platform. But probably a device layer API will be introduced in future to provide more agility and performance in internal GPU module implementation and more functionality for users.
The GPU module is designed as host-level API, i.e. if a user has pre-compiled OpenCV GPU binaries, it is not necessary to have Cuda Toolkit installed or write any extra code to make use of the GPU.
External dependencies of the module are only libraries included in Cuda Toolkit and NVidia Performance Primitives library (NPP). These libraries can be downloaded from NVidia site for all supported platforms. Only comparability with the latest Cuda Toolkit and NPP is provided for trunk OpenCV version and we switch to each new release very fast. So please keep it up to date. OpenCV GPU code can be compiled only on such platforms where Cuda Runtime Toolkit is supported by NVidia.
The GPU module depends on the Cuda Toolkit and NVidia Performance Primitives library (NPP). Make sure you have the latest versions of those. The two libraries can be downloaded from NVidia site for all supported platforms. To compile OpenCV GPU module you will need a compiler compatible with Cuda Runtime Toolkit.
OpenCV GPU module is designed to make its usage as easy as it possible. It can be used without any knowledge about Cuda. But for advanced programming and extremely optimization it is highly recommended to learn principles of programming and optimization for GPU. This is helpful because of understanding how much each operation costs, what it does, and how it is better to call. In this case GPU module became an effective instrument of development computer vision algorithms for GPU on prototyping stage and when hard optimization is in process.
OpenCV GPU module is designed for ease of use and does not require any knowledge of Cuda. Though, such a knowledge will certainly be useful in non-trivial cases, or when you want to get the highest performance. It is helpful to have understanding of the costs of various operations, what the GPU does, what are the preferred data formats etc. The GPU module is an effective instrument for quick implementation of GPU-accelerated computer vision algorithms. However, if you algorithm involves many simple operations, then for the best possible performance you may still need to write your own kernels, to avoid extra write and read operations on the intermediate results.
The OpenCV can be compiled with enabled and disabled \texttt{WITH\_CUDA} flag in CMake. Building with the flag set will force compilation of device code from GPU module and requires dependences above installed. If OpenCV is compiled without the flag, GPU module will also be built, but all functions from it will throw \cvCppCross{Exception} with \texttt{CV\_GpuNotSupported} error code, except \cvCppCross{gpu::getCudaEnabledDeviceCount()}. The last function will return zero GPU count in this case. Building OpenCV without CUDA does not perform device code compilation, so it does not require Cuda Toolkit installed and supported by NVidia compiler. Also such behavior makes it possible to develop in future smart enough algorithms for OpenCV, that can decide itself whether it is reasonable to call GPU or do their work in CPU or use both. Thereby disabling \texttt{WITH\_CUDA} flag will force using only CPU. The mechanism can be used also by OpenCV users in their applications to enable or disable GPU support.
To enable CUDA support, configure OpenCV using CMake with \texttt{WITH\_CUDA=ON}. When the flag is set and if CUDA is installed, the full-featured OpenCV GPU module will be built. Otherwise, the module will still be built, but at runtime all functions from the module will throw \cvCppCross{Exception} with \texttt{CV\_GpuNotSupported} error code, except for \cvCppCross{gpu::getCudaEnabledDeviceCount()}. The latter function will return zero GPU count in this case. Building OpenCV without CUDA support does not perform device code compilation, so it does not require Cuda Toolkit installed. Therefore, using \cvCppCross{gpu::getCudaEnabledDeviceCount()} function it is possible to implement a high-level algorithm that will detect GPU presence at runtime and choose the appropriate implementation (CPU or GPU) accordingly.
\subsection{Compilation for different NVidia platforms.}
NVidia compiler allows generating binary output (cubin and fatbin) and intermediate code (PTX). Binary code is a code to directly run on GPU, binary code compatibility of GPU is not guaranteed across different generations. PTX generation is a building to virtual platform. A virtual GPU is defined entirely by the set of capabilities, or features, so compilation to it is just a claim what GPU features are used and what features are restricted to use (example some unsupported instructions can be emulated).
NVidia compiler allows generating binary code (cubin and fatbin) and intermediate code (PTX). Binary code often implies a specific GPU architecture and generation, so the compatibility with other GPUs is not guaranteed. PTX is targeted for a virtual platform, which is defined entirely by the set of capabilities, or features. Depending on the virtual platform chosen, some of the instructions will be emulated or disabled, even if the real hardware supports all the features.
On first GPU call run PTX code is passed to Just In Time (JIT) compilation for concrete GPU platform on which it is run. There is a rule that PTX code can be compiled for all newer platforms (because they will support current feature set) but not for older (because current PTX may contain features not supported by older).
On first call, the PTX code is compiled to binary code for the particular GPU using JIT compiler. When the target GPU has lower "compute capability" (CC) than the PTX code, JIT fails.
By default the following images are linked to GPU module library:
By default, the OpenCV GPU module includes:
\begin{itemize}
\item Binaries for compute capabilities 1.3 and 2.0 (controlled by \texttt{CUDA\_ARCH\_BIN} in CMake)
\item PTX code for compute capabilities 1.1 and 1.3 (controlled by \texttt{CUDA\_ARCH\_PTX} in CMake)
\end{itemize}
That means for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms the PTX code for 1.3 is JITed to a binary image. For devices with 1.1 and 1.2 the PTX for 1.1 is JITed. For devices with CC 1.0 no code present and execution will fails with \cvCppCross{Exception} somewhere. For platforms where JIT compilation is performed first run will be slow.
That means for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms the PTX code for 1.3 is JIT'ed to a binary image. For devices with 1.1 and 1.2 the PTX for 1.1 is JIT'ed. For devices with CC 1.0 no code is available and the functions will throw \cvCppCross{Exception}. For platforms where JIT compilation is performed first run will be slow.
Devices with compute capability 1.0 are supported by most of GPU functionality now (just compile the library corresponding settings). There are only a couple things that can not run on it. They are guarded with asserts. But the in future the number will raise, because of CC 1.0 support requires writing special implementation for it. So, It is decided not to spend time for old platform support.
If you happen to have GPU with CC 1.0, the GPU module can still be compiled on it and most of the functions will run just fine on such card. Simply add "1.0" to the list of binaries, for example, \texttt{CUDA\_ARCH\_BIN="1.0 1.3 2.0"}. The functions that can not be run on CC 1.0 GPUs will throw an exception.
Because of OpenCV can be compiled not for all architectures, there can be binary incompatibility between GPU and code linked to OpenCV. In this case unclear error is returned in arbitrary place. But there is a way to check if the module was build to be able to run on the given device using \cvCppCross{gpu::DeviceInfo::isCompatible} function.
You can always determine at runtime whether OpenCV GPU built binaries (or PTX code) are compatible with your GPU. The function \cvCppCross{gpu::DeviceInfo::isCompatible} return the compatibility status (true/false).
\subsection{Threading and multi-threading.}
Because GPU module is written using Cuda Runtime API, it derives from the API all practices and rules to work with threads. So on first the API call a Cuda context is created implicitly, attached and made current for the calling thread. All farther operations, such as memory allocation, GPU kernels loads and compilation, will be associated with the context and the thread. Because another thread is not attached to the context, memory allocations done in first thread are not valid for it. For second thread another context will be created on first Cuda call. So by default different threads do not share resources.
OpenCV GPU module follows Cuda Runtime API conventions regarding the multi-threaded programming. That is, on first the API call a Cuda context is created implicitly, attached to the current CPU thread and then is used as the thread's "current" context. All further operations, such as memory allocation, GPU code compilation, will be associated with the context and the thread. Because any other thread is not attached to the context, memory (and other resources) allocated in the first thread can not be accessed by the other thread. Instead, for this other thread Cuda will create another context associated with it. In short, by default different threads do not share resources.
But such limitation can be removed via using Cuda Driver API. (\textbf{Warning!} Interoperability between Cuda Driver and Runtime APIs is supported only in Cuda Toolkit 3.1 and latter). The Driver API allows retrieving context reference and attaching it to another thread. In this case if the context was created with shared access policy both threads can use the same resources. Shared access policy is default for implicit context creating now.
But such limitation can be removed using Cuda Driver API (version 3.1 or later). User can retrieve context reference for one thread, attach it to another thread and make it "current" for that thread. Then the threads can share memory and other resources. It is also possible to create a context explicitly before calling any GPU code and attach it to all the threads that you want to share the resources.
Also there is possible in Cuda Driver API to create context explicitly before first Cuda runtime call, and make it current for all necessary threads. Cuda Runtime API (and OpenCV functions respectively) will pick up it.
May be in future the tricks above will be wrapped by OpenCV GPU utility functions (it is also necessary for Multi-GPU modes).
Also it is possible to create context explicitly using Cuda Driver API, attach and make "current" for all necessary threads. Cuda Runtime API (and OpenCV functions respectively) will pick up it.
\subsection{Multi-GPU}
At the current stage all OpenCV GPU algorithms are single GPU algorithms. So to utilize multiple GPUs users have to manually parallelize work between GPUs. Multi-GPU practices is also derived from Cuda APIs, so for detailed information please read Cuda documentation. Here is two ways to use several GPUs:
In the current version each of the OpenCV GPU algorithms can use only a single GPU. So, to utilize multiple GPUs, user has to manually distribute the work between the GPUs. Here are the two ways of utilizing multiple GPUs:
\begin{itemize}
\item In case of using only synchronous functions, several threads for each GPU are created and for each thread CUDA context is initialized (explicitly by Driver API or by calling \newline \cvCppCross{gpu::setDevice()}, cudaSetDevice) that is associated with the corresponding GPU (CUDA context is always associated only with one GPU). Now each thread can workload its own GPU.
\item In case of asynchronous functions, it is possible to create several Cuda contexts associated with different GPUs but attached to one thread. This can be done only by Driver API. Next switch between devices is done by making corresponding context current for the thread. With non-blocking GPU calls managing algorithm is clear.
\item If you only use synchronous functions, first, create several CPU threads (one per each GPU) and from within each thread create CUDA context for the corresponding GPU using \cvCppCross{gpu::setDevice()} or Driver API. That's it. Now each of the threads will use the associated GPU.
\item In case of asynchronous functions, it is possible to create several Cuda contexts associated with different GPUs but attached to one CPU thread. This can be done only by Driver API. Within the thread you can switch from one GPU to another by making the corresponding context "current". With non-blocking GPU calls managing algorithm is clear.
\end{itemize}
While developing algorithms for multiple GPUs a data passing overhead have to be taken into consideration. For primitive functions and for small images it can be significant and this stops the idea to use several GPU. But for some high level algorithms Multi-GPU acceleration is suitable. For example, we have done parallelization of Stereo Block Matching by divide the stereo pair into two parts horizontally with overlapping, process each part on separate Fermi GPU, next download and merge resulting disparity. Performance for two GPU is about 180\%. As conclusion, may be in future Cuda context managing functions will be wrapped in GPU module and some multi-GPU high level algorithms be implemented. But now user has to do this manually.
While developing algorithms for multiple GPUs a data passing overhead have to be taken into consideration. For primitive functions and for small images it can be significant and eliminate all the advantages of having multiple GPUs. But for high level algorithms Multi-GPU acceleration may be suitable. For example, Stereo Block Matching algorithm has been successfully parallelized using the following algorithm:
\begin{itemize}
\item
Each image of the stereo pair is split into two horizontal overlapping stripes.
\item
Each pair of stripes (from the left and the right images) has been processed on a separate Fermi GPU
\item
The results are merged into the single disparity map.
\end{itemize}
With this scheme dual GPU gave 180\% performance increase comparing to the single Fermi GPU. The source code of the example is available at
\url{https://code.ros.org/svn/opencv/trunk/opencv/examples/gpu/}

File diff suppressed because it is too large Load Diff