Doxygen documentation: cuda

This commit is contained in:
Maksim Shabunin
2014-11-20 16:42:06 +03:00
parent 472c210687
commit ceb6e8bd94
80 changed files with 2917 additions and 398 deletions

View File

@@ -0,0 +1,85 @@
CUDA Module Introduction {#cuda_intro}
========================
General Information
-------------------
The OpenCV CUDA module is a set of classes and functions to utilize CUDA computational capabilities.
It is implemented using NVIDIA\* CUDA\* Runtime API and supports only NVIDIA GPUs. The OpenCV CUDA
module includes utility functions, low-level vision primitives, and high-level algorithms. The
utility functions and low-level primitives provide a powerful infrastructure for developing fast
vision algorithms taking advantage of CUDA whereas the high-level functionality includes some
state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others)
ready to be used by the application developers.
The CUDA module is designed as a host-level API. This means that if you have pre-compiled OpenCV
CUDA binaries, you are not required to have the CUDA Toolkit installed or write any extra code to
make use of the CUDA.
The OpenCV CUDA module is designed for ease of use and does not require any knowledge of CUDA.
Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest
performance. It is helpful to understand the cost of various operations, what the GPU does, what the
preferred data formats are, and so on. The CUDA module is an effective instrument for quick
implementation of CUDA-accelerated computer vision algorithms. However, if your algorithm involves
many simple operations, then, for the best possible performance, you may still need to write your
own kernels to avoid extra write and read operations on the intermediate results.
To enable CUDA support, configure OpenCV using CMake with WITH\_CUDA=ON . When the flag is set and
if CUDA is installed, the full-featured OpenCV CUDA module is built. Otherwise, the module is still
built but at runtime all functions from the module throw Exception with CV\_GpuNotSupported error
code, except for cuda::getCudaEnabledDeviceCount(). The latter function returns zero GPU count in
this case. Building OpenCV without CUDA support does not perform device code compilation, so it does
not require the CUDA Toolkit installed. Therefore, using the cuda::getCudaEnabledDeviceCount()
function, you can implement a high-level algorithm that will detect GPU presence at runtime and
choose an appropriate implementation (CPU or GPU) accordingly.
Compilation for Different NVIDIA\* Platforms
--------------------------------------------
NVIDIA\* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX).
Binary code often implies a specific GPU architecture and generation, so the compatibility with
other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the
set of capabilities or features. Depending on the selected virtual platform, some of the
instructions are emulated or disabled, even if the real hardware supports all the features.
At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT
compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails. By
default, the OpenCV CUDA module includes:
\*
Binaries for compute capabilities 1.3 and 2.0 (controlled by CUDA\_ARCH\_BIN in CMake)
\*
PTX code for compute capabilities 1.1 and 1.3 (controlled by CUDA\_ARCH\_PTX in CMake)
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer
platforms, the PTX code for 1.3 is JIT'ed to a binary image. For devices with CC 1.1 and 1.2, the
PTX for 1.1 is JIT'ed. For devices with CC 1.0, no code is available and the functions throw
Exception. For platforms where JIT compilation is performed first, the run is slow.
On a GPU with CC 1.0, you can still compile the CUDA module and most of the functions will run
flawlessly. To achieve this, add "1.0" to the list of binaries, for example,
CUDA\_ARCH\_BIN="1.0 1.3 2.0" . The functions that cannot be run on CC 1.0 GPUs throw an exception.
You can always determine at runtime whether the OpenCV GPU-built binaries (or PTX code) are
compatible with your GPU. The function cuda::DeviceInfo::isCompatible returns the compatibility
status (true/false).
Utilizing Multiple GPUs
-----------------------
In the current version, each of the OpenCV CUDA algorithms can use only a single GPU. So, to utilize
multiple GPUs, you have to manually distribute the work between GPUs. Switching active devie can be
done using cuda::setDevice() function. For more details please read Cuda C Programming Guide.
While developing algorithms for multiple GPUs, note a data passing overhead. For primitive functions
and small images, it can be significant, which may eliminate all the advantages of having multiple
GPUs. But for high-level algorithms, consider using multi-GPU acceleration. For example, the Stereo
Block Matching algorithm has been successfully parallelized using the following algorithm:
1. Split each image of the stereo pair into two horizontal overlapping stripes.
2. Process each pair of stripes (from the left and right images) on a separate Fermi\* GPU.
3. Merge the results into a single disparity map.
With this algorithm, a dual GPU gave a 180% performance increase comparing to the single Fermi GPU.
For a source code example, see <https://github.com/Itseez/opencv/tree/master/samples/gpu/>.

View File

@@ -49,10 +49,22 @@
#include "opencv2/core/cuda.hpp"
/**
@defgroup cuda CUDA-accelerated Computer Vision
@ref cuda_intro "Introduction page"
@{
@defgroup cuda_objdetect Object Detection
@}
*/
namespace cv { namespace cuda {
//////////////// HOG (Histogram-of-Oriented-Gradients) Descriptor and Object Detector //////////////
//! @addtogroup cuda_objdetect
//! @{
struct CV_EXPORTS HOGConfidence
{
double scale;
@@ -61,31 +73,92 @@ struct CV_EXPORTS HOGConfidence
std::vector<double> part_scores[4];
};
/** @brief The class implements Histogram of Oriented Gradients (@cite Dalal2005) object detector.
Interfaces of all methods are kept similar to the CPU HOG descriptor and detector analogues as much
as possible.
@note
- An example applying the HOG descriptor for people detection can be found at
opencv\_source\_code/samples/cpp/peopledetect.cpp
- A CUDA example applying the HOG descriptor for people detection can be found at
opencv\_source\_code/samples/gpu/hog.cpp
- (Python) An example applying the HOG descriptor for people detection can be found at
opencv\_source\_code/samples/python2/peopledetect.py
*/
struct CV_EXPORTS HOGDescriptor
{
enum { DEFAULT_WIN_SIGMA = -1 };
enum { DEFAULT_NLEVELS = 64 };
enum { DESCR_FORMAT_ROW_BY_ROW, DESCR_FORMAT_COL_BY_COL };
/** @brief Creates the HOG descriptor and detector.
@param win\_size Detection window size. Align to block size and block stride.
@param block\_size Block size in pixels. Align to cell size. Only (16,16) is supported for now.
@param block\_stride Block stride. It must be a multiple of cell size.
@param cell\_size Cell size. Only (8, 8) is supported for now.
@param nbins Number of bins. Only 9 bins per cell are supported for now.
@param win\_sigma Gaussian smoothing window parameter.
@param threshold\_L2hys L2-Hys normalization method shrinkage.
@param gamma\_correction Flag to specify whether the gamma correction preprocessing is required or
not.
@param nlevels Maximum number of detection window increases.
*/
HOGDescriptor(Size win_size=Size(64, 128), Size block_size=Size(16, 16),
Size block_stride=Size(8, 8), Size cell_size=Size(8, 8),
int nbins=9, double win_sigma=DEFAULT_WIN_SIGMA,
double threshold_L2hys=0.2, bool gamma_correction=true,
int nlevels=DEFAULT_NLEVELS);
/** @brief Returns the number of coefficients required for the classification.
*/
size_t getDescriptorSize() const;
/** @brief Returns the block histogram size.
*/
size_t getBlockHistogramSize() const;
/** @brief Sets coefficients for the linear SVM classifier.
*/
void setSVMDetector(const std::vector<float>& detector);
/** @brief Returns coefficients of the classifier trained for people detection (for default window size).
*/
static std::vector<float> getDefaultPeopleDetector();
/** @brief Returns coefficients of the classifier trained for people detection (for 48x96 windows).
*/
static std::vector<float> getPeopleDetector48x96();
/** @brief Returns coefficients of the classifier trained for people detection (for 64x128 windows).
*/
static std::vector<float> getPeopleDetector64x128();
/** @brief Performs object detection without a multi-scale window.
@param img Source image. CV\_8UC1 and CV\_8UC4 types are supported for now.
@param found\_locations Left-top corner points of detected objects boundaries.
@param hit\_threshold Threshold for the distance between features and SVM classifying plane.
Usually it is 0 and should be specfied in the detector coefficients (as the last free
coefficient). But if the free coefficient is omitted (which is allowed), you can specify it
manually here.
@param win\_stride Window stride. It must be a multiple of block stride.
@param padding Mock parameter to keep the CPU interface compatibility. It must be (0,0).
*/
void detect(const GpuMat& img, std::vector<Point>& found_locations,
double hit_threshold=0, Size win_stride=Size(),
Size padding=Size());
/** @brief Performs object detection with a multi-scale window.
@param img Source image. See cuda::HOGDescriptor::detect for type limitations.
@param found\_locations Detected objects boundaries.
@param hit\_threshold Threshold for the distance between features and SVM classifying plane. See
cuda::HOGDescriptor::detect for details.
@param win\_stride Window stride. It must be a multiple of block stride.
@param padding Mock parameter to keep the CPU interface compatibility. It must be (0,0).
@param scale0 Coefficient of the detection window increase.
@param group\_threshold Coefficient to regulate the similarity threshold. When detected, some
objects can be covered by many rectangles. 0 means not to perform grouping. See groupRectangles .
*/
void detectMultiScale(const GpuMat& img, std::vector<Rect>& found_locations,
double hit_threshold=0, Size win_stride=Size(),
Size padding=Size(), double scale0=1.05,
@@ -98,6 +171,17 @@ struct CV_EXPORTS HOGDescriptor
double hit_threshold, Size win_stride, Size padding,
std::vector<HOGConfidence> &conf_out, int group_threshold);
/** @brief Returns block descriptors computed for the whole image.
@param img Source image. See cuda::HOGDescriptor::detect for type limitations.
@param win\_stride Window stride. It must be a multiple of block stride.
@param descriptors 2D array of descriptors.
@param descr\_format Descriptor storage format:
- **DESCR\_FORMAT\_ROW\_BY\_ROW** - Row-major order.
- **DESCR\_FORMAT\_COL\_BY\_COL** - Column-major order.
The function is mainly used to learn the classifier.
*/
void getDescriptors(const GpuMat& img, Size win_stride,
GpuMat& descriptors,
int descr_format=DESCR_FORMAT_COL_BY_COL);
@@ -145,20 +229,82 @@ protected:
//////////////////////////// CascadeClassifier ////////////////////////////
// The cascade classifier class for object detection: supports old haar and new lbp xlm formats and nvbin for haar cascades olny.
/** @brief Cascade classifier class used for object detection. Supports HAAR and LBP cascades. :
@note
- A cascade classifier example can be found at
opencv\_source\_code/samples/gpu/cascadeclassifier.cpp
- A Nvidea API specific cascade classifier example can be found at
opencv\_source\_code/samples/gpu/cascadeclassifier\_nvidia\_api.cpp
*/
class CV_EXPORTS CascadeClassifier_CUDA
{
public:
CascadeClassifier_CUDA();
/** @brief Loads the classifier from a file. Cascade type is detected automatically by constructor parameter.
@param filename Name of the file from which the classifier is loaded. Only the old haar classifier
(trained by the haar training application) and NVIDIA's nvbin are supported for HAAR and only new
type of OpenCV XML cascade supported for LBP.
*/
CascadeClassifier_CUDA(const String& filename);
~CascadeClassifier_CUDA();
/** @brief Checks whether the classifier is loaded or not.
*/
bool empty() const;
/** @brief Loads the classifier from a file. The previous content is destroyed.
@param filename Name of the file from which the classifier is loaded. Only the old haar classifier
(trained by the haar training application) and NVIDIA's nvbin are supported for HAAR and only new
type of OpenCV XML cascade supported for LBP.
*/
bool load(const String& filename);
/** @brief Destroys the loaded classifier.
*/
void release();
/* returns number of detected objects */
/** @overload */
int detectMultiScale(const GpuMat& image, GpuMat& objectsBuf, double scaleFactor = 1.2, int minNeighbors = 4, Size minSize = Size());
/** @brief Detects objects of different sizes in the input image.
@param image Matrix of type CV\_8U containing an image where objects should be detected.
@param objectsBuf Buffer to store detected objects (rectangles). If it is empty, it is allocated
with the default size. If not empty, the function searches not more than N objects, where
N = sizeof(objectsBufer's data)/sizeof(cv::Rect).
@param maxObjectSize Maximum possible object size. Objects larger than that are ignored. Used for
second signature and supported only for LBP cascades.
@param scaleFactor Parameter specifying how much the image size is reduced at each image scale.
@param minNeighbors Parameter specifying how many neighbors each candidate rectangle should have
to retain it.
@param minSize Minimum possible object size. Objects smaller than that are ignored.
The detected objects are returned as a list of rectangles.
The function returns the number of detected objects, so you can retrieve them as in the following
example:
@code
cuda::CascadeClassifier_CUDA cascade_gpu(...);
Mat image_cpu = imread(...)
GpuMat image_gpu(image_cpu);
GpuMat objbuf;
int detections_number = cascade_gpu.detectMultiScale( image_gpu,
objbuf, 1.2, minNeighbors);
Mat obj_host;
// download only detected number of rectangles
objbuf.colRange(0, detections_number).download(obj_host);
Rect* faces = obj_host.ptr<Rect>();
for(int i = 0; i < detections_num; ++i)
cv::rectangle(image_cpu, faces[i], Scalar(255));
imshow("Faces", image_cpu);
@endcode
@sa CascadeClassifier::detectMultiScale
*/
int detectMultiScale(const GpuMat& image, GpuMat& objectsBuf, Size maxObjectSize, Size minSize = Size(), double scaleFactor = 1.1, int minNeighbors = 4);
bool findLargestObject;
@@ -174,8 +320,13 @@ private:
friend class CascadeClassifier_CUDA_LBP;
};
//! @} cuda_objdetect
//////////////////////////// Labeling ////////////////////////////
//! @addtogroup cuda
//! @{
//!performs labeling via graph cuts of a 2D regular 4-connected graph.
CV_EXPORTS void graphcut(GpuMat& terminals, GpuMat& leftTransp, GpuMat& rightTransp, GpuMat& top, GpuMat& bottom, GpuMat& labels,
GpuMat& buf, Stream& stream = Stream::Null());
@@ -192,8 +343,13 @@ CV_EXPORTS void connectivityMask(const GpuMat& image, GpuMat& mask, const cv::Sc
//! performs connected componnents labeling.
CV_EXPORTS void labelComponents(const GpuMat& mask, GpuMat& components, int flags = 0, Stream& stream = Stream::Null());
//! @}
//////////////////////////// Calib3d ////////////////////////////
//! @addtogroup cuda_calib3d
//! @{
CV_EXPORTS void transformPoints(const GpuMat& src, const Mat& rvec, const Mat& tvec,
GpuMat& dst, Stream& stream = Stream::Null());
@@ -201,13 +357,34 @@ CV_EXPORTS void projectPoints(const GpuMat& src, const Mat& rvec, const Mat& tve
const Mat& camera_mat, const Mat& dist_coef, GpuMat& dst,
Stream& stream = Stream::Null());
/** @brief Finds the object pose from 3D-2D point correspondences.
@param object Single-row matrix of object points.
@param image Single-row matrix of image points.
@param camera\_mat 3x3 matrix of intrinsic camera parameters.
@param dist\_coef Distortion coefficients. See undistortPoints for details.
@param rvec Output 3D rotation vector.
@param tvec Output 3D translation vector.
@param use\_extrinsic\_guess Flag to indicate that the function must use rvec and tvec as an
initial transformation guess. It is not supported for now.
@param num\_iters Maximum number of RANSAC iterations.
@param max\_dist Euclidean distance threshold to detect whether point is inlier or not.
@param min\_inlier\_count Flag to indicate that the function must stop if greater or equal number
of inliers is achieved. It is not supported for now.
@param inliers Output vector of inlier indices.
*/
CV_EXPORTS void solvePnPRansac(const Mat& object, const Mat& image, const Mat& camera_mat,
const Mat& dist_coef, Mat& rvec, Mat& tvec, bool use_extrinsic_guess=false,
int num_iters=100, float max_dist=8.0, int min_inlier_count=100,
std::vector<int>* inliers=NULL);
//! @}
//////////////////////////// VStab ////////////////////////////
//! @addtogroup cuda
//! @{
//! removes points (CV_32FC2, single row matrix) with zero mask value
CV_EXPORTS void compactPoints(GpuMat &points0, GpuMat &points1, const GpuMat &mask);
@@ -215,6 +392,8 @@ CV_EXPORTS void calcWobbleSuppressionMaps(
int left, int idx, int right, Size size, const Mat &ml, const Mat &mr,
GpuMat &mapx, GpuMat &mapy);
//! @}
}} // namespace cv { namespace cuda {
#endif /* __OPENCV_CUDA_HPP__ */