Prev Tutorial: Similarity check (PNSR and SSIM) on the GPU
Goal
Thrust is an extremely powerful library for various cuda accelerated algorithms. However thrust is designed to work with vectors and not pitched matricies. The following tutorial will discuss wrapping cv::cuda::GpuMat's into thrust iterators that can be used with thrust algorithms.
This tutorial should show you how to:
- Wrap a GpuMat into a thrust iterator
- Fill a GpuMat with random numbers
- Sort a column of a GpuMat in place
- Copy values greater than 0 to a new gpu matrix
- Use streams with thrust
Wrapping a GpuMat into a thrust iterator
The following code will produce an iterator for a GpuMat
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>> GpuMatBeginItr(
cv::cuda::GpuMat mat,
int channel = 0)
{
if (channel == -1)
{
channel = 0;
}
return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.
ptr<T>(0) + channel),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), step_functor<T>(mat.
cols, mat.
step /
sizeof(T), mat.
channels())));
}
Template "trait" class for OpenCV primitive data types.
Definition traits.hpp:113
Base storage class for GPU memory with reference counting.
Definition cuda.hpp:106
GpuMat reshape(int cn, int rows=0) const
int channels() const
returns number of channels
int cols
Definition cuda.hpp:352
int depth() const
returns element type
uchar * ptr(int y=0)
returns pointer to y-th row
size_t step
a distance between successive rows in bytes; includes the gap if any
Definition cuda.hpp:355
#define CV_Assert(expr)
Checks a condition at runtime and throws exception if it fails.
Definition base.hpp:359
template<typename T>
thrust::permutation_iterator<thrust::device_ptr<T>, thrust::transform_iterator<step_functor<T>, thrust::counting_iterator<int>>> GpuMatEndItr(
cv::cuda::GpuMat mat,
int channel = 0)
{
if (channel == -1)
{
channel = 0;
}
return thrust::make_permutation_iterator(thrust::device_pointer_cast(mat.
ptr<T>(0) + channel),
thrust::make_transform_iterator(thrust::make_counting_iterator(mat.
rows*mat.
cols), step_functor<T>(mat.
cols, mat.
step /
sizeof(T), mat.
channels())));
}
int rows
the number of rows and columns
Definition cuda.hpp:352
Our goal is to have an iterator that will start at the beginning of the matrix, and increment correctly to access continuous matrix elements. This is trivial for a continuous row, but how about for a column of a pitched matrix? To do this we need the iterator to be aware of the matrix dimensions and step. This information is embedded in the step_functor.
template<typename T> struct step_functor : public thrust::unary_function<int, int>
{
int columns;
int step;
int channels;
__host__ __device__ step_functor(int columns_, int step_, int channels_ = 1) : columns(columns_), step(step_), channels(channels_) { };
{
step = mat.
step /
sizeof(T);
}
__host__ __device__
int operator()(int x) const
{
int row = x / columns;
int idx = (row * step) + (x % columns)*channels;
return idx;
}
};
The step functor takes in an index value and returns the appropriate offset from the beginning of the matrix. The counting iterator simply increments over the range of pixel elements. Combined into the transform_iterator we have an iterator that counts from 0 to M*N and correctly increments to account for the pitched memory of a GpuMat. Unfortunately this does not include any memory location information, for that we need a thrust::device_ptr. By combining a device pointer with the transform_iterator we can point thrust to the first element of our matrix and have it step accordingly.
Fill a GpuMat with random numbers
Now that we have some nice functions for making iterators for thrust, lets use them to do some things OpenCV can't do. Unfortunately at the time of this writing, OpenCV doesn't have any Gpu random number generation. Thankfully thrust does and it's now trivial to interop between the two. Example taken from http://stackoverflow.com/questions/12614164/generating-a-random-number-vector-between-0-and-1-0-using-thrust
First we need to write a functor that will produce our random values.
struct prg
{
float a, b;
__host__ __device__
prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
This will take in an integer value and output a value between a and b. Now we will populate our matrix with values between 0 and 10 with a thrust transform.
{
cv::cuda::GpuMat d_value(1, 100, CV_32F);
auto valueBegin = GpuMatBeginItr<float>(d_value);
auto valueEnd = GpuMatEndItr<float>(d_value);
thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
cv::Mat h_value(d_value);
}
Sort a column of a GpuMat in place
Lets fill matrix elements with random values and an index. Afterwards we will sort the random numbers and the indecies.
{
cv::cuda::GpuMat d_data(1, 100, CV_32SC2);
// Thrust compatible begin and end iterators to channel 1 of this matrix
auto keyBegin = GpuMatBeginItr<int>(d_data, 1);
auto keyEnd = GpuMatEndItr<int>(d_data, 1);
// Thrust compatible begin and end iterators to channel 0 of this matrix
auto idxBegin = GpuMatBeginItr<int>(d_data, 0);
auto idxEnd = GpuMatEndItr<int>(d_data, 0);
// Fill the index channel with a sequence of numbers from 0 to 100
thrust::sequence(idxBegin, idxEnd);
// Fill the key channel with random numbers between 0 and 10. A counting iterator is used here to give an integer value for each location as an input to prg::operator()
thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_data.cols), keyBegin, prg(0, 10));
// Sort the key channel and index channel such that the keys and indecies stay together
thrust::sort_by_key(keyBegin, keyEnd, idxBegin);
cv::Mat h_idx(d_data);
}
Copy values greater than 0 to a new gpu matrix while using streams
In this example we're going to see how cv::cuda::Streams can be used with thrust. Unfortunately this specific example uses functions that must return results to the CPU so it isn't the optimal use of streams.
{
cv::cuda::GpuMat d_value(1, 100, CV_32F);
auto valueBegin = GpuMatBeginItr<float>(d_value);
auto valueEnd = GpuMatEndItr<float>(d_value);
cv::cuda::Stream stream;
//! [random_gen_stream]
// Same as the random generation code from before except now the transformation is being performed on a stream
thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
//! [random_gen_stream]
// Count the number of values we are going to copy
int count = thrust::count_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, pred_greater<float>(0.0));
// Allocate a destination for copied values
cv::cuda::GpuMat d_valueGreater(1, count, CV_32F);
// Copy values that satisfy the predicate.
thrust::copy_if(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), valueBegin, valueEnd, GpuMatBeginItr<float>(d_valueGreater), pred_greater<float>(0.0));
cv::Mat h_greater(d_valueGreater);
}
First we will populate a GPU mat with randomly generated data between -1 and 1 on a stream.
// Same as the random generation code from before except now the transformation is being performed on a stream
thrust::transform(thrust::system::cuda::par.on(cv::cuda::StreamAccessor::getStream(stream)), thrust::make_counting_iterator(0), thrust::make_counting_iterator(d_value.cols), valueBegin, prg(-1, 1));
Notice the use of thrust::system::cuda::par.on(...), this creates an execution policy for executing thrust code on a stream. There is a bug in the version of thrust distributed with the cuda toolkit, as of version 7.5 this has not been fixed. This bug causes code to not execute on streams. The bug can however be fixed by using the newest version of thrust from the git repository. (http://github.com/thrust/thrust.git) Next we will determine how many values are greater than 0 by using thrust::count_if with the following predicate:
template<typename T> struct pred_greater
{
T value;
__host__ __device__ pred_greater(T value_) : value(value_){}
__host__ __device__ bool operator()(const T& val) const
{
return val > value;
}
};
We will use those results to create an output buffer for storing the copied values, we will then use copy_if with the same predicate to populate the output buffer. Lastly we will download the values into a CPU mat for viewing.