OpenCV  4.5.4
Open Source Computer Vision
Vectorizing your code using Universal Intrinsics

Prev Tutorial: How to use the OpenCV parallel_for_ to parallelize your code

Compatibility OpenCV >= 3.0

Goal

The goal of this tutorial is to provide a guide to using the Universal intrinsics feature to vectorize your C++ code for a faster runtime. We'll briefly look into SIMD intrinsics and how to work with wide registers, followed by a tutorial on the basic operations using wide registers.

Theory

In this section, we will briefly look into a few concepts to better help understand the functionality.

Intrinsics

Intrinsics are functions which are separately handled by the compiler. These functions are often optimized to perform in the most efficient ways possible and hence run faster than normal implementations. However, since these functions depend on the compiler, it makes it difficult to write portable applications.

SIMD

SIMD stands for Single Instruction, Multiple Data. SIMD Intrinsics allow the processor to vectorize calculations. The data is stored in what are known as registers. A register may be 128-bits, 256-bits or 512-bits wide. Each register stores multiple values of the same data type. The size of the register and the size of each value determines the number of values stored in total.

Depending on what Instruction Sets your CPU supports, you may be able to use the different registers. To learn more, look here

Universal Intrinsics

OpenCVs universal intrinsics provides an abstraction to SIMD vectorization methods and allows the user to use intrinsics without the need to write system specific code.

OpenCV Universal Intrinsics support the following instruction sets:

We will now introduce the available structures and functions:

Register Structures

The Universal Intrinsics set implements every register as a structure based on the particular SIMD register. All types contain the nlanes enumeration which gives the exact number of values that the type can hold. This eliminates the need to hardcode the number of values during implementations.

Note
Each register structure is under the cv namespace.

There are two types of registers:

Load and Store operations

Now that we know how registers work, let us look at the functions used for filling these registers with values.

The universal intrinsics set provides element wise binary and unary operations.



Note
Comparison and Min/Max operators are not available for 64 bit integers. Bitwise shift and logic operators are available only for integer values. Bitwise shift is available only for 16, 32 and 64 bit registers.

Reduce and Mask

Demonstration

In the following section, we will vectorize a simple convolution function for single channel and compare the results to a scalar implementation.

Note
Not all algorithms are improved by manual vectorization. In fact, in certain cases, the compiler may autovectorize the code, thus producing faster results for scalar implementations.

You may learn more about convolution from the previous tutorial. We use the same naive implementation from the previous tutorial and compare it to the vectorized version.

The full tutorial code is here.

Vectorizing Convolution

We will first implement a 1-D convolution and then vectorize it. The 2-D vectorized convolution will perform 1-D convolution across the rows to produce the correct results.

1-D Convolution: Scalar

void conv1d(Mat src, Mat &dst, Mat kernel)
{
int len = src.cols;
dst = Mat(1, len, CV_8UC1);
int sz = kernel.cols / 2;
copyMakeBorder(src, src, 0, 0, sz, sz, BORDER_REPLICATE);
for (int i = 0; i < len; i++)
{
double value = 0;
for (int k = -sz; k <= sz; k++)
value += src.ptr<uchar>(0)[i + k + sz] * kernel.ptr<float>(0)[k + sz];
dst.ptr<uchar>(0)[i] = saturate_cast<uchar>(value);
}
}
  1. We first set up variables and make a border on both sides of the src matrix, to take care of edge cases.
    int len = src.cols;
    dst = Mat(1, len, CV_8UC1);
    int sz = kernel.cols / 2;
    copyMakeBorder(src, src, 0, 0, sz, sz, BORDER_REPLICATE);
  2. For the main loop, we select an index i and offset it on both sides along with the kernel, using the k variable. We store the value in value and add it to the dst matrix.
    for (int i = 0; i < len; i++)
    {
    double value = 0;
    for (int k = -sz; k <= sz; k++)
    value += src.ptr<uchar>(0)[i + k + sz] * kernel.ptr<float>(0)[k + sz];
    dst.ptr<uchar>(0)[i] = saturate_cast<uchar>(value);
    }

    1-D Convolution: Vector

We will now look at the vectorized version of 1-D convolution.

void conv1dsimd(Mat src, Mat kernel, float *ans, int row = 0, int rowk = 0, int len = -1)
{
if (len == -1)
len = src.cols;
Mat src_32, kernel_32;
const int alpha = 1;
src.convertTo(src_32, CV_32FC1, alpha);
int ksize = kernel.cols, sz = kernel.cols / 2;
copyMakeBorder(src_32, src_32, 0, 0, sz, sz, BORDER_REPLICATE);
int step = v_float32().nlanes;
float *sptr = src_32.ptr<float>(row), *kptr = kernel.ptr<float>(rowk);
for (int k = 0; k < ksize; k++)
{
v_float32 kernel_wide = vx_setall_f32(kptr[k]);
int i;
for (i = 0; i + step < len; i += step)
{
v_float32 window = vx_load(sptr + i + k);
v_float32 sum = vx_load(ans + i) + kernel_wide * window;
v_store(ans + i, sum);
}
for (; i < len; i++)
{
*(ans + i) += sptr[i + k]*kptr[k];
}
}
}
  1. In our case, the kernel is a float. Since the kernel's datatype is the largest, we convert src to float32, forming src_32. We also make a border like we did for the naive case.
    Mat src_32, kernel_32;
    const int alpha = 1;
    src.convertTo(src_32, CV_32FC1, alpha);
    int ksize = kernel.cols, sz = kernel.cols / 2;
    copyMakeBorder(src_32, src_32, 0, 0, sz, sz, BORDER_REPLICATE);
  2. Now, for each column in the kernel, we calculate the scalar product of the value with all window vectors of length step. We add these values to the already stored values in ans
    int step = v_float32().nlanes;
    float *sptr = src_32.ptr<float>(row), *kptr = kernel.ptr<float>(rowk);
    for (int k = 0; k < ksize; k++)
    {
    v_float32 kernel_wide = vx_setall_f32(kptr[k]);
    int i;
    for (i = 0; i + step < len; i += step)
    {
    v_float32 window = vx_load(sptr + i + k);
    v_float32 sum = vx_load(ans + i) + kernel_wide * window;
    v_store(ans + i, sum);
    }
    for (; i < len; i++)
    {
    *(ans + i) += sptr[i + k]*kptr[k];
    }
    }
    • We declare a pointer to the src_32 and kernel and run a loop for each kernel element
      int step = v_float32().nlanes;
      float *sptr = src_32.ptr<float>(row), *kptr = kernel.ptr<float>(rowk);
      for (int k = 0; k < ksize; k++)
      {
    • We load a register with the current kernel element. A window is shifted from 0 to len - step and its product with the kernel_wide array is added to the values stored in ans. We store the values back into ans
      v_float32 kernel_wide = vx_setall_f32(kptr[k]);
      int i;
      for (i = 0; i + step < len; i += step)
      {
      v_float32 window = vx_load(sptr + i + k);
      v_float32 sum = vx_load(ans + i) + kernel_wide * window;
      v_store(ans + i, sum);
      }
    • Since the length might not be divisible by steps, we take care of the remaining values directly. The number of tail values will always be less than step and will not affect the performance significantly. We store all the values to ans which is a float pointer. We can also directly store them in a Mat object
      for (; i < len; i++)
      {
      *(ans + i) += sptr[i + k]*kptr[k];
      }
    • Here is an iterative example:
        For example:
        kernel: {k1, k2, k3}
        src:           ...|a1|a2|a3|a4|...
      
      
        iter1:
        for each idx i in (0, len), 'step' idx at a time
            kernel_wide:          |k1|k1|k1|k1|
            window:               |a0|a1|a2|a3|
            ans:               ...| 0| 0| 0| 0|...
            sum =  ans + window * kernel_wide
                =  |a0 * k1|a1 * k1|a2 * k1|a3 * k1|
      
        iter2:
            kernel_wide:          |k2|k2|k2|k2|
            window:               |a1|a2|a3|a4|
            ans:               ...|a0 * k1|a1 * k1|a2 * k1|a3 * k1|...
            sum =  ans + window * kernel_wide
                =  |a0 * k1 + a1 * k2|a1 * k1 + a2 * k2|a2 * k1 + a3 * k2|a3 * k1 + a4 * k2|
      
        iter3:
            kernel_wide:          |k3|k3|k3|k3|
            window:               |a2|a3|a4|a5|
            ans:               ...|a0 * k1 + a1 * k2|a1 * k1 + a2 * k2|a2 * k1 + a3 * k2|a3 * k1 + a4 * k2|...
            sum =  sum + window * kernel_wide
                =  |a0*k1 + a1*k2 + a2*k3|a1*k1 + a2*k2 + a3*k3|a2*k1 + a3*k2 + a4*k3|a3*k1 + a4*k2 + a5*k3|
      
Note
The function parameters also include row, rowk and len. These values are used when using the function as an intermediate step of 2-D convolution

2-D Convolution

Suppose our kernel has ksize rows. To compute the values for a particular row, we compute the 1-D convolution of the previous ksize/2 and the next ksize/2 rows, with the corresponding kernel row. The final values is simply the sum of the individual 1-D convolutions

void convolute_simd(Mat src, Mat &dst, Mat kernel)
{
int rows = src.rows, cols = src.cols;
int ksize = kernel.rows, sz = ksize / 2;
dst = Mat(rows, cols, CV_32FC1);
copyMakeBorder(src, src, sz, sz, 0, 0, BORDER_REPLICATE);
int step = v_float32().nlanes;
for (int i = 0; i < rows; i++)
{
for (int k = 0; k < ksize; k++)
{
float ans[N] = {0};
conv1dsimd(src, kernel, ans, i + k, k, cols);
int j;
for (j = 0; j + step < cols; j += step)
{
v_float32 sum = vx_load(&dst.ptr<float>(i)[j]) + vx_load(&ans[j]);
v_store(&dst.ptr<float>(i)[j], sum);
}
for (; j < cols; j++)
dst.ptr<float>(i)[j] += ans[j];
}
}
const int alpha = 1;
dst.convertTo(dst, CV_8UC1, alpha);
}
  1. We first initialize variables and make a border above and below the src matrix. The left and right sides are handled by the 1-D convolution function.
    int rows = src.rows, cols = src.cols;
    int ksize = kernel.rows, sz = ksize / 2;
    dst = Mat(rows, cols, CV_32FC1);
    copyMakeBorder(src, src, sz, sz, 0, 0, BORDER_REPLICATE);
    int step = v_float32().nlanes;
  2. For each row, we calculate the 1-D convolution of the rows above and below it. we then add the values to the dst matrix.
    for (int i = 0; i < rows; i++)
    {
    for (int k = 0; k < ksize; k++)
    {
    float ans[N] = {0};
    conv1dsimd(src, kernel, ans, i + k, k, cols);
    int j;
    for (j = 0; j + step < cols; j += step)
    {
    v_float32 sum = vx_load(&dst.ptr<float>(i)[j]) + vx_load(&ans[j]);
    v_store(&dst.ptr<float>(i)[j], sum);
    }
    for (; j < cols; j++)
    dst.ptr<float>(i)[j] += ans[j];
    }
    }
  3. We finally convert the dst matrix to a 8-bit unsigned char matrix
    const int alpha = 1;
    dst.convertTo(dst, CV_8UC1, alpha);

    Results

In the tutorial, we used a horizontal gradient kernel. We obtain the same output image for both methods.

Improvement in runtime varies and will depend on the SIMD capabilities available in your CPU.