Introduction

Halide code is the same for every device we use. But for achieving the satisfied efficiency we should schedule computations properly. In this tutorial we describe the ways to schedule your networks using Halide backend in OpenCV deep learning module.

For better understanding of Halide scheduling you might want to read tutorials @ http://halide-lang.org/tutorials.

If it's your first meeting with Halide in OpenCV, we recommend to start from How to enable Halide backend for improve efficiency.

Configuration files

You can schedule computations of Halide pipeline by writing textual configuration files. It means that you can easily vectorize, parallelize and manage loops order of layers computation. Pass path to file with scheduling directives for specific device into cv::dnn::Net::setHalideScheduler before the first cv::dnn::Net::forward call.

Scheduling configuration files represented as YAML files where each node is a scheduled function or a scheduling directive.

relu1:
  reorder: [x, c, y]
  split: { y: 2, c: 8 }
  parallel: [yo, co]
  unroll: yi
  vectorize: { x: 4 }
conv1_constant_exterior:
  compute_at: { relu1: yi }

Considered use variables n for batch dimension, c for channels, y for rows and x for columns. For variables after split are used names with the same prefix but o and i suffixes for outer and inner variables correspondingly. In example, for variable x in range [0, 10) directive split: { x: 2 } gives new ones xo in range [0, 5) and xi in range [0, 2). Variable name x is no longer available in the same scheduling node.

You can find scheduling examples at opencv_extra/testdata/dnn and use it for schedule your networks.

Layers fusing

Thanks to layers fusing we can schedule only the top layers of fused sets. Because for every output value we use the fused formula. In example, if you have three layers Convolution + Scale + ReLU one by one,

conv(x, y, c, n) = sum(...) + bias(c);
scale(x, y, c, n) = conv(x, y, c, n) * weights(c);
relu(x, y, c, n) = max(scale(x, y, c, n), 0);

fused function is something like

relu(x, y, c, n) = max((sum(...) + bias(c)) * weights(c), 0);

So only function called relu require scheduling.

Scheduling patterns

Sometimes networks built using blocked structure that means some layer are identical or quite similar. If you want to apply the same scheduling for different layers accurate to tiling or vectorization factors, define scheduling patterns in section patterns at the beginning of scheduling file. Also, your patters may use some parametric variables.

# At the beginning of the file
patterns:
  fully_connected:
    split: { c: c_split }
    fuse: { src: [x, y, co], dst: block }
    parallel: block
    vectorize: { ci: c_split }
# Somewhere below
fc8:
  pattern: fully_connected
  params: { c_split: 8 }

Automatic scheduling

You can let DNN to schedule layers automatically. Just skip call of cv::dnn::Net::setHalideScheduler. Sometimes it might be even more efficient than manual scheduling. But if specific layers require be scheduled manually, you would be able to mix both manual and automatic scheduling ways. Write scheduling file and skip layers that you want to be scheduled automatically.