This article describes an ICP algorithm used in depth fusion pipelines such as KinectFusion.

The goal of ICP is to align two point clouds, the old one (the existing points and normals in 3D model) and new one (new points and normals, what we want to integrate to the exising model). ICP returns rotation+translation transform between these two point clouds.

The Iterative Closest Point (ICP) minimizes the objective function which is the Point to Plane Distance (PPD) between the corresponding points in two point clouds:

$E=\sum_{i}\left\|ppd(p_{i}, q_{i}, n_{i})\right\|_{2}\rightarrow0$

What is ppd(p, q, n)?

Specifically, for each corresponding points P and Q, it is the distance from the point P to the plane determined by the point Q and the normal N located in the point Q. Two points P and Q are considered correspondent if given current camera pose they are projected in the same pixel.

p - i'th point in the new point cloud

q - i'th point in the old point cloud

n - normal in the point q in the old point cloud

Therefore, ppd(...) can be expressed as the dot product of (difference between p and q) and (n):

$dot(T_{p2q}(p)-q, n)=dot((R\cdot p + t)-q,n)=[(R\cdot p + t)-q]^{T}\cdot n$

T(p) is a rigid transform of point p:

$T_{p2q}(p) = (R \cdot p + t)$

where R - rotation, t - translation.

T is the transform we search by ICP, its purpose is to bring each point p closer to the corresponding point q in terms of point to plane distance.

How to minimize objective function?

We use the Gauss-Newton method for the function minimization.

In Gauss-Newton method we do sequential steps by changing R and t in the direction of the function E decrease, i.e. in the direction of its gradient:

At each step we approximate the function E linearly as its current value plus Jacobian matrix multiplied by delta x which is concatenated delta R and delta t vectors.
We find delta R and delta t by solving the equation E_approx(delta_x) = 0
We apply delta R and delta t to current Rt transform and proceed to next iteration

How to linearize E?

Let's approximate it in infinitesimal neighborhood.

Here's a formula we're going to minimize by changing R and t:

$E=\sum\left\|[(R\cdot p + t)-q]^{T}\cdot n\right\|_{2}$

While the point to plane distance is linear to both R and t, the rotation space is not linear by itself. You can see this in how R is generated from its rotation angles:

$\displaystyle R = R_{z}(\gamma)R_{y}(\beta )R_{x}(\alpha)= \begin{bmatrix} cos(\gamma) & -sin(\gamma) & 0 \\ sin(\gamma) & cos(\gamma) & 0\\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} cos(\beta) & 0 & sin(\beta)\\ 0 & 1 & 0\\ -sin(\beta) & 0 & cos(\beta) \end{bmatrix} \begin{bmatrix} 1 & 0 & 0\\ 0 & cos(\alpha) & -sin(\alpha)\\ 0 & sin(\alpha) & cos(\alpha) \end{bmatrix}$

But since we have infinitesimal rotations, R can be approximated in the following form:

$R=I + Ad\theta$

where I - unit matrix, A - member of the three-dimensional special orthogonal group so(3).

By approaching all sin(t) and cos(t) terms to their limits where t --> 0 we get the following representation:

$R = I + \begin{bmatrix}0 & -\gamma & \beta \\ \gamma & 0 & -\alpha \\ -\beta & \alpha & 0 \end{bmatrix} = I + skew(\begin{bmatrix} \alpha & \beta & \gamma \end{bmatrix}^{T}) = I + skew(R_{shift})$

Substituting the approximation of R back into E expression, we get:

$E_{approx}=\sum\left\|[(I + skew(R_{shift})) \cdot p + t - q]^{T} \cdot n \right \|_{2}$

$E_{approx} = \sum \left \| [I \cdot p + skew(R_{shift}) \cdot p + t - q]^{T} \cdot n \right \|_{2}$

$E_{approx} = \sum \left \| [skew(R_{shift}) \cdot p + t + p- q]^{T} \cdot n \right \|_{2}$

Let's introduce a function f which approximates transform shift:

$f(x, p) = skew(R_{shift}) \cdot p + t$

$E_{approx} = \sum \left \| [f(x, p) + p- q]^{T} \cdot n \right \|_{2}$

How to minimize E_approx?

E_approx is minimal when its differential (i.e. derivative by argument increase) is zero, so let's find that differential:

$d(E_{approx}) = \sum_i d(\left \| ppd(T_{approx}(p_i), q_i, n_i) \right \|_2) =$ $\sum_i d(ppd(T_{approx}(p_i), q_i, n_i)^2) =$ $\sum_i 2\cdot ppd(...)\cdot d(ppd(T_{approx}(p_i), q_i, n_i))$

Let's differentiate ppd:

$d(ppd(T_{approx}(p_i), q_i, n_i)) = d([f(x, p_i) + p_i- q_i]^{T} \cdot n_i) = df(x, p_i)^{T} \cdot n_i = dx^T f'(x, p_i)^T \cdot n_i$

Here's what we get for all variables x_j from vector x:

$\frac{\partial E}{\partial x_{j}} = \sum [2 \cdot (f(x, p) + p - q)^{T} \cdot n] \cdot [f_{j}'(x, p)^{T} \cdot n] = 0$

$\sum [2 \cdot n^{T} \cdot (f(x, p) + p - q)] \cdot [n^{T} \cdot f{}'(x, p)] = 0$

Let new variable: $\triangle p = p - q$

$\sum [2 \cdot n^{T} \cdot (f(x, p) + \triangle p)] \cdot [n^{T} \cdot f{}'(x, p)] = 0$

$\sum [(f(x, p) + \triangle p)^{T} \cdot (n \cdot n^{T})] \cdot f{}'(x, p) = 0$

$\sum f{}'(x, p)^{T} \cdot [n \cdot n^{T}] \cdot [f(x, p) + \triangle p] = 0$

f(x, p) can be represented as a matrix-vector multiplication. To prove that, we have to remember that $cross(a, b) = skew(a) \cdot b = skew(b)^{T} \cdot a$ :

$f(x, p) = skew(R_{shift}) \cdot p + t_{shift} = skew(p)^T R_{shift} + t_{shift}$ $f(x, p) = \begin{bmatrix} skew(p)^{T} & I_{3\times 3}\end{bmatrix} \cdot \begin{bmatrix} \triangle R & \triangle t \end{bmatrix}^{T} = G(p) \cdot x$

G(p) is introduced for simplification.

Since $d(f(x, p)) = G(p) \cdot dx = f'(x, p) \cdot dx$ we get $f'(x, p) = G(p)$ .

$\sum f{}'(x, p)^{T} \cdot [n \cdot n^{T}] \cdot [f(x, p)] = \sum f{}'(x, p)^{T} \cdot [n \cdot n^{T}] \cdot [- \triangle p]$

$\sum G(p)^{T} \cdot [n \cdot n^{T}] \cdot [G(p) \cdot X] = \sum G(p)^{T} \cdot [n \cdot n^{T}] \cdot [- \triangle p]$

Let a new value:

$C = G(p)^{T} \cdot n$

$C^{T} = (G(p)^{T} \cdot n)^{T} = n^{T} \cdot G(p)$

Let's make a replacement:

$\sum C \cdot C^{T} \cdot X = \sum C \cdot n^{T} \cdot [- \triangle p]$

$\sum C\cdot C^{T}\cdot \begin{bmatrix} \triangle R\\ \triangle t \end{bmatrix} = \sum C \cdot n^{T} \cdot [- \triangle p]$

By solving this equation we get rigid transform shift for each Gauss-Newton iteration.

How do we apply transform shift?

We generate rotation and translation matrix from the shift and then multiply the current pose matrix by the one we've got.

While the translational part of the shift contributes to the resulting matrix as-is, the rotational part is generated a bit trickier. The rotation shift is converted from so(3) to SO(3) by exponentiation. In fact, the 3-by-1 rshift vector represents rotation axis multiplied by the rotation angle. We use Rodrigues transform to get rotation matrix from that. For more details, see wiki page.