an aesthetic derivation
In robotics, before anything else, the robot must estimate its state. An aerial vehicle must know its position, velocity, and heading before planning its path and creating a control law to follow it. An autonomous vehicle must know its velocity and the position of surrounding vehicles. Even a Roomba vacuum cleaner must know its location relative to the household cat in order to avoid an early demise at the hands of General Mittens. The state is thus whatever quantity of interest must be estimated for subsequent safe, reliable robot operation.
Furthermore, the state is never known perfectly but rather estimated from sensor measurements, which are themselves imperfect. An accelerometer is corrupted by noise; a camera can suffer from motion blur; feature points on images may be incorrectly associated across frames to yield outliers. The problem is thus statistical in nature, and is often solved in a Maximum A Posteriori optimization framework, which takes the form of a nonlinear least squares problem. For “well-behaved” problems, where there are no outliers, Gaussian sensor models are used. In cases where outliers are present, robust losses are used to address the effect of the outliers’ high residuals on the solution quality.
For Gaussian sensor models, the Maximum A Posteriori problem takes the nonlinear least squares form
\begin{align} \hat{\mathbf{x}}&=\text{argmin}_{\mathbf{x}} J(\mathbf{x}) = \text{argmin}_{\mathbf{x}} \sum_{i=1}^N \frac{1}{2} \mathbf{e}_i^T(\mathbf{x}) \mathbf{e}_i(\mathbf{x}), \end{align}
where each $\mathbf{e}_i(\mathbf{x})$ is a nonlinear, vector function of the state $\mathbf{x}$, and usually corresponds to the difference between expected (based on the sensor model) and actual value of the $i$’th received measurement.
The Newton optimization update at an iterate $\mathbf{x}^k$ is derived by minimizing a quadratic expansion expanded around the current iterate to yield
\begin{align} \frac{\partial J}{\partial \mathbf{x} \partial \mathbf{x}^T} \Delta \mathbf{x} = -\frac{\partial J}{\partial \mathbf{x}^T}. \end{align}
For the nonlinear-least-squares form of the loss $J(\mathbf{x})$, the Jacobian and Hessian become
\begin{align} \frac{\partial J}{\partial \mathbf{x}^T} &= \sum_{i=1}^N \frac{\partial J}{\partial \mathbf{e}_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} = \sum_{i=1}^N \mathbf{e}_i^T \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}}, \\ \frac{\partial J}{\partial \mathbf{x} \partial \mathbf{x}^T} &= \sum_{i=1}^N \left(\frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} +\sum_{j=1}^{n_{e_i}} e_{i,j} \frac{\partial ^2 e_{i,j}}{\partial \mathbf{x} \partial \mathbf{x}^T} \right). \end{align}
The second term in the Hessian sum is neglected to yield the Gauss-Newton update,
\begin{align} \left(\sum_{i=1}^N \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}}\right) \Delta \mathbf{x} &= \sum_{i=1}^N \mathbf{e}_i^T \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}}. \end{align}
Typically we see all the error terms stacked into one big error, which yields a cleaner expression. However, for the sake of the robust loss derivation later on, we will keep them separated.
As an aside, the convenient way to derive/check vector derivative identities is through expanding the matrix operation as a sum, carrying out the differentiation, and reassembling the output as a matrix expression. For example, the Jacobian of the quadratic form $\mathbf{e}^T \mathbf{e}$ may be derived by writing the definition of the Jacobian, and keeping in mind that we are working with a scalar such that it only has one row,
\begin{align} \left . \frac{\partial \mathbf{e}^T \mathbf{e}}{\partial \mathbf{e}} \right|_{1j} &= \frac{\partial}{\partial e_j} \sum_{i=1}^N e_i^2 = 2e_j, \end{align}
which may be reassembled back into matrix form as
\begin{align} \frac{\partial \mathbf{e}^T \mathbf{e}}{\partial \mathbf{e}} &= 2\mathbf{e}^T. \end{align}
In the case of robust losses, a robust loss $\rho_i: \mathbb{R}\rightarrow \mathbb{R}$ is assigned to each quadratic form $f_i=\frac{1}{2}\mathbf{e}_i^T\mathbf{e}_i$, which downweighs the effects of very high residuals. See
\begin{align} \hat{\mathbf{x}}&=\text{argmin}_{\mathbf{x}} J(\mathbf{x}) = \text{argmin}_{\mathbf{x}} \sum_{i=1}^N \rho_i\left(\frac{1}{2}\mathbf{e}_i^T(\mathbf{x}) \mathbf{e}_i(\mathbf{x})\right) \end{align}
The Jacobian of the loss is given by
\begin{align} \frac{\partial J}{\partial \mathbf{x}^T} &= \sum_{i=1}^N \frac{\partial J}{\partial \mathbf{e}_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} = \sum_{i=1}^N\frac{\partial \rho_i}{\partial f_i}\mathbf{e}_i^T \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}}, \end{align}
while the Hessian of the loss is given by
\begin{align} \frac{\partial J}{\partial \mathbf{x} \partial \mathbf{x}^T} &= \left( \frac{\partial}{\partial \mathbf{x}} \frac{\partial J}{\partial \mathbf{x}^T}\right)^T \\ &= \left( \frac{\partial}{\partial \mathbf{x}} \sum_{i=1}^N \left( \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i \right) \right)^T. \end{align}
The derivative of the summand is obtained by a double application of the product rule, such that
\begin{align} \frac{\partial}{\partial \mathbf{x}} \left( \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i \right) &= \frac{\partial}{\partial \mathbf{x}} \left(\frac{\partial \rho_i}{\partial f_i}\right) \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i + \frac{\partial \rho_i}{\partial f_i} \left( \left( \frac{\partial}{\partial \mathbf{x}} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \right) \mathbf{e}_i + \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} \right). \end{align}
The key step in the Gauss-Newton iteration, which is carried over to the robust loss case, is in neglecting the second order derivatives of $\mathbf{e}_i$, such that $\frac{\partial}{\partial \mathbf{x}} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T}\approx \mathbf{0}$. Furthermore, the chain rule must be applied since since $\frac{\partial \rho_i}{\partial f_i}$ is itself a function of $\mathbf{x}$. This yields
\begin{align} \frac{\partial}{\partial \mathbf{x}} \left( \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i \right) &= \frac{\partial}{\partial \mathbf{x}} \left(\frac{\partial \rho_i}{\partial f_i}\right) \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i + \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} \\ &= \frac{\partial^2 \rho_i}{\partial f_i^2}\mathbf{e}_i^T \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i + \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}}. \end{align}
The overall Hessian is obtained by substituting the summand expression we just derived back into the sum,
\begin{align} \frac{\partial J}{\partial \mathbf{x} \partial \mathbf{x}^T} &= \left( \frac{\partial}{\partial \mathbf{x}} \frac{\partial J}{\partial \mathbf{x}^T}\right)^T \\ &= \sum_{i=1}^N \frac{\partial^2 \rho_i}{\partial f_i^2}\mathbf{e}_i^T \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \mathbf{e}_i + \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}}. \end{align}
The first term in the sum is called the Triggs correction
\begin{align} \sum_{i=1}^N \frac{\partial \rho_i}{\partial f_i} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}^T} \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} \Delta \mathbf{x} &= \sum_{i=1}^N\frac{\partial \rho_i}{\partial f_i}\mathbf{e}_i^T \frac{\partial \mathbf{e}_i}{\partial \mathbf{x}} \end{align}
Comparing this expression to the Gauss-Newton case clarifies why it is called iteratively reweighted least squares. For each factor with a robust loss, we just consider the non-weighted error, and then just weigh it by $\sqrt{\frac{\partial \rho_i}{\partial f_i}}$, which yields exactly the nonlinear-least-squares update for the robust loss case we just derived. This is done for every factor separately. We can notice that if only one factor is used, the $\sqrt{\frac{\partial \rho_i}{\partial f_i}}$’s cancel out.
Different robust losses may be used, and correspond to specific heavy-tailed distributions if we work “backward” to the MAP problem from the negative log-likelihood. A Cauchy loss corresponds to a Cauchy distribution assumed on sensor noise. However, the motivation for using robust losses is usually not any kind of rigorous statistical argument. We do not consider the statistical properties of erroneous feature associations across camera frames to determine the “right” robust loss. Rather, we use trial and error. It works because it works, and because we have nice methods for solving the resulting problem. This somewhat echoes a theme I’ve seen on Ben Recht’s substack. We often work with given models not because they are the most accurate or rigorous ones. Rather, we work with them because we are able to solve them. Afterward, we can try to justify why these models correspond to reality.