Linear Quadratic Regulators

No, they’re not named after the three kids on my highschool football team that thought they were super cool and referred to themselves as the regulators.

Building off of what we talked about in the case of the Kalman filter, we can derive a state space regulator that is also optimal in some sense. This sense is to drive some state x to the smallest possible value as soon as possible, but without spending too much control effort to achieve that goal. Given a continuous-time linear time invariant (LTI) system of the form

\begin{aligned}  1) \displaystyle \qquad \dot{x} &= Ax + Bu, \quad x \in \mathbb{R}^n, u \in \mathbb{R}^k, \\  y &= Cx, \qquad y \in \mathbb{R}^m, \\  z &= Gx + Hu, \quad z \in \mathbb{R}^l \\  \end{aligned}  

The above dynamic system looks more complicated that it needs to be because we have a measured output y, and a controlled output z. The measured output is the signal that we have available for measurement, whereas the controlled output is the signal that we would like to make as small as possible in the shortest time possible. It’s possible that z and y are the same, or z can be comprised of y and it’s derivative, etc.

The LQR problem is then to find a control input, u, that minimizes the following cost function:

\begin{aligned}  2) \displaystyle \qquad J_{LQR} = \int_0^\infty \| z(t) \|^2 + \rho \|u(t) \|^2 dt  \end{aligned}  

\rho is simply a tuning parameter used to weigh which part of our minimization problem we want to place more importance on. Is it more important for us to minimize the energy of the controlled output, z? Then select \rho to be very small. Is it more important for us to minimize the energy of the control signal, u? Then select \rho to be very large. One of the things that makes the LQR a bit annoying to some is that the LQR problem involves minimizing both of these energies.

Equation 2 is useful because it illustrates more compactly what we are trying to achieve, but the LQR problem is usually written more generally in textbooks as

\begin{aligned}  3) \displaystyle \qquad J_{LQR} = \int_0^\infty \left ( z^T \bar{Q} z + \rho \, u^T \bar{R} u \right ) dt  \end{aligned}  

Where \displaystyle \bar{Q} \in \mathbb{R}^{lxl} and \displaystyle \bar{R} \in \mathbb{R}^{mxm} are symmetric positive-definite matrices.

Or we can represent things in a bit more complicated way as:

\begin{aligned}  4) \displaystyle \qquad J_{LQR} = \int_0^\infty \left ( x^T Q x + u^T R u + 2x^T N u \right ) dt  \end{aligned}  

Without getting into the math yet (algebraic Riccati equations, yay!) the feedback control law that minimizes the value of the above cost function is

\begin{aligned}  5) \displaystyle \qquad u = -Kx  \end{aligned}  

Where the control gain, K is given by:

\begin{aligned}  6) \displaystyle \qquad K = R^{-1} \left ( B^T P +N^T \right )  \end{aligned}  

Where P is the solution to the, in this case, continuous time algebraic Ricatti equation (ARE):

\begin{aligned}  7) \displaystyle \qquad A^TP + PA + Q - \left (PB + N) \right ) ^{-1} \left (B^T P +N^T \right )=0  \end{aligned}  

Note: another thing I’ve noticed is that algebraic Riccati equation is a bit of a buzz word with a lot of students not being able to define what one is and profs just throwing it around like “ofcourse, if you remember back to 3rd grade, just after learning your multiplication tables you’ll recall the ever present algebraic Ricatti equation.” -_-
Now, a Riccati equation is just any first-order ordinary differential equation that is quadratic in the unknown function. In our case, P is the unknown function, and if we carry out all of the matrix multiplications we can easily see that the equation in question is quadratic in P.

Now, this is about the point where people start to get frustrated with LQRs… not only do we have these stupid matrices Q and R that we have to figure out what to do with, but we also have to figure out how to solve the ARE above. Oh, and there’s now this nuisance of a cross term, N, that most books decide, in a super frustrating fashion, to just ignore talking about.

To arrive back at the more simple representation in equation 2 we use the following relations:

\begin{aligned}  8) \displaystyle \qquad Q &= G^T G\\  R &= H^T H + \rho I \\  N &= G^T H  \end{aligned}  

And to arrive back at the slightly more general equation 3, use the following relations:

\begin{aligned}   9) \displaystyle \qquad Q &= G^T \bar{Q} G \\   R &= H^T \bar{Q} H + \rho \bar{R} \\   N &= G^T \bar{Q} H   \end{aligned}   

If you’re using Matlab, as it seems like controls theoriticians are absolutely in love with (move to python, please), you can use the command:

[K, P, E] = lqr(A,B,Q,R,N)

to solve the ARE and compute the optimal state feedback gain matrix, K.

Now we are left with a couple things to scratch our head about.

First, what the hell do we do with those \displaystyle \bar{Q}  and \displaystyle \bar{R}  matrices? And second, how the hell do we actually solve that algebraic Riccati equation?

The first one is a bit easier to tackle. The simplest thing to do is set \displaystyle Q = I, \ R = \rho I . From here we just vary the single parameter \displaystyle \rho until we get something that has a decent response.

Many books suggest using Bryson’s Rule to begin your tuning of the LQR [1]

\begin{aligned}   10) \displaystyle \qquad \bar{Q}_{ii} &= \frac{1}{maximum \ acceptable \ value \ of \ z_i^2}, \qquad i \in \left (1,2,\dotsb,l \right ),\\   \bar{R}_{jj} &= \frac{1}{maximum \ acceptable \ value \ of \ u_j^2}, \qquad j \in \left (1,2, \dotsb, k \right )   \end{aligned}   

which then corresponds to the following cost function

\begin{aligned}  11) \displaystyle \qquad J_{LQR} = \int_0^\infty \left ( \sum_{i=1}^l \bar{Q}_{ii} z_i(t)^2 + \rho \sum_{j=1}^m \bar{R}_{jj} u(t)^2 \right ) dt  \end{aligned}  

This effectively sets the maximum acceptable value for each term to be 1. This generally gives semi-decent results, but really only acts as a starting point for us to start twiddling our control engineer knobs.

For example, if for our first state we want the maximum possible error to be 1 m and the second state to have a maximum possible error of 1/60 radians, we can use the following

\begin{aligned}  \displaystyle \qquad q_{11} &= \left ( \frac{1}{1} \right ) ^2 \qquad q_{11}x_1^2 = 1 \ when \ x_1 = 1m \\  q_{22} &= (60)^2 \qquad q_{22}x_2^2 = 1 \ when \ x_2 = \frac{1}{60} \ rad  \end{aligned}  

The second question can be easy or hard… If you just want to use an already available library, like the python control library, you can just use the lqr method of the control class.

OR, you can write your own solver piggybacking on the scipy linear algebra library using the following code snippet[2]:

import numpy as np
import scipy.linalg
 
def lqr(A,B,Q,R):
    """Solve the continuous time lqr controller.
     
    dx/dt = A x + B u
     
    cost = integral x.T*Q*x + u.T*R*u
    """
    #ref Bertsekas, p.151
 
    #first, try to solve the ricatti equation
    X = np.matrix(scipy.linalg.solve_continuous_are(A, B, Q, R))
     
    #compute the LQR gain
    K = np.matrix(scipy.linalg.inv(R)*(B.T*X))
     
    eigVals, eigVecs = scipy.linalg.eig(A-B*K)
     
    return K, X, eigVals
 
def dlqr(A,B,Q,R):
    """Solve the discrete time lqr controller.
     
     
    x[k+1] = A x[k] + B u[k]
     
    cost = sum x[k].T*Q*x[k] + u[k].T*R*u[k]
    """
    #ref Bertsekas, p.151
 
    #first, try to solve the ricatti equation
    X = np.matrix(scipy.linalg.solve_discrete_are(A, B, Q, R))
     
    #compute the LQR gain
    K = np.matrix(scipy.linalg.inv(B.T*X*B+R)*(B.T*X*A))
     
    eigVals, eigVecs = scipy.linalg.eig(A-B*K)
     
    return K, X, eigVals

If you’re interested in writing your own solver, what you can do is solve the ARE via the eigenvalues and eigenvectors of the Hamiltonian matrix.

I’ll expand on this Hamiltonian matrix solution more later, but for now I’ll leave off by saying that we first construct the Hamiltonian matrix, find the eigenvalues and eigenvectors of it, selected the eigenvectors associated with stable eigenvalues, then compute the steady state solution P from these.

References

  1. Hespahna, J. (2009). Linear Systems Theory. Princeton University Press
  2. http://www.kostasalexis.com/lqr-control.html

The Extended Kalman Filter

What’s better than a Kalman filter? An Extended Kalman Filter! [1] (I’m just kidding, this isn’t from the book, but wouldn’t it be awesome if it was? Most of the other stuff in here is, though.)

Up until now all of our estimators have necessarily required the system dynamics to be linear and time invariant. Unfortunately, most things in the real world are non-linear, which sucks. One of the most common ways of dealing with this when using Kalman filters is to employ the extended Kalman filter.

The main difference between the EKF and standard KF is that the EKF is able to handle non-linear dynamics by linearizing the non-linear system around the Kalman filter estimate, and then the Kalman filter estimate is based on the linearized system. It’s a bit of a mouthful, but essentially we are just linearizing our system around our estimate, and then feeding that back in to the filter. It’s bootstrapping.

Let’s get crackin’!

For the case of the discrete time extended Kalman filter, we have the usual suspects of system and measurement equations, but this time the state transition matrix F and the output matrix H can be nonlinear.

\begin{aligned}  1) \displaystyle \qquad x_k &= F_{k-1}(x_{k-1}, u_{k-1}, w_{k-1})\\  y_k &= H_k(x_k, v_k) \\  w_k &\sim (0, Q_k) \\  v_k &\sim (0, R_k)\\  \end{aligned}  

We then initialize the EKF as before:

\begin{aligned}  2) \displaystyle \qquad \hat{x}_0^+ &= E(x_0)\\  P_0^+ &= E \left[ \left(x_0 - \hat{x}_0^+ \right ) \left(x_0 - \hat{x}_0^+ \right)^T \right ] \\  \end{aligned}  

Once we let the EKF start rolling in measurements there is a lot more work to be done now that we are linearizing the system. For each time step we then perform the following step to linearize the state transition matrix about the previous steps a posteriori state estimate:

\begin{aligned}  3) \displaystyle \qquad F_{k-1} &= \frac{\partial{f_{k-1}}}{\partial{x}} \bigg\rvert_{\hat{x}_{k-1}^+}\\  L_{k-1} &= \frac{\partial{f_{k-1}}}{\partial{w}} \bigg\rvert_{\hat{x}_{k-1}^+}\\  \end{aligned}  

After computing the partial derivatives update the state estimate and estimation-error covariances:

\begin{aligned}  4) \displaystyle \qquad P_k^- &= F_{k-1}P_{k-1}^+F_{k-1}^T+L_{k-1}Q_{k-1}L_{k-1}^T\\  \hat{x}_k^- &= f_{k-1} \left ( \hat{x}_{k-1}^+, u_{k-1}, 0 \right)\\  \end{aligned}  

The time update of the state estimate above just indicates to plug the previous steps a posteriori estimate back in to the original nonlinear system model.

With the a preori state estimate for the current time step calculated, we then linearize the output model and evaluate it about the a preori state estimate.

\begin{aligned}  5) \displaystyle \qquad H_{k} &= \frac{\partial{h_{k}}}{\partial{x}} \bigg\rvert_{\hat{x}_{k}^-}\\  M_{k} &= \frac{\partial{h_{k}}}{\partial{v}} \bigg\rvert_{\hat{x}_{k}^-}\\  \end{aligned}  

We then use the linearized output model evaluated about the a preori state estimate to calculate the Kalman gain, and therefore the a posteriori state estimate and a posteriori estimation-error covariance.

\begin{aligned}  6) \displaystyle \qquad K_k &= P_k^- H_k^T \left( H_k P_k^- H_k^T + M_k R_k M_k^T \right ) ^{-1}\\  \hat{x}_k^+ &= \hat{x}_k^- + K-k \left [y_k - h_k(\hat{x}_k^-, 0) \right ]\\  P_k^+ &= \left (I - K_k H_k \right ) P_k^-  \end{aligned}  

EKFs, The Popular Crowd Kalman Filter

The EKF is one of the most popular tools for state estimation that you’ll come across within the robotics community. As opposed to tools like the particle filter, the EKF is much more computationally efficient because it represents the belief by a multivariate Gaussian distribution. Just like the standard Kalman filter, each update of the EKF requires time O(k^{2.4} + n^2), where k is the dimension of the measurement vector, and n is the dimension of the state vector.[2]

You may be thinking, “Surely, Eric, this is as deep as it goes,” but you’d be wrong. Why? Well, one of the limitations of the EKF is that we are approximating our state transition and measurement matrices by Taylor expansions (that whole partial derivative thing above). In most real-world applications these matrices are nonlinear (duh, that’s why we were using the EKF in the first place!) but if they are highly nonlinear about our evaluation point this approximation may not be good enough.

How Much More Complicated Can This Get?

Well, we have to keep grad students and profs busy, so there are lots of more complicated variations. We can represent posteriors using mixtures or sums of Gaussians, such is the case of the multi-hypothesis extended Kalman filter, or MHEKF. Then there’s the
information filter, which propagates the inverse of the estimation-error covariance, which can be computationally more efficient if we have significantly more measurements than states. Then there’s the \boldsymbol{ \alpha - \beta} filter, which is actually less complicated because it’s been studied since the days of Newton (it involves a two state Newtonian dynamical system with a position measurement). We have fading memory Kalman filters for when we have a crap system model and we don’t want our estimate to diverge from the true state, The \boldsymbol{H_\infty} filter for when we need a very robust filter to minimize the worst-case estimation error, the hybrid extended Kalman filter, for when we want to combine continuous and discrete time dynamics, the second order extended Kalman filter where we employ a second order Taylor series expansion of our state transition and measurement matrices to reduce the linearization error of the standard EKF, the iterated extended Kalman filter where we perform our linearization process again (or as many times as we like) after getting our a posteriori state estimate. If we want to get even trickier there’s the unscented Kalman filter that aims to further reduce linearization errors present in the EKF by transforming our probability density functions through a nonlinear function (which is difficult) by means of deterministic vectors known as sigma points whose ensemble mean and covariance are equal to our state mean and estimation-error covariance. Finally, we have the very common particle filter that just brute forces its way through estimation, which is pretty awesome now that computation power is relatively cheap.

References

  1. Simon, D. (2006). Optimal State Estimation. Hoboken, New Jersey: Wiley. ISBN: 9780471708582
  2. Thrun, S., Burgard, W.,, Fox, D. (2005). Probabilistic robotics. Cambridge, Mass.: MIT Press. ISBN: 0262201623 9780262201629 

Kalman Filters, They’re Magic!

They’re not really magic, and I’ll probably hate you if you say they are.

The Kalman filter, and variations thereof, are probably the most commonly come across optimal state estimators that you will find. They just work well and mathematically are pretty, well, badass. I’ve seen plenty of people use them despite knowing their noise characteristics, or really having too good of an idea of what the system processes are, and still work decently well when tuned. That said, one of the most common causes for failure of Kalman filtering is due to modeling errors.

Guts of the Kalman Filter

We’ve already talked about a few different flavors of least squares estimators, and the Kalman filter continues to build on these concepts. In the case of the recursive least squares estimator, we now only need to change the symbols we use slightly to denote a priori and a posteriori state estimates and covariances. Oh, and actually have an idea of what the hell it is our system is doing, along with its internal noise process. This then allows us to estimate what the next set of states will be for the system given our model of it, and also incorporates our actual measurements in to it. One of the beautiful things about the Kalman filter is that the Kalman gain term then selectively weighs how much we should trust our model versus our measurement given the measurement statistics.

With the extra complication of having to know how our system evolves in time a la its state transition matrix, we can describe the discrete-time Kalman filter by first modelling our dynamic system as follows:

\begin{aligned}  1) \displaystyle \qquad x_k &= F_{k-1}x_{k-1} + G_{k-1} u_{k-1} + w_{k-1}\\  y_k &= H_k x_k + v_k \\  E \left( w_k w_j^T \right ) &= Q_k \delta_{k-j} \\  E \left( v_k v_j^T \right ) &= R_k \delta_{k-j} \\  E \left( w_k v_j^T \right ) &= 0  \end{aligned}  

Woah, woah, woah. There’s a whole other noise process now that we have to figure out what the hell to do with? That’s right! But fortunately for us the process noise covariance, Q, is often just used as a tuning parameter in the filter. As the process noise covariance is increased, the covariance between samples will increase, thereby making the Kalman filter gain coverge to a larger value, which will then cause a larger emphasis to be placed on measurements. From this it’s easy to see we can use Q to favor or devalue the effect of our measurements if we have the desire to do so.

Moving on to initializing the Kalman filter:

\begin{aligned}  2) \displaystyle \qquad \hat{x}_0^+ &= E(X_0)\\  P_0^+ &= E \left [ (x_0 - \hat{x}_0^+)(x_0 - x_0^+) \right ]  \end{aligned}  

Where the \hat{x}_0^+ refers to the initial a posteriori state estimate, and P_0^+ refers to the initial a posteriori covariance.

With this model and initialization, it’s time to just let the Kalman filter chug along with the following equations at each time step k = 1, 2, etc.

\begin{aligned}  3) \displaystyle \qquad P_k^- &= F_{k-1}P_{k-1}^+ F_{k-1}^T + Q_{k-1}\\  K_K &= P_k^- H_k^T \left (H_k P_k^- H_k^T + R_k \right ) ^{-1} \\  &= P_k^+ H_K^T R_k^{-1} \\  \hat{x}_k^- &= F_{k-1} \hat{x}_{k-1}^+ + G_{k-1} u_{k-1} \\  \hat{x}_k^+ &= \hat{x}_k^- + K_k \left( y_k - H_k \hat{x}_k^- \right ) \\  P_k^+ &= \left (I - K_kH_k \right ) P_k^- \left (I- K_k H_k \right )^T + K_k R_k K_k^T \\  &= \left [ (P_k^-)^{-1} + H_k^T R_k^{-1} H_k \right ] ^{-1}\\  &= \left (I - K_k H_k \right )P_K^- \\  \end{aligned}  

Where \hat{x}_k^- is the a priori state estimate, and \hat{x}_k^+ is the a posteriori state estimate. The first representation of the a posteriori covariance is known as the Joseph stabilized version, and is the most robust, whereas the third version is much more succinct, but is less robust. If the second expression for the Kalman gain is used, so too must the second expression for the a posteriori covariance.

Cool Things to Take Away

In the measurement update equation above, the quantity (y_k - H_k \hat{x}_k^-) is known as the innovations, which is simply the difference between the the actual measurement and the predicted measurement. This is the part of the measurement that contains new information about the state. As you can see, the Kalman filter then weights the innovations by the kalman gain, and then adds it to the a priori state estimate, to finally obtain the a posteriori state estimate. Therefore, the only difference between the a priori and a posteriori state estimate is that the latter takes the new measurement into account, whereas the former is the state estimate according to what our dynamic model predicts. If you’re coming at this from a mobile robotics application, people will generally refer the dynamic model as the motion model, and the state estimate the belief.

I won’t go into the proof, but the Kalman filter is the optimal filter given Gaussian noise processes. Even if the noise processes are not Gaussian, the Kalman filter is still the optimal linear filter.

I really like the little block diagram that wikipedia uses on their page for the Kalman filter; I think it does a pretty nice job of illustrating the flow of the filter.

Kalman Filter Block Diagram Courtesy https://en.wikipedia.org/wiki/Kalman_filter

Another way that I’ve seen the Kalman Filter depicted is with a drawing similar to this one from Mathworks:

In the above image, I would prefer it if the mean for the leftmost, blue, Gaussian was labelled \hat{x}_k^- and the mean for the center Gaussian was labelled \hat{x}_k^+, but I think you get the idea they are trying to convey.

Lastly, I feel like there isn’t much need in writing your own filter from scratch as there are a decent number of libraries out there that you can piggy back off of. If you want to give FilterPy a try, it has a class named kalman that will do basically what you’d like, otherwise there are libraries like PyKalman.

Verifying Your Kalman Filter Isn’t Broken and Final Thoughts

One of the easiest things to do to verify the performance of your Kalman filter is to monitor what the innovations are doing. The innovations should be a white noise process with zero mean with a covariance of

4) \displaystyle \qquad \left (H_kP_k^-H_k^T + R_k \right )  

If we see that the innovations appear to be colored, nonzero-mean, or has the wrong covariance, then there is something wrong with the filter; this is usually due to an error in modelling.

Oh, also make sure your error covariance matrix, P is converging (kind of important).

Sometimes to tie a bunch of the concepts together people will refer to Kalman filters as Linear Quadratic Estimators, simply because we assume a linear system, and our cost minimization function is quadratic.

Until next time, happy filtering!

Recursive Least Squares Estimation

So, we’ve talked about least squares estimation and how we can weight that estimation based on our certainty in our measurements. Now let’s talk about when we want to do this shit online and roll in each subsequent measurement!

Things begin to get a bit more complicated now that we have an “update” step that requires us to fold in new information, but we are getting closer and closer to the more complicated Kalman filter. In order to fold in this newly acquired data, we add in a matrix, K_k. known as the estimator gain matrix, or if you’re dealing with a Kalman filter it’s called the Kalman gain matrix, either way, it’s the same thing: just a weighting of how confident we are in our measurements based on their statistics.

We will therefore amend our state equations as follows:

\begin{aligned}  1) \displaystyle \qquad y_k &= H_kx + v_k)\\  \hat{x}_k &= \hat{x}_{k-1} + K_k \left (y_k - H_k \hat{x}_{k-1}\right ) \\  \end{aligned}  

The difference this time being that we compute \hat{x_k} based on the previous estimate, \hat{x}_{k-1} and the measurement y_k.

For the sake of expediency I’m going to skip most of the derivation of this one because the matrix equations continue to get longer, so let’s just jump right to the useful part of it all.

Recursive Least Squares Estimation Algorithm

We begin by first initiating the estimator as follows:

\begin{aligned}  2) \displaystyle \qquad \hat{x}_0 &= E(x)\\  P_0 &= E \left [ (x - \hat{x}_0)(x - \hat{x}_0)^T \right ] \\  \end{aligned}  

Where P_k is the estimation-error covariance. There’s a bit of a trick we play next to get the ball rolling: if we don’t know anything about x before measurements start rolling in, we set P_0 = \infty I. If we know exactly what x is before measurements roll in we set P_0 = 0. So, basically the less we know about x the larger we set our estimation-error covariance.

With the estimator initialized, let’s start chugging along. We update our estimate of x and the estimation-error covariance, P, with the following equations:

\begin{aligned}  3) \displaystyle \qquad K_k &= P_{k-1}H_k^T \left ( H_k P_{k-1} H_k^T + R_k \right )^{-1}\\  \hat{x}_k &= \hat{x}_{k-1} + K_k \left ( y_k - H_k \hat{x}_{k-1} \right) \\ P_k &= \left (I - K_kH_k \right ) P_{k-1} \left (I - K_kH_k \right )^T + K_kR_kK_k^T  \end{aligned}  

Where R_k is the covariance of our zero-mean noise process, v_k. In the real world this is literally just the variance in your measurement variable when your state is held constant. So, if our output measurement variable were, for example, the angular velocity of a table saw blade, we could run the table saw at constant velocity, collect our data, and determine R to be the variance in our angular velocity measurements.

Here’s a picture I found from researchgate[1] that illustrates the effect of a recursive least squares estimator (black line) on measured data (blue line). This scenario shows a RLS estimator being used to smooth data from a cutting tool. Don’t worry about the red line, that’s a bayesian RLS estimator.

Slightly Sexier Examples (Curve Fitting)

Despite what many people with their resistor based examples of what least squares estimators will tell you, we can actually do some pretty cool things with these guys. Take for example curve fitting. Let’s suppose we are measuring the altitude of a free falling object. From basic physics we know that the underlying governing equation for this is a quadratic function of time:

\begin{aligned}  4) \displaystyle \qquad r = r_0 + v_0t + \frac{a}{2}t^2  \end{aligned}  

Where r is position, r_0 is the initial position, v_0 is the initial velocity, and a is the acceleration. If we then measure r at many different times and fit a quadratic to the resulting position vs time curve, we will have an estimate of our initial velocity and acceleration.

In this scenario, our output equation takes the form

\begin{aligned}  5) \displaystyle \qquad y_k &= x_1 + x_2 t_k + x_3 t_k^2 + v_k \\  E(v_k^2) &= R_k \\  H_k &= \left [ 1 \, t_k \, t_k^2 \right ]  \end{aligned}  

From here, it’s just a simple matter of initializing the estimator and then letting the algorithm governed by the equations set out in 3 do its thing, or what ever other flavor of least squares estimation you prefer.

In the words of Jerry Smith, later days, amigo.

Resources

  1. Mehta, Parikshit & Mears, Laine. (2014). Adaptive control for multistage machining process scenario—bar turning with varying material properties. International Journal of Advanced Manufacturing Technology. 10.1007/s00170-014-6739-x.

Weighted Least Squares Estimation of a Stationary Linear System

I figured I’d kind of skip over this variant in the least squares estimator, after all, my goal is to talk about all the stepping stones to get to the Kalman filter, and I’m a little impatient, but I like talking about the weighted least squares estimator because it shows that even if some of our measurements are less reliable than others we should never discard them. If we continue to run with the beat to death example of a resistor, let’s now assume we are using a couple different multimeters to make measurements, one that is really expensive and calibrated, and one that’s been purchased from Amazon for 10 dollars.

The above scenario essentially boils down to us saying we have more faith in one measuring device than the other, but how do we represent this in a mathematically meaningful way? With statistics!

Let’s take the same equations from our last little talk about least squares estimation, but modify them now by actually characterizing the noise that we are experiencing. Now, if we assume that the noise for each measurement is zero-mean and independent we can write our expectation and measurement covariance matrix as follows:

\begin{aligned}  1) \displaystyle \qquad E(v_i^2) &= \sigma_i^2 \qquad (i=1, \dotsb, k)\\  R &= E(vv^T) \\  &= diag \left (\sigma_1^2, \dotsb , \sigma_k^2 \right )  \end{aligned}  

We then construct our cost function like we did before, but this time weigh the sum of squares equation by the variance in our measurements. This allows us to place more emphasis on measurements that are less noisy and less emphasis on measurements that are very noisy. This technique of weighing things by how much faith we have in measurements will continue to show up as we discuss “better” estimators. This weighting results in a cost function as follows:

\begin{aligned}  2) \displaystyle \qquad J &= \frac{\epsilon_{y1}^2}{\sigma_1^2} + \dotsb + \frac{\epsilon_{yk}^2}{\sigma_k^2}\\  &= \epsilon_y^T R^{-1} \epsilon_y \end{aligned}  

Honestly, when we carry out the matrix multiplication shown above we get a pretty dumb and long looking equation that I don’t really care to type out, and unless you’re just looking for a proof for a homework assignment it doesn’t matter much. Let’s call it good enough to show that when we again take the partial derivative of our cost function with respect to \hat{x} we arrive at

\begin{aligned}  3) \displaystyle \qquad \hat{x} = \left ( H^TR^{-1}H \right )^{-1}H^TR^{-1}y \end{aligned}  

It goes without saying that since we are inverting R in equation 3 it must be nonsingular, in other words, we have to assume that each of our measurements are affected by at least some amount of noise.

More Examples!

If we return to our example of estimating the resistance of a resistor, we can basically just employ equation 3 directly, which ends up being

\begin{aligned}  4) \displaystyle \qquad \hat{x} &= \left ( H^TR^{-1}H \right )^{-1}H^TR^{-1}y \\  &= \left( \sum \frac{1}{\sigma_i^2} \right )^{-1} \left (\frac {y_1}{\sigma_1^2} + \dotsb + \frac{y_k}{\sigma_k^2} \right ) \end{aligned}  

This time, instead of just the average of measurements our solution becomes the weighted sum of measurements, where each measurement is weighted by the inverse of our uncertainty in that measurement.

Boom.

Least Squares Estimation of Stationary Linear Systems

Least squares estimators are typically used to determine a constant on the basis of several noisy measurements of that constant. Mathematically I think they’re pretty beautiful, and the derivation of them leads us to the somewhat ubiquitous Moore-Penrose pseudo inverse.

Let’s just jump right in!

Suppose we want to determine a “best” estimate of some state constant n-element state vector, x, and y is a k-element noisy measurement vector. Typically estimates of something are denoted as that “something” with a hat over it, so we will call our estimated vector \hat{x}. Let’s assume that the process governing this system is linear, otherwise our lives have just gotten a whole lot harder, such that each element of the measurement vector y is a linear combination of elements of the state vector, x, with some added noise on top.

\begin{aligned} 1) \displaystyle \qquad y_1 &= H_{11}x_1 +\dotsb +H_{1n}x_n + v_1 \\  &\vdots \\ y_1 &= H_{k1}x_1 +\dotsb +H_{kn}x_n + v_k  \end{aligned} 

Which can be written in matrix form simply as:

2) \displaystyle \qquad y = Hx + v

Now, let’s introduce something called the measurement residual, which is simply the difference between the noisy measurement (actual system output) and the estimated output.

3) \displaystyle \qquad \epsilon_y = y - H \hat{x}

So, as we can see from equation 3, the smaller the measurement residual becomes, in some sense, the better our estimate of that parameter is. Simply by inspection it makes sense that if our measurement residual becomes zero we have done a pretty good job estimating x.

Since we are dealing with least squares estimation, the above “some sense” is in a least squares sense, so we will construct a cost function, J, that is the sum of squares of our measurement residuals as follows:

\begin{aligned}  4) \displaystyle \qquad J &= \epsilon_{y1}^2 + \dotsb + \epsilon_{yk}^2\\  &= \epsilon_y^T \epsilon_y   \end{aligned}  

Now, let’s substitute equation 3 into equation 4 so we obtain the cost function in terms of our measurement vector, y, the output matrix, H, and estimated state vector \hat{x} .

\begin{aligned}  5) \displaystyle \qquad J &= \left (y - H \hat{x} \right )^T \left (y - H \hat{x} \right )\\  &= y^Ty - \hat{x}^TH^Ty - y^TH\hat{x} + \hat{x}^TH^Th\hat{x}   \end{aligned}  

Cool, now that we have our cost function constructed, the next step is to minimize said function with respect to \hat{x}. Remembering back to calc 3 we know that in order to minimize a multivariable equation with respect to one of its variables we must compute the partial derivative and set it equal to zero.

\begin{aligned}  6) \displaystyle \qquad \frac{\partial J}{\partial \hat{x}} &= -y^TH -y^TH + 2 \hat{x}^TH^TH\\  &= 0  \end{aligned}  

If we gently massage equation 6 so as to solve for our state estimate we obtain:

\begin{aligned}  7) \displaystyle \qquad H^Ty &= H^TH\hat{x}\\  \hat{x} &= \left (H^TH \right ) ^{-1}H^Ty \\  &= H^Ly  \end{aligned}  

Where H^L is the left pseudo inverse. The pseudo inverse exists if k \geq n and H is full rank. This just means that we need to have more measurements than there are number of variables and the measurements must be linearly independent.

Example Time!

The example that I always see given for this simple problem is that of estimating the resistance of an unmarked resistor based on k measurements from a multimeter. In this scenario our state variable x is the resistance of the resistor, and y are our noisy measurements from the multimeter.

\begin{aligned}  8) \displaystyle \qquad y_1 &= x + v_1\\  &\vdots \\  y_k &= x + v_k \\  \end{aligned}  

which again just reduces to the simple matrix equation

\begin{aligned}  \displaystyle \qquad y = x + v\\  \end{aligned}  

What the hell, our output matrix isn’t there! Well, it is, it’s just a k x 1 matrix filled with 1’s. Now, if we just plug this into equation 7 from above, our optimal (in the sense of least squares) estimate of the state x is simply

\begin{aligned}  9) \displaystyle \qquad \hat{x} &= \left (H^TH \right )^{-1}H^Ty\\  &= \frac{1}{k}\left (y1 + \dotsb + yk \right)  \end{aligned}  

In this, almost stupidly, simple example, the least squares estimate reduces to just what a normal person would have done and is the average of the noisy measurements.

Allan Variance and Its Use in Characterizing Inertial Measurement Unit Errors

This is going to be a super quick and dirty writeup until I take the time to make it more presentable, but here we go!

For the time being this is mostly going to just be a link and short description of my allan variance github repo used to create synthetic IMU error models using a first order Gauss-Markov process: https://github.com/EVictorson/allan_variance

Also, I’m going to REALLY simplify the technical discussion here because this seems to be one of those areas where PhDs in control theory like to start waving their… boy parts… around to show who knows more about what, which is unnecessary.

Allan Variance Background

The Allan Variance (AVAR) is historically used as a measure of frequency stability in clocks, oscillators, and amplifiers. It is named after the one and only David W. Allan.

Mathematically the AVAR is is expressed as:
1) \displaystyle \qquad {\sigma_{y}}^2(\tau)

While the Allan Deviation is, as you may expect from basic statistics:
2) \displaystyle \qquad {\sigma_{y}}(\tau)

We often see the M-sample Allan variance, which is a measure of frequency stability using M samples, time T between measurements, and observation time tau. The M-sample Allan variance is expressed as:

3) \displaystyle \qquad {\sigma_{y}}^2(M,T,\tau)

To explain the AVAR as academically as you can in as few words as possible is to say that it is defined as one half of the time average of the squares of the differences between successive readings of the frequency deviation sampled over the sampling period. The Allan variance depends on the time period used between samples: therefore it is a function of the sample period, commonly denoted as tau, likewise the distribution being measured, and is displayed as a graph rather than a single number. (Editors note: TL;DR) A low Allan variance is characteristic of a clock with good stability over the measured period. [1]

The results of AVAR or ADEV are related to five basic noise terms appropriate for inertial sensor data, namely:

  • Quantization noise
  • Angle Random Walk
  • Bias Instability
  • Rate Random Walk
  • Rate Ramp

Now, you may ask, why the hell are we using something that was designed to measure the frequency stability of clocks in characterizing errors of an IMU? The answer is, to put it simply, we typically separate the errors of inertial measurement devices into two groups: deterministic errors like bias and nonlinearity, and stochastic processes like the ones mentioned above. Within the field of system dynamics, everything can be decomposed into frequency domain, and we are therefore concerned about the contribution to errors from different frequency sources.
As wikipedia succinctly puts it: “The Allan variance is intended to estimate stability due to noise processes and not that of systematic errors or imperfections such as frequency drift or temperature effects. The Allan variance and Allan deviation describe frequency stability.” Basically, the easiest way to think about this, if you’ve taken an undergraduate controls course, this will be a piece-meal Bode plot with different error frequencies. I’ll expand on this in a minute, but that’s basically what we are dealing with.

Again, the Allan variance / Allan deviation is nothing more than a Bode plot that analyzes the magnitude of different noise sources with differing frequency power. This will cause different contributing noise terms to appear with differing slopes on the Allan deviation plot; these differing slopes making it easy to see that each noise process differs by an integrator (1/s).

The following error terms are the ones most commonly used, and can be distinguished on the Allan Variance / Deviation plot with their corresponding slopes.

  • Quantization Noise: slope -1
    • Quantization noise is Gaussian and zero mean. It is strictly due to the digital nature of ring laser gyro outputs.
  • Angle Random Walk: slope = -1/2
    • ARW is high frequency noise that can be obvserved as a short-term variation in the sensor output. After integration, it causes random error in angle with distribution that is proportional to the square root of the elapsed time. It is often thought to be from randomized dither and the spontaneous emission of photons for a ring laser gyro.
  • Bias Instability: slope = 0
    • BI has an impact on long-term stability. It is a slowly fluctuating error so it appears in low frequencies as 1/f (pink, or flicker) noise. For a gyroscope, this measures how the bias of the sensor changes over a specified period of time at a constant temperature. For a ring laser gyro this noise originates in the discharge assembly, electronics, and other components susceptible to random flickering.
  • Rate Random Walk: slope = 1/2
    • RRW basically characterizes the time scale over which changes in the bias offset occur and can give an idea of when recalibration of sensors should occur. This random process is of uncertain origin, possibly a limiting case of an exponentially correlated noise with a very long correlation time. Mechanical as well as rate biased laser gyros exhibit this noise term.
  • Rate Ramp: slope = 1
    • Slow monotonic change of output over a long time period. This is more of a deterministic error rather than random noise. Its presence in the data may indicate a very slow monotonic change of the ring laser gyro intensity persisting over a very long period of time.

So, what do one of these bad boys look like you may ask? Well, to steal, er borrow, an image from Freescale’s application note AN5087 “Allan Variance: Noise analysis for gyroscopes” [2] we see something like this:

Allan Deviation Plot, courtesy Freescale AN5087

In the above plot, and with all AVAR / ADEV plots, the horizontal axis is the averaging time, an the vertical axis is either the AVAR or ADEV, ADEV in the case of the plot above.

Empirical Data Acquisition and Testing

For an in-depth tutorial on how best to collect data to perform AVAR, see Section B.4 of [4]. To summarize their recommendations, you can select the sample rate to be at least twice the sample bandwidth. The record length should be at least several times the required performance interval. The recommended approach is to limit the time/frequency domain dynamic range to about 3 orders of magnitude. In practice this is done by dividing the total range into overlapping intervals of geometrically increasing lengths. Thus, the high frequency data is acquired for a short period of time. Lower frequency data is filtered (integrated) and acquired for a longer period; e.g. 100 us data is collected for 0.1s, 0.01s data for 10s, 1s data for 1000s, and 100s data for 10^5 s. The appropriate sampling rates/ record lengths should be chosen to overap about one decade of frequency (time).

If we are considering a gyroscope, the Allan variance can be defined in terms of the output rate, {\Omega}(t), or the output angle as

4) \displaystyle \qquad {\theta}(t) = \int^{t}{\Omega}(t')dt'

The lower integration limit is not present in equation 4 because only angle differences are employed in all of the definitions.

The angle measurements that are fed into equation 4 are made at times t = k{\tau}_0, where k varies from 1 to N (the number of samples), and the sample period is \tau_0. For a discrete set of samples, a cumulative sum can be used to give N values of \theta. In this case the cumulative sum of the gyro output samples at each k\tau_0 is taken and each sum obtained is then multiplied by the sample period \tau_0 to give N values of \theta. [2]

A general heuristic to use is to set the averaging time to be \tau = m{\tau}_0, where m is the averaging factor. The averaging factor can be chosen arbitrarily as some value according the inequality m < (N - 1) /2. [2]

Cool, now that we’ve got our values of \theta calculated, we compute the Allan variance with the following equation:

5) \displaystyle \qquad {\sigma}^2(\tau) = {\frac {1}{2\tau^2}} < (\theta_{k+2m} - 2\theta_{k+m} + \theta_k)^2 >

Equation 5 represents Allan variance as a function of \tau and < > is the ensemble average. Upon expanding the ensemble average in equation 5 we obtain the following estimate of the Allan variance:

6) \displaystyle \qquad {\sigma}^2(\tau) = {\frac {1}{2\tau^2 (N - 2m)}} \sum_{k=1}^{N-2m} (\theta_{k+2m} - 2\theta_{k+m} + \theta_k)^2

For those who care, and if you want to swing your boy parts around in public, the Allan variance is related to the two-sided power spectral density, S_{\Omega}(f) by:

7) \displaystyle \qquad {\sigma}^2(\tau) = 4 \int_0^\infty S_{\Omega}(f) \frac{sin^{4}(\pi f \tau)}{(\pi f \tau)^2} df

Cool, so hopefully if you do this correctly you’ll have a plot that looks something like this:

Simulation and Validation

Alrighty, so we can get a plot from our time series data easily enough, what next? Well, if we presume to already know the noise coefficients we can simulate the Allan variance plot and compare the two, or we can try to characterize the noise coefficients from the Allan variance plot itself.

For simulation and modelling purposes, typically a first order Gauss-Markov process is used to model these kinds of errors, which is just an exponentially correlated, noise driven stochastic process characterized by the following differential equation:

8) \displaystyle \qquad \frac{dx}{dt} = \frac{-1}{\tau} x + N(\mu, \sigma)

Which is ultimately just an exponentially decaying autoregressive term with driven white noise. The discrete time version of this is:

9) \displaystyle \qquad x(n+1) = x(n) e^{\frac{-1}{\tau}dt} + N(\mu, \sigma) dt

The first order gauss-markov process may also be defined in terms of its autocorrelation function:

10) \displaystyle \qquad R_{xx}= \sigma^2 e^{-\beta \tau}

where \sigma is the entire process variation (NOT JUST THE DRIVEN NOISE VARIATION), \beta is the inverse of the correlation time, T=\frac{1}{\beta}, and \tau is the autocorrelation time lag.

For sufficiently long time series (much longer than the time constant), the autocorrelation plot of the first order Gauss-Markov process can verify the correlation time, which is the time lag at which R_{xx} = \frac{\sigma^2} {e}, and the entire time series variance will appear as the peak at time lag = 0. Note that the driven noise standard deviation cannot be validated via autocorrelation. Also note that matlab has two different autocorrelation functions that will produce slightly different results if your time scales are short (autocorr and xcorr), as well as an autocovariance function, xcov. Autocorr and xcov are normalized by the process mean such that a constant process will have a steady autocorrelation (autocorrelation is just autocovariance scaled by the inverse of the process variance). xcorr, however, does not do this, so a constant process will have a decaying autocorrelation as the time lag increases, which results in a shortening of the summation by 1 element each iteration.

For process lengths on the order of magnitude of the time constant or less the process will be dominated by the integrated white noise term, resulting in a random walk. For time scales much longer than the time constant the exponentially correlated term will begin to be apparent. This can be see on an Allan deviation plot, where for sampling intervals much shorter than the time constant the Gauss-Markov Allan variance reduces to that of a singly integrated white noise process (rate random walk), whose slope is +1/2, and the noise magnitude (standard deviation) may be picked off by finding the intersection of the +1/2 slope line and sampling_interval = 3.
This can also be verified by differentiating this time series to obtain Gaussian noise, whose Allan deviation has a slope of -1/2, and the noise magnitude is interpreted as the intersection of this line with sampling_interval = 1.

For process time scales large enough to see the decay time constant, and Allan deviations with sampling intervals ranging from much less than the time constant to sampling intervals much larger than the time constant, the slope will transition from +1/2 for sampling intervals much shorter than the time constant, to 0 as the sampling interval approaches the time constant at it’s maxima, to -1/2 as the interval becomes much larger than the time constant.

When multiple gauss markov processes are added together it will be nearly impossible to pick apart any of these parameters, but when run individually the driven noise magnitude (qc) and time constant (Tc) can be found on the allan deviation plot as follows:
Tc = argmax_sampling_interval / 1.89, where argmax_sampling_interval is the allan deviation sampling interval that maximizes the allan deviation. The driven noise magnitude may then be found with the following relationship:

11) \displaystyle \qquad  q_c = \frac{\sigma_{max}} {0.437 * \sqrt{T_c}}

where \sigma_{max} is the Allan deviation maxima.

If the combined output of all of these Gauss-Markov processes are analyzed, the only easy thing to pick off will be the velocity random walk / angle random walk, which may be found by fitting a line to the segment with a slope of -1/2 and finding the intersection of this line with tau = 1.

Characterizing Error Sources

Angle Random Walk

Angle random walk appears on the AVAR plot as the following equation:

12) \displaystyle \qquad  \sigma^2(\tau) = \frac{N^2}{\tau}

where N is the angle random walk coefficient. In a log-log plot of \sigma(\tau) vs \tau ARW has a slope of -1/2. The numerical value of N can be obtained by reading the -1/2 slope line at \tau = 1.

Bias Instability

Bias instability appears on the AVAR plot as the following cumbersome equation:

13) \displaystyle \qquad  \sigma^2(\tau) = \frac{2B^2}{\pi}\left [ln 2 - \frac{sin^3(x)}{2x^2}\left (sin(x) + 4xcos(x)\right ) + Ci(2x) - Ci(4x) \right ]

Where x is \pi f_0 \tau,
B is the bias instability coefficient
f_0 is the cutoff frequency, and
Ci is the cosine-integral function.

As you can see it’s probably a bit easier to just pick off the part of the AVAR plot that has a slope of 0.

Rate Random Walk

Rate random walk appears on the AVAR plot with the following equation:

14) \displaystyle \qquad \sigma^2(\tau) = \frac{K^2 \tau}{3}

Where K is the rate random walk coefficient.

Because we are essentially dealing with Bode plots, the presence of \tau in the numerator indicates we will be searching for a slope of +1/2. The magnitude of this noise can be read off the +1/2 slope line at \tau = 3.

Rate Ramp

Rate ramp appears on the AVAR plot with the following equation:

15) \displaystyle \qquad \sigma^2(\tau) = \frac{R^2 \tau^2}{2}

Where R is the rate ramp coefficient.

With one more power of tau in the numerator we are therefore looking for a slope of +1. The magnitude of R can be obtained from the +1 slope line at \tau = \sqrt{2}.

Quantization Noise

Quantization noise appears on the AVAR plot with the following equation:

16) \displaystyle \qquad \sigma^2(\tau) = 3\frac{Q^2}{\tau^2}

Where Q is the quantization noise coefficient.

Quantization noise appears with a slope of -1 on the log-log plot and can be read off the intersection of the line with slope -1 and \tau = \sqrt{3}.

Exponentially Correlated (Markov) Noise

Markov noise appears on the AVAR plot as:

17) \displaystyle \qquad \sigma^2(\tau) = \frac{(q_c T_c)^2}{\tau}\left [1- \frac{T_c}{2\tau}\left (3-4e^{-\frac{\tau}{T_c}}+e^{-\frac{2\tau}{T_c}}\right )\right ]

Where qc is the noise amplitude, and Tc is the correlation time as noted above.

It is easier to view the behavior of the Markov noise at various limits of the equation. For cases with \tau >> T_c we obtain:

18) \displaystyle \qquad \sigma^2(\tau) = \frac{(q_c T_c)^2}{\tau}

Which is the Allan variance for angle random walk where N = qcTc is the angle random walk coefficient. For \tau << T_c we obtain:

19) \displaystyle \qquad \sigma^2(\tau) = \frac{q_c ^2}{3}\tau

Which is the Allan variance for rate random walk.

Sinusoidal Noise

Also, sinusoidal noise exists, but I won’t talk about it because it hasn’t concerned any of my work. Just know it exists.

Now, if we take and superimpose all of these noise sources on one plot we should see a piece-wise representation of the empirically derived plot like that shown in [8]:

References

  1. IA state AERE432 Lecture Notes, retrieved 8/10/2018 http://home.engineering.iastate.edu/~shermanp/AERE432/lectures/Rate%20Gyros/Allan%20variance.pdf
  2. Freescale appnote AN5087, retrieved 8/10/2018 http://cache.freescale.com/files/sensors/doc/app_note/AN5087.pdf
  3. IEEE Standard Specification Format Guide and Test Procedure for Single-Axis Interferometric Fiber Optic Gyros. IEEE std 952-1997 (R2008)
  4. IEEE Standard Specification Format Guide and Test Procedure for Single-Axis Laser Gyros. IEEE std 647-1995
  5. Kirkko-Jaakkola, M., Collin, J., & Takala, J. (2012). Bias Prediction for MEMS Gyroscopes. IEEE Sensors Journal, 12(6), 2157-2163. DOI: 10.1109/JSEN.2012.2185692
  6. Peng-Yu, C. (2015). Experimental Assessment of MEMS INS Stochastic Error Model. Graduate Thesis, The Ohio Sate University, Department of Civil Engineering.
  7. Martin, V. MEMS Gyroscope Performance Comparison Using Allan Variance Method, Brno University of Technology.
  8. ST Design Tip DT0064. Noise analysis and identification in MEMS sensors. Retreived 8/10/2018. https://www.st.com/content/ccc/resource/technical/document/design_tip/group0/bf/92/6b/31/e7/c1/4d/c8/DM00311184/files/DM00311184.pdf/jcr:content/translations/en.DM00311184.pdf