Data Assimilation: Analytical Methods

Apr 23, 2020 20:47 · 8274 words · 39 minute read separate plot looking plug giving

So where I wanna start this afternoon, and then continue tomorrow morning, is the discussion of what’s referred to as data simulation. I think I mentioned on the first day that the first data simulation meeting for ecologists I went to, spent most of the three days arguing about what ecologists meant by data simulation, but in the physical environmental sciences there’s a very specific meaning which is this process of iteratively updating models as new information becomes available, which is different than calibration, which is constraining your model’s parameters with data. I’ve known ecologists who wanna call that data simulation, and what I am calling data fusion, which we’ll talk about on Friday morning, which is when we try to use multiple different types of data constraints simultaneously because then we’re trying to combine multiple sources of information. We’ve talked about calibration, we’ve talked about propagating uncertainty into forecast and understanding it. So now we have, maybe we’ve made the forecast.

01:14 - The new observations become available, we wanna simulate that new information back in. Feedback to have additional constraints on the state of the system that we can then use to update our forecasts. So, what we’re gonna focus on over the course of these twp lectures, is this idea of what I’ll call the Forecast Analysis Cycle. So this iterative process again, of predicting the future given our current understanding of the system. Here’s our current uncertainty, we project that from time one to time two. We compare that to observations.

01:49 - We then do a statistical analysis to update that state. So the analysis is the process of combining our predicted understanding with new observations, to update that understanding. So, in terms of those two steps, we essentially spent the morning talking about how the forecast step works. So everything we talked about the morning of how we propagate uncertainty in making a forecast, that is the forecast step. So I wanna focus now on the analysis step which is given that, how do we do that update? So, prior to observing how the future plays out, what is our best estimate of the future state of the system? So if the state variable we’re interested in is X, what’s our best estimate of X at time t plus one? So, if you felt X was your best predictor of X at t plus one, then you’ve ascribed to a random walk model.

02:56 - And the point here, is that prior to observing how the future plays out, our forecast is our best estimate of the future state of the system, that’s why we made it. So everything we know about what we think is gonna happen in the future, is what we want to build into our forecast. Before the future becomes the present, that is our best understanding of what we think the future could be. If it wasn’t, we would improve our forecast. If we knew something that wasn’t in our forecast, we’d wanna put it into our forecast.

03:26 - So we are assumed that we’re putting all of our information, all of our understanding, into our forecast model that’s accounting for all the uncertainties we have about the system. And so, that’s our best guess of the future. So, it’s our forecast, our predictive probability, forecast probability, of X at time t plus one. Okay, so now we make observations of the system Y. Those observations are imperfect. Now what’s the best estimate of the state of the system? Some balance between the two. Yes, and importantly, it’s not the data.

04:08 - So, I think a lot of folks before they get introduced to how data simulation works, have the kind of intuition that if I make a prediction, and I observe some data, I should just reset my model to the data. Which seems reasonable, but the data is imperfect. The data is noisy. Maybe we shouldn’t trust the data completely? We should instead think about how we combine information. So an example that would, say, come from Catherine’s work on phenology. So she’s trying to predict, say, percentage of leaf-on and leaf-off through the season.

04:51 - So, if I look at a modus image right now, I’ll get an estimate of LAI every week. Throughout the summer, that’s going to be bouncing around because there’s non-trivial observation error in the modus product. Should our estimate of whether leaf-on has occurred or not bounce around mid-summer? No. We should trust the model that says it is the middle of the summer. All you’re seeing there is observation error. You’re not seeing any process, all you’re seeing is noise. If you caused your model to follow all of the wiggles in that, you would just be following noise. So we don’t wanna just follow the noise in the data, we wanna balance that against the data. So instead we think about, well, how do we get that? What do we want? So we would say we want, instead of just wanting the probability of Y, so we started with the probability of X, our forecast, now we want to understand the probability of X, given some new observations. So how do we get to the probability of X now conditioned on some new data? Can we calculate this directly? What’s the probability of the latent state X, the true state of the system, given the data? Not really.

06:14 - But what we can do, we can definitely calculate what’s the probability of observing some real world data, given some state variable. What do we call this? - [Class] Observation error? - It’s the observation model but– - [Student] Likelihood? - Likelihood. That’s what I was thinking. This is the likelihood. So if I have a likelihood for Y given X but what I really want is X given Y, How do I get from one to the other? - [Student] You practice? - [Class] Bayes. - Bayes (class laughs) Yes. We need Bayes’ theorem. (class laughs) So if we want to get to, if we want to update our understanding of the state of the system, given the new data, we need Bayes’ theorem. We need a prior. We need a likelihood, to give us this updated posterior.

07:13 - The likelihood, in this case, yes, it is mostly that observation error model. Where’s the prior? - [Student] Forecast. - Forecast. Good, you’re paying attention. (laughs) And this, in a nutshell, is actually the key insight of how data simulation works. Which is, most of the time when we apply Bayes’ theorem, we spend a lot of time thinking about the likelihood of bringing the data in. But priors are sometimes uninformed and sometimes we have some prior data, sometimes we use expert elicitation.

07:48 - But prior, it gets some weight but it’s not a lot. We could come from lots of places. But in a forecast system, the prior’s really important because we just said, it embeds everything we know about the system up to date. It can be highly informative and it’s what we wanna update. So, the key idea is that the analysis problem is just Bayes’ theorem. And it’s just Bayes’ theorem where the model itself, the forecast itself, is providing the prior.

08:22 - It should say the forecast not the model, because it’s the forecast with all of the uncertainties propagated in is our prior. So, that’s the key concept. Pretty much everything else after this is detail. It’s example. Because armed with that key concept, you could actually go forth and say, “I can write down the prior for my model “out of that forecast however I want. “I can write down the likelihood however it makes sense. “I can do this analysis using whatever Bayesian computational tools I have available to me, “and I can customize it.

” 08:59 - What we’re gonna see over the rest of this lecture and the beginning of next lecture, is largely, in some ways, the history of how people have dealt with this problem through a series of specific special cases, specific simplifying assumptions, that deal with certain problems efficiently. But one of the take-homes that we will return to at the end of this is to remember there’s nothing sacred about the current class of data simulation models that you will find in the literature. And one of the things I said earlier, I think I’ve said multiple times, was that the data simulation tools that we borrow from the atmospheric sciences were tools that were optimized for one type of forecast problem, and we might not have that problem. So it’s completely fair game to say, if we take this concept we might, as ecologists, optimize it differently for different classes of problems we have. With that, I wanna dive into some of the ways that this actually has been used.

10:05 - So again, conceptually we have some estimate of the state, we forecast it in the future, we make some observations, and now we use Bayes’ theorem to combine the two to get our updated estimate and then that then serves as the initial condition for our next forecast. The previous slide, I showed a case where the data and the model don’t completely agree with each other so you’re gonna find the average balance between the two. But if you’re doing this well, if you’re forecasting well, the forecast in the observation shouldn’t be wildly different and what you’re getting is this case where they’re reinforcing each other. So, if your forecasts are good, it’s not a matter of jumping to the data and ditching with the model predicted, it’s more a matter of the data reinforcing. But if your model’s wrong, then the data should pull you back to reality.

11:00 - I wanna start with walking though what I believe to be the simplest possible case of data simulation as the starting point. It’ll help you get your head around it and it also will be the foundation for more complicated cases. So, simplest thing, whenever statisticians want to make life as simple as possible, they start assuming that everything that can be Gaussian and distributed, is Gaussian distributed so we’ll do that. We’ll assume that our forecast is Gaussian distributed with some mean and variance. And as a place to get started, so we’re going to start at the analysis step, we’ll assume that we know what our forecast is.

11:44 - It would be kind of hard to do data simulation if you didn’t know what your own forecast was. We’re gonna also assume that the observation is, again, Gaussian distributed with some observation error. There’s no bias correction in here, there’s no calibration model in here, it’s just simple observations are distributed around the true value with some observation error. So again, simple as possible observation error model. So that’s our likelihood or data model. So we’re gonna assume that our observations are known, which seems like a good assumption, again, that the forecast mean and variance are known and that we know what the observation error is, assume that the observation error has been reported.

12:32 - We’re not trying to estimate the observation error as we’re going. So the only thing really left is the unknown here, is that state of the system that we’re trying to update. Now, this is a case of a normal prior times a normal likelihood. It’s a conjugate combination of prior and likelihood as an analytical solution. If you guys read Chapter 5, we introduce Bayes as a box, it goes with the derivation of normal times normal as an example of conjugacy.

13:04 - It wasn’t an accident that I went through that derivation ‘cause I knew I was going to need it here. So this is the actual posterior but I should point out that I can recode this in a way that makes it a little bit easier to understand. So specifically, lets call, lets call Pf the precision of my forecast, one over tau squared. Lets call Po, the precision of my observations, one over sigma squared. If I do that, then the precision of my analysis, that guy over here, is just the precision of the forecast, plus the precision of the observations.

13:55 - And if you stare at this awhile, you’ll realize you have one over that, one over that, added together, one over the whole thing. This just looks nasty because you’re dealing with it in terms of variances but if you deal with it in terms of precisions, all we’re doing is adding the precisions together. Here, I’m also simplifying away the case to assume that I only have one observation but I can make the observations, I can deal with n observations just as easily. Now, the mean. The mean can similarly be simplified by saying that the mean of the analysis involves the mean of the forecast and the data, and each of those things are going to be weighted by their precisions. The precision of the forecast or the precision analysis, plus the precision of the observations, or the precision analysis.

14:57 - So again, the precision analysis is just the total precision and so this works out to saying that in the simplest example, normal forecast, normal observation error, the mean is a weighted average between the forecast and the data, and they are weighted by their precisions, whichever is more precise gets more weight. And in the precision of the analysis, is the sum of the precisions. Again, that idea that they are self-root reinforcing. The analysis is always more precise than either because it combines those two pieces of information. Posterior is just the weighted average of the prior and the data, each weighted by its precisions. So this has a couple of things it could, a couple examples. So if we have imprecise data, then the posterior of the analysis will stay pretty close to the forecast. That was, again, the idea if you have noisy data, you don’t want to just be following noisy data. You want to buffer that, the degree to which you follow the noisy data, based on the degree to which the model says, yes, this is or is not a time where I expect to be certain or not. If I’m very confident that it’s the middle of the summer and the data’s really noisy, I just ignore the data. That’s actually the right thing to do.

16:20 - Likewise, if the model is not very confident but the data is really good, yeah, your posterior is mostly gonna look like your data. I think I said this earlier in the week, but I’ll reinforce, I’ll reinforce it here. Even if you never do data simulation ever, an important take-home message about data simulation is that to be able to do it, we need to know the uncertainty in the data and we need to understand the uncertainty in the models. We cannot combine datas and models if we don’t know the uncertainties in each, and those uncertainties need to be robust and as complete as we can. So if you have an estimated observation in the data but it’s only considering part of the error in the data and it’s actually leaving tons of stuff out, then you’re giving the data way more weight than it should.

17:14 - Likewise, if you have a model and you’re propagating, you guys just saw, propagating five sources of uncertainty into a prediction, if you propagate one of those and you choose the wrong one, you may be way overconfident in your model. Lets say you only considered the initial condition uncertainty in that model that you did in the last exercise, you would be giving that model huge weight that it didn’t deserve. If you failed to include the parameter, the process error, the driver error, you’d be giving that model way too much weight. So, doing this in theory is simple. It’s just a weighted average we’re done (laughs). Doing this in practice is about becoming obsessive about characterizing these uncertainties correctly which is why that’s where we started the course, was on characterizing uncertainties, because you can’t do data simulation and forecasting without it.

18:15 - So if that’s our simplest possible analysis, lets talk through our simplest forecast. Technically, our simplest forecast would be a random walk. I’m gonna make this slightly more interesting. By assuming that the process model has some slope to it, m. Some rate of change. So that may be decaying back to a backroom rate, it may be a rate of growth, but we’re gonna assume that there’s some constant. X of t plus one is some constant, slope m, times the current state. If m is one, this is just a random walk model. We’re gonna assume that the process error is normally distributed. The mean’s zero, so no bias in the process error, with variance q. The reason for q is murky in the depths of the data simulation literature but it’s always q for some reason, so we’re gonna stick with it ‘cause you’ll see that in other literature.

19:16 - At this stage with the simplest model, we’re gonna assume that the model’s parameters and the process error are known. I’ll point out here that it is not hard to augment what we’re doing to consider that slope to be an unknown, that needs to be updated. It is very hard to consider this q to be an unknown, that needs to be updated. The reason for that is because, very similar to the math we worked out here, when we combine normals they end up as weighted averages of their uncertainties. If you put q in the numerator and q in the denominator, it blows up (laughs). You can’t constrain it. If I put m in the denominator, fine. No problem, I can update m.

20:17 - But I can’t update q, if q is the weight, with only assuming normal, normal. There are ways to get beyond it, but it requires adding terms to the model. In fact, this is something that I think I have, in tomorrow’s lectures, some examples of doing this with some mechanistic process models with Megan’s lab mate, Ian Rayhoe. He and I have spent a lot of time obsessing over the fact that, in ecological problems, I would totally believe that we understand the data, we could figure out the observation error, I might even be able to wave my hands and believe that I could constrain the model somehow prior to starting. But the idea that I knew the process error before I started, is just ridiculous.

21:05 - There’s no ecological problem where we know the process error before we start (laughs). Process model, process error. In this case, we now need to consider the initial conditions as what’s going in because we have a model that doesn’t have drivers, so we don’t have that term. It doesn’t have random effects, so it doesn’t have that term. Again, we’ve assumed the parameters are known, so it doesn’t have that term. So of the five terms we need to worry about, we have initial conditions and process error.

21:35 - So what is the initial conditions? Well, that’s what came out of the analysis. So I’m keeping this mu a, tau a, that tau a is just Pa equals one over tau a. So how do we figure out what our forecast is? So we went over five different ways to make a forecast this morning. Anyone think about which one might be useful here? You could definitely use all of them but one of them gives you the most elegant– - [Student] Maybe analytically, right? - Yeah, we can drive this analytically. We can drive this analytically because this normal, and this is normal, and this is linear.

22:26 - So since it’s linear, the analytical approach is to propagating the moments will work. And since everything is Gaussian, the Gaussian is fully defined by its mean and variance. So if we just know that mean and variance, then we know everything we need to know. We’re not losing any information in higher moments. And since it’s a linear transformation of a normal, it’s going to come out as normal on the other side, so we’re not losing anything there.

23:00 - So, how would we do that? Well, we kinda saw that this morning. The expected value of X t plus one, is expected value of the process model, while expected value of the error is zero. Expected value of the model is constant times expected value of X which is u. Variance, we’ve done this derivation already. The variance of X is the variance of the linear model.

23:32 - So if we do this out in full, slope squared variance of X plus the variance in this error minus two times covariance. Now, I’m gonna make a simplifying assumption which is there’s not a covariance between the process and the error. So, this is actually not uncommon to find that your residuals are not correlated with your parameters. So that’s approximately m squared variance plus variance, residual variance, and the variance of X, is there. ‘Cause that’s at our initial conditions, so this is our variance at our initial conditions which is what came out of the analysis.

24:25 - M squared from the process model, process error for a forecast involves initial condition uncertainty, process error and a process model. Worth pointing out, if you had a linear model that did include other uncertainties and other processes but remain linear, you could expand this out to be that full five term derivation I did before just as easily. This is just the classic derivation of the simplest forecast model but if you had a driver in that model, you just add that term in. It’s gonna, again, end up with some slope times the uncertainty in your drivers. Same with your parameters, same with any random effects.

25:11 - And so overall, the probability of X at time point t plus one, so our forecast is normal. Mean, mu X. I guess that should be– - [Student] Mu t? - Yeah, mu at time t and then our forecast variance. Cool. So now we have the capacity to put these things together. We have a forecast step that we update, X from t to X t plus one, and we have an analysis step that we then update X. And as I wrote on the board here, we are now, again, putting this in precision notation, some of the precisions and a weighted average of the data in the forecast.

26:06 - Where’s the next? It has an iterative analytical solution, you can’t write the whole thing out in closed form but you can iterate between these two steps in a completely analytical way. There’s no need to do sampling, there’s no need to run ensembles when the model wants propagate all the uncertainties analytically. Cool This analytical solution has a name. This is known as the Kalman filter. It is named after Rudolph Kalman who got a presidential medal for this. Now, you don’t get a presidential medal for just doing a bit of clever math, that bit of clever math has actually enabled an enormous amount of technology that’s in our society, and I’ll come back to some of those examples of where the Kalman filter has been applied in the real world later in this lecture.

27:04 - Also cannot help but snark on the fact that I could not imagine the current administration giving a scientist any credit for anything, a more enlightened time. (laughs) So, this is an example of applying a Kalman filter, just focusing on the means. So here, the solid line is some true state of the system. The dots are data, the data don’t fall on the truth because some observation error has been added onto them. And then, the filter is the Kalman filter, so it’s trying reconstruct the dynamics of that system through time.

27:45 - And it does a pretty good job of following it when you get to places like here, where there’s gaps in the data, it has no choice but to project across them. And this was a simple AR1 simulated data fit, with a Kalman filter where the slope was known. Now, that last image didn’t have the confidence interval on it, so here is just a separate plot looking at how the uncertainty in the Kalman filter evolves over time. So what we see is that when there is data, the uncertainty kind of reaches a steady state, kind of reaches an equilibrium ‘cause in this particular case, the observation error is not changing through time, the process error is known and so it reaches a steady state. These spikes correspond to those periods of missing data.

28:35 - So when you have missing data, the analysis uncertainty increases during those data gaps. Now, if you remember yesterday when Shannon talked about state-space models, they had this behavior where they had a confidence interval around the data but if there was a gap in the data, they would balloon up and come back down. What do you notice that’s different about the shape of these uncertainties, compared to what we saw in the state-space model? - [Student] They’re not symmetric? - They’re not symmetric. No. So in the state-space model, remember, the state-space model could look ahead and see that there were these constraints after the gap. It looks both forward and backwards in time.

29:33 - That ability to look forward and backwards in time, is characterized but what’s known as a smoother. So, the state-space model is a smoother. It uses information in both directions. The Kalman filter is a filter, it only progresses through time in one direction. It’s designed to do iterative simulation, it’s designed to use information as it’s coming in. Not to go, and go oh, now there is observation plus I’m gonna revise the past. So, the Kalman filter does not revise past time steps that it’s seen previously.

30:13 - When it gets to this data point now, it goes, ah, now I do know where I am. It doesn’t say, oh, and I should go back and revise where I was previously. It’s a forward operation. And because of that, it has this characteristic kind of shark tooth behavior to it, or sawtooth behavior, to its uncertainty. Yeah, sawtooth. As opposed to this, which I kind of liken to the balloon animal model of uncertainty, it kind of bulges out where you aren’t constraining it with data. So, the Kalman filter derived in the simplest possible case for one variable, may be of use to ecologists but was not gonna cut it with atmospheric scientists, where they’ve got hundreds of thousands of atmospheric grid cells with dozens of state variables.

31:01 - They wanted something that could deal with more than one thing at a time. And so, what we’re gonna do next, is generalize this Kalman filter from the univariate case to a multivariate case. Before I dive into the math for doing that, I wanted to highlight some of the key concepts behind what we’re gonna gain from doing that. And I talked about this earlier in the week, but it’s worth reiterating. So, imagine that my forecast model implies a covariance between two of my states.

31:37 - At this stage, with an analytical Kalman filter, it’s not done as an ensemble. It would literally be done as a mean and covariance. But the key point is that there is a covariance in the forecast, so that when we observe one thing about one of my state variables, I can use Bayes’ theorem to update that thing, but I can also use the covariance to update other latent states. So that’s one of the advantages of the Kalman filter. This kind of data simulation approach allows us to easily update latent variables during the analysis step and likewise, as I mentioned earlier in the week, to have mutual information.

32:25 - So if I did have an observation here about AGB, the update on each of these variables would be coming from the direct constraint of that variable and indirect constraint through the correlations. And this works, not just over multiple state variables but it can also work over multiple locations. So, lets say I am forecasting only one thing, but I’m forecasting that one thing at multiple locations in space. When I’m doing a forecast at multiple locations in space, you could easily see that my forecast for the population here and forecast for the population there, are also positively correlated because they may be responding to the same environmental, entering into a variability in the same environmental drivers. So, information at one location, helps me constrain what’s happening at other locations in proportion to the degree, the degree to which we forecast them to be correlated.

33:21 - So again, this allows us to do what, I kind of like to call, the idea of models as scaffolds. So, the model itself, the thing that we’re using to make the predictions, ultimately, what it is contributing to this, is that covariance structure. It’s contributing a prediction that tells us how the covariance between different processes in space and time should be related to each other. And thus, allows different types of data that may not be directly comparable to each other, to be mutually informative, even if you necessarily run, even if you can’t make a scatter plot between how variable X is related to variable Y, because they’re operating on different scales or just disconnected in their process, your understanding of how the process works, which is what you’ve built into the model, your hypotheses about how reality works, are used to link different data sets together, through this data simulation process. So, generalizing this to a multivariate case.

34:36 - I’m gonna try to keep the notation as similar as possible. If you haven’t been thinking about matrix algebra in a while, you’re gonna take a few things for granted but I’ll try to highlight what I’m doing. Mu a, mu f are means. Instead of being a single number, they’re now gonna be a vector of our means, length and long. So, now I care about n means. Those n’s may be n different locations, they may be n different variables that my model’s predicting or they can be some combination. So if I’m predicting five things at 20 locations, n is unwrapped to be 100 long, and it’s my responsibility to figure out how to wrap them back up (laughs), to know which means go which with variables and locations. It’s not actually hard.

35:29 - The error covariance matrix in the analysis and in the forecast, are gonna be Pa and Pf, that’s what I was previously calling tau a and tau f, and those are n by n covariance matrices because there’s n states. I then have some vector observations that’s length p long. Key point, p and n don’t have to be the same number. So like I said, you can have latent variables, you can have states that are not observed directly. Likewise, you can have states that are observed with multiple observations.

36:05 - So p could be bigger than n, p could be smaller than n, p and n could be the same but all the observations are just for one state, it doesn’t matter. What matters, is that there is going to be a term that gives us a map between Y and X, and that’s gonna be called H. Since I have p observations, the observation error is described by covariance matrix, R. It’s not uncommon for that to be diagonal, to assume that our observations are independent, but the structure of the Kalman filter doesn’t require that your observation errors be independent. In practice, they often are, but they don’t have to be.

36:47 - And so, H is what’s called our observation matrix and that’s essentially, I’ll show an example on a slide very soon, but it’s essentially just a look-up table that says which Y goes with which X. So, the first X goes with the first Y. The second X goes with the third Y, whatever. It allows us to keep track of which goes with which. M is now our process model. This can now be n by n, so we now have a matrix process model. I don’t care what part of ecology you come from, you’ve probably seen matrix models before.

37:30 - Matrix population models, ecosystem bijection models, all pool and flux models can be represented in matrix forms, food webs can be represented in matrix forms. So, matrix models, I’m gonna take that as actually not that big of a limiting constraint. And so, our analysis problem comes down to a normal likelihood of the data, given the latent state X and the observation error, times forecast, which has some mean and covariance. So we have our analysis step is again normal, normal, it’s just multivariate by multivariate normal. They’re still conjugate, and so we can still multiply them together.

38:15 - Conceptually, the answer will be the same which is, conceptually, the precisions will add up and the means will be a weighted average of the means. It just gets uglier (laughs). So, here’s my variance. This one over one over, this inverse, is just turning those into precisions. I’m adding those precisions up, and I’m inverting them again to get back to a covariance. So again, I’m just summing up the precisions. That term shows up again, so that’s in the denominator, so it’s again the weight.

39:00 - Here’s the data, weighted by its observation error. Here’s the forecast, weighted by the forecast error. It’s just uglier (laughs). It’s very common to reorganize this, in terms of this term, K, which is called the Kalman gain. It’s just a reorganization of this to kind of say, instead of thinking of this thing as being weighted with that thing, of thinking of the analysis being the forecast times some gain. Okay, so lets give a really simple example.

39:37 - Lets assume that we are forecasting three things, three state variables, One, two and three. And lets assume that we observe two of them, y2 and y3. So we don’t observe the first variable, either the first location or the first state variable. We’ll also assume that our observation error is diagonal of sigma, so same observation errors for every observation, and that they’re independent, which is a simplifying case but it’s a pretty common assumption for observation error. Now H, that observation matrix, again, what it’s doing is mapping between the two.

40:22 - So it’s saying the first y maps to the second X, the second y maps to the third X. So it’s very common for these H matrices to just be zeroes and ones mapping between these things. That said, they don’t strictly have to be just zeroes and ones. So, if you have some process where you have your forecast on one grid and your observations on a slightly different grid, which can happen with remote sensing, these numbers might be fractional to say, I have to add up remote sensing pixels mapped to my observation, so they each get ¹⁄₃ weight or any sort of spatial regridding or down-scaling, is always handled in H, any place where you need to bias correct. So, H being one, essentially implies no bias.

41:21 - So if you need to bias correct, you can put that in there. It’s any place where there’s just proxy data, the TDR example, there’s a slope with a line between TDR and soil moisture. That slope, is what goes in there. In a more general sense, when you get beyond the most basic Kalman filter, H is just that observation process in the observation model. In more complicated examples, H can be a function that just maps between the latent states in the model and the real-world observation. When Shannon was talking about state-space models yesterday, I gave the example.

42:01 - For your remote sensor, H may be everything between our latent state, which may be the vegetation that’s actually on the ground and everything, and what we observe from the satellite. So it could have to do with, it could be a complex function of the cloudiness of the air and the aerosols and sensor bandwidths and stuff like that. So, it can get more complicated but for most it’s not uncommon for it to just be, like I said, a mapping variable. Posterior for x2 and x3 aren’t gonna be the super interesting one, I’m gonna focus on the posterior for x2 because we don’t actually observe x2 directly. So, we end up with a case where the posterior mean for x2 involves the mean of the forecast plus some weight based on the difference between what we forecast and observe for the latent state and what we forecast and observed for each of these two latent states.

43:11 - And these weightings are coming from these error matrices. I think this interesting and worth mentioning because this math is gonna be familiar to some of you. Because this is the exact same math that goes on in every spatial statistical model when you’re doing kriging. This is the same math if you do a time-series model and are projecting to an unknown state. So if you thought of those x’s locations as p as a spatial covariance matrix, this is kriging.

43:47 - So, what I wanted to point out here, is that the only real difference between spatial multivariate models, time-series multivariate models, data simulation models, is really what p means. So in a spatial model, you have a covariance matrix and the way you construct that covariance matrix is by writing down a function that describes how correlation changes the function of distance, and that is an empirically calibrated function just of distance. And the reason you do that is because you have n observations but need some way of filling in an n by n covariance matrix. There’s no way to fill in an n by n covariance matrix with only n observations, unless you make some assumption about the structure of that covariance matrix. And the common assumption in spatial statistics is that the correlation decays as a function of distance.

44:51 - In a data simulation, you’re doing essentially the same thing but the way you’re filling in that covariance matrix is with your forecast model. So, you’re using your forecast model as a way of filling in that covariance matrix. Once you do that, under the hood, it’s just like kriging, just like time-series. So you could imagine, if you have a model making predictions in space, it’s essentially borrowing information in space, not based on distance, but based on the similarity of your forecasts. Which is probably implicitly a function of distance but not explicit.

45:32 - Okay, forecast step, I think is actually gonna be a bit more straightforward. Mean and the forecast, expected value of f given X a, is just n times mean of the analysis. Covariance of the forecast has q, our process error plus, and this is just essentially, if you’re not familiar with matrix, this M times the covariance matrix times M transpose, is essentially analogous to M squared. So, it’s mathematically directly analogous to M squared times variance. We just know our covariance and M is also a matrix. So, examples of using the Kalman filter.

46:25 - If for example, M were the equations for Newtonian mechanics, where we do know the constants, then the Kalman filter is how your GPS gets you around. It’s making observations of where you are in space with uncertainty, it’s assuming that you’re obeying the laws of physics in where you are going. So, it’s predicting where you’re gonna be the next time point and constantly correcting that based on where it thinks you are, but it never believes the GPS satellite completely (laughs), which I think you all know is true (laughs). And it’s also constant with why you sometimes, you exit and the GPS thinks you’re still on the highway, ‘cause it has not rejected the hypothesis that you are still on the highway, until it’s accumulated enough data that it goes, a-ha, the observations are no longer compatible with that model. If M is Newtonian mechanics, you can also land on the moon.

47:27 - That’s why Kalman got the Presidential Medal. One of the first real-world applications of the Kalman filter, was in the Apollo navigation system. Because if you try to get from here to the moon by dead reckoning, launch and never correct anything along the way, you’re going to miss the moon (laughs loudly). Instead, you are observing where you are in space and re-adjusting based on an algorithm that’s keeping track of where you are. And as my colleague, Dave Moore, who leads the flux course that I’ve taught at a number of years, likes to say, “Data simulation is not rocket science “but you can use it for that.” (laughs loudly) Which is true.

48:13 - Data simulation is used in often less innocent uses of rocket launches, as well. (laughs) Okay, so pros and cons of the Kalman filter. Big pro, analytically tractable. Makes it very computationally efficient, in that regard, and it gives you some understanding of what’s going on in data simulations. So, even as we start relaxing assumptions, the intuitions are very similar. We’ll start relaxing the math but the idea that the analysis is gonna be a weighted average between the data and the model is gonna stay.

48:47 - The idea that when we combine pieces of information, we’re gonna be more confident than we were with either, that’s gonna stay. It depends only on the previous state. So, the Kalman filter is Markovian, in that regard. So you need to know where you are, you don’t need to know where you are n steps in the past. That’s really handy, and in fact, that’s one of the reasons data simulation can be helpful. So, imagine I was trying to forecast the weather for the next few hours, ‘cause they update that every six hours, and if I had to re-assimilate all of the data, since we started numerical weather forecasting in the ‘60s till today to predict the six hours, then numerical weather forecasting would become, it would get harder everyday.

49:35 - (laughs loudly) No, what you’re saying is all of the information about everything that came before now is embedded in the forecast. So, we embed all of our understanding in the forecast. And so, we don’t need to go back and look at the previous data. The past is in the past. Which, again, makes this something that we can handle. There’s a lot of problems that are computationally infeasible to do as a smoother where you have to fit all of the data at once.

50:07 - Where you, even if you’re not making a forecast, you can do it as a filter by moving through the problem in time because you’re only caring about each observation at each point in time and you’re updating your understanding. So when you do that, you go back and rerun an a analysis over the past, that’s a reanalysis, which if you’ve ever used atmospheric data products, you may have used, for example, NCEP’s reanalysis products, or the ECMWF reanalysis products, those are the analysis part of the data simulation rerun after all the data was available, and they’ve cleaned it all up. So, it’s not the raw observations, it’s that analysis step. These are things that are either pros or cons depending on how you are thinking about the world. Linear, normal. I tend to not believe the world is linear and normal but sometimes it’s a nice simplifying assumption. It does involve matrix inversion.

51:01 - When that matrix, that n by n matrix gets really big, the matrix version itself is the computational cost. So imagine if I have, if I wanted to assimilate, say, a snapshot of the world at Landsat resolution, 30 by 30 meters, there’s a lot. And I have to unwrap that into a length n and then make an n by n covariance matrix between every single Landsat pixel on the planet? That’s a very big matrix. (laughs loudly) Assumes that all the parameters are known, and it’s again, forward only. If we look at the history of the Kalman filter, the assumption that troubled the communities that were using it, which were largely engineers and atmospheric scientists and those folks, the linearity assumption was the one that presented them with the most immediate problem.

51:55 - So if we think about what we learned this morning, how might we relax that linearity assumption? - Taylor approximation, exactly. So, what we’re gonna see is that the major flavors of data simulation that are used by the community, actually map fairly cleanly onto what we learned this morning about the alternative ways of propagating uncertainty in a forecast. So, the analytical moments approach maps onto the Kalman filter. The Taylor series approach maps onto what’s called the extended Kalman filter. We could’ve been more original but extended Kalman filter works.

52:39 - Can any one guess, we’re gonna stay on the trend of lacking originality, what they call this one? - [Class] Ensemble Kalman filter. - Ensemble Kalman filter. You guys are brilliant. Yes, the ensemble Kalman filter is just the same math but now we’re doing the forecast with an ensemble. And then, here they were more clever, if you extend this to using the Monte Carlo uncertainty propagation, that’s called a particle filter. To my knowledge, no one has tried this. (laughs) So the last thing I wanted to finish with is the extended Kalman filter, like I said earlier, this lecture covers the analytical approaches, we’ll pick up with the numerical approaches, the ensemble Kalman filter and the particle filter, in the morning. Extended Kalman filter assesses that linearity assumption and it does it by saying mu f is just whatever my non-linear model predicts if you put the mean in.

53:48 - Technically, that violates Jensen’s inequality. It is an approximation. But I guess the counter-argument is, well, what else would you put in? (laughs) If all you know is the mean and the variance? We’re gonna update the variance using Taylor series expansion. And so to do that, we’re gonna need F which is going to be the Jacobian, the matrix of all those possible pairwise derivatives. Once we have that matrix of derivatives, which is essentially a matrix of slopes, the forecast step just switches to being q times F squared initial condition uncertainty, instead of q times M squared initial condition uncertainty. So instead of having a matrix of slopes, we have a matrix of derivatives, which is a matrix of slopes.

54:45 - But the key point here, is when you run the normal Kalman filter this matrix of slopes, is always the same. When you run the extended Kalman filter, this matrix of slopes gets re-updated every time step because it’s derivatives with respect to different means every time step. So you analytically solve the derivatives, but you have to plug in what the current mean is, every time. So why the extended Kalman filter works is because you are changing those means every time step. Is this a decent approximation? That’s honestly, in many ways, kind of a Calc I sort of question.

55:29 - If we have a bunch of derivatives and I want to approximate some process, if I make my time step small enough, then approximating that with derivatives isn’t too bad. The bigger the time step, the worse the approximation begins. The more non-linear the process, the worse approximation this is. So if you have a highly non-linear process and you’re taking really big time steps. If you wanna update soil microbes annually. Not gonna work (laughs) because they have generation times on the order of hours No, so the time step would need to be appropriately short. Technically, it can be extended to higher-order moments but I’m not gonna go through that derivation. Key take-homes, the degree to which this works has a lot to deal with the non-linearity of the model and the time step but technically, it is biased because of Jensen’s inequality that can’t be that. And likewise, because of Jensen inequality the assumption of normality is also technically false. Because if I put in a normal distribution as my initial condition, and I apply a non-linear transform to that, what comes out on the other side can’t be normal.

56:54 - The degree to which these are bad assumptions, again, has a lot to with how non-linear things are and how often you’re doing the updating. .