Data Assimilation: Ensemble Methods
Apr 23, 2020 20:26 · 9980 words · 47 minute read
- So this is the second lecture on data assimilation methods and the one where I want to introduce numerical methods for doing data assimilation. And so this pairs very nicely with again the lesson on uncertainty propagation where in that lecture we discovered analytical methods for propagating uncertainty, linear tangent approach, the exact transformation of the distribution and the analytical approach to transforming moments. And we saw when I introduced the analytical approaches to data assimilation yesterday, that we could map the analytical approach to uncertainty propagation to the Kalman Filter and the linear tangent approach to the Extended Kalman Filter. And so just to refresh, again, we’re thinking about this idea of a Forecast cycle. We have some initial state of system, we project that uncertainty into the forecast, we have some new observations and we wanna update that state in order to update our forecast.
01:07 - Again, the big picture ideas that we always want to be updating our forecast as new information becomes available. The implicit part of this, is we wanna update our information when new information becomes available without refitting the whole darn model to every observation we’ve ever had. Fundamentally that becomes a problem that gets perpetually harder. That’s it. If you have it, super Uber simple model, you may be like whatever, I’m just gonna ignore this ‘cause I don’t mind refitting the model. Like I know Tom Hobbes who does some of these population assessments, he does them annually and he just, he’s got 10 years of data, 15 years data.
01:51 - He just refits the whole darn thing every year because, so what if it takes a few hours to run? He runs it once a year. (laughs) Because it’s a small model with an infrequent update. By contrast, if I was trying to update something more complex, worth a larger volume of data, I don’t wanna be refitting everything. I just wanna be incrementally updating as new information becomes available. And again, we mentioned yesterday this idea of, in the Multivariate case we also want to be able to borrow strength and fuse multiple sources of information.
02:27 - So as I was saying earlier, we can map the Kalman Filter in the Extended Kalman Filter onto these two analytical methods for propagating uncertainty. And so today I wanna introduce our two numerical methods for doing iterative forecasting or data assimilation that map onto these two numerical method we know how to propagate uncertainty. The first being the Ensemble Kalman Filter which uses an ensemble based uncertainty propagation and the Particle Filter which uses a Monte Carlo based uncertainty propagation. Okay, so I’m gonna refresh where we left off yesterday was on the Extended Kalman Filter, which took… Where we started, which was the simplest possible model, normal likelihood, normal Process Error Linear Model.
03:24 - And it tried to address that assumption of linearity in the forecast. And it did that by assuming that the forecast mean would whatever nonlinear thing comes out of our forecast model, putting in the analysis Mean, which we know technically violates Jensen’s Inequality, and then we update the uncertainties in the forecast using our Taylor Series expansion. So we need this Jacobian Matrix of derivatives and instead of use… Which essentially gives us the instantaneous slope at that part point in time, rather than using a constant set of slopes in a linear model. So this had two potential problems.
04:10 - One is that we’re violating Jensen’s Inequality. So this mean is technically biased, and that, we’re assuming normality, but if we have something that starts normal and do a nonlinear transform on it, it can’t technically be normal anymore. But as I said yesterday, those assumptions might be something you can live with or they might be fatal. It has a lot to do with the actual shape of the error in the real world and a lot to do with the degree of non-linearity in the problem. But if I jump back to my title slide, in my mind, I kind of contrast the Extended Kalman Filter and the Ensemble Kalman Filter, is kind of summed up well in this quote, which wasn’t about Kalman Filters at all.
04:59 - “An approximate answer “to the right problem is worth a good deal ” than the exact answer to an approximate problem.” I kind of view the Extended Kalman Filter as the exact analytically tractable answer to the approximate problem. By contrast, I view the Ensemble Kalman Filter as an approximate answer, it’s not analytically tractable, it’s done through numerical sampling. So it’s an approximate answer, but to what I think is the right problem. So why do I think that? So first all I’m gonna point out that in Ensemble Kalman Filter, like in the Extended Kalman Filter, the analysis step is identical.
05:39 - So we’re gonna keep that assumption of normality and therefore even if the assumption of normality in the forecast is keep the assumption of normality in the process error, in the observation error and therefore we have this conjugacy and an analytical solution to the analysis problem. But we’re gonna do the error propagation using an ensemble based approach. So specifically we’re gonna draw some number of samples from the Analysis posterior and then we’re gonna run the process model forward with whatever process error is in that model, you need to add to that model, and so what we get out is an ensemble of M forecasts. What you do after that is actually relatively straight forward. So I have an ensemble of M forecasts, I need to plug that in to an analysis step that assumes that those M forecasts follow a multivariate normal distribution.
06:44 - Or how do I get from M forecast to multivariate normal distribution? Well, I just use the sample mean and the sample variance. It’s really that simple. When we introduced ensemble approaches to uncertainty propagation earlier, I always found that they’re fairly conceptually straightforward. If you can sample some from something, you can transform those samples through your model and you get samples as the output. And remember when we transform samples, we are actually getting, we were dealing with the nonlinearity correctly. So this is why I think we’re getting the approximate answer to the right problems.
07:21 - So we’re dealing with the non-linearities in our model as they are. The approximation comes in from the fact that we’re just using the sample mean and covariance and again assuming normality and the degree of approximation here, improves the larger your ensemble size. So if you use a really tiny ensemble size, you get a very approximate answer. If you use a larger and larger sample size, you get a better and better description of this prior mean and prior covariance matrix. To show that graphically with an ensemble size of five, which I would never ever do.
(laughs) 08:02 - But it makes for a more clear picture. You have some initial conditions and you draw some discrete number of samples from your initial conditions, you simulate those forward, they give different predictions. So here, this one started high, it ended up in the middle, this one started in the middle and ended up high, whatever. So that’s my distribution of predictions. I fit, I mean in variants to there. I then do the math of the analysis, which remember is just a weighted average between the data and the model each weighted by their precisions and the the summing of the precision to get the updated analysis. So I have an observation, I shrink towards that observation relative to its uncertainties.
08:44 - I sample again, I forecast again, update again, forecast again. It’s really actually not that complex. So here’s a simple figure comparing a Kalman Filter and an Ensemble Kalman Filter. So the Kalman Filter, the solid line with dots, is the exact same figure I showed yesterday when I was introducing the Kalman Filter. Here’s an Ensemble Kalman Filter with an ensemble size of 10 and this is just making the point that even with a fairly small ensemble size for univariate problem, they give you virtually the same results. And asymptotically as I increase this sample size, it will converge to being the same results.
09:28 - And it does so without having to take derivatives. So some of the pros and cons of the Ensemble Kalman Filter, which my opinion and observation is that if we look at all the trade offs across the different methods we’re going to use, this is the probably most commonly used approach to data assimilation in ecology right now because I think it balances a lot of these trade offs fairly well. It’s something that is intellectually attractable, It’s intellectually attractable, the implementation is usually fairly straightforward, it has computational demands, but not huge computational demands. So pro, it deals with the nonlinearity correctly. So you’d have no violation of Jensen’s Inequality.
10:16 - Your model can be as crazily nonlinear as you want. If your credit model is hugely crazily nonlinear and you throw it a whole ensemble and the ensemble might give you spaghetti output, but that’s actually the correct spaghetti output. That’s just the crazy stuff your model does. (laughs) But it doesn’t make any approximation to that. It uses your existing code. You don’t need this Jacobian. You don’t need this linear tangent version of your model.
10:43 - What you do need and you need this anyway, so for all of the data assimilation variants, you need the ability to restart your model. And that’s actually with a lot of these process based models I work with. The biggest limitation for data assimilation, is you need to be able to restart the model, stop the model, muck around with its current configuration and then restart it under the analysis state. One thing that I’m gonna mention here is that, a really important test to perform when you’re working with the data assimilation system, building a data assimlation system, is what’s called the Hop Test. So the envision the hop test is, imagine I have time and some state X, I start my model at a specific initial condition. I run it forward.
11:39 - Then, I try this in data assimilation mode, but without any data. I run it forward, I stop the model. I then restart the model, feeding it back in its current state. (chuckles) It should go and give you the same result. If it goes off and gives you different results, your model does not have the capacity to restart correctly. So the Hop Test is just verifying that if you stop the model and restart it, that it goes and gives you numerically identical results to what you had previously.
12:25 - Failing the hop tests is usually because, your model didn’t write out all of its state variables before it stopped, or it didn’t read in all of its state variables when it restarted. With more complex models, that’s really easy to do. Like you have some sort of latent thing that, you assumption, I assume when I start the model I just initialize all these things to zero. No, when I restart the model I have to initialize them to what they were when I stopped. And you miss one and suddenly you can’t restart the model correctly.
12:56 - So if you’re doing a Hop Test on a Stochastic model, you also need to write out and read back in the seed on your random number generator. Or else it will not be numerically identical. But it’s a good test to perform that your model doesn’t give different results when you restart it. Because when we do data assimilation, we’re gonna restart it under different conditions and if it can’t even… Yeah, you don’t want it to go crazy. I’m only mentioning it because I’ve yet to work with a process based model that passes the Hop Test the first time. (laughs) Okay.
13:40 - But like I said, that’s, you need to be able to do that to do the Kalman Filter or the Extended Kalman Filter and the product for all of these approaches to data assimilation are based on the idea, we’re stopping the model in grabbing new information and restarting. Back to the Ensemble Kalman Filter. It is simple to implement. You just need to be able to sample things, run your model. Simple to understand. I just need to calculate this the ensemble mean and covariance. How big should your ensemble size be? That’s really a basic power analysis question. If I have some amount of variability in my ensemble and I want some sort of numeric, some specific amount of numerical precision in my estimate of the mean and covariance, what sample size do I need to get a specific standard error? That’s all it is. This is really stats 101.
14:35 - I have some observed variability, I want a specific standard error, I need to solve for the sample size that gives me the standard error I want. (Man sneezing) Just do a test run to figure out how variable your ensemble is. And remember that the standard error, in many ecological problems, the process variability in the model is fairly nontrivial. We have a substantial amount of variability such that you don’t necessarily need to estimate your mean down to like 12 decimal points, if you only trust the model to one decimal point. Maybe you only need to get this down to one decimal point.
15:25 - So that means that you can get by with reasonable sample sizes. The con of the Ensemble Kalman Filter. This computational cost is larger than the analytical method. With analytical method you only run the model forward once. And the uncertainty propagation is all done analytically. One other thing that’s I think another advantage of Ensemble Kalman Filter is it is particularly easy to add additional sources of uncertainty.
15:56 - So if I wanted to add uncertainty in the drivers, I just sample the uncertainties in the drivers as well. If I wanted to add on some uncertainty in the parameters, I’d sample the parameters as well. This is exactly what we did on the uncertainty propagation activity. We just want to propagate five sources of uncertainty, we just sample from all five. If you’re doing an Ensemble Kalman Filter, that’s all you have to do.
16:20 - If you can sample from it and put it in your ensemble, you are propagating that source of uncertainty, done. By contrast, if I did this with the Extended Kalman Filter, every time I add a term, I need to add the derivatives with respect to that term. So now I don’t just don’t just need the Jacobian of all of the state variables and how they relate to each other in the model, but I also need the all of the derivatives of the forecasts relative to every input and I need the derivatives with respect to every parameter. That can be really messy. And again, just to give you a linear tangent approximation. Moments. We are good on Jensen’s Inequality. Because we are transforming individual ensemble members, not the distribution all at once.
17:13 - Like with the Extended Kalman Filter, we are making the assumption the forecast is normal but we know technically it can’t be. That’s the assumption, but then again that was the assumption we made with all ensemble based methods. Which is we have to make in order to work with a small ensemble member size relative to the Monte Carlo methods where you want with thousands and thousands of iterations, so I can draw the full posterior distribution just from the samples. If I’m working with tens to hundreds of samples, I need to make some sort of distributional assumption. The Ensemble Kalman Filter, in order to do this conjugate math in the analysis step, assumes that’s normal.
17:57 - But again it’s worth being aware of that, in many cases is not a huge deal. But it’s worth checking, I will say that. As you start running ensembles, I would definitely say it is worth looking at a histogram of your ensemble projections to say, does this look approximately normal or not? If you look at the projections of your ensemble and they look like bi-modal and you stick that in to something that assumes a normal , you’re making the mean parameter, imagine that. (chuckles) Because it’s actually not hard to generate that. Let’s say I have some, I have some X. And the histogram of X looks like, actually I’m gonna draw it this way, I’m gonna draw it exactly how it happens in my models.
18:51 - (audience laughing) What happens in my model where X is often biomass. Everything is happy, everything dies. Everything dies could be choice of parameter combinations that are as in viable ons met ensemble that imposes some severe stress. Driver disturbance event that causes a disturbance. So this could just be no fire, fire. No hurricane, hurricane. Now if I fit a (chuckles) normal to that, the mean doesn’t represent any of my ensemble members at all. (chuckles) So I’m putting in my mean forecast as something that like never actually happens.
19:47 - So those are the sorts of cases where you go, well, that’s a bad assumption. I need to do something a little bit more sophisticated. But if on the other hand you only had this case and you’re like, well, sure, that’s approximately normal, I can live with that. And that’s the key point, last point here, that analysis is not hard to generalize, but it just won’t have an analytical solution. When you’re dealing with problems that are so big that the computation on that analysis step really has to have an analytical solution to be able to do the forecast.
20:24 - So if you look at the weather forecast where they, every six hours they have to update the analysis with, put this in perspective. Noah’s current operational weather forecast system right now assimilates 10 terabytes of data per day. They can’t MCMC their way (audience laughing) through 10 terabytes. So two and a half terabytes every six hours. I can’t even download two and a half terabytes every six hours (laughs) let alone assimilate it. They have more computers than I do. (laughs) By a lot.
(laughs) 21:05 - Noah’s next generation analysis system is being designed to assimilate 10 terabytes an hour. So anytime you think that ecologists have entered the era of big data, (all laughing) you talk to our brethren and the physical environmental scientists go, “Oh, “we’re small potatoes.” Really we can afford to do things the hard way for quite a while still. Okay, a couple of things I wanted to introduce that are kind of additional flavors on this theme. So, in the standard version of the Ensemble Kalman Filter, when you get that analysis posterior, you just sample from it.
21:51 - Someone at some point came around the idea of well, there’s a number of cases where just completely resampling from it might not be exactly what I wanna do. Is there another… Instead of resampling, can I just nudge the current ensemble back towards the mean? So if I have, and this is actually what I visually showed earlier. Which is if I have some set of ensemble members coming in for the forecast and then my, so they’re summarized by this distribution and then the analysis says they’re supposed to have so that they have this current mean and covariance. They need to have this mean and covariance. The Standard Kalman Filter just says, you take this, you take that, and then I just draw samples from this.
22:50 - And they have no connection to the samples from this. Alternatively, you might say, well, what I wanna do is just nudge these so they stay in the same order. They represent the same quantiles, but they’ve shifted the mean and shifted the covariance. If this is a univariate problem, this is really easy. I mean all of you could figure out the math for doing that. This is just the matrix version of that math. The matrix version of that math essentially boils down to here are the residuals between the forecast mean in the observations, I’m going to standardize those to be Z scores. This is just a bit of matrix math that essentially means divide by the standard deviation. I then take that Z score, essentially multiply it by the new standard deviation, add on the new mean and get it back into the new domain. To do that, you need to be able to calculate a singular value decomposition, which is something that’s in every programming language in the world.
23:55 - There’s an art package for it, there’s a Python package, like it’s built into everything. It’s very straight forward and a bit of math. So when would you want to do this though? A couple of key places. One would be if you have latent states that you did not write out. So if I have 12 Xs in my model, but there’s actually a few more things in the model that I need to restart, but I’m not like choosing to nudge, then I might wanna keep those consistent from one state to another.
24:27 - Other place where it’s particularly useful is when I’m propagating on other uncertainties and I wanna keep them associated. So imagine the part of why I got this ensemble spread is, I have a particular set of parameters here and a different set of parameters there. And this parameter set that caused this to be high and this parameter set that caused this to be low, I wanna keep that high one here and I wanna keep the low one there. So that would be a reason why I would want to keep this ancillary information with each ensemble member as they move it from one to another. Similarly with the drivers. I have a particularly wet or dry set of ensemble members from my drivers, I might wanna keep them associated with their previous ensembles.
25:19 - Okay, another thing that actually does apply to all Kalman Filter variants as well. I mentioned that when I introduced the Kalman Filter that the real computational cost of the Kalman Filter, is that matrix inversion. So as the number of states in my model gets large, the size of that covariance matrix gets very large and the math for the analytical solution of combining the two involves a number of places we have to invert matrices, particularly that covariance matrix. The vanilla flavor for how computer scientists invert a matrix is what’s called an order n cubed algorithm. And that means the time it takes to do this computer operation scales to the third power of the operation.
26:14 - So if imagine I had a matrix that was 100 by 100 and let’s say that took one second to invert. If I then switch to something that’s 1000 by 1000, I made n 10 times bigger, now it takes n cubed longer to do that computation. So you can see, even if this is a millisecond or a lot smaller, when these matrices start getting bigger, the computational cost of inverting them ends up being really non-trivial. Enter into this the applied mathematicians say have a number of non vanilla algorithms for inverting matrices, but they work when matrices have very specific structure. So a diagonal matrix, I can invert order n because if the matrix is diagonal, the inverse is just one over everything on the diagonal. It’s trivial.
(clears throat) 27:22 - You wouldn’t actually want a covariance matrix which is one over the diagonal because then we’ve lost that whole borrowing strength thing. But if your matrix has, say, block diagonal structures or basically things where you can make the matrix sparse and very specific structures, sparse meaning you put in a lot of zeros. So how do you make the matrix sparse and often sparse with some structure to it so you can employ more sophisticated approaches to matrix inversion? The key way we do this as an approach called Localization. And the idea of localization is that you assume that the correlations truncate to zero beyond some distance. So we know that correlations are Nikkei as a function of distance.
28:06 - That’s implicitly generated by the models and we dislop them off. That creates, potentially a lot of zeros in a nice often diagonal structure. So if you imagine that your Xs were different locations in space, if I truncated them after a certain distance, I would end up with a more diagonal structure, which is something that I can invert more easily. So that increases the computational efficiency that can also help us avoid spurious correlations. So if I have thousands of state variables and hundreds of ensemble members, I wanna get strong correlations every once in a while just by chance.
28:45 - And if I leverage those I can get wacky things happening. So Andy Fox who was running data assimilation at Neon before that was descoped, had this one example where there was like some, I think it was like some phenological transition or some flux whereby what he strongly believes to be a spurious correlation, there was a strong correlation between like Alaska and Florida and it’s like, so like data in Alaska suddenly like changed the predictions for Florida very strongly because the correlation was like.95 and it’s like that makes no sense biologically. It just happen because, with that number of correlation coefficients being calculated, it’s gonna happen every once in a while. So you kind of lop off and say, even if I do find a correlation beyond this distance, it’s probably not real.
29:36 - Also noting that that distance doesn’t need to be physical. Like you can correlate, you can drop correlations beyond some distance in whatever way you want it to define distance, community similarity, some sort of other environmental space, some other measure of distance that Manhattan does, whatever. However you want to define distance, the idea is you have some rule that you say after some, there’s some metrics that you say, I’m gonna start chopping things off and setting the correlations to zero. Another thing to be aware of in all forms of data assimilation, not just ensemble ones, is this idea of Filter Divergence. So here I’ve set up a simple data assimilation example.
30:20 - I don’t even remember if this was Ensemble or Kalman Filter or whatever, but the idea is I started off two versions of the assimilation at the same starting point, same initial conditions, pretty well high uncertainty. I am assimilating the same observations, but what’s different between these two variants is what I’ve assumed about that process error. Remember in the Standard Kalman Filter variance, you have to assume that process error. In one case I said the process error is fairly high and in the other one I said it’s fairly low. So the model doesn’t have much unexplained variability in it.
30:56 - In the model with higher process variability, it’s following the data and then there was kind of a regime shift in the data at time 10 and it shifts to the new mean. When there was low process variability, the model observes data, it gains information from the data, it becomes more precise and so the precision gets more and more, gets more and more and the regime shift occurs. The model says I’m very confident in the model ‘cause I gave it a lot of weight because it’s uncertainty is small. Data has higher uncertainty in the model, so the model continues to get most of the weight. So this is an instance… So filter divergence is essentially what happens when you become falsely overconfident in your model causing the model to stop following to diverge from reality.
31:51 - When we do the analysis, we’re summing the precision of the two, so you always get more precise and then I’m weighting them by their relative uncertainties. What happens is your uncertainties reach an equilibrium. In many cases because you have this, the precision from the data making the analysis tighter, then you have the observation error making it larger again. So if you kept doing this cycle of analysis pulling you in, process error pushing you out, analysis putting you in, process error, you hit… So essentially you hit a steady state. And now you’ve hit a steady state where the model is falsely overconfident.
32:33 - This is usually, this isn’t just an issue of, how you tuned your process error. This is largely just a reflection of you’ve ended up in a place where you are falsely overconfident in your model. You have not propagated, there’s some uncertainty about your model that you’re not propagating in appropriately. So this Filter Divergence, I will say, the folks that I know that do data assimilation, for atmospheric scientists, spent a lot of time worrying about Filter divergence. This idea of a model variance collapsing to zero or some small amount whereby it ignores the data.
33:14 - Because the model that ignores divergence from the data, their solution is to tune the process error to stop that from happening. It stops that from happening, it has no theoretical justification. So I label that is not necessarily the best solution to your problem. Turning knobs to get the answer you want. It’s also worth noting that ecology is far less chaotic than the atmosphere. So in ecology, so if this happens in atmospheric science where they have a chaotic system, this is even more prone to happen in ecological systems where we have stabilizing feedbacks.
33:56 - But that also means that occasionally, convergence of your ensemble members to one answer is the right answer. So if I’m trying to predict phonology, my predictions for what percentage of leaf on are we at right now, in the middle of the summer is 100%. If I did this as an ensemble forecast, I’m fairly confident, every ensemble member will say it is 100% and will ignore any data that says it’s not because what I’m gonna see in the data right now, is just sampling noise. So in that case, all of my ensemble members converted the answer, is not the wrong answer even if it causes the model to start ignoring the data. It’s a problem if the model in August doesn’t say, well, actually there’s some probability that we’re going into an early fall.
34:51 - And so if we go into the fall, the Process Model needs the ability to recreate that ensemble spread again. So convergence by itself is not a wrong answer if that convergence does represent true understanding of the system and it’s stabilizing feedbacks. But in many other cases, it can really indicate miss specification of the process model or miss partitioning of the process error. So very often it means that there’s some source of uncertainty in the real world that you did not put in the model. So this is easy way to get that. If I take one of our models, and run it on the Ensemble Kalman Filter just with initial condition uncertainty, I’ll get filter divergence fairly easily because I’m ignoring uncertainty in the parameters. I’m ignoring uncertainty in the drivers.
35:50 - If I start ignoring sources of uncertainty, then I become falsely overconfident in the model. Or like the example I had before, if in the real world there’s disturbance and in the model I’m ignoring that disturbance can occur, all of my ensemble members will converge on your nicely successional forest and when I skid observations and say no, that forest is burned down, my ensemble’s gonna ignore it if I didn’t put in that there’s some probability of that disturbance occurring. On the flip side, that’s actually one of the strengths of data assimilation in my mind, which is that there are stochastic events in the real world like disturbances. And I sure as heck want to update my model’s forecast, the second I observe it. So if I observe that a fire has occurred, I want the data assimilation system to say, Oh no, our forecast shifts to a completely different state right now. I need to reconcile that with reality.
36:54 - So even if even if you had a model that was perfect in its representation of process, it can’t be perfect in representation of stochastic events. And so you’ll always wanna be assimilating those. Reiterating that process error is important but none of our Kalman Filter variates can estimate it explicitly. So if we have, again, for example, a simple Random Walk State Space model has priors on those two things. The Kalman Filter variants don’t have priors in those things ‘cause they’re not treating them as unknowns.
37:27 - If you put priors on them and treat them as unknowns, you can do that, but you lose the conjugacy in the analytical solution. And we’ll come back to that again later in the lecture but what I wanna do now is move on to thinking about what if we forecast with a larger ensemble? So what happens if we use a large enough number of ensemble members that we can approximate the distribution just with that sample? We go back to this Monte Carlo approach to propagating uncertainty. Because the only real disadvantage we saw in an ensemble Kalman Filters, were making this normality assumption and it might not be appropriate. We don’t have to. We know from MCMC we can approximate any distribution just by samples from it. Why can’t we do the same thing in data assimilation? So we can eliminate these distributional assumptions, we can eliminate this normal normal analysis.
38:26 - We can start dealing with estimating the process error and things like that. But how do we do it? So the problem with the Monte Carlo approach, the uncertainty propagation isn’t in the forecast step. Because in the forecast step, just run a bigger ensemble. You run a big enough ensemble, you approximate the distribution with samples from that ensemble. The problem then comes with the analysis step. My prior is now just a pile of samples.
38:55 - I’m not writing down a distribution because the whole point of this was to not write down a distribution. So how do I update that analysis? How do I update the system? So the idea here is, behind this, which is what’s called the Particle Filter, is imagine I have some set of initial conditions, I sample from it and I run an ensemble. Here I’ve got a finite, a small number just for visualization purposes. But remember in a Monte Carlo approach, I’m gonna want to run like 10000 assimilations because I want to be able to approximate the problem with distribution at all times. So this is my set of samples and the Particle Filter essentially calls the each ensemble member a particle. That’s where it gets its name.
39:45 - It’s envisioning these as particles moving through. I now observe some data, what do I do? (sighs) So I look at it and I go intuitively, these ensemble members just became a lot less likely, these ensemble members became more likely. The likelihood tells me that, how do I convey that information to these ensemble members? That I like these ones better and I like these ones less. So the key idea, is the key way that I tell these particle members that I like this one more, and I like this one less, is I give them weights. So when I start out, I sample from the prior, and since I’m sampling randomly from the prior, all samples from the prior are equally likely.
40:45 - They all have the same weight, weight equals one. But then I observe some data and I can say, well, this value has some likelihood, it’s non zero. Not particularly high. This one has a higher likelihood. If I had one here, it’d have the highest likelihood. And these guys out here, a very very low likelihood. So I give the values out here, very low weight, and I give the values over here much higher weight. The weight I give them is the likelihood. I use it likely to give them that weight. Okay.
41:27 - So now every particle has two piece of this information. Its state and its weight. What do I do? How do I use that? I panic, whoa. (audience laughing) Okay, wait, this has made things harder. Okay, but I go, okay, no, no, this isn’t actually that bad. I know how to calculate a weighted mean. So if I want the mean of my ensemble, I just calculate a weighted mean instead of a non-weighted mean. Okay, turns out I can also calculate a weighted variance. That’s not actually that hard to do. I can calculate weighted quartiles. It’s not hard to do. Instead of doing the cumulative distribution of all particles, you’re giving them distribution cumulative weight. So you’re summing up their weights in order. I can do basically anything that I want to calculate from my posterior, I’m calculating the weighted version of that thing. So confidence interval is done with a weighted quantile, mean, variance, all the weighted versions. So we’re actually good.
42:31 - As long as we remember to use those weights, that’s the only… I can make a weighted histogram perfectly fine if I’m just using those weights, I have a way of carrying that information forward. Now I can get a, I make a second observation so I get a weight from that second likelihood. So the total weight is now the product of those two weights. And if I get a third observation, I’m just continuing to accumulate weights.
43:02 - I continue to accumulate weights, these guys that are consistent with the data keep getting more and more weight. These guys out in the tails keep getting lower and lower weight. So in the basic vanilla flavor, the part of the filter of weights provided by the likelihood or posterior, are likelihood times prior. Any estimates based on those weighted versions,in the latter if you’re doing this in a non time series model, this is often also referred to as Sequential Monte Carlo. Problem. Weights can converge on one or a small set of ensemble members.
43:37 - So if I keep doing that process, those values that are near the middle keep getting lots and lots of weight, and those values that are out in the tails keeps getting less and less weight. That really causes two very similar problems. Two related problems. One, the whole idea of a Particle Filter is that I’m approximating my distribution with samples from my distribution. But if I’m giving a lot of weight to a small subset of those and very little weight to the others, the effective sample size gets very small. So I’m actually doing a very bad job of approximating that histogram.
44:17 - So if I kept doing this and I gave 100% of the weight to one ensemble member and zero weight to the other 9999, my histogram is just a point estimate and it’s a really bad histogram. At the same time, those other 9999 that have virtually no weight, I keep doing all of the cost of carrying them forward and I’m getting almost nothing out of it. Like I incur the cost of running the model I incur the cost of comparing the model to data, but they contribute essentially nothing my posterior. So the idea of the resampling part of filters is to try to deal with both those problems. And they’re exactly the intuition that you had.
45:03 - Which is I want to resample from my posterior. And when I resample from those posterior, things that have more weight, are resampled more than resampling in proportion to the weights. So things that have high weights get resampled more, things that have low weights get sampled less often. Things that have very low weights may get dropped from the ensemble completely. So let’s look at what this might look graphically.
45:28 - So I might start with a set of forecasts coming out of different ensemble members and imagine this line represents the likelihood surface. And so these guys get low weights, these guys get higher weights. And so here is what I, here’s my posterior in terms of, kind of that weighted histogram. Obviously it’s a very bad posterior because it didn’t use nearly enough particles, but the idea is still like these guys get more weight, these guys get lower weights, these guys get very little weight. And I can work with that. I can work with that in terms of its, ensemble mean, weighted mean, weighted covariance, et cetera.
46:12 - I can then during the resample step, if I resample in proportion to these weights, this one which high way might get cloned three times. These guys with intermediate weights might get cloned twice. These guys might stay and these guys with very low weights might just get dropped. So once I resample, now all the weights get set back to one and I just have a histogram again. So now approximating the posterior as a histogram, I don’t have to calculate a weighted mean, a weighted covariance, I just have sample mean, sample quantiles, whatever.
46:51 - Now the key for the Particle Filter, resampling Particle Filter to work, is that there has to be process error in the model such that these three ensemble members, when I run them forward to the next step, don’t just give me the exact same answer. Because if they did, I’d still have this problem of degeneracy. I would just be doing the calculation three times to get the exact same result and I could have just saved myself the trouble. You have process error causing these ensemble members to spread out over time and produce different predictions. And so I can go through this cycle of forecast, analysis, resample, forecast, analysis, resample.
47:38 - So this is just again the same idea in a different graphic now pointing out that this easily deals with multimodal distributions in non-normal distributions. And again, reinforcing this idea that we’re carrying forward both the state and the weights. So yeah, forecast analysis which gives us weights. Resample, forecast again, analysis gives us weights and then we could resample. This is the part where I will say if you’ve never heard of this thing before, your brain may be going, what the hell is going on? (laughs) This is so different than MCMC.
48:21 - It’s not that different, but it is definitely different in algorithm than the other forms of data assimilation that we’ve talked about. So when to resample. There’s a balancing act. If I resample too often, I lose particles simply through drift. So imagine I started out, I took a sample from my prior and at every step there’s no data, I just keep resampling from the prior. I resample from that ensemble, I’ll lose some ensemble members just by chance. I resample I’ll lose on both chance. So if I kept resampling with no information, I would still converge on one ensemble member just by a process of drift.
49:06 - So it’s not guaranteed that I wanna resample every single time step. On the flip side, if I don’t say resample enough, the filter converges on too many too few ensemble members, we get this degeneracy problem. We don’t approximate the distribution well and we’re spending a lot of time on computation we don’t need. So there’s no rules for when to resample. It’s kind of like the Jump Distribution in MCMC, you kind of tune it to get something to behave well. But typically you do that using rules of thumb such as when the effective sample size drops below some threshold, resample.
49:47 - So a very common one would be if the effect of sample size drops by more than half, you might wanna resample. And in fact the sample size is not hard to count. It’s just the sum of the normalized weights. So this number works if you standardize the weights to make sure they sum to one. The weights are standardized to sum to one, then one over the sum of the squared weights, will give you the effective sample size. So if every ensemble members calculate is counted equally, effective sample size is just the sample size. And this down weights that when you have some that are counted high and some that are counted low. And then just again a reminder that when you resample, set those weights back to one. Okay, so the pros and cons of the Particle Filter. The biggest con is the computation. You have to have a problem where you can maintain enough particles to maintain a good approximation of your full distribution.
50:54 - And building in the fact that you know, you’re gonna encounter this degeneracy problem. So if I’m resetting it every say n over two, then instead of 5000, I need at least 10000, probably more. There’s gonna be a curse of dimensionality here too. When you sample parameters, for example, you only get that initial sample. If you have a very high dimensional space, you could easily be missing parts of parameter space that are actually quite likely.
51:29 - So if you have a very wide prior and a very small ensemble, the odds of that you had good ensemble members in that initial ensemble may be very low. So again, the less in form of your priors, the more samples you need to explore parameter space. The bigger the dimensionality of your problem, the more samples you need to make sure that you’ve explored parameter space. Because all you can do in the Particle Filter is up weight or down weight what you already have. Pros. It’s actually not hard to implement. Other than that weighting bit, it’s really not that hard.
52:07 - It’s fairly general and flexible and is parallelizable. All those ensemble members, and this is also true with the Ensemble Kalman Filter. If you have a big computer you can put each of your model runs on different nodes or different CPUs. And you can also evaluate all of your parameters. So you can update the parameters and the process error and the observation error and all those things, you can update them. There’s a but, though. There is a but.
52:37 - Which is, remember that the Particle Filter only works because when I resample specific values, they have to be able to have this process error that causes them to diverge back out. That’s true of your state, but it’s not true of your parameters. When I resample a parameter and I run my model forward, the parameter that comes out of the model isn’t different than the perimeter I put in the model. So because parameters lack process error, they can be subject to that degeneracy that we can avoid by resampling for the process state. So there’s a couple of ways that are out there to try to deal with this and they basically come down to, trying to come with an approximate way to resample from the parameters.
53:29 - Often involved by using some sort of Kernel Smoother. So the idea, so to step back when we worked with Particles, the same as when we work with MCMC, we are working with a discreet approximation to a joint probability distribution. So we have a set of discreet parameter values in our parameters, just a list of numbers. So to be able to resample from that, we need to generate some sort of continuous approximation to that joint distribution to be able to resample from it. ‘Cause we need to be able to generate parameter values that aren’t in our current parameters set.
54:01 - Often done by putting some sort of Smoothing Kernel over that. And Smoothing Kernels are the sorts of things that we see all the time. Like if you would run an MCMC and you make a smooth density plot at the end instead of the row histogram, that’s just a Kernel Smoother. This is nothing more or less complicated than that. But when you do that smoothing you have to make a choice about how you’re doing this smoothing, the bandwidth and stuff like that.
54:26 - So here’s a very simple example, which is kind of the global, Gaussian Smoother, which is essentially just fit the multivariate normal to the current posterior parameter distribution. And then update the parameters essentially with some weighting between no smoothing and complete smoothing. So this, in this example, if you set each to zero, then this goes away and you are essentially just redrawing parameters from this multivariate normal approximation to the posterior. If you set this to one, you’re essentially not redrawing anything and your updated parameters are just your same current parameters. And you can kind of tune h to kind of give you something in between.
55:14 - Even better than just kind of tuning h to give you something reasonable, is to actually build in a formal Metropolis-Hastings Accept or Reject into these proposed moves. So you might, for example, split every particle in two, one with its current parameter value, one with a new proposed parameter value, run forward and when you get to the next time step, accept or reject which one of those you keep. So it does require doubling your particle size, but then you have a formal rather than kind of a tune thing, you can have a formal Bayesian way. So that together, so the idea of putting resampling in with this Metropolis-Hasting is called the Reject Move Particle Filter. And I’m not gonna go into their implementation.
56:00 - This is now kind of just giving you buzzwords to Google. (audience laughing) If you ever decide to dive into it and you get to these papers that are like more equations than text. That’s where I am right now. I’m at the like, I get the general idea, but I’m still scratching my head over some of the equations. (laughs) Getting close to wrapping up. So just again, big picture, we have a bunch of different options for how we assimilate data into models, into forecast.
56:28 - Iteratively, they largely map onto the different ways that we propagate uncertainty. What about that analysis step? I wanted to come back to that. So why don’t you just do the whole analysis step? Why don’t we just do the whole thing by MCMC? So option one, just refit the full State Space Model every time step. I said this earlier, that’s exactly what Tom Hobbes does. If I’ve got 50 observations over 12 years and I refit it every single time, once a year by hand, who cares? That’s still an option.
57:01 - So you can always refit the full State Space Model and then just make a new forecast out of that. Option two, what if we just update the forecast coming out of the State Space Model? So if we treat the priors as samples, we can just do that as a Particle Filter. So in many ways the MCMC can run seamlessly into the Particle Filter. ‘Cause they’re just treating things as particles. Alternatively, we can approximate those with distributions, and then we’re essentially doing some generalized form of ensemble filtering where we don’t have to assume the same conjugacy that Kalman did.
57:46 - But, because remember like with classical statistics where I said, everything was done so that there was an anecdotal solution. Kalman likewise had to do things so there was an anecdotal solution because MCMC wasn’t invented when the Kalman Filter was derived. No one was doing that. So here’s a very simple example of generalizing this. And so here I’ve just got a simple Dynamic Linear Model and I actually started with doing it just with a standard ensemble Kalman Filter, assuming everything was known. And then I incrementally started relaxing those assumptions about what was known.
58:29 - But I refit it using a full MCMC at each analysis, I refit each analysis step using MCMC. So I didn’t refit the full model, I just did each analysis step by MCMC. So the first thing I did is I said, well, that assumption that process error is unknown, is the least tenable of all of these assumptions. So I wanna fit that. And it turns out if I treat that as an unknown, I get something that essentially converged to the same thing as the Kalman Filter so quickly that they’re essentially on top of each other. I then said, well, what if the process and observation error are both unknown? Well, it turns out, I have no idea at all what I’m doing for the first five times steps ‘cause as I’m going through the analysis step, the model is trying to reconcile these two parameters and come up with estimates of them.
59:20 - But then they start converging and then by 10, 20 steps, I’m largely getting the same answer that I did without that. Then I said, well what if I treat the parameters unknown? Now it takes me longer to start sorting out these three parameters that I need to estimate. But, still I’m trying to estimate three parameters with 10 observations. That’s not unreasonable to expect that it takes a while to do that. By the time I have 20 observations, I can actually get decent estimates of those three parameters.
59:49 - And what I see after that is largely similar, not identical because again, there’s always uncertainty in these parameters that cause their posteriors to be a little bit wider. So this is not, example code for this is not in the GitHub Repository for course examples, but it is in the GitHub Repository for all the code that goes with the book. So you can just, if you ever want to look at this, you can see that there’s this process of refit it, you’re running the forecast step with an ensemble, stopping, running of Jags Model to do the analysis, sampling from that Jags posterior to run the forecast. We’ve generalized this even further in some of the work I’ve been doing with Ann Ray who I mentioned this, a number of these features just in discussion over the last few days. And this is an example of some of Ann’s work where we’re trying to fit.
00:49 - So the data here is an estimate of above-ground biomass and Woody biomass increment that’s coming from fusing tree ring data with forest inventory data. The model itself is a Forest Gap Model that’s projecting the forest dynamics in time. And so here we have the tree ring data, we have the model prediction at each time step and then the pink bar is the analysis at each time step. So that’s the fusion of the model of the data. Always trying to reach and kind of, following the same general mean trend line as the data but in many ways suggesting that there is inter-annual variability there that might be missed by the tree ring reconstruction.
01:31 - When we did this, we generalized it so that we treated the process error as an unknown. And so we fit that with a Wishart. We assumed to Wishart prior on that which we update every analysis time step. And so from that Wishart prior, I can actually estimate the process error covariance matrix. This is just the process error correlation matrix because let’s be honest, none of us can actually interpret what a covariance matrix looks like, (chuckles) but a correlation matrix is well pretty good at. So diagonal was one because it has to be. So how do we interpret this? So how I interpret this is, yellow Birch and Cedar have a positive interaction that is not being captured by the current process model.
02:16 - By contrast, yellow Birch and Cedar have a negative interaction with hemlock beyond the interactions that are already in the process model. So, and that’s not something that’s gonna come out obviously from the data alone because again, if I look at the data alone, this stand is undergoing succession. So Hemlock bio-mass increases, Yellow Birch bio-mass increases. If I just did the correlation between yellow Birch and Hemlock, it’s positive. I can’t detect a negative interaction between competitive interaction between those species, but I get it out of the Process Model.
02:49 - And more importantly, what I’m getting out of the Process Model is that the interaction is stronger than what the Gap Model already thinks. ‘Cause the Gap Model already represents competition for light and water and nutrients. But there’s something stronger going on that’s not captured and that’s captured in the process error. By contrast, Maple just does its own thing.(laughs) So that either means Maple is ignoring the other trees in the stand or that the Process Model is already accurately capturing the interactions between Maple and other species in the stand.
03:23 - Which is probably the better way of thinking about it. And then again, we’ve, I mentioned this yesterday, we instead of using a Multivariate Normal in the analysis that we use this Multivariate Tobit which allows us to zero truncate and zero inflate the probability distributions. The zero truncations and zero inflation doesn’t really show up as important in these two, but when you do this, and if you look at rare species in the stand, it’s very easy to have cases where specific ensemble members predict a species to go extinct, and so you have zeros in the model forecasts or vice versa. The model predicts a species to be in the stand that isn’t observed. So the data predicts a zero. So you get more zeros in both the model and in the data when you do this for the rare species than you would have predicted by chance. And so there, it’s really handy.
04:18 - But with these common species, the Multivariate Normal would have been fine. So some take-homes. The iterative forecast-analysis cycle, data assimilation allows us to continually confront models with data. All of these data assimilation variants are forward-only special cases of the State Space Model. So State Space Model is the parent of all of these. These are all just special cases. Forecast step. Standard data assimilation methods map onto our uncertainty propagation options. Analysis step.
04:52 - Don’t feel constrained by Kalman’s normal normal assumption. Take those assumptions into your own hand, MCMC is perfectly viable way of doing the analysis step except for very large compute problems. And that’s all I had on numerical ways of approaching model data fusion. .