Propagating Uncertainty

Apr 23, 2020 20:27 · 10273 words · 49 minute read actually come data collection included

Like I said, we spent a good bit of the last day and a half after the general introduction, really focused on statistical tools that solve what is ultimately the calibration problem. We have some model, we have some data, we need to estimate the parameters to that model. We covered some fairly sophisticated ways of doing that. Methods for dealing with the complexity of data, observation errors, errors and variables, non-constant variance, blah, blah, blah, non-Gaussian distributions, we talked about using hierarchal models to partition that variability, we talked about using state-space models to fit these as dynamic time series, but they were all ultimately about calibration, about fitting that model. So what we wanna do now today is think about how we take that calibrated model, integrate it and the other sources of uncertainty we have into making projections, but then also time thinking about how this creates a feedback loop where we can analyze the uncertainties themselves to better understand what data we need to collect and how we might do that data collection more efficiently.

01:18 - So the four things I want to emphasize in this lecture, the four key concepts are sensitivity analysis, basically asking the question, how does it change in x, one of our inputs translated in a change in y, the thing we’re trying to predict? Then building on that, uncertainty propagation, how does the uncertainty in x affect the uncertainty in y? And likewise, how do we forecast y with uncertainty? Uncertainty analysis, which source of uncertainty here are most important? And then optimal design, thinking about how we can most effectively reduce the uncertainties in our forecasts. I’m gonna spend a disproportionate amount of time on uncertainty propagation because this is very much a course on forecasting and like I said, this is a key concept in forecasting, but I think there is a deep connection between sensitivity analysis and uncertainty propagation and uncertainty analysis, and then one of the values of uncertainty analysis is to think explicitly about that feedback loops, how we can use our forecasts to actually improve the way we make measurements and the way we do monitoring. I wanna take for granted that most of you have probably had some prior exposure to the idea of a sensitivity analysis. Is that pretty fair? So I’m just gonna give a quick review. There’s multiple ways to assess the sensitivity, you’re going to understanding how our outputs change given our inputs.

03:00 - Probably the most straightforward of those is just to do it analytically. If we wanna know how our outputs change as a function of our inputs, that is essentially what a derivative is. A change in output given change in input. So that first derivative for a simple model is a very simple way of understanding sensitivity, and if you have a simple model, go for it, you’re basically done. But all these other numerical methods exist because this is often nontrivial or for things that exist only as computer codes, not really even something you can do. So then, we can classify sensitivity methods into those that are local, those that are understanding sensitivity relative to add a specific point in parameter space usually centered around the mean value versus those that are trying to understand the sensitivity more globally, so they look at how the outputs change as opposed to the inputs across the larger part of that input space.

04:10 - I’m not gonna go through all of the global methods today, more of them are covered in the book and then even more are covered in the Saltelli book, which is a great reference if you’re interested in sensitivity analysis, it really gets into the details. But I think the important thing to know is that in terms of global sensitivity analyses, they’re pretty much always more computationally expensive because, which makes sense. If you’re exploring more parameter space, it’s gonna take more cost. You can arrange many of the other methods into a gradient of how extensively they explore parameter space, but that being more computationally costly versus things that give you a more approximate answer for the fewer number of evaluations, so there’s a set of trade offs. I’m gonna particularly focus on these two right here.

05:12 - The simplest univariate numerical sensitivity analyses are these one at a time. You hold every parameter except the one that you’re interested, and its mean or its default and then you vary one parameter. So it’s essentially giving, allowing you to numerically construct that derivative holding everything else that it’s constant. So this is some of the work that we’ve done with vegetation models, this y-axis is NPP, and so we’re asking, as we varied a bunch of different parameters, how did NPP change as we varied the parameters? In this particular instance, the gray line was when we did this with the prior distribution and the black line was when we did this with the posterior distribution. And actually this brings up one thing that I like to emphasize about sensitivity analysis relative to how it is often done in earlier literature.

06:11 - At least when I was trained, I am older than most of you, you’d see a lot of one at a time sensitive analysis that were like plus or minus 20%, plus or minus 10%, which were completely arbitrary amounts. I think one of the things that I find particularly valuable is if you have an estimate of the uncertainty, we tend to do our sensitivity analysis plus or minus certain number of standard deviations. Plus or minus, or for non-Gaussian distributions, the quantile equivalent of those standard deviations. So that’s what you’re seeing here, you’re seeing the prior mean and the prior one, two, three standard deviations, one, two, three standard deviations. This was not normally distributed, so these are the quantile equivalents of those standard deviations and then again with the posterior.

07:00 - And I find that useful because it gives me a better frame of reference for understanding what range of variability is actually meaningful for a particular parameter because some parameters vary by plus or minus 2% and some vary by plus or minus 200%. And so why would I vary them both the same amount? In this case, we find that plants are really sensitive to their specific leaf area. For those that don’t think about terrestrial NPP, and I didn’t even give the units, I think it was megagrams per hectare that goes from like ridiculously huge to dead. (laughs) At the other extreme are our global methods that explore parameter space very broadly. One of the simpler global sensitivity methods is just Monte-Carlo sensitivity analysis, where you just sample completely from that joint distribution to understand what the outputs look like over wide range of different parameter values.

08:02 - So in the one at a time analysis, I was holding every other parameter constant, here, I’m letting every other parameter vary simultaneously. These are actually predictions from a simple Gaussian plume pollutant transport model looking in this case at the sensitivity of pollutant concentration at a specific location downwind to wind speed and atmospheric pressure. So you can see the more full shape and variability and that sensitivity. Two metrics that I find useful if you do, you do this global sensitivity analysis but you still want to still that down to a specific number that’s handy to assess the relative importance of different parameters. Just fitting a regression slope through that cloud of points, again, slope is dy/dx.

08:56 - So slope is essentially one number for average sensitivity, would just be that regression slope. The other number that I find useful is the partial R squared associated with that input. Because if you wanted to say partition out the relative importance of different variables, those partial R squareds tell you how much does each parameter contribute to the overall variability. Because you’re often approximating nonlinear models with linear models, the partial R squareds are not gonna sum to 100% because of interactions. But you can also, it’s also likewise fairly easy with those by throwing a regression through there to put in interaction terms to actually assess the interactions.

09:45 - So here, we can see wind speed has some sensitivity, obviously sensitivity units are dependent upon the units of the y and the units of the x, but it’s a decently important parameter. By contrast, atmospheric pressure, the slope is essentially zero, the R squared is essentially zero, pollution transport was not at all sensitive. The other reason that I emphasize Monte-Carlo sensitivity analysis is that, and when I talk about uncertainty propagation earlier and later, and one of the approaches to uncertainty propagation is Monte-Carlo based. So if you’re doing a Monte-Carlo based approach to uncertainty propagation, you essentially get the Monte-Carlo sensitivity analysis for free by just doing some postdoc analysis of those, post your samples. So, that’s I mostly wanted to say about sensitivity analysis in order to set up being able to think about uncertainty propagation.

10:53 - For forecasting, what we want to do is say we have some set of uncertainties in our current state of the system and our other parameters or our other inputs and then we want to project that into the future with uncertainty. How do we do that? Well, how we do that, how we make predictions with uncertainty is just a special case of a more general statistical problem, which is how do we transform distributions through functions? How do we transform variables? So the idea of a variable transform is I have some probability distribution and I have some function and I want to transform some probability distribution through a function to convert the distribution from one space to another. In forecasting, you’re taking the space that you’re trying to transform through, is through your model into the future or into some novel condition or novel location. All of the methods for propagating uncertainty in the future are just a special case of transforming uncertainties more generally. I’m gonna go over five different ways that we can think about propagating uncertainty into projections and I’m gonna organize them in this matrix.

12:11 - I’m just gonna point out that in the textbook, there is a typo that didn’t get caught. For whatever reason, I sent this, this is what I sent in, but this got moved down here. So your book has the Taylor series as a numerical uncertainty approach when it is in fact very much analytical. So what I’ve done here is I’ve taken the approaches to propagating uncertainty and classified them in two directions, depending on what we want to get out and the approach we wanna take. So first, classifying these into analytical versus numerical methods.

12:48 - Analytical methods are ones that involve doing math, but they give us an equation. That equation is handy ‘cause it gives us a more general understanding of what’s going on and it’s also, once you have it, you can keep reusing it and it’s often very computationally efficient, because I plug a number in, I get an answer, I’m done. Numerical methods are ones that rely on some form of computer simulation, they’re more computationally demanding but they often require you to do less math. And so that’s the trade off, you’re trading off computers doing hard work versus your brain doing hard work. Hell you, it’s not uncommon for ecologists to decide that they’d rather have a computer turn on a problem for a few weeks than to remember how to do derivatives.

13:36 - (laughs) And then there are a few models we work with where doing derivatives takes more than a few weeks. If I’m working with a global earth system model and I wanted to solve for the full matrix of all Pairwise derivatives, it would take years to code that up and verify that there’s no bugs in it. The other direction I’m gonna think about is what we want to actually come out of this. One thing we might want out is the full probability distribution. In fact, ideally that’s what we would like to get out, is that full probability distribution.

14:15 - Alternatively, we might be okay with just getting the statistical moments. I make a prediction in the future, I’m fine with just getting the mean and standard deviation, I can live with that. Why would I be fine with that? Because not surprisingly, it’s harder to get the full distribution than it is to just to get the mean and standard deviation. So again, there’s more work in this column, this column is often easier to achieve. You’ve probably all now zeroed on probability, this one’s easier, and that easier, so I’m gonna totally end up here, right? (laughs) Can we just cut to that one? No.

14:58 - I’m gonna start with the hard one and I start in the other corner, which is what I’m gonna call the gold standard. Analytical methods for variable transformation, you get a closed form solution, you get the full distribution, and this whole thing that you can find in a graduate level stats textbook and I will say I’ve done this transformation in graduate level statistics classes, I’ve never done it after a graduate level statistics class and I’ve only done it for fairly simple problems because it’s actually pretty nontrivial. So if you think about how I have some probably distribution for my input, I have some function and I wanna know the probability distribution of the output. Well first of all, in an ideal world, you’d be like probability distribution of y should just be the probability distribution of this, no. It should just be, I’m gonna take my function, I’m just gonna take my probability of my inputs into my function, won’t that work? No. (laughs) That doesn’t work.

16:04 - You can’t just take a Garrison and plop it into your function and expect to get something out. What you have to do instead is to plop the inverse of your function into the probability distribution. So if you have a model, you need to analytically solve for the inverse of the model and then plug that into the equation for your probability distribution. And that’s the easy part. Because you also need to take the derivative of that inverse of the function, which isn’t too bad if you have a univariate model. You never have a univariate model. I mean, the only univariate model we actually have is fitting a constant mean and that’s not a particularly interesting forecast, even a regression is a bivariate problem and then if you acknowledged that you have uncertainties about the xs as well, you now have three terms.

17:00 - So you now have a trivariate model and in a trivariate model, this is not one to root of, this is a three by three matrix of derivatives. And that’s just for your simplest straight line case. If you have a 12 parameter model, this is a 12 by 12 matrix of all Pairwise derivatives. And then you’re multiplying that by probability distribution and that gives you the probability distribution. Now just having the probability distribution by itself isn’t particularly useful because it’s unlikely to be a named probably distribution, like you’re not gonna do this and suddenly it’s gonna be like, oh normal, great.

17:42 - It’s gonna be some weird obscure thing that no one’s ever seen before. So then you have to go, okay, well what if I wanna know things like the mean of that distribution? Well then I have to take this thing and say, if I wanna know the mean, that’s y probability of y integral of that. If I wanna know the variance, it’s the integral of y minus y bar probability of that. And so now, you have this incredibly nasty multi-variate thing and you’re having, sorry, to multiply it by this and then take the integral of the whole thing analytically, just to find out what the standard deviation is. Like I said, you never do it. The places you do it is like when f of x equals x squared.

18:35 - And then you need to go through a whole lot of work and you find out that if x is distributed normally, and you wanna know the transformation for x squared, it sense of being chi squared, which is, have you ever wondered why the chi squared is called the chi squared? It’s because it’s just the distribution of x squared. You will do that as a homework in a grad level course and you don’t do it in practice. So gold standard, absolutely wonderful, absolutely useless. So let’s move on and think about, let’s move to the next corner, which should say, well, we can’t get a full solution analytically, can we get an approximate solution analytically by focusing just on the mean and standard deviation? If you’re gonna take that approach, what you’re gonna do is think about the fact that there are a lot of existing proofs that explain how random variables interact with each other and how random variables interact with constants. So here’s a set of rules, the algebra of variances for here, capital letters are random variables, so things coming out of probability distributions, and lowercase are just constants.

20:00 - And if you look at the table for the rules for how means behave, it’s actually relatively intuitive as long as you’re not multiplying things together, it largely works out like the mean of random variables, the mean, (laughs) nothing crazy. Variances by the other hand are a bit unintuitive, the algebra of variances like the variance of a constant times a random variable is the constant squared times the variance. Variance of a random variable plus a constant is just the variance of the random variable, the constant goes away, variances of constants, constants don’t vary by definition. And then if you add two random variables together, yeah, you can add their variances together, but then you also need this term for their covariances and then you can start saying, well, if I combine constants with random variables, I start getting these a squareds, b squareds and 2ab covariances and their algebra is not actually hard, it’s just not necessarily intuitive until you’ve done it a few times. But it actually is something that you can take a model and you can apply these rules and work out analytically what the moments of the predictions would be. There’s an important caveat though.

21:26 - All the rules that I show here correspond to linear combinations of random variables. The method of dealing with the moments analytically only works out exactly if you’re using linear combinations. So essentially, this works very elegantly, but it only works very elegantly for linear models. Anytime there’s any non-linearity, you don’t get an exact solution through the analytical moments. So let’s look at this graphically, and I wanted to do this to emphasize again what’s going on but also to graphically show the relationship between uncertainty propagation and sensitivity analysis ‘cause I said that these are not unrelated. Let’s say I have a linear model.

22:17 - Well my simplest univariate linear model, y equals m theta plus b, theta is my x. And again, m is just the slope, which is our derivative, that’s just for reminder that the slope here is the sensitivity. If I have some distribution here, what I’m wanting to do is transform the uncertainties in my inputs into the uncertainties in my outputs. So if I wanna know the variance into my output, that’s just the variance of the full model. When I apply these analytical rules for dealing with uncertainties, essentially you’re applying them from the outside inward.

22:56 - So I would say I have a variance of some thing plus a constant, that becomes a variance of something because that constant doesn’t matter. I can then apply the rule of variance of a constant times a random variable and get out that the variance of y is m squared the variance of theta. And again, because that slope is the sensitivity, we get back this property for a linear transformation, the variance in y is the sensitivity squared times the variance in theta, which is something I actually mentioned on day one, which is that if we wanna understand the uncertainties in our predictions, that that is determined by two things, the uncertainties in our inputs and the sensitivity of our outputs to those inputs. And if you remember that standard deviations are variances, the square root of variances, if you take the square root of this, you essentially get sensitivity times standard deviation. So what if you have a nonlinear model? Well, you can’t apply these rules for variances exactly, but there’s a couple alternatives.

24:14 - One that I’m not gonna cover is that there are approximations that it can be applied using equations to variances of other mathematical operations. And again, you would apply those from the outside inward, so you wanna say the variance of my function, I’ve written out my full function and I’m gonna break that in working my way from the outside, inward, down, back to the random variables. The other way you can think about this and say, well, what if it’s nonlinear, I have some uncertainty in the input? Well, the thing I can do is I can approximate that nonlinear function with its linear approximation. So I can take a linear tangent approximation of my nonlinear function. If I take a linear tangent approximation of my function, I then embark to having a linear model and I can apply these analytical rules.

25:09 - Those two approaches are actually going to be the same because all of the equations in the appendix for dealing with the variances of products and quotients and other mathematical transforms are linear approximations. But they’re ones that potentially allows you to get away with not having to solve derivatives. Then if I have a linear approximation, I’m estimating the uncertainty in my output relative to that linear approximation, not necessarily linear to the full model, but it gives me an approximate answer. It’s not computationally demanding, it does require knowing these derivatives, but it gives you an analytical conclusion that again has some analytical generality to it. And then we just saw earlier that if the model is linear, then the variance in y is exactly dy/d theta squared, the variance and theta, if we’re taking a linear approximation, it’s going to be approximately the variance of the sensitivity squared times the variance of the input.

26:18 - To get at the same idea in a slightly different way, if I have some function and we know the variance of this function, the variance of what we’re predicting with our model, that’s going to be approximately the variance of the Taylor series approximation of that function. And what we’re essentially doing is truncating that Taylor series approximation after the first two terms to make a linear approximation. So if this is what I’m doing, I’m taking a Taylor series linear approximation of my model, I can again apply those same rules for how variances work. So, variance of a function evaluated at its mean. Well, the function evaluated at the mean of the parameters are constants.

27:07 - I take constants, I plug them into a model, I get a constant back, and so the variance of the function of the model evaluated, its mean, zero that goes away. df/d theta, a derivative evaluated at the parameters means, that’s actually also a constant. Your derivative is evaluated at the mean parameter values. Theta is a random variable, the mean of theta is a constant. So, df/d theta times theta, that will give us, constant times random variable gives us constant squared times the random variable and the last term, constant times a constant, no variance there.

28:01 - Again, we’re getting the same answer, the variance of a function is approximately related to its sensitivity squared times the variance of the input. Again, that works if you have one parameter and you never do, so the more general solution to this involves having to sum over multiple parameters. So I have multiple parameters, I have a direct contribution of each parameter just summed up over all the parameters and then if you remember, when I sum up over multiple parameters, so properly, the variance of ax plus by was a squared variance of x plus b squared variance of y plus 2ab covariants of x and y, which is essentially what we’re seeing here. So these are the terms, these are the a squared variance of x, b squared variance of y, and then these are the two ab covariants. And in fact, you can generalize all of this to just being the sum over i, sum over j, df/dxi, df/dxj covariants between xi and xj.

29:49 - And if you write it like that, the variance terms are just the special case where x and j are the same and the two comes up because you end up with ij and ji being identical terms. And so, it shows up twice. This is essentially the general solution. But I think most people like being to see this way emphasizes the direct effects and interactions a little bit. So in the book, I go through an example of applying this to a fairly simple nonlinear model, the Michaelis Menten model where the f of x is dx k plus x, the Michaelis Menten model has the behavior that it asymptotes to v because as x goes to infinity, x gets a lot bigger than k and this is essentially v times one, and then k has a special case of when x equals k, this ends up being ¹⁄₂ times v. So 1/2v maps to k, you have saturation constant, and those are the two parameters. If I apply this analytical approach taking the derivatives of this model, I find that v has a linear sensitivity, so it’s the transformation between v and y is actually exact because it says linear sensitivity.

31:36 - By contrast, relationship between k and predictions is nonlinear, and so if I have some uncertainty about k, we see here, this is the mean value of k that translates to a specific prediction. By contrast, the overall variance of the prediction is offset in that property, that the mean of a function is not equal to that function, under its mean parameters is Jensen’s inequality. So this is just a graphical example that the mean of a function and the function under the mean are not the same thing, and that’s actually really important thing to remember when we’re dealing with transforming uncertainties through nonlinear functions. If you run your model under your mean parameter set, then what comes out is not your mean prediction if your model is nonlinear because what you actually are saying, we actually have a distribution of predictions and the transformation itself is nonlinear. So, included that just to give me a chance to emphasize the importance of remembering Jensen’s inequality.

32:59 - And so here we get a mean and competence interval for the relationship between x and y and here, the black line was showing the model evaluated with the mean parameter set. So again, this difference here between the middle line and the black line is again the result of Jensen’s inequality. And this is a confidence interval that is derived completely analytically, so this is just from closed form math. If you go back to pretty much every model you’ve seen in classical stats that draws a confidence interval, it was derived analytically. Someone went through the math of deriving analytically how to calculate that confidence interval.

33:38 - For things like linear models, they’re linear, so the math works out perfectly, and actually that is another thing worth refreshing and reminding people about. If I’ve some linear aggression between x and y, it gives me a straight line. If I were to propagate the uncertainty in just the intercept, that would give me an interval that just perfectly parallels the line. If I were to propagate the uncertainty in just the slope, looks just like that. If I added those two things together, it would look just like that because I would just have this interval with the additional uncertainty in the intercept.

34:32 - Is that what a regression confidence interval looks like? No. Why not? Why does a non linear confidence interval not just look like a big V? Because we have an uncertainty intercept, we have an uncertainty, we add those things together, why doesn’t it look like that? - [Student] Because of the primary correlation. - Because of the covariants. This is really important. When you propagate uncertainty through, those covariances are critical. The correlations between your parameters are important. And so, this happens a lot and it’s really easy to do.

35:14 - You get some, post your estimate of your parameters, you plot up these marginal distributions and you think those marginal distributions are your posteriors. No, your posterior is the full joint distribution of all of your parameters with all the complexity of their covariant structure and many of our parameters, those covariant structures can themselves be very nonlinear. So when you sample from parameters, you have to sample from that joint distribution, you have to account for that correlation. It’s that correlation that gives us our classic hourglass shape. Our classic hourglass shape comes because when I choose an intercept that’s bigger than average, the slope is flatter than average.

36:00 - Because otherwise, it doesn’t go through the data. So there is there always, even if something is simple as a regression, there’s always a strong negative correlation between the slope and intercept. It just gets more important with more complex models. So that covariance term can’t be dropped. So we’ve covered our analytical approaches, one that’s impossible, one that’s pretty straightforward and one that isn’t bad, but has the limitation of the more nonlinear model is the worst than approximation it gives. If I’m looking at this and looking at that linear approximation, I might not be particularly comfortable.

36:44 - I may be like, that’s a pretty bad approximation. That said, if I have a model that’s slightly, only a slight curvature to it, and it might be a perfectly adequate approximation. So let’s think about alternative numerical ways to propagate uncertainty. Well, we just saw that in Jensen’s inequality, I can’t just take the moments of my model and transform those moments through the distribution. So I can’t take the mean and transform it through a nonlinear function and get the mean prediction out.

37:23 - I definitely can’t do that with the variance either. By contrast, one of the things that I can do is I can take samples from this distribution. So if I have a distribution here and I take random samples, I draw random numbers from this input distribution, now once I take a random number from this distribution, I have a number. I can transform that number through the function, no problem, ‘cause I’m not transforming a moment, I’m not transferring distribution, I’m transforming an actual number. So if I draw from this distribution, I can transform that to my heart’s content.

38:00 - The key idea in all of the numerical approaches to uncertainty propagation has to do with this idea of taking samples from the distribution, transforming those samples, and essentially approximating the distribution on the backside using those samples. Now the idea of approximation of distribution with samples from that distribution is pretty common. That’s exactly what we’re doing basically in the Bayesian and MCMC methods. We don’t get an analytical solution, we get samples from a distribution. With frequent as bootstrapping method, you’re essentially doing the same thing as well, you’re approximating a distribution with samples from the distribution.

38:42 - So this idea that’s behind most of these Monte-Carlo methods, it’s a straightforward conceptual one. If I have samples from a distribution, I can transform those samples because they’re just numbers. And actually it’s a powerful approach because we can transform, I can transfer that into a distribution here, I can then take that distribution and transform through something else, I can repeat this process through increasing layers of complexity and analysis and I’m always propagating the uncertainty properly because I’m just propagating individual instances, individual numbers. This approach of propagating through samples is the Monte-Carlo approach to air propagation, but I’m going to make a distinction for the rest of this chunk between your standard run of the Monte-Carlo approach and ensemble based approach. I will say that this distinction is not one that’s I think found throughout the literature, but I think it’s one that I make and I think it’s a useful one to think about, which is that I’m gonna think about using, what I’m gonna call as Monte-Carlo approach is when we’re trying to approximate that full distribution from samples.

39:56 - By contrast, we also have lots of times where we have ensemble based approaches and the only real difference between an ensemble based approach and a Monte-Carlo approach is that when most people talk about ensembles, they’re usually talking about a much smaller number of samples. And there’s some trade offs there, and the key one is gonna be that when you have a much smaller number of samples, you cannot approximate the full distribution with those samples. There’s no math for the Monte Carlo approach, it’s just an algorithm. And it’s a pretty straightforward algorithm, you set up a loop, you draw random values of your inputs from their distributions, you run the model under those inputs and you save your predictions. You can then summarize those predictions in terms of probability distributions, you can make histograms of them, you can calculate mean, sample mean, sample variances, sample quantiles to get confidence intervals, but you’re working with those samples.

40:55 - One thing that makes this a particularly straightforward thing to do if you’re Bayesian is that the input to this is to drawing random samples of your inputs. Well, if you’ve done MCMC to fit your model, that’s what you have, that’s the only thing you have, is random samples. So it’s very straightforward to take those samples from the MCMC and run those through the model as a way of propagating uncertainty and in fact, the key part here as I was saying before with capturing the covariances is that when you sample from this, essentially you’re sampling rows. So the key trick is to essentially to sample row numbers from your posterior distribution. And then when you sample the whole row of parameters, you’re capturing the covariants across the parameters.

41:52 - So this just shows a simple numerical simulation of doing this for a linear model where each of these timelines is me drawing a mean and a variance from their joint distribution, drawing a line under that specific choice of parameters. And when the number of draws is small, I don’t get a very good approximation of the uncertainty, but as the number of draws gets large, you can see that yeah, there’s outliers in the tail, but the bulk of this gives me this classic hourglass shape that I would get from analytical methods. Sorry, you’re right. I misspoke, for each of those lines, I’m drawing an intercept and variance from their posterior distribution and I’m drawing the line using that slope and intercept. Here, I’ve essentially made 10,000 predictions with a linear model, each with different parameters. At any particular x, I can go in and I essentially have a histogram at that specific value.

42:53 - And if I have a histogram at any particular x I wanna look at, I can summarize that in terms of it’s sample mean, it’s sample standard deviation, it’s sample quantiles, I can draw the distribution. Once I have all the samples, anything else that I want to calculate can be summarized from those samples. So, making a confidence interval is simply taking these predictions and just applying the quantile function across a range of xs. So now the distinction I’m gonna make between Monte-Carlo approaches and ensemble approaches is a pragmatic one. Usually when someone’s talking about an ensemble, they’re usually talking about a much smaller number of runs.

43:41 - Usually if someone’s running a model, 100,000, 10,000, even 1,000 times, they’re not usually referring to it as an ensemble. You usually get ensembles when you know ensemble weather forecast for today has 21 ensemble members and they represent different perturbations of initial conditions in different model structures. In some sense, the core part of the algorithm is essentially the same. You’re drawing samples from the input distributions, you’re running the models, you’re saving the results. What I think is the key distinction between an ensemble and a Monte-Carlo method is that an ensemble having a smaller sample size, you can’t trust that histogram of your predictions to really have characterized the uncertainty.

44:33 - So if I have 10,000 draws from a distribution, I draw a histogram, I’m fairly confident in saying that histogram is a pretty good approximation of the distribution itself. If I draw 10 numbers from a normal and I make a histogram, you’re gonna tell me that’s a good approximation of what the whole distribution looks like? No, it’s gonna look like, it’s gonna be like, and then maybe there’s one over here and maybe there’s like one over here. You’re like, is that sample and a good approximation of the full distribution of values? Like no. So if that sample is not, you can’t use that sample to represent the distribution itself. But what you can do is make distributional assumptions.

45:25 - So here’s where you’re gonna see a trade off. You’re gonna see a trade off between making the assumption and then gaining something from that assumption. So if I were to say I have these samples, if I assume those samples are normally distributed, I can fit the sample mean and the sample variance, assume that these came from a normal distribution. And then I can go back 1.96 standard deviations and get a confidence interval from the distribution. So now my interval estimates are not coming from the samples, they’re coming from the distribution.

46:04 - I don’t have to assume it’s a normal distribution, but the thing you’re essentially doing is I’m getting the ability to estimate that shape by assuming a shape. That assumption might not be perfect, but that’s essentially the trade off. Basically, it’s hinging on the idea that you need far fewer samples to estimate the moments of a distribution than to estimate the full distribution. So if I want to estimate the full distribution, the rule of thumb I usually use when I teach my grad course, I want to see an effective sample size from an MCMC of around 5,000, if I want to approximate a distribution. By contrast, you don’t need 5,000 data points to estimate a sample mean confidently. That starts one-on-one.

46:55 - How many sample points do you need to get a stable estimate of mean? Tens, hundreds if you’re feeling ambitious. But people go out and do field experiments with tens of samples all the time and they get stable estimates of the means. Likewise, if I want stable estimates of an ensemble mean, I can do it with tens of ensemble members. I can get a stable estimate of the standard deviation with tens of ensemble members and I can get a confidence level only because I’ve made an assumption about the distribution. The choice between these two I hope is fairly straightforward, if you have a computationally efficient model, if you can afford to run your model 1,000 to 10,000 times, go ahead and do the full Monte-Carlo, you don’t have to make any assumptions, you get the distribution.

47:45 - The distributions are gonna be whatever they’re gonna be. They don’t have to conform to name distributions. If you have a computational challenging model, you can get by with a smaller ensemble if you’re willing to make a distributional assumption. And if you have a really computationally challenging model that you cannot afford to run multiple times, you have to revert to analytical methods in doing math. You can get by with making a prediction, running the model once if you’ve solved all those derivatives because you can then take that analytical prediction of your model that you’ve run once and figure out how to propagate those uncertainties analytically.

48:33 - So this is just a graphic highlighting some of these differences between these high level classes. So here is an example of an input probably distribution, its mean, its covariance, these dots are all samples from the input distribution, transform through a nonlinear function. This crazy manifold shape here is the true shape, given all those samples, and then here’s the mean and covariance of those samples which we take as a summary of that output distribution. But realizing that that is just a summary, that distribution is definitely not. Even if this distribution looked Gaussian, this one definitely is not.

49:20 - And that actually points out an important point, even if your input distributions are Gaussian, once you’ve applied a nonlinear function to them, the output distributions cannot be Gaussian. So that’s where that assumption in the ensemble approach is always worth thinking about. If you were putting a Gaussian assumption on your output, by definition a lot of these things can’t be Gaussian. This is the Taylor series approach, I have the mean and covariance of my inputs, I have an ensemble size of one, I only have to run the model once, I analytically transform that. And in this case, because the model was so crazy non-linear, the approximation, putting me in the right part of parameter space, but it wasn’t perfect, it didn’t really capture the covariance structure, there was this Jensen’s inequality bias between the mean and the function and the function of the mean.

50:18 - Another point that’s worth thinking about and an area that I think there could be some interesting research, further research, which is when we do ensemble methods, the way that I spelled out, it says take a random sample from your inputs and transform them. Well, if I’m working with a fairly small sample, you might go, maybe I could do better if I sampled non randomly, if I actually thought systematically. And so an example of that is the unscented transform, which instead of sampling from the inputs space randomly, it actually has an algorithm that chooses the sigma points and analytically back transforms from the single sigma points to the covariance structure in the transform by using a clever bit of math on how to relate specific design points in one space to design points in the transformation. And you can see that, actually did a pretty good job of approximation. Interestingly, the name of the unscented transform, (laughs) it didn’t stink (laughs) - [Student] This is on video, you’re sure you wanna do that? (laughs) - Hey, I didn’t invent it.

51:45 - The graduate student who invented this a few decades ago he named it for the fact that the transformation didn’t stink. (laughs) So it was the unscented transform. So that wraps up the highlights of the different trade offs for uncertainty propagation. I wanna now quickly go through how we can analyze what comes out of that. So first, gonna come back to where I started on day one, saying that we have a general approach to thinking about how we can take the uncertainty in our predictions and partition that out into different sources.

52:31 - Hopefully, you can now see where this actually came from. This came from that Taylor series approximation applied to just the general form for a dynamic ecological model. But it also, by spelling it out this way rather than writing the sums makes it clear that if I want to understand the overall variance, I can actually calculate these terms and if I calculate each these two terms individually, I can say, well what percentage of the overall variance is due to this term? Well, it’s just that term divided by the sum. So I can actually partition out those variability and tell me something about the relative importance of these different terms. I can also then dive into each of them. So if I have a whole vector of parameters, I can decompose that into the contributions of each parameter in the model to the uncertainty.

53:24 - And again, I mentioned this on Monday, thinking about the covariants and scaling, but again, you now see where these covariants came from, that’s literally what I wrote there, the same thing. So an uncertainty analysis. Couple of things I want to highlight about uncertainty analysis. The idea of an uncertainty analysis is to estimate though the contribution of each input to the overall output uncertainty. Unlike a sensitivity analysis, change in x gives you a change in, we’re now saying uncertainty in x gives you a partial uncertainty in y. And in some sense, we’re summing up all of those partial uncertainties to the overall uncertainty.

54:09 - And remember that that was related to two things, the uncertainty of the input and the sensitivity. So this just highlights that if I wanna understand what is important to the uncertainty in my predictions, that there’s two ways things can be important. Things can be important because they’re highly sensitive or things can be important because they’re highly uncertain. And so things that are highly uncertain and highly sensitive make large contributions to the predictive uncertainty. Things that are insensitive and well constrained don’t make much contribution to predictive uncertainty, those are the things we can ignore.

54:49 - Thankfully, there’s actually usually a lot of these and then things can be important, either by being sensitive or by being uncertain. So one of the take homes from this is that sensitivity analysis by itself doesn’t tell you what is important in your model because you could have something that’s highly sensitive, but the thing itself is very well constrained. So it doesn’t actually contribute to your uncertainties. Likewise, you could have something that’s very insensitive but really poorly constrained, that does matter. So it’s the combination of the two that tells you what’s important.

55:32 - We’ve been doing these sorts of uncertainty analyses with process-based carbon cycle models for quite a while now, this comes from the first times we did it, which was some biofuel work that we did at the University of Illinois, and what you’re seeing here is a whole bunch of parameters in a model. And here are the contributions, they’re partial contributions, the overall predictive variants. Black was the posterior, gray was the priors. And some of these, we got some good constraint. So we rank ordered them by their contribution, you can see things that matter to the model, things that don’t matter to the model.

56:17 - And actually, yeah, and then here, CV, all I’ve done is taken the uncertainty of the parameters and normalized them as a CV so that I have them all in the same units. And then here, elasticity, likewise, I’ve taken the sensitivity and I’ve normalized them to all of the same units. So since inelasticity of one is a unit change in x produces a unit change in x, but it has a sign because the slopes can be negative. So I can now go back and say, well, this thing’s really important because it’s sensitive and not well constrained. But I can jump down and I can see other examples like leaf turnover rate is important because it’s poorly constrained, so even though it’s not particularly sensitive, it’s not well constrained, while specific leaf area is well constrained, but at a particularly high sensitivity. And so they contribute almost equally.

57:13 - Another thing that I think is worth noting, if you have a model that has 100 factors going into it, that means the average factor contributes 1% to your predictive uncertainty, which means the average thing doesn’t matter. And most uncertain analyses look like this, they tend to look a little bit of exponential. So, this is why I’ll make the assertion that most of the time, it’s a small number of factors that are really driving your predictive uncertainty, and there’s a long tail of things that average out in the wash, which is useful. This figure also shows a great example of using this information to actually constrain models. I said, the grays were the priors, the blacks were the posteriors, all of these parameters up here were ones that were just calibrated using literature data.

58:02 - The things that didn’t change or things where there just isn’t literature data. This guy is a different story, the stomata slope, which in the prior was maybe the fifth or sixth most important parameter. When I did this at the University of Illinois, Andy Leakey, the guy who had an office next to me, he’s really, really obsessive about plant stomata, so for all the animal people. Stomatas are those little holes on the leaves where the CO2 comes in and the water goes out and the stomata slope is a parameter in the model that essentially controls the sensitivity of that stomata closure, mostly as a function of humidity. So this is the sensitivity of the stomata to the environmental conditions.

58:52 - Andy is great at measuring that, he’s great. He sent out a graduate student into the field for two days with a light core to measure the stomata sensitivity and essentially eliminated that as a parameter worth worrying about. He knocked it back, he nixed the uncertainty. So this was an example where we identified an uncertainty and two days of field work, done. I have another example I don’t show here of a master student from our work in Alaska where I had him spend the first year of his master’s going through the literature to understand the system.

59:35 - So he’s taking an existing vegetation model you’re working with and moving it up to apply it in the Alaskan Tundra in a system that model had never been applied in before. So he went through iterative rounds of reading about things, but also gleaning information, quantitative information from the literature synthesis he was doing and using that to constrain individual parameters, which is the thing you can do in a process based model, is to actually say, this observation in the field maps to this specific parameter. And he did this as a feedback loop, ‘cause he would do these sorts of analyses and say, oh, I need to read more parameter, read more papers about roots because roots matter. And then around April, I stopped him and said, “Okay, design your summerfield campaign around “your uncertainty analysis.” And then he could go out in the field and measure the things that the model is telling him is driving his forecast uncertainty.

00:29 - And his master’s thesis has a series of these that show the uncertainties marching down as he does literature synthesis, field study, Bayesian inverse calibration, and at the end, we end up with a well calibrated model, and it’s done by, I really like this idea of a feedback loop. That leads me to a few final thoughts about using these uncertainty analyses to constrain what we do in the field, this idea of model data feedback. So I’m gonna talk a little bit about power analysis and then observational design. So power analysis is the simple version of these questions like what sample size is needed to get a certain effect size, what’s the minimum effect size detectable in an observational design, that little bit more complicated version of that. So just as a quick reminder, in classical hypothesis testing, you have your know model, some threshold in your tail probabilities that gives you your p value, but then if you have an alternative hypothesis, you have some rate of false positives and false negatives and that one minus beta, all of that integrated area, all this integral is your power, how well do you distinguish the alternative hypothesis? And so power isn’t a function of your effect size and the uncertainties.

01:59 - You can do this sort of power analysis for traditional statistical models fairly easily, so this is the one I like to show in my undergrad class looking at the power of regression expressed in terms of R squared, essentially saying that if you want to be able to detect an R squared that’s 10% or bigger with 90% probability or greater, you essentially need a sample size of approximately 100, which is a good rule of thumb. I use that rule of thumb a lot when reading papers. We often under sample a lot and I encourage folks to think about that. When you get nonsignificant effects, did you get a non-significant effect just ‘cause you didn’t have the power to detect it? More generally, we know that in classic sampling theory as we increase the number of samples, our parameter estimates typically go down as one over square root of n. It doesn’t perfectly follow that same rule with some nonlinear models in Bayesian calibration, but it’s a good first approximation.

03:01 - The uncertainties are gonna go down with sampling. But that points to I think another useful thing to think about when trying to use model data feedbacks, which is if I am in this part of this curve, a few samples can reduce my uncertainty a lot. If I’m in this part of the curve, I’ll do a bunch of samples, I’m gonna get very little return on investment. Actually I’m gonna jump back. So for example, if I had these two cases, a perimeter that’s very sensitive and is already fairly well constrained versus a perimeter that’s not particularly sensitive but is under constrained, they both contribute about the same to the predictive uncertainty, which should I go out and measure? (class mumbles) Turnover, yeah. I’m gonna get the most bang for the buck for measuring leaf turnover rate because I have the most potential to reduce its uncertainty with further measurement.

04:09 - By contrast, I can’t really change the inherent sensitivity of the model to this process. So I’m gonna get a lot of bang for the buck. I’m not necessarily gonna just measure these things in the rank order that they’re given here, I’m gonna tackle the things that are going to be giving me the greatest return on investment. Approaches to estimating these uncertainty, estimating power with non-linear processes in Bayesian model is most often done with some sort of pseudo data simulation. So here, we’re gonna simulate data of some specific sample size, fit the model, save the parameter or parameters of interest, and those are essentially extensions of bootstrapping sorts of methods.

04:54 - We’re either re sampling the data or assuming the model parameters and simulating the data, but the idea is you can generate random data of a specific size. And then in a bootstrap analysis, you’re always generating a sample size, it’s the same as your original because you want to understand the uncertainty about the original. But for things like experimental design, you may be simulating, what if I doubled, tripled or quadrupled my current sample size, how the uncertainty in the parameter is gonna go down as I increase sample sizes. So here, you might end up embedding this in a loop over multiple sample sizes. Here’s a simple conceptual example of putting these things together.

05:40 - I might have an understanding of how the parameter error, an error in a specific parameter may change as a function of sample size, either analytically or through some sort of bootstrap approach. Then using everything we talked about earlier, we may then be able to think about, well how does the forecast or the model error change Bayesian function of sample size? I’m transforming the uncertainty and the parameter into the uncertainty in the predictions. One of the reasons that’s handy is because when I transform into the uncertainties in the predictions, now I can make apples to apples comparisons across different sources of uncertainty because they’re all expressing the same units, which is the predictive uncertainty. I can’t really compare the uncertainty reduction in the slope versus uncertainty reduction in the intercept, I need to be able to put them together. And so here, I’m imagining a case where variable one might actually contribute more to the uncertainty, but it’s uncertainty declines more slowly than variable two, which starts out a little bit slower, but I get a better return on investment.

06:48 - I can then use basic economic theory to say, if I’m willing to invest a certain sample size, what should I measure? And so, my turn’s out I’m gonna measure this for a while ‘cause it gives me the best return on investment, but then at some point, this starts to level out and I shouldn’t be sampling that. You can take that a step further and say, well, not all types of measurements cost me the same in terms of man hours, dollars, whatever. So if you could actually put a cost on these, you could actually then say maybe variable two gives me a better return on investment, but it’s more expensive. And so actually, it might end up with a sampling strategy where I make a whole lot of measurements on this guy because even though I get less return on investment in terms of per measurement, I get better return on investment in terms of it’s cheap. And you can extend this further to not just include the marginal costs of each sample, but also the fixed costs involved in field research, which could be, or in lab research as well.

07:51 - Your fixed costs are the instruments themselves and the time involved in setting up new sites, are fixed cost. Like typically, for me, the cost of going to getting the permitting on a new site is the same whether I’m putting in one plot or 50 plots. The difference doesn’t really scale much, which leads to actually interesting conclusions, which is the results of that uncertainty analysis of what I should measure first could be very different for different labs depending on which toys they already own. (laughs) Like if I already own an Eddy covariance tower, which costs $100,000 to put up, but gives me measurements 20 times a second that are virtually free, I might make a lot of Eddy covariance measurements. If I don’t own that already, it’s not gonna be the first thing I’m gonna buy.

08:43 - I’ll be like, I’m gonna go for that specific leaf area because I need a hole punch and a drawing of it. (laughs) And then the thing I wanted to end on is this idea of what are called observing system simulation experiments, Which is just taking what we’ve been talking about to the next level, which is to think about designing not just a measurement but often designing a campaign or a network. So, these sorts of experiments are often done by folks like NASA and NOAA and other agencies. Say you’re gonna deploy a new satellite and you want to optimize its orbit or optimize what wavelengths it’s measuring, what you can do is you can simulate your true system, you use your model to predict what you think is happening in the world, you simulate observations, pseudo observations from that system, and also you add your observation error in and then you try to assimilate those pseudo observations back into your model to assess their impacts. And you can do this with whole networks. So I might say, what if I deployed buoys in this configuration? What if I set up towers and that other configuration? What if I added this instrument? And so you see these on larger scale research networks as ways of, and in other parts of the physical environmental sciences, we see this being done more often.

10:11 - We don’t see this a lot in the ecological realm. I think Neon wanted to do this, but they had the capacity to do it years after they had already locked in their design. ‘Cause it took them years to actually be able to get up to the point that they could do this. Sorry, I ran a bit further than I intended. Any questions? .