Characterizing Uncertainty

Apr 23, 2020 20:28 · 4804 words · 23 minute read october climbing using probably error

  • I’m gonna talk about characterizing uncertainty, in a kind of a broad sense, but really only broad within a destination of why you wanna use Bayesian statistics. We have some classic assumptions, whenever we’re learning about statistics throughout much of our training, and for many people still probably whenever you’re doing a statistical inference. Homoscedasticity, so our variance is constant across time. The X variables, and X being the predictor variables, are in fact not variables, right? There is no error in them, all your error’s in Y, and it’s generally all crunched into the measurement error, So your residuals, or things you just didn’t count right. We generally assume that that’s symmetric about the mean, it’s normally distributed, observations are independent, and you don’t have missing data.

00:56 - If you do that, generally you lose a whole row, right? These are our assumptions for many linear models, and they have been associated with classic or frequentist statistics, that are really rigorous, if you can meet these assumptions. But we’re generally designed for agricultural systems, and well controlled systems. You know I have a lot of colleagues that do things in jars, they put dirt in a jar, and put it in the incubator, and really tight control, but most of ecology really doesn’t fit with those assumptions. And so, you know, if you had this cloud, yeah, you can fit a linear model, that goes through those data, and you often would get something like an R squared back out, right? And so an R squared might be a 0.6, and yet your slope is really strong, P less than 0.

05, 01:50 - but you’re still not explaining 40% of the variation in those data. If your goal is to get a statistically significant slope, that’s great. If your goal is to make a prediction or a forecast, based on that slope, it’s probably not great. And I’m gonna, again, kind of ad nauseum, go over some of their just terminology, and the formatting to make sure we’re kind of all on the same page, right? So this line here is a model, it’s a linear model, and in this case, it’s a linear model that has an intercept and a slope. Those are our beta terms and some error. We can think about this as both the data model, where the the Y, our observations are normally distributed, so that’s the N, with some mean, mu, and error, and an error term, that’s our observation model.

02:39 - And that’s describing generally how big is that spread of our observations around that line. If we measured perfectly our Y’s, they would fall on that line. If this model, our process model was correct, right? So your process model is that linear model, okay? As you get more and more data, your beta and beta, your beta naught and beta one parameter estimates, get tighter, but not necessarily your observation. Because we’re working in a Bayesian framework, we also then have the parameter model. So our parameters in the data and process model, which are, we’re gonna be working with precision, so one over the variance, and the betas are themselves treated as random variables.

03:30 - That means that they are distributions, that have their own parameters which we will fix. This one for instance, we have a a mean of zero in some center deviation, is generally how we think of it, right? You’ve got your normal error, and then here we would have again a mean, of the beta naught or the beta one, with some variance about that mean. So they each have their own variance, right? So in Bayesian framework, even our parameters are treated as random variables, and we’re going to assume that they themselves have a distribution that we’re gonna describe, and we’re usually trying to describe the variance on that distribution. And when I drew these lines again, to emphasize that most of this spread is considered to be observation error. Given that model, the process model, if you could measure things or count them better, then they should fall on that process model, that line.

04:37 - Another way to write this is in graph notation. Again, we have our linear model up top, and what we’re really saying is that we have a data model for how Y, we are assuming is sampled. We have a process model that has parameters, and then we have parameter models. Again, because we’re working in a Bayesian framework, our parameters themselves have models, or are assumed to be from distributions. And we’ll use this kind of terminology again and again. So if it doesn’t look right, stop me.

05:11 - All right, that kind of fell off the side a little bit, but again, this is reemphasizing that you know, each of these lines has its own, or series of equations that go with it. So beyond the classic assumptions, what are you actually gonna do with real ecological data, which is why we’re all here. First of all, I think it’s useful for some people to back up again, and think about why are we, you know, what are we actually doing when we’re thinking about a data model or a distribution at all? Why are we using a distribution? So we assume that the data, these bars, this histogram, are random samples from a true population or a distribution. And we describe that by a probability distribution, right? So this one happened to be a normal distribution, and we describe by a distribution, because we can’t sample all the, like if we wanted to know the height of people in the room, we could measure that, but if we want to know the height of, you know, everybody who’s interested in ecological forecasting, we would just use this group to estimate it, but we’re not gonna measure everybody. The distribution allows us to make some assumptions about how this group represents the entire population.

06:21 - So we all know that, and we do it to, because there are some known, we generally pick distributions that have these known parameters, mu, standard deviation, things that describe the distribution, that allow us more flexibility as we move further on to make comparisons. And it also helps us estimate what we didn’t measure, right? So that kind of end down there, we didn’t get any data for, but if we believe that the population’s a normal distribution, then then we can estimate what it should be, okay, so distributions. But not all data are normal, and so some distributions, again, even though they’re not normal, I’m sure you know, classic statistics has many very rigorous ways to deal with them, and I’m sure everyone in this room has used them, so we have a distribution here that’s clearly not normal. It goes from zero upwards, right? So it’s, it’s not continuous also, if you fit a normal distribution to these data, you know, if you do a statistical analysis as assuming that these data are drawn from a normally distributed population, you will get a mu, and you’ll get a Sigma. But it’s actually, right, so your mean here of three, is not representing the data, it’s not representing what was actually measured. I mean, and it’s wrong.

07:45 - So, but you don’t necessarily know that, right? Unless you plot them out. And so this is when you pick a different distribution that better describes that error, or difference between the observations you collect and the assumptions about the true population. In this case this would be a Poisson distribution, which is just a general non-negative, discrete data distribution. And instead of a mu, you describe a mean that now you can see is a little more representative of the data, and make your assumptions about that. Incidentally, this is also a distribution that doesn’t require homoscedasticity, right? So the variance increases with mean.

08:34 - So sometimes you can deal with that assumption just by choosing a better distribution. So we’re gonna kind of practice and talk, about a lot about the distributions that we’re working with, for the different datasets, as well as for the parameters. You know, are they realistic, are you just choosing normal distributions ‘cause they multiply times each other really well, multiply by each other well, you know, are they actually a biologically relevant representation of the data? So another common example of data that don’t look normal, we’ve got binary data, and in this case you can use a binomial, or Bernoulli, to look at the probability of a success for instance seed germination. So the probability in this case, is treated with a logit, which is a transformation that puts the probability space, which is zero to one, into a normal distribution, and you can model that as a linear model, right? The important thing to kinda think about, as we move forward, is that you know, we’ve got our data model, and so we’re assuming that, you know, that’s not a normal distribution, that we are sampling out there, from some true population of zeros and ones, not a true population of 0.5’s, right? So, but then we have this link that makes it linear, and the process itself.

10:01 - and both of those together are part of your process model. So as you collect more data, you’re getting a better and better estimate of that process model. And because we’re Bayesian, we also have our parameter model. Another assumption that we touched, that I touched on homoscedasticity, variance doesn’t increase over a time, one of these datasets, It gets a check, meets that assumption, one of them does not. So these data, I mean, we would test it to make sure, but I’m gonna tell you it doesn’t, these data look like the variance increases over time, that’s a problem, because it violates the assumption of a classic linear model.

10:44 - And there’s a couple of different ways to deal with this, many different ways to deal with that. One would be to try to find a distribution that does not assume that the variance is constant over time. But another way would be to model it, you could model the variance, and so if you remember the last time I showed you a graphic notation, Y is distributed normally, with mean beta one plus beta two X, right? So that’s our linear model, and before we just had a variance term, right? And now you actually have a model for variance, that’s also based on X. You’re modeling the change in variance over that variable X. And the way it looks in the graphic notation, is you just add in parameter models for the variance.

11:34 - A couple of different examples of how once you put this in a probabilistic or Bayesian framework, you can pull apart the data model from the process model, and it gives you more flexibility to tweak either, to better represent your data and your question In this case, so the red is the linear model run, as if it, the variance was constant, and the green is that second model run, with changing variance over time. So X is here, these are our observations, and the true line is in the middle, dark black, red and green fall on each other here and here. Green or the one that actually models variance, actually the credible interval changes over time, over X not time, and the red doesn’t really, so the mean of the posterior also better represents, better reflects the true mean, once you actually take into account the variance. But there’s no difference really, if you do it on data that are not heteroskedastic. And then in this case, this is a model that Mike ran, but the DIC, the deviance information criterion, which is one way you can evaluate a Bayesian model similar to an AIC, right? So what have we done here? We’ve added another parameter to the green.

13:05 - Okay, so it’s got one more parameter, so when in a traditional information criterion, right? Like that’s a penalty, it’s the same data, but now I have one more parameter. And when you have smaller numbers, that’s actually a better fit. So even though you’ve added another parameter, you’ve better fit the data, and you’ve increased the information criterion. And so again, we’re talking this week about uncertainty. Doing a better job at capturing uncertainty does not necessarily mean you’re gonna get more certain.

13:31 - You’re more certain, you know, if you actually, you’re more certain out here, if you use the model that doesn’t fit as well, right? That that doesn’t make you right. And that becomes much more problematic when you’re trying to forecast. All right, so another kind of class of assumptions, the observation error, regression model assumes all error is in the Y’s. Again so you figure out what the best fit line is, and you calculate residuals by looking at the difference between your observation and the the line. There’s always some uncertainty associated with the parameters that you estimate.

14:14 - You can make that uncertainty smaller by collecting more observations, but the noise that’s left, is always assumed to be in Y, in the observation itself. If you had the perfect model, if your process model was perfect then then that’s fine. It’s important to know what your observation error is, you know, maybe really everything that describes that line in the real variability is in the model. You know, the model explains 40% of the variance in that, and maybe really 60% really is just, you know, you can’t work a DBH tape. But I don’t think most people get to the end of their linear regression and feel that way.

14:53 - But you know, usually there are a few that that you think probably something else is biologically relevant, and it’s not just, I couldn’t count it, but that’s what the assumption is. Okay, so if you, so generally we assume that the observation error is symmetric about that mean, or a predicted value, but sometimes you actually also, don’t measure X very well, right? So sometimes there’s actually error in our predictor variables. And in fact a lot of times there’s probably error in our predictor variables, and there are few more frequentist opportunities for capturing that. Sometimes it’s not a big deal, and sometimes it could be, you can imagine that if there’s error in X, and you’re trying to use X to make a prediction, then that error is going to propagate out and also be in your prediction. If you haven’t somehow accounted for it or described it, then it makes your prediction overconfident.

16:03 - All right, so errors and variables, is a way to deal with the fact that we can often have errors in X, or uncertainty in our predictor variables, as well as in our response variables, okay? So a classic assumption is that it’s all in the response variables, but in ecology, the reality is that it’s often also in our predictor variables. So how do you deal with that? And the Bayesian framework again, because we have this probabilistic structure, allows us flexibility to build that into a model. So in this case, what we’ve done is we’ve got the same linear model we been working with, Where our parameters describe the variance around a slope for instance in an intercept, and the observation error, but also we’ve got a model for the variable, the predictor variable X, that’s described by its own set of parameters, and also informs Y. And I’ve written it up here, a little bit different, So we’re actually modeling X as a random variable. Now we’re gonna move into latent variables.

17:10 - So anything that’s not directly observed, sometimes that’s just errors, sometimes that’s you know, variables are measured with error, whether that error is biased, meaning you’ve got an instrument that’s always a little bit high, or it’s actually got the random error as assumed, generally with measurement error. And you want to account for that explicitly. If you have missing data, right? And you wanna estimate data that you didn’t actually collect, and more often we hear about latent variables with proxy measures. So you’ve got something that you measure, and you’ve got something that you actually wanna interpret. And I think we heard a lot of people in the project descriptions talk about, I’ve got these data, and this is what I wanna do with these data.

18:01 - And generally it’s not, I wanna summarize those data exactly, because I think they’re the true measurement of the population, right? You’ve got data that represents some bigger population. And sometimes they’re actually proxies for something that you can’t count. But you could also you know, I think the book talks about GPP being a proxy, right, of component things that you go out and measure, and then you kind of put stuff together. And any P would be the clear example. You put stuff together, and try to make some summaries about net ecosystem productivity. So ignoring the fact that there are latent variables, can have a whole bunch of outcomes, which is modeling a derived response, or a flawed observation, can lead to incorrect or falsely overconfident conclusions.

18:54 - And I think everybody knows that, but if you go back and look through a lot of analyses, it does not stop most people from doing it. So we’ll talk more explicitly about what is a latent variable and how might you treat it. So, but first missing data, we’ve got a dataset, this is something that has response on the Y, and some variable, in this case I wrote time on the X, and you’ve got missing data. If you’ve got a good model, then you could predict what’s missing here, right? Like you can predict Y given some known X, and the error that you’ve already estimated. You can make good predictions if your model is a good representation of those data, right? That’s what probably most people have experienced doing as far as missing data, or you fill in, you know, use regression to fill in a gap.

19:47 - That’s this example, I guess sometimes people fill them in other ways, but generally people are more familiar with, you’ve got a missing Y, I fill it in given my model. In a Bayesian framework, again, because it’s probability, and everything that’s not known is being treated as a random variable. If you have missing data, when you run the model, it will estimate those missing Y’s. Likewise, if you have missing X’s, missing predictor variables, and you have a model for those predictor variables, as long as you define a model for them, a distribution that it can draw from, then you can estimate the predictor variables. And not only can you estimate what they are, the missing ones, but you can estimate how they influence the Y’s, right? It does it all at once.

20:40 - So that’s pretty powerful, especially if you have, you know, you’ve been out there collecting data, and you have lots of data, but then you’ve got, you know, weird sensor things, or you know, somebody got shin splints and couldn’t help you that day. So you’ve missed counts on one day. You know, you’ve got random things that are missing, throughout your dataset. Most often in a linear regression in the classic sense, you would have to just kinda get rid of that whole row, or you would have to do some kind of gap filling, that just fills in a point value. This allows you to actually estimate what it should be, as well as look at the influence of that missing data on the rest of the response, or the response variables. So by the time you know, you get down here, everything’s been defined, as to what distribution it was drawn from.

21:33 - but you’re gonna update the regression model, right? That’s our, that was the mean equals beta naught plus beta one. You’re gonna update the regression model based on all the rows of data, given the current values of the missing data, right? You start the model with an initial value, and then you update it, and then you update the missing data based on that regression model, which is similar to a kind of a regression gap-filling, except that you’re doing it in this iterative fashion, so you’re actually also, you’re not just gap-filling a point, you’re, you’re actually filling in with all of the uncertainty in how much information you have to inform that point. In order to do that, you assume that your data are missing at random. If your data are not missing at random, if there is some reason that you’re missing chunks of data that happened over and over, like we never got night, or you know, this one person, every time they went out, things went really badly, and they were out every Tuesday. You know what I mean? If there are reasons that you have missing data, then you have to build a better missing data model than what I just showed, right? That that’s based on the idea that you just have random missing data.

22:43 - If it’s systematic, then then you can’t estimate data that’s missing if you don’t have other data that should be replicates of it, right? I mean this is to say, again like, there’s a lot of power in Bayesian inference, and in using a probabilistic approach to statistics, but if you don’t have information in the data, you’re still not gonna do magic with the statistics. Latent variables, when you observe Y, but interpret it as Z. And we already talked about the fact that sometimes, you know, we already know we do that with our data model, we’re saying that we’re observing data, and expecting it to come from this observation error, or this true distribution, but missing data, and then these proxies where you have one or multiple things informing a Y, informing your response variable, are kind of more common. And then I was gonna end with kind of an example, of kind of the breadth of one latent data approach. But, so this is actually kinda going back more than a decade for me, when I was a forest ecologist, and I worked at the Duke Forest base site, where they, it was kind of the, one of the coolest experiments that I’ve worked on, ‘cause it was just so big, it was DOE, right? So they had these big things that kind of extended above the trees, and just blew carbon dioxide onto them.

24:14 - At the time, that was like a pretty futuristic CO2 level. And it was a pine plantation, so they’re all the same age. So we know the year that the gas was turned on, we know how old all the pine trees were. They’re all the same species. It’s kind of like, you know, a jar of dirt that you put in an incubator, but like out in big scale. And you know, I got to spend September and October climbing up those towers above the trees with binoculars, and like counting things, it was a beautiful PhD.

24:47 - You know, perfect time of the year, everyone else in the lab’s like out in August, like you know, measuring trees, that are all like poison Ivy infested, and like only go out in the fall. And so this is the time they’d been collecting seeds before I got there, they put up these laundry baskets and collected seeds in the baskets. And they could see that you got more baskets, and more seeds in the baskets in the elevated, which are the red, or the bigger ones here, and fewer in the ambient rings. And there were three ambient and three elevated. And so the question here is this, is this a CO2 fumigation response? If you’re not a forest ecologist, what you need to know is the fecundity of trees is often related to tree size.

25:28 - So you make the assumption that the bigger the tree is, the more resources it has to put into fecundity. They have to grow big first, and then they start putting out seeds. And then people do things like fecundity equals the number of seeds, which is a function of diameter and treatment. Tree size isn’t really diameter, but it’s a good proxy. Seed number is really hard to estimate. If you wanna like count them on a tree, it turned out that’s hard, and so you can count them in these seed baskets, and then you have to try to figure out how many of them came from what tree.

26:05 - So right away, right, we’ve got error in Y, error in X, and then I’m trying to ask about this treatment response. But also, there’s the question of is this more trees putting out more seeds? So are there more trees contributing seeds, or are there just, are the trees just bigger, and contributing more seeds, right? So there’s lots of different layers to this, which made it a good PhD. And when we went out and looked, so remember the assumption is that if you know the diameter, you can predict something about fecundity. And we went out and looked, and this is total seed cones, so I counted cones, ‘cause counting seeds is insane. And you could actually count the cones on a tree with binoculars.

26:44 - And then the diameter of the tree is here, and if I pull out one chunk here, you can see that there’s a ton of variance in fecundity at that diameter, especially in the elevated. But if I moved it over here, then you would see that there’s a ton of variance in the ambient as well. And so there’s a lot of individual variability in seed production. If you just wanted to do a mean, then at that diameter class, the mean ambient would be about seven cones per tree, and the mean elevated is about 52, but you can see that that mean, it isn’t really representative of these data, and especially for elevated. And I’m not a perfect counter, but I’m not off by 50 either, right? Like so the variance there is more than just my inability to count.

27:38 - And so we built a model, that could take a bunch of different information into that process of fecundity. So where where the fecundity is still a function of tree size, it’s now conditioned on whether or not the tree is mature, there’s a couple of different ways you can get zero, right? You can get zero because the tree just didn’t produce anything that year, which happens in some years. You can get zero because it’s not mature, and not, it shouldn’t be counted as mature. Then there’s a different trigger that seems to happen, that kinda turns a tree reproductive. And we used both the cones and the seeds to inform the fecundity estimate, and part of the reason we were able to do that is because we had older trees.

28:30 - Remember all these trees are the same age, and so even though they kind of have a diameter range that goes from probably around 10 to up to about 30, that’s not really a big diameter range in terms of trees, or even loblolly pine, and we actually had information about bigger trees that could help. Remember if you don’t have any information out here, about 40 centimeter trees, then your line here, when you’re trying to estimate a change in fecundity with size, you can’t really inform that very well. And so if you have information, to constrict what that should look like, which we did with these other seed traps in other stands, then then we could actually use that to help constrain what’s observation error, from what’s actually something within the realm of natural variability, Okay? So you can use different data sources to start to describe a latent variable. And in this case, we’ve got the latent variable of maturation status, and fecundity, which is both a function of information derived from numbers of seeds in laundry baskets, that comes in via this dispersal model, and the number of cones that were counted on a tree. .