Fusing Data

Apr 23, 2020 20:51 · 8526 words · 41 minute read knew strictly identifiable also point

  • This is really my last lecture and the thing I wanted to talk about here is fusing data and by that I’m talking about how we combine multiple sources of information. In the book, and in the grad version of this course, I actually cover this earlier in this semester, I think right after state-space models but before uncertainty propagation and data simulation. But, really, I wanted you guys to get that material earlier so you could start applying it and figured we can actually talk about how you deal with multiple sources of information a little bit later. So, some of this is going to jump back from issues that have more to do with calibration but they definitely come into play during the data simulation, as well. One of the things I think is important about data fusion is that fusion is fundamentally about synthesis.

00:57 - It’s about bringing multiple sources of information together. Synthesis is a key complement to forecasting. To do good forecasts, we need to do good synthesis. Synthesis is (mumbles) it’s this complement to reduction as scientists…science, and it deals with the fact that virtually every system we study there’s no single data set that provides the complete picture of how that system works.

01:28 - And very often, you have a lot of ecological processes where the processes (mumbles) have multiple parts, that we might have information on those different parts. So, I’m trying to project a population, it may have age or stage structure, so, I might have different information on different life history stages. I might have some information that’s on the individual level and some information that’s on a population level. If I’m working with a carbon cycle model, or biogeochemical model, I might have different pools and different fluxes, and I might have different ways of making measurements of those different things. Important thing about data fusion: data fusion is far more than just the informatics challenge of bringing different data sets together.

02:10 - It involves much more than just concatenating files together. That would actually probably be the last thing you want to do (laughs) if you have data that were collected differently, is to just put them all together in one file and lose the fact that they were collected differently. And the other challenge is that we often have data observed at different scales and there’s different uncertainties associated with different scales and different measurement types. So we need to deal with the fact that we can’t just have a single observation error just plopped down on top of all different data sources. It’s also worth talking about data fusion because, if you look in the literature, I think there’s a lot of naive approaches out there that just drop the uncertainties and drop the covariances.

03:00 - There was a time, not that long ago, where a lot of folks thought that the right way to build a model was to build models for each subcomponent, to put those subcomponents together, and when you put all your pieces together, it should make the right prediction. That’s great, but when you calibrate each part of a model independently, you don’t estimate the covariances between them. Those covariances are often essential. I think a great example, that I’ve seen in the literature I’ve worked with, imagine I’m trying to predict the composition of a forest, or specifically let’s say I’m trying to predict the leaf area in a forest. I could do this by trying to predict the leaf area of every individual species in the forest and then sum those independent predictions up. If I sum those independent predictions up, the range of variability in that prediction is going to be huge because each individual species is going to be fairly variable, fairly uncertain, and the overall prediction would be very large.

04:08 - But, does that actually reflect our understanding of how canopies behave? No, canopies have emergent phenomena whereby it doesn’t matter what species are there, the LAI in a certain biome is fairly predictable and that’s because predictions aren’t independent, there are these covariances between them and if you don’t capture those in the process you have to capture them in the error structure. They have to be part of that calibration. So, I’m going to go through a couple ways that we think about synthesis. I’m going to start with the simplest which is meta-analytical methods. These are things that have become more popular in ecology as ways of synthesizing information. One thing that’s improving these days is the ability to access raw data from previously published studies, but there’s a lot of legacy research out there where all you get from a published study are the numbers in the published study.

05:02 - So, summary statistics, you might have a mean, and a sample size, and a standard error, or some summary statistic from different studies. And when you do a meta-analysis, you often have some effect size that you’re interested in, such as difference between means, correlation coefficients, and regression slopes, that you’re trying to use to combine information about. In the context of a lot of predictive synthesis, and predictive forecasting, the things that are often of interest to us, and the work that we’ve done, are things that map onto model parameters, the things that constrain specific processes. It’s always worth noting that when you do a meta-analysis you face this challenge of reporting bias, that neutral or negative results are less likely to be published than positive results. That can give you skewed estimates of parameters.

05:55 - The fact that this has become very common in ecology is a source of inspiration to me. About a year ago, Frank Davis, who is one of the former directors of NCs came to B.U. to start a sabbatical here, he’s been here all year, and I was telling him about some of the forecasting stuff we’re doing and how we’re trying to advance ecological forecasting, and he kind of told me this story that when NCs started you could spot any NCs paper in the literature, very easily, and they knew they had essentially achieved their mission goals. When you would look through the literature and you’d find lots of synthesis papers and you couldn’t tell which ones were coming form NC’s working groups and which ones were just done by the rest of the field. So, that’s kind of my inspirational thought about that’s what I want ecological forecasting (mumbles).

06:48 - I expect that for the next few years I’ll see you guys publishing on ecological forecasting, when folks that haven’t come through here are the ones publishing on it, I think when we’re starting to make a difference. There’s lots of ways of doing meta-analysis of figures like this really highlight in my mind the connection between meta-analysis and some of the things we’ve been talking about over the past few days, about updating our inference as we go along. Often meta-analysis might take a series of effect scores, each of which have a lot of uncertainty, and try to synthesize them into an overall aggregate story that combines the information across all these different individual studies, often we can reach a confident conclusion by the sum of a whole bunch fairly unconfident results. What I like is this figure here, the cumulative meta-analysis that says, “What if we start with the first study, and then what do we get if we combine the information from the first study with the second study? Then what do we do if we combine that with the third study, and then combine it with the fourth study, and then combine it with the fifth study?” Which you’re actually here seeing is essentially this Bayesian updating process that’s similar to what we use in forecasting. Every time there’s a new study, they’re adding that information.

08:16 - I mean, they’re doing this all retrospectively, but the thing that I find amazing about this, you can realize that as a community you could distinguish your hypotheses from some known model, often quite early, and you’re just gaining confidence on this and things like these huge sample size studies actually don’t really change what was already a clear picture. But I like the iterative version of this. I wanted to touch on one specific version of a meta-analytical model which is one that we’ve been using in some of our work, which is to think about meta-analysis in the context of hierarchal based model. One of the things that separates the meta-analytical model from any other model is that in the data model you don’t have the raw observations. You have things like sample mean, sample standardizations, and sample sizes, you have some summary statistics. What we do is we then write down, as our process model stage relating something, for example, what we believe the latent true mean of that study was, given the observed sample mean, in summary statistics.

09:35 - So, we actually have a layer that tries to say what is the mean of that study, even though the mean of that study was what was reported in the study, but what’s reported in the study is the sample mean. So, we might say, for example, the sample mean might be normally distributed around the latent true mean, given some standard error, it depends on, say, a within site variability, and then the actual sample size. We might similarly write down a model that might say that the within study variance might be related to the actual variance in, again, the sample sizes. So one of the things that’s important about a meta-analysis is when you combine information across multiple studies, is they don’t all count equally. Studies that have larger sample sizes tend to count more.

10:30 - Studies that are more precise tend to count more. We can then, as a hierarchal level, write down a model, for example, just a simple mean, describing what’s the mean of these individual studies, and then the cross study variance. These are unknowns and so we need to put priors on them. Likewise, we might write down a model describing the study of variability in the variance itself, and again we need priors on these parameters, and we need priors on those parameters. So, we could actually end up with a hierarchal model for the site-to-site mean, a hierarchal model for the site-to-site variance.

11:18 - In practice we often make the simplified assumption that within study’s variability is actually fairly similar which just turns this into a prior, takes that extra hierarchal layer away. We’ve actually, in our system, operationalized this into a meta-analytical model that runs on trait databases largely in a fully automated way. Anytime we run ecosystem models it goes into the trait databases we’re connected to, pulls the latest trait data down, and updates parameters. We have a system is we start with uninformative, or expert elicited priors on model parameters, we then update those with these trait data. So this is just a graphical version of what we’re talking about.

12:13 - We’re fitting an overall, across site mean, a random site effect, we also include, in our version, a random treatment effect, because often we are synthesizing information that’s coming from experimental studies, and, so, we might have different treatments. And then we actually, since we’re dealing with plants mostly, we have a fixed effect for any sort of greenhouse or potted plant study, knowing that the traits that come out of those studies may be systematically biased relative to those that you find in natural systems. Then again, we combine those to get study specific means, constraining them by the observed means, observed sample size, and observed…. The next thing I want to introduce is what if we wanted to assimilate all of the data at once rather than doing this as an iterative process? Though, pointing out, that mathematically they are equivalent. So, if I have posterior one is proportional to likelihood one times the original prior, posterior two is proportional to likelihood two times prior one, posterior three proportional likelihood three, prior two.

13:30 - That’s actually mathematically equivalent to saying posterior three is proportional to likelihood one, times likelihood two, times likelihood three, times the original prior. It’s worth noting, whether you fit your data all at once, or fit iteratively, you should get the same answer. In practice, because we have issues like the thing that comes out of the posterior is a set of samples and, then you either have to particle filter them or assume a distribution, you can lose a little bit of information when you do it iteratively, but on the flip side if you do it iteratively, you don’t have to go back and refit your whole model any time you get new information. So, those are trade offs. But, either way, whether you’re doing this iteratively or all at once, I want to point out the idea of what if I have some process model that’s making a prediction, and doesn’t matter what this is, but I’m predicting the mean, given some covariants and some parameters. What if I have K different types of observations that are relevant to that prediction? So I’m predicting something and I have a bunch of different ways of observing that thing.

14:45 - So I might end up with K different observation models, which essentially translates to K different likelihoods describing the probability of that data given the model’s predictions. So how do we combine them? I’m going to give a simple example of just a regression because I feel regression is something that people can get their heads around. So imagine I have two types of data that both tell me about the relationship between X and Y, and let’s also assume that they’re actually measuring the same X and Y, which is an important assumption because it’s not uncommon for when you have two different approaches to measuring what are supposed to be the same thing that they’re actually measuring slightly different things. Let’s here assume that they’re actually measuring the same thing, but what I’m seeing is a trade off between one method, the blue method, which might be cheap, but more uncertain, and the green method, which is more precise, but probably was more expensive, so I have less of that data. So I have precise data, less of it, less precise data, more of it.

16:02 - I know lots of people who would just be like, “Well, let’s throw out the blue data, “it’s less precise. “Let’s just use the green.” Well, that’s throwing out information. Especially, the total amount of information contribution might actually be similar because with enough low quality data you actually still end up often getting a constraint. So here’s what I would get if I fit a Bayesian regression to these two data sets independently. You can see I get slightly different slopes, but they’re not incompatible with each other. I could fit each of these individually by just writing down, this is just the JAGs code for how I might write down a regression.

16:42 - I have some prior on the slope of intercept, I have some prior on the standard deviation, I loop over all the data, I have my regression model and I have my data model. How do I expand this to fitting both of these data sets at the same time? Because that’s what I want to do. So I want to fit this, which is the synthesis fit across both data sets. So, here’s an example of how I might do that. So, first I have one set of priors because I’m fitting one line to both data sets. I have the same exact process model with the same parameters in both places. But when I fit one regression it has it’s own observation error, when I fit the second regression it has a different observation error. So I have two likelihoods, each with their own uncertainty, but fitting the same underlying process model. What we get out of that is kind of what we would expect given all that we’ve talked about in terms of updating the forecast, and stuff like that. The resulting line is more precise than either individually, because we’ve combined their information, we’re using both pieces of information.

18:07 - Obviously, if these were not measuring the exact same things your observation models might end up being more complicated. For example, if you took one of these types of observations as truth you might need an observation model on the other that involves some sort of calibration process for it being a proxy. So I might actually end up with a linear model relating the latent state to a proxy variable, or some other thing relating that latent state to that proxy variable, which is completely fine and valid to do. I’m going to give a few more more complicated examples of trying to combine multiple pieces of information. This one comes from the plots I actually did my dissertation work on in North Carolina.

18:55 - Shannon’s lost almost as much blood to these plots as I have. There’s lots of things that like to bite you in them. But this is aerial imagery, at fairly high resolution where you can pick out individual tree crowns. In a colleague of Shannon and I’s, Mike Wallison, for his dissertation, was actually digitizing the individual tree crowns to estimate how much light those trees received. We actually have multiple pieces of information, not just the remotely sensed image, to try to understand how much light each tree is receiving.

19:33 - And so in the end we ended up realizing we had three pieces of information that actually enter through four different likelihoods. We had this remotely sensed estimate of light availability. First of all, what I’m trying to estimate here is lambda, an estimate tree by tree of it’s light environment, how much light is reaching each individual tree. I have an estimate of that that’s coming from the remotely sensed imagery. I can measure the size of it’s crown and translate that to the scale of light estimate, which I think is actually expressed in terms of crown area.

20:15 - But, you know, it’s measured with some uncertainty. But we also found that remote sensing doesn’t see every tree in the forest. It mostly sees the ones at the top. So we also ended up with a logistic regression describing the probability of being observed in these images versus not being observed in the images and that that itself was a function of light availability. So, whether we saw the tree or not an imagery itself gave us information about how much light a tree was getting. Because if we didn’t see it we’d know it was receiving less.

20:52 - So in that case the missing data was actually information. Then we have a, sort of, field data that traditionally gets collected if you’re wondering around in the woods doing a forest inventory. Status. Was this tree in the canopy? Was this tree in the midstory or was this tree in the understory? So, we get estimates of light as a function of a multinomial regression for these canopy statuses. And then we had a mechanistic model, so we had mapped every tree in this stand, we could put it into a 3D ray tracing model to predict each tree’s light environment, but those models are also imperfect. So we have a mechanistic model and the overall estimate of the light environment is the synthesis of information coming from this remote sensing, coming from this canopy field data, on the canopy status, and coming from this process based model on the light environment.

21:45 - Each of which has the same latent variable entering in as the X on all of them. Four likelihoods all constrain the same latent variable. When these examples we’ve talked mostly up to now about combining multiple sources of information at a specific point in time, kind of a snap shot, which is very common for regression analyses. I want to next think about how we combine information across space and time. So, essentially, bring this idea of data synthesis into thinking about our state-space models.

22:24 - Imagine we have our state-space model, some latent X that’s evolving through time, according to our process model. We have some observations, Y, and one of the things that we learned, from Shannon’s lecture, is that the state-space model is fairly robust to missing data. I might not have an observation for Y here, I might have observations for Y here and here, and that’s fine. I’m kind of borrowing the strength From this to make an inference about this. But we can apply what we did, for example the simple regression model, here as well.

22:57 - I might have a second set of observations that give me an estimate of X, and they might have their own observation model, they might have their own observation error, similar to that first example of regression that had two likelihoods, each related to the same underlying model. And the nice thing about the state-space model was it’s very flexible and it can deal with the fact that maybe I just have one, maybe I just have the other, or maybe I have both. So when I have both, I’m going to have the most constraint on X because I literally have four pieces of information. The previous X, the next X, Y, and Z constraining that. While in these other cases, here I might have two because I have the previous and the observation, I might not have a future observation.

23:50 - Likewise if I’m at the start of the time series, I might have two. So, you have different levels of constraint in proportion to the different number of likelihoods. If all of these are Gaussian, the constraint, again, just ends up being the sums of all of their precisions. If they’re not, it’s conceptually similar. So, it’s straight forward to extend the state-space model to take into this idea of trying to fuse multiple types of observations.

24:20 - What to do if they come from different scales though? There’s basically two options: one is to scale the process model and the other is to aggregate the data model. Let’s look at what that might be. First option, let’s think about this in terms of time. Imagine I have Y is operating on a courser timescale than Z, so Z’s at a fine timescale. I might choose to write down my process model at the time scale of Z and then when I write down the observation model for Y, a single Y may actually be mapped to multiple latent states. Worth noting: if Y is just instantaneous, but courser resolution in time, and I’m mapping it to individual Ys, just infrequently, but often you have type of measurements that integrate over space, or integrate over time.

25:21 - So what was the cumulative discharge through a weir, or what was the cumulative flux observed, what was the cumulative number of individuals that fell into a pitfall trap? It integrates over the whole time that the sensor was making the measurement, it doesn’t map to a specific discrete time. But that’s actually not hard to do. You might say that the sum or the average of all of these latent states are related to some other state, according to a likelihood. Advantages of this is we’re working at the full resolutions, or taking advantage of the high resolution data. Cons is the computation. When you’re working at the high resolution you have a lot more states to estimate. And then potentially identifiability. So if I had a period where I didn’t have the Zs and I only had the Ys, then there’s potentially an infinite number of permutations on these Xs that are compatible with their sum.

26:24 - Again, that’s not a problem if you do have the Zs. And it’s also not a problem if you have a strong process model that’s linking these that say I was observing a trend and then I saw the aggregate over that trend. Yeah, it’s not hard to disentangle that, but if all you’re seeing is course resolution, so, as an extreme, if I only had course resolution data and decided to model the model to very high resolution, I might have any way of disaggregating that. I might just end up with four latent states that are highly correlated with each other and just trading off. I can also do the opposite. I might choose to model at the course resolution, in which case I might write down a likelihood that compares this course timescale step to, for example, the sum or mean of my high resolution data.

27:23 - That is computationally more efficient, but potentially involves a loss of information because I’m taking high resolution data I have and aggregating it to a courser scale. In this example I talk about doing this in time, but everything here also works in space as well. So here’s an example, not mine, of applying this concept to, for example, dealing with irregularly polygon data, such as GIS layers, where the observations might be township’s or county’s level summary statistics and you’re trying to make an inference about the latent continuous process. So, there’s some continuous process in which you observe, might be, county level summary statistics and you’re trying to disaggregate that. Again, it’s not strictly identifiable if every single high resolution pixel is independent, but if you make some assumption about the spatial smoothness of that, and it doesn’t actually make a strong assumption, I’m just saying it is, for example, if you assumed it was a spatially smooth process with some spatial auto correlation parameter that has to be estimated, just saying what it is, you can actually do that disaggregation.

28:41 - And it works even better if you then do have other data sets at other scales that give you information. If you need to dive into this literature these sorts of things are often referred to as change of support problems in the stats literature, because you’re dealing with integrating data across different scales. In the spatial stats literature there’s also a whole series of examples in the literature dealing with, for example, combining point data with aerial extent data, whether it be polygon or raster. Other take away is that the way our GIS does it is completely wrong. (laughs) They do all of their upscaling and downscaling without any accounting for uncertainty; ARC would just smooth it.

29:34 - The nice thing about the state-space approach is when you ask it to disaggregate information you have the full posterior about the thing you disaggregated. You don’t just have an interpellated surface, which can be really important because if I was taking this map and feeding it into some other analysis as a covariant or as an input I don’t want to treat each of those disaggregated pieces of information as if they were data, because you can result in a lot of false confidence. I could set this up at some fine spatial resolution and have millions of data points entering into the next stage of the analysis. I don’t really have millions of data points, I have like thirty townships. So changing of support, changing of scales, can create a false impression of more information than you actually have.

30:28 - By contrast if we do this in a Bayesian way, not only do we have the uncertainty, but remember when we draw things from posteriors we draw them jointly, so you might draw a whole row. So you might draw a whole map that would account for the uncertainties appropriately. The next challenge that I’ve seen a lot, when it comes to data fusion is our challenges with identifiability. In my line of work the classic example of this are eddy covariance towers. If you’ve not seen these before they’re a cool bit of technology.

31:06 - You set up a scaffold in an ecosystem and it has a bunch of toys on them that measure wind speed using sound waves and concentrations of gas using lasers, and it’s cool. It’s neat because it can measure the net flux of gases between the atmosphere and the land surface, most commonly the CO2 flux and the water flux. So it might give me the net carbon flux in and out of the system and the net water flux in and out of the system. Okay, but that’s made out of a whole bunch of underlying processes. And there’s literally an infinite way to get that net flux from all the trade offs among these different processes.

31:53 - So if all you observe is the net it’s very hard to disentangle that. So, this is, again, an example where synthesis is very valuable because if I have that net flux but I also combine that with a lot of detailed information about specific processes than I have way of starting to disaggregate it. So this is an example where I might have a model that has multiple processes and some observations constrain the aggregate that comes out of it and others constrain parts. A population analog, again, would be something like an age or stage structured model where I might have detailed information about specific transitions, I might have individual level data, and then I also might have population level data that’s constraining the net overall behavior of this system and then combine both of those pieces of information. This study, don’t by Trevor Keenan a few years ago, I thought was a neat example of exploring some of the challenges of fusing multiple pieces of information and trying to understand the information contribution of different sources of information.

33:07 - Each of these are different data sets and this is actually an attempt to restrain a carbon cycle model at Harvard Forest. Harvard Forest is an LTER and NEON site that’s about an hour and a half due west of us, in central Mass. What we see here is the posterior distribution of our error estimate on a log scale. And when he starts with just the first data set it’s kind of off the scale. Then here’s the posterior estimates of individual parameters plotted as violin plots.

33:41 - Then here is his carbon cycle prediction from 2000 out to 2100. We can see that there’s a good bit of uncertainty in the forecast, a lot of uncertainty in the parameters, and a lot of process error. What he did was he went through and assimilated one data set at a time, kind of that iterative process, take the posteriors from one, feed it as the priors for the next. You can see how, when he added different data sets, how he constrained the overall error, constrained individual parameters, some of which still never ended up well constrained, and then how that affected the forecast error in the end. Trevor’s working with a fairly simple model, so he was able to do something that I rarely do, which is he did this as a forward selection problem like he would do in a regression.

34:32 - Where the first stage he literally tried fitting every single data set by itself and said, “Which data set gave me “the best constraint on the model?” So the first thing at the top was actually the single data set that gave the best constraint on the whole system. He then said to the N minus one data sets remaining, I fit all of them, conditional on the posterior from this, and said, “What’s the second most important data set?” This why it’s called “Rate my data”. He could actually say what was the order of importance of the data. And as a nice contrast to the idea of the value of multiple constraints, Trevor will also point at that when you get to the end there’s also the reality of redundancy. That two pieces of information may be providing you essentially the same constraint on the system.

35:28 - While they still constrain parameters, they’re not nearly as valuable as additional, independent axes of constraint. For example, he found that some of the explicit phenological measurements were not as valuable because a lot of the information about phenology was already embedded in the flux data. This isn’t the last thing I want to focus on but it sets me up for, what I think is, one of the biggest and most underappreciated challenges in data synthesis which is what happens when you combine data sets that are not equal in size. If I have the same number of samples from 10 different data sets, I combine them together, I get a decent constraint, they all contribute, roughly, equally. If I have something like this. This is the NAE data, so this is half hourly data throughout the year.

36:32 - At the Harvard Forest, towers have been running for over 20 years, so that’s 17,000 observations per day times over 20 years. This is a very large amount of data. By contrast, soil carbon. How many time are going to go out with soil cord hammering it into the ground? You’re definitely not doing that every half hour. Unless you want to kill your undergraduates, or be killed by mutinous undergraduates, you’re never going to have someone to verify soil carbon on a half hourly timescale. What happens when you combine something that is measured manually at a small number of samples with something that’s measured in an automated sense? Well, on one hand, you definitely want to include both, because usually the reason you’re combining this additional information is because it’s giving you that additional access of information. You’re trying to use it to tell you something you didn’t already know from the big data.

37:38 - But often your likelihood will end up dominated by the larger data set, such that your model calibration will often just ignore that other high quality manually collected data just because it’s getting overwhelmed by the large volume of data. If you look at the existing literature, at least the literature that I’ve found, it’s full of AdHawk solutions. Things like multiply the likelihoods by arbitrarily expert chosen numbers in order to make one data set more important and the other one less important, or subsample the data to make this lower frequency, or average the data to make this lower frequency to bring them into balance. Well, that works but, if you’re subsampling or averaging the data, you’re essentially throwing out all this information that you actually have, which is painful. First of all, this one’s technically invalid, because if you just multiply your likelihood by arbitrary numbers, not only does it affect your mean, but it can have a real big impact on your confidence intervals.

38:44 - It’s like yeah, if I measure 10 observations, but I multiply my likelihood by the number 100, my confidence intervals looks great. (laughs) It doesn’t actually mean I know what’s going on, it just I means I artificially pretended I had a 100 times more information that I actually do. Then we have loss of information, again, this is somewhat arbitrary. You can in some sense get the answer you want by tuning the degree of averaging. If I don’t like this answer I can try monthly or try daily or try weekly, and at some point I’m going to get something where I say, “Okay, “I’ve decided that I like the answer I got, “because it seems to balance these two pieces of “competing information,” but, again, it’s arbitrary.

39:33 - So how about we do this in a little less arbitrary way. Oh yeah, and avoid double dipping. I’ve also seen people do things, like, actually, I’ve seen Trevor do it. Any day, any month, any annual. Are those independent pieces of information? No. He’s put in the same data three times as three different constraints. I love Trevor but I don’t always agree with all of his choices.

40:05 - But he’s not the only one who’s done that, I’ve seen lots of people do this. This animation is kind of meant to give an example of something I think is intuitive to a lot of people, which is when we have high frequency automated measurements that may make thousands and thousands of observations per year, it doesn’t actually mean that we have thousands and thousands of pieces of information. Here I am starting with a full data set that’s not thinned, and I’m thinning it by half each time, and I’m looking at how the autocorrelation changes and you can also just visually look and see that when I first did this initially, I’m not actually losing much information, even though I’m halfing the sample size every single time. So, the point there is when we put in a high frequency automated measurements in space or time and we ignore our auto correlation we’re giving them too much weight, partly because the observations are not independent. So here, for that simple simulated example, I’m actually plotting log number of observations, log autocorrelation, and saying that there’s just a strong trade off between how as I’m fitting, how the autocorrelation is changing, such that the effective sample size in the data does not change nearly as quickly as the reduction.

41:44 - In some sense, it’s the the effective sample size that’s closer to a real measure of the information contribution of a data set. I guess the point there is treat the uncertainties in data appropriately when you try to combine multiple pieces of information. Things like when you have repeated measures data include the fact that it is repeated measures data. If you ignore the autocorrelation in space and time you will give a lot of these automated data sets much more weight than they deserve. If you treat each observation as independent you’re really inflating it.

42:22 - That’s one of the reasons that these automated measurements can swamp other measurements because if done naively we can treat as independent and inflate their importance. The other thing that we can think about is when someone goes out and puts a flux tower up or puts any other sort of automated measurement up, and I do have data loggers running out lots of places, I usually put out one sensor. I may have a lot of time series information, but I have sample size of one. So at the the same time, if you think about it, if I have sampling uncertainty, how do we even estimate my sampling uncertainty if I only take one sample? That’s another thing that’s interesting, again, it effects why he need to think about the uncertainty in these sorts of measurements appropriately. So if you have high temporal resolution information that’s unreplicated, how do you deal with the fact that you don’t know its replication? In fact if you believe that there’s some overall mean that you’re sampling from and you don’t know where you’ve drawn from that distribution, where you happen to set up your data logger, you can essentially treat that as a bias.

43:42 - So, think about it this way, if I had unlimited money and I set up a dozen flux towers I could get a could get a good estimate of their sample mean and their variability and it’s really that overall landscape mean that I’m interested in. If I happened to have put up that tower then all of the information I’m collecting at that tower has some bias associated with it and I don’t know what bias is because I don’t have any sampling. If I have any way of estimating that sampling uncertainty, I sure as heck should put an informative prior on it. If I have an uninformed prior, essentially, the whole thing blows up. I’ve never even tried that. But, in some sense, that’s something that you would want to think about.

44:42 - How do you account for that? And then the other thing I want to focus more on for the rest of this lecture is systematic errors. This gives kind of a hint of it. By chance I’ve set up some place that’s slightly different than average but every place I’ve set up is slightly different from average. But, there’s also lots of measurement techniques out there that are, themselves, have some errors associated with them. The important thing about systematic errors is they don’t average out. So we talked about if you treat data independently, it over-inflates its information.

45:21 - If I can account for it’s autocorrelation I can reduce that but I still have this property that random errors will average out. But if there’s also a systematic error I can have 17,000 observations a year. The random component of that will essentially go to zero at the end of the year, but any systematic errors that I have in that observation persist, no matter how many observations I pick. You’ve got summarized data, every single second I could take an observation on a process, but if the way I’m taking that observation is biased, I’m still getting a biased estimate. Something happened to me when I was a post-doc, one of the first times I tried to constrain a process model.

46:07 - I was working with data from Bartlett Forest, up in New Hampshire, and I knew that I couldn’t just use the net carbon flux because of the identifiability problem so I also had soil respiration fluxes. But in the raw data the soil respiration numbers were bigger than the total ecosystem respiration numbers. I didn’t realize that until after I tried the calibration. I can tell you the optimal solution to try to make a mechanistic model produce higher sol respiration than ecosystem respiration results in a crazy ecosystem. It basically says there is no autotrophic respiration, there’s a whole lot of heterotrophic respiration.

46:56 - Plants grow absolutely insane, because they have a constraint that is physically impossible. The data sets have systematic errors. In this case you didn’t actually know which one was wrong, but you knew that they had to be. They were incompatible with each other. In examples like that it’s obvious, but it’s also true that it’s not always obvious and it’s definitely true that you get really funky errors because of these systematic errors. Don’t forget that systematic errors are present, even when we’re only calibrating with one data set. But, they often become much more obvious when we’re trying to do synthesis because if you’re only calibrating models against one data set, you can often get the model to fit that data quite well.

47:45 - But when you’re trying to calibrate a model with multiple data sets you can get to a point where that really highlights that there’s no way to get the model compatible with both sets of observations. So, the last bit here comes from some work I did with David Cameron, the Scottish ecologist, not the Prime Minister. A few years ago I started working with across E.U. cost action, cost actions are kind of similar to what we have as RCNs here in the U.S. I found a kindred spirit in someone else who also was obsessed with the challenge of synthesizing multiple sources of information and some of these challenges of systematic errors and inconsistencies in data imbalance.

48:35 - So one of the things that David and I did was some psuedodata experiments with a very simple ecosystem model. In fact we named the model very simple ecosystem model and this is a very similar to what you guys played with in the particle filter model so we just had an NPP process that allocates carbon to leaf and wood, that turns over to the soil, and then we have some turn over to the soil, and then the amount of leaves affects the amount of NPP. Pretty simple model. What we did was we took that model, we simulated data from it, we then simulated it with the different types of errors, systematic and random errors in the data and the systematic and random errors in the model and explored some trade offs with balanced and unbalanced data. First, if you have a perfect model, so we didn’t put any errors in the model, and we have very balanced data, you get a good fit of the model for the data and you recover the right parameters. - [Student] What do you mean by balanced data? - So the sample size in the different things.

49:45 - I think we simulated measurements of net carbon flux, soil carbon, and above ground carbon, and we made measurements of each of those things, we made the same number of observations of each of these different things. So, this avoids that problem of unbalanced data that I was talking about before. If I have a lot of data on one thing and a little bit of data on another thing. Next, we created that unbalanced data situation which we had observed led to some of these problems we’ve seen previously with real data and real models, which was we had unbalanced data, the model would fit one of the high volume data set very well and just ignore the others. Well it turns out if you have a perfect model, you still get a good fit and the right parameters if you’re fitting the right data to the right model.

50:42 - That highlighted that the problems of unbalanced data are not inherently in the fact that they’re unbalanced. We then introduced errors in the model and in some sense that meant that the model we were fitting the data to was slightly different than the model used to generate the data. So, the model that we fit the data to was an approximation of the true model. When we did that with balanced data we could get the model to fit the data well, but we did not actually recover the true parameters. But if we fit to unbalanced data, we didn’t recover the true parameters, we couldn’t get the model to fit the low volume data well.

51:32 - Again, this is exactly what we see in the real world, that with unbalanced you fit to the high volume data and you kind of ignore the low volume data. So that kind of highlights that it’s actually not the unbalanceness that’s the problem when you’re synthesizing data. One problem is error in the models themselves. You may ask, “Okay, great. “Why don’t you just fix the model?” You can’t actually ever do that. I mean you can improve the model but models are always approximations of reality. Models are always approximate.

52:11 - So I can improve the model to deal with some of the known errors but the model will never be perfect, the model will never be reality, and therefore if I have high volume data, and in some parts of ecology we are getting to big data, if I have big data I will always cause a conflict between the model and reality at some point. Because the model is perfect, it always becomes unbalanced at some point. So here is a simple example what this looks like. Model error, unbalanced data, high volume data, we fit great, low volume data, here’s the truth, here’s what the model tried to reconstruct what it tried to reconstruct is driven by this much more than the true observations. This is our classic thing we see a lot, which is models ignoring low volume information.

53:12 - Then we ask, “Well what if the data isn’t perfect too?” I highlighted that earlier. So we ran a couple simulations. First, model is perfect, the data sets are balanced, so we have the same volumes of information, but we have a bias in the data now. We can get a good fit, but we can’t recover the exact true parameters because the data itself has a systematic error in it but we can still get a good fit that we can use to make good predictions. On the flip side, if the data’s unbalanced and there’s some bias in the data, that even if we use the true model as truth, we fit back to that true model, but if there’s some bias in the data, we can’t get it to fit right once it’s unbalanced. Then, obviously, if we have errors in the data and errors in the model, no, it doesn’t work. Especially when it’s unbalanced. So here’s another example.

54:16 - Errors in the model, unbalanced data, errors in the data, fit the high volume well, but we don’t even have the direction of what’s going on in this term correct. What we then ask is the question, “What if we build corrections for the systematic errors “in the data or in the models into the likelihood itself?” So here linear model meant that we’ve put in a linear bias correction model into the likelihood. We’ve said that the data may be additively biased, it might be multiplicatively biased, so we just put both in. All of these are cases where it’s unbalanced but we explored where there was an error in the model, where there’s errors in the data, or where there’s errors in the model and the data. In all of these cases we can actually get good fits to data once we include the fact that there are systematic errors as part of the data model itself, as part of the likelihood.

55:25 - We never recover the true parameters when there’s biases, but we actually can get good fits. So here’s an example. NEE, soil carbon, vegetative carbon, low volume data, bias data, we can get things to behave right, and we can actually get the bias correction to get the state right. Some of the take home was that while information content is important to fusing data, the errors and biases in the models and data were actually much more important. Perfect models can deal with unbalanced data, though if you don’t account for the autocorrelation you can still end up with over confident estimates of your parameters. Then building the bias correction into the calibration can lead to good performance of your model so you can make decent productions but you’re never actually recovering the true system. .