Machine Learning

Apr 23, 2020 20:51 · 8544 words · 41 minute read reduction techniques go randomly selecting

I’m Barbara A. Han, I’m a disease ecologist at the Cary Institute, I work with Shannon and Kathy who are just down the hall from me. Mike asked me to talk to you guys today about machine learning which seems like a really daunting field or area to talk about so, I’m gonna try to give sort of a broad overview and really connect it more towards the type of datasets and questions that we might ask as ecologists and I’m gonna use some examples from work that I do from my program looking at infectious diseases. But I think really it would be more useful for both of us, if you guys would interject and ask questions and say hey you haven’t talked about this yet but I keep hearing this term, could, have you heard of it? can you tell me what it means? and if it’s relevant to us and that kind of stuff. So, I’m gonna sort of start us out with a broad picture of where machine learning fits into the bigger world of AI and data mining and that kind of stuff and then I’ll sort of start to move towards examples and sort of rationale behind when we would want to use these types of things but please feel free to just stop me and say this might be a non-sequitur but can you talk about this other thing. I’m gonna talk about machine learning but I’m gonna sort of set the stage, machine learning I think is oftentimes conflated with artificial intelligence and, I mean they’re sort of used interchangeably in the media but really machine learning sits underneath a broader umbrella of artificial intelligence.

01:19 - But I would say that, in contrast to how this sort of diagram is showing it. Machine learning sort of bleeds over and is sort of taking over a lot of these other areas, because it’s sort of like the the engine that creates the understanding, the patterns that come from all of these big data sets that these other systems are sort of sitting on top of. So IBM Watson, they are sort of a question and answer type of algorithm that’s sitting underneath that, so that sort of bleeds over into this, speech to text, text to speech that sort of intuitive like have you guys seen those videos of, I think it’s actually here there’s like a really famous robotics lab - [Man] Boston Dynamics. - Yes, that’s the one Boston Dynamics, there are these really sort of sad videos of them training a robot to pick up a box and they may knock knock the box out of his hands and then they figure out what the robot does like how it picks it up and then they’ll like try to push the robot with a hockey stick and it’ll sort of self correct and I mean it’s really amazing to watch this robot sort of taken the information about what’s happening and then re-orient itself. It’s just a little sad to watch, it continue to be bullied with a hockey stick.

02:22 - And then there’s just really exciting area of image recognition and machine vision, and so if any of you guys have used Pinterest or Google Image Search, this is the kind of thing that it’s using, this type of algorithm. Underneath machine learning so there’s this thing called deep learning, which I think actually just sort of, it connects to machine vision a little bit but deep learning is sort of an even more specialized, and even more cutting-edge version of machine learning and predictive analytics is just sort of this broad umbrella term that tries to capture the feeling that machine learning what it really excels at is making predictions based on previous, based on the input data that you give it about something based on some pattern. So what machine learning really does is, identify patterns from data and then learns those patterns and tries to figure out how to optimize some tasks that use on it, whether it’s classification or identifying some number that’s associated with a regression tree or separating things into groups, multiple groups and things like that. So there is a more formalized definition of this that was put forward by a guy named Tom Mitchell, who wrote one of the seminal books, Machine Learning he says it’s a computer program, okay so that it’s a program, it’s an algorithm and it’s said to learn from experience with respect to some class of tasks. So it’s a program that’s learning some, to do some thing based on experience and you measure it with performance P, and then there’s this other book which those of you who are really interested in machine learning should check out, there are a couple of different editions of this now.

03:49 - But Hastie, Tibshirani, Friedman said that vast amounts are data being generated in many fields and it’s our job to make sense of it all, to extract important patterns and trends, so that’s the learning part and to understand what the data say. And we call this learning from the data and I think that definition is pretty agnostic to feel but it really does apply to ecologist but we’re trying to make sense of the world around us and sometimes that means that we are formulating our own hypotheses based on observation and then we do something to test those hypotheses in a rigorous way. And that sort of assumes that we have a process in mind and so it’s important for us to assign error and to understand how well we’ve gotten that process model correct. Other times you have lots of data and you can make lots of observations, sometimes they’re very high-frequency observations about the world around us and then we wanna extract pattern from that information that suggestive of hypotheses that the data are telling us about. And that’s sort of a top-down approach to generating hypotheses and gleaning information about what the system is showing us.

04:48 - The Coursera Course, Machine Learning taught by Andrew Inge who’s one of the founders, one of the real innovators in this area, he actually sort of invented deep learning. Elements of Statistical Learning is a Springer series book, so a lot of the Springer series books are free online especially if you have a nice University subscription like there’s a ton of the Springer Yellow, the Yellow series that are free and they’re all released, so I think this also has lots of great code, it’s quite maxy so there’s lots of like all of the formulations are in there for all the methods underneath the hood. When you think about machine learning, there’s a couple of ways that you can break it up. The two main things that you’ll hear about probably are supervised learning and unsupervised learning and reinforcement learning is way on the other side there because it’s a little bit weird, and then semi-supervised learning is of course between supervised and unsupervised. But what supervised basically means is, that you have kind of a label like if, like if you are thinking about a matrix and you have rows and you have columns.

05:46 - Those columns are related to the row in some way right, so if your row is like animals and you’re measuring something about the animal that thing is the label and if you want to use a bunch of variables to predict that thing, you would want a supervised learning type of approach to that, right, because you’re trying to make some associations between a whole bunch of predictors and some label. When you don’t have a label and you just have lots of data, lots of columns about some about a system, it might be that you want to learn about the patterns in that system and so you ask maybe an unsupervised learning algorithm to cluster them into groups according to how far apart you can separate them on the basis of their predictors, other variables. You might wanna ask it to summarize, so you guys are very familiar with probably things like PCA, these are really common dimensionality, reduction techniques which just means that you’re taking a whole bunch of features and sort of collapsing them down to into those fewer number of features that are really capturing the majority of the information there. There’s another group of algorithms that are called association rule algorithms and those are based on market basket analysis, have you guys heard about about this example, where people that are in supermarkets wanna know how best to, sort of market their products and so if you crunch the data on what objects are found together in somebody’s shopping cart. You’ll find that you know close to Super Bowl Sunday, beer and chips and salsa are usually found in the cart together, so you might want to put those things together in physical proximity in the store, at sort of any given time of the year.

07:17 - You might find that like wine and diapers are kinda near each other you know, so like, when your parents in the room (mumbles). Yeah, like these association rules are sort of generated from the data, where their data are grouped in some way and they’re giving you some rules that you can use as rules of thumb to make predictions about what you might find in another basket. And then you have things like anomaly detection, where you’re kind of trying to establish a baseline pattern according to some group of features and then you want to be able to detect the anomalies like the weird thing, so a lot of the financial and fraud algorithms are anomaly detection. Semi-supervised is a, it’s not I think in depending on who you talk to it’s not really recognized as like a field. It’s sort of like a mash of supervised and unsupervised where the majority of especially the ecological cases it, you have a label for some small number of samples but for the majority of the other samples you don’t really know what the label is, you have no observation for those other ones and so what semi-supervised learning tries to do.

08:19 - Is it tries to glean pattern information from the unlabeled things and use that information to help you do a better job of predicting from the information that you have labels for. So it’s kind of borrowing information from both the label than the unlabeled things to do a better job at prediction. So, in all of these cases the, in these cases the job is maximizing prediction accuracy and then reinforcement learning is this weird thing where they’ll like walking, like if you want to teach a four-legged thing to walk like in simulation. You can give it a goal say, I want you to minimize the time it takes for you to get from point A to point B and you can put some obstacles in front of it. The way that you program these things is you don’t give it like the rules of the game, you don’t tell it like how to scale those obstacles or anything but you give it this goal and then you give it sort of incentives and penalties for like dying or like falling, okay.

09:14 - So if you see another example is video gaming, in video games like, I think there’s just one game called Space Invaders that’s super old, where there’s like aliens that shoot things down and there’s this tank at the bottom that’s trying to shoot them and there’s like you know little blocks where you can hide behind. And in this one example where, if you teach, if you train this little tank to try and get rid of the aliens in the beginning, it dies really fast, like it doesn’t know what it’s doing, it doesn’t know what the rules are but it knows it gets a penalty if it dies. So it gets shot right away it dies and it’s like oh, that was a penalty am gonna try again and so you leave that model running overnight and by the morning it’s like, it’s got the whole game figured out to the point where it like shoots the last alien before the alien gets to the spot where it’s supposed to be to get shot, like it’s so good at learning that, it’s kind of amazing. So that’s one example of reinforcement learning and it’s it’s a little bit weird because of the way it’s programmed but it’s underneath machine learning. So, here I just grabbed a couple of my favorite examples of some ecological use cases of machine learning, so Mike Walsh did this nice study where he’s trying to understand the climate suitability for anthrax pathogen and I’m not really sure whether he used random forests in this particular model but, that’s a very recent one that’s come out.

10:28 - And there are some older papers on Honor Davidson did this great study where she tried to predict the IUCN threat status of the world’s mammal species using a bunch of their features and the analysis suggested some groups of traits that are really important but in, some times in really intuitive ways and other times in counterintuitive ways that suggest multiple pathways to extinction and what we should do about that from a conservation perspective. And then, my collaborators JP Schmidt and John Drake did this great series of studies, I think there’s another paper that follows up to this one, but they’re basically asking this question of why are some plant general more invasive than others and they do that by using features about a whole huge database of plants. And I think they found things like polyploidy and their ability to adapt to certain environments obviously if there’s traits that are very suggestive of when they’re able to do this better than other times. I thought that was a really ingenious way to use that data set and they followed it up with another bio economic forecasting model that suggests when it’s most cost-effective to try and eradicate something that has a certain suite of features. So, those are the sort of non-diseasy ones and then there’s a bunch of things that I do that try to predict when an organism should be a problem for disease.

11:41 - But really quickly I wanted to kind of, talk about this I don’t know, I know I wouldn’t call it a dichotomy but sometimes I feel like machine learning and forecasting get a little bit confused as well. If you talk to ecologist, I feel like when we talk about forecasting, we’re talking about making some prediction in the future. So it’s an inherently a time series based word like when we forecast something, it’s something in the future, right. But when you talk to computer scientists they don’t always mean that, so in computer science forecasting is sometimes used synonymously with machine learning and really we’re just, we just mean prediction, so if you are forecasting something you’re making a guess about some state of something in the future, something that you don’t know the answer to yet. And that could also be counted as forecasting, but in ecology I think, you’re fundamentally identifying some underlying process model, where machine learning does not do that, there’s no process involved, there’s no hypotheses that you’re testing.

12:34 - You start from data you use the data, you end up with predictions based on the data and the process is not there, like you’re not specifying it at all. I think that Mike is going to talk about data assimilation methods later but basically what data simulations, try to do at the high level is use incoming data streams to update the states of a model that you have specified and then you sort of update the predictions and the forecasts as you go using data that you take in. Your sort of ingesting it along the way, and an analogy to that in machine learning might be called online learning versus batch or offline learning and that’s where offline learning is where you take a data set, you sort of split it into training and tests and you make your predictions on that data set and, or on the on the test data and sort of you’re done. When you take in a new piece of information that’s not in that original set of data, you wanna update your whole modeling process using that information and then come up with a slightly modified model that can then make predictions on everything else, that’s online learning we are taking in new information updating our model. That might be an analogy to data assimilation, they’re they’re pretty different.

13:37 - I’m gonna, go into a series of examples from disease ecology to kind of illustrate the utility of it for some of the work that I do and also to walk you through a particular algorithm and I’ll talk about boosted regression, and the general questions of interest that I try to work on are what organisms are most likely to give rise to zoonotic diseases? So zoonotic diseases are just diseases that originate animals and are transmissible to humans and can cause problems. And then, I mean more ecologically speaking, I’m really interested in what makes that fraction of organisms unique. So what is it about certain vectors that allow them to carry a lot of parasites that are infectious to humans? What pathogens infect humans? versus others, the vast majority of them don’t cause any problems in humans. And then, for hosts like what makes some hosts really good reservoirs of disease, while others don’t seem to be problems at all? The first time that we did this, we used rodent data because rodents sort of have a bad reputation and we know that they carry lots of diseases, there’s lots of them and you know, they’re fairly well studied and so we started off by just mapping all of the ranges to get a sense of where they are, so there’s 244 reservoir species that we know of, they carry between one and 11 zoonoses. So right away you can see that you cast this as a regression problem because you have a count, so my label is a number of zoonotic diseases or you can cast it as a classification problem and say I just want to be able to separate reservoirs from non reservoirs.

14:59 - You can do it either way, but there’s 2277 total rows of data so this is probably the best, (laughs) the highest number of labels I’ve ever had to work with so it was a good thing I picked rodents to begin with. It was some somewhat of a less challenging problem than the other groups that I’ll talk about, but we use intrinsic features about the rodents so things that describe their biology, things that describe their ecology and their life history. So I didn’t list all of them here, I think, yeah we had about 80 variables at the end, but they kind of loosely captured these categories of features, one of the hallmarks I guess of machine learning is that, you’re kind of implying that there’s lots of data to work with and I think when we hear about business examples or you know, or sort of industry standard examples in computer science you’re talking about like hundreds of thousands of rows or millions of rows of data and that’s, I mean that’s not the case here obviously, there’s just 2000 rows and I’ve gotten this to work with less than a thousand rows of data, a couple hundred rows of data before too. But you have to be a little bit careful about the dimensions of the data set, so if you have a data set that’s really short but really really really wide, you get the curse of dimensionality, do you guys know what that is? So you just don’t end up having enough examples to capture any, most of the combinations of those features so you your algorithm just can’t learn or it’ll come up with some results that are just not accurate. Here we had a pretty nice size you know, nice dimensions, a good number of rows and a reasonable number of features.

16:22 - Most of them ended up not being super important but we collected them from lots of sources which is another part of machine learning that I didn’t, I guess anticipate I don’t know why I didn’t anticipate it there’s like, it’s like 85% data cleaning (laughs) so like lots of tedious like why is this not falling, primates never get that big, why is this number off? And they don’t live up there in Antarctica that number is off too, like just fixing those kinds of little things in the data set which is more common than you think they might be. So we collected a bunch of data from multiple sources and of course like in ecological data, we run into common issues all the time right, there’s hidden interactions between these things that you just can’t know in advance, you wouldn’t have even imagined to test for it in advance. There’s obviously collinearity among a lot of variables, there’s weird outliers that are that are outliers because they’re biologically meaningful outliers not because they you know, they’re just inconvenient and you want to toss them out. There’s missing value, so of course like in any data set you have lots of data for some species and not a lot of data for other species and some features are sort of non randomly missing which is a huge problem typically and then you have diverse data types you know, you can classify things into categories, other things have really precise numbers with variances attached to them. So that stands a (mumbles) the data types are kind of I think, a strength and a difficulty with ecological data.

17:38 - So there are some early solutions to dealing with some of these issues right, classification near regression trees, you guys have all seen pictures of you just the classic regression tree and then neural networks are another option that has come out of the machine learning literature from the computer science world. But these things are not without their shortcomings, so cuts are really sensitive to small perturbations in the data when you start building the tree. So if you mess up the data early on, you can get a totally different tree and the prediction accuracy there is obviously really dependent on the quality and the starting point of your data. And neural networks are, work great but they have a tendency to overfit and so the model is not generalizable to new records. So there are some some shortfalls there and so the the newer methods tend to overcome these outstanding issues and the one that I’ve been working with the most often is boosted regression trees which is just a combination of two algorithms that I’m going to walk through.

18:29 - So there is a there’s a magazine IEEE Spectrum that’s read, but it’s an engineering magazine that I had actually never heard of but, they wanted to do like, cause the work that I do is really weird but it uses tools that are sort of off-the-shelf and easy for a computer scientist to understand so they thought it was cool and they wanted to do sort of a feature/like a general application type of a paper. So I worked with the artist to try and describe what it would look like from the traits that we use and I don’t know if you guys can read this in the back but they’re things like body mass, geographic range area, sexual maturity, these are really intrinsic properties of species right, so the all these data at the species level and so this top tree here is, just a classic classification tree. So each of these dots is a species and we’re trying to classify them into two groups, reservoirs and non reservoirs, so you have like nos and you have yeses and the idea is that if you have a perfect tree then all of the nos and yeses will like segregate right, and you’ll get all the answers perfectly right. But of course as you can see from all the open circles down here it didn’t do great, it classified the species correctly, only 64% of the time and the way that achieve this is just by randomly selecting a variable, splitting it into two groups so that each of these two groups are homogeneous and the values are far apart from each other as possible, so sort of maximizing the distance and maximizing the homogeneity within these two groups and then you keep doing that, you just keep splitting until some stopping point and that stopping point is a criteria that you tune the model for. Okay, so that’s one of the hyper parameters, so, it split on these variables, it got some answers and then the model says, okay how did I do, that’s how he gets this classification accuracy and it says okay I did awfully, okay.

20:08 - So it’s got its predictions and it got a lot of them wrong and, now it’s the boosting algorithm sorts, assigns weights so it says I got some of these just I tanked, I did terribly and some of these I did great, I’m gonna assign my weights accordingly and I’m gonna build a new tree, based on these weights. Okay, so what’s effectively happening is the boosting algorithm is now linking, chain linking this first tree and its results to the next tree which is going to use to maximize classification accuracy of the residuals, okay of the weights. So this second tree is based on this first tree and it selects terrestrial burrowing, if it’s a terrestrial or a burrowing animal and then it continues to split, it gets some answers and the second, the combination between the first and the second tree is 71% accuracy, so it’s doing a little better it’s climbing up. The third tree it does it again, this time splitting by carnivore or herbivore, some of these features are repeated so geographic range is here and it’s also here so there’s nothing to say, it can’t select the same thing again and this time it’s a to 79% accuracy so you can imagine that like even if you start off with a terribly weak classifier in the beginning, it might just be a little bit better than 50%, by the time you’ve added 3,000 or 5,000 or 50,000 trees this ensemble of weak predictive models together for a really powerfully accurate ensemble classification tree. So now you guys know how to do base regression, easy right so the way that we oftentimes measure prediction accuracy is based on this receiver operator curve which is just the false to true positive rate and a bad model would predict, you know no better than a coin toss so that’s if there’s no signal in the data there’s nothing that’s relating the predictor, variables to the label and if you get a prediction with 100% accuracy and especially if you’re working with ecological data there’s definitely something wrong and it’s probably over fit and it life is not that easy so you don’t, you want something in between those two and so this is how our model did for the rodent data.

22:13 - So how well can we train the algorithms to predict Rodent reservoirs using trade data, it performed with 90% accuracy on the test data so you guys understand this concept of training and tested, you’re splitting the data up, training on the training day, training the model on the training data and setting it loose on the test data and on the test data it performs less than 95% which was the training it got to about 89 about 90% performance, so that tells us that it the model is pretty good at classifying between the reservoirs and the non-reservoirs just on the basis of their traits. So just intrinsic features of these animals is enough to, is enough information for this algorithm to do a good job classifying those two groups. And so as soon as I got this answers, I was like I wanna know what are rodents that are supposed to look like they carry stuff but they (mumbles) are known unknown to carry things at the moment so when we pulled out those species that the model says they should carry stuff and we have no information about whether the carry things. We got eight species, predicted with a greater than 70% probability to be novel reservoir so undiscovered reservoirs of zoonotic diseases and this map is so ugly, I’m sorry the for just meet the redder areas just mean there’s more species ranges overlapping each other but the hashed regions that you see here are two particular species that were confirmed to be positive for, to be new reservoirs for zoonotic diseases between the time that we finish the analysis and I gave the first talk at ESA to the time that I submitted the results for publication. So, I hide, I don’t know if you guys do this where you’re like, okay, I’m done with my model I just want to check it one more time just to make sure that I’m right.

23:49 - So then you know, I like took the results out and I was like doing going through the species one by one just to check to make sure that I hadn’t mislabeled stuff and those two guys came back as positive labels. I’m like no, I have to retrain my home model and do the whole thing again but it turns out that on the way between that time scientists had independently observed these guys infected with a kind of constant system which was due to parasites in the field so it gave me some, I mean I it’s not in the paper because I couldn’t say that Oh like, I had these model results before and then these guys confirmed it and now you know I couldn’t say that so, I updated the model and but I just wanted to throw that story in there. There’s a way to so the thing that I find sort of gratifying and a little bit scary about these models is that you can generate these very like clear predictions about you know this species, it exists here and we think that it carries something. Like that’s something you can go out and validate right or you can look into the data and say people have sampled it, the sample sizes are awful maybe we should jack those up a little bit and see if we get something and then when you’re right it’s like oh, that’s kind of gross like I, you know this is I think I was giving the talk in Minnesota and this species is like very common over there, so people in Minnesota are like, oh what’s that species again I don’t I want to make sure I don’t pick up any rodents. So, the other thing that’s really exciting about this work, I think is that, once you’ve judged, you once you’ve done the modeling, you can ask the algorithm to tell you which features were the most important in its prediction accuracy, there are multiple ways to do this but one way is to perturb the, do a permutation analysis or you can like perturb the values of the other variables while holding one and its mean and see how the prediction accuracy drops.

25:29 - I mean you do that for each of the variables and the more the prediction accuracy drops, the more important that variable was for getting a prediction actually high. So, when you do that analysis you get these curves, so these are called partial dependence plots and I just pulled out four just to give an example but the most important feature for prediction accuracy in our model was how many other mammal species that rodent co-occurs with across this geographic range. So, it turns out that rodents that have really low mammal diversity across this geographic range are more likely to carry zoonosis. They’re more likely to carry more zoonosis and then there was a bunch of life history features that were really predictive, so the blue bars are just a frequency histogram of the whole data set so, here is what most of the rodents look like in terms of their litter size and this is not logged so, most rodents have between two and five pups in a litter but species that are reservoirs tend to have more than four. So the majority, so even though the majority of rodents have less like around here, reservoirs tend to have more than that.

26:39 - Reservoirs also tend to reach sexual maturity much earlier than the majority of species, sorry these figures should be blown up a little bit so you can see the variation there. They also tend to be larger as adults so, about the size of an Australian wood rat for scale and they tend to have neonates that are slightly larger than the majority of species, so most species have pretty small neonates but reservoirs tend to have babies that are about the size of a baby red squirrel. So the thing that I found really sort of satisfying about this result is that there are lots of scientists who have studied rodents and you know, their immunity and what their immunity looks like when they have a fast life history strategy verus a slow life history strategy and where the fast life history strategy rodents tend to live and so there is sort of this rich empirically driven body of understanding about rodents in general. And so when we got this trait profile back out of the machine learning algorithm, it was clear that it was it was giving us the signal of something that was fast living, so things that have larger litters, they reach sexual maturity early. They tend to live in like seasonally dynamic habitats and so it sort of corroborated what all of these sort of one-off studies have shown independently for different systems.

27:47 - It kinda gives you this intuition about how the system is working just on the basis of data right, so you’re not out there like testing a particular hypothesis about fast life histories which you can do, it might take us a lot longer to get there as a discipline because you’re sort of paying attention to the lay of land and like who’s finding what and getting a pulse on and then going to test hypotheses. This allows you to start from data first, identify some hypotheses that might be worth testing and then moving forward with those. So sometimes though when you do this analysis, it performs really well in terms of prediction actually you can get really really high prediction accuracy but like this ecological signal is not that clear and so one example of that was this analysis we did for a filovirus positive bat species of filoviruses are like Ebola virus, Marburg virus hemorrhagic fevers, so this we did this analysis right after the West Africa outbreak of Ebola started to get like kinda out of control and so we thought, well nobody knows what the reservoir is, we’ll just start with that and see what we get. The data were really hard to work with because there were only 21 filovirus positive labels and that’s like I don’t know if you guys work with disease data but there’s something called seropositivity which is basically like, I don’t know you’ve been exposed, like your blood looks like it’s been exposed and there’s some antibody signatures but it doesn’t mean that you’re actually infected. It just means that you might be infected or you might not, you might just be in the vicinity of something that wasn’t, that the Ebola was there and so you got exposed to it.

29:08 - So anyway we scraped together all of those data and there’s 21 labels out of 1116 total bat species globally so that’s not a high percentage of labels. So this model was considerably harder to train this is just a bunch of distribution information, we got a pretty a prediction of where new carriers should be and so there’s a couple of things I want to point out here. We did the analysis globally, all of the filovirus data have ever come at, they’ve only ever come out of this region in the world, so like the fact that there are bats that we’re popping up on the, in the Western Hemisphere, like everyone was like huh, what does that mean? Does that mean we have Ebola? That’s not what it means like, we probably don’t have Ebola I mean I actually the people that do the testing tell me that we’ve never had a positive sample but the virologist in that bat virologist say that’s really interesting because that might indicate that there’s a niche opening, and there might be something that’s similar to filovirus, it’s serving the same sort of, feeling same sort of niche but in the Western bats, which was not a hypothesis that I would have come up with not having been a bad virologist right, so the point here is to work with bat people or to work with people who are domain experts in the data because it’s really easy to train a model on anything and then make erroneous inferences because you don’t understand what your data actually mean or any of the dependence, the biological dependencies in your data. So, we gotta a couple of positives that made good sense based on the fact that some crazy guy experimentally injected this bat with filoviruses, I don’t know who pays for these studies but, it he got it to replicate the Ebola virus in the lab which is one of the only times that we’ve been able to get about to do this. But this species was not included in our dataset cause it was never confirmed in the world to be positive for Ebola viruses, so we’re like well we’re just going to be conservative and call you negative but when we got this prediction and we checked the literature it’s true that it can replicate Ebola virus in it’s bloodstream and then this species is actually really closely related to Ronchetti, which is up it’s a fruit bat that is a known reservoir from Marburg, it’s like one of the only species that we know for sure replicates and sheds virus into the environment has been responsible for human outbreaks of Marburg.

31:13 - So, this all makes sort of intuitive sense and then the coolest/awful thing about this is that when we zoomed in on so Southeast Asia like if you look at this map obviously there’s a hot spot in Southeast Asia. We zoom in there, anyhow okay well there’s like almost there’s like 25 plus species of bats that are overlapping and they all are within this 90% percentile that being positive for Ebola virus, what does that mean for us? Like are there feel of viruses in Asia, why are we having not having outbreaks like we know that the filovirus is there, we know that they circulate in wildlife, we know that there’s high biodiversity in Southeast Asia but we have no human outbreaks and the question is, is there something biological that’s keeping the outbreak from happening? Are there outbreaks that we’re not seeing, raises a whole bunch of surveillance questions. But since this study was published there have been new filoviruses detected in this region and also there was a new filovirus found in a new reservoir and this guy was number five in our list, of pretty so, if you know we rank ordered everything this is the one that had the fifth highest probability of being positive for Ebola. So, when this this paper came out maybe six months after our paper had been published so, obviously like we had no, I didn’t know that these guys were working on it. Ebola viruses are like really hot topic right now, but you know it’s one of those things we’re like, yes, no, that’s awful.

32:32 - (audience laughs) I don’t know another cheer or like not cheer so, the we found that filoviruses we know so we have this intuition that filoviruses tend to hide out and really mega-diverse areas, and that bats who are more likely to be filovirus positive their ranges have extraordinarily high mammal biodiversity which is totally opposite what the rodents showed right. Rodents have deep operate mammalian diversity and filovirus positive bats have really high biodiversity. They also tend to occur in really large population groups so, I don’t know if you can see this picture this dark area they’re all bats, they’re roosting together in a single cave and these guys are in protective gear with like respirators and everything because it’s a huge health hazard. And then the other really cool output was that filovirus bats tend to have high production which is this sort of size corrected measure of reproductive output like fitness measure, so like how much biomass do you produce per unit body mass is what production means, and here is an example of a mother I think this is the this is some species of fruit bat and this is its offspring, just to give you a sense for how big this ratio can be. So, I mean like we can like hand-wave about what we think these variables mean but the honest answer is like, I don’t know like I don’t that, doesn’t really indicate a clear ecological story to me.

33:50 - So we tried to do the same type of analysis with Zika virus in primates, so right after the Zika virus outbreak in primates in Brazil, South America and the Americas. We tried to figure out well okay, the Zika virus outbreak is generally under control, like it’s gonna its we have herd immunity now, it’s probably not gonna break out in the same, to the same level as it did before but the based on all of our biological understanding about flaviviruses in general. We think that there’s a high likelihood that if the Egypt Aegypti mosquito are there and they’re carrying Zika virus, they’re transmitting to humans able to pick it up and transmit it to another human, they can pick it up and transmit it to a primate because primates are and all other and many other mosquito-borne flavivirus systems they are the wildlife reservoir. And so that was the motivation for this work, can we identify the primates in advance that are susceptible to becoming these spill back hosts and therefore reservoirs of Zika virus in the long term. And what do we need to do in terms of our conservation management and decision making preventive health care? In order to make sure that these repeated spill overs from primates than back to humans does not happen in future.

34:52 - So these data were really hard to work with because I know, I think most data are like, my first worry, the rodents those are like the easiest data had to work with and everything after that was totally downhill. So, we had traits for a 285 primate species so, way fewer data than we’re ever used to working with and we had like no labels for Zika virus. We had like two species that have ever tested positive for Zika virus and we have like four other mosquito-borne flaviviruses of which we have many other species that we know other primate species that we know test positive for like one or more of these. So, flavivirus belong to a group, the flaviviruses that are mosquito-borne are like few there’s like five of them and so we we implemented this bayesian multitask learning approach where we wanted to borrow information like so if you are positive for one flavivirus that’s mosquito-borne and you are living in an area where we know that there’s another mosquito virus that’s flavivirus mosquito-borne then we want to borrow information from those species to help us do a better job at predicting the data that we don’t have labels for. Okay, so it’s an inherently Bayesian way of thinking about the learning problem, and also because this Bayesian multitasking learning requires that you have a complete data set, which for primates we obviously do not have complete data on all of our features.

36:06 - We did this multiple imputation by chained equations approach which I still completely don’t understand but, I did this work in collaboration with a bunch of scientists from IBM Watson that is close by Cary in New York, in New York town and so they were sort of the magic behind the multiple imputation stuff. So I’m just gonna skip to the predictions here so, we made a bunch of predictions about which species are likely to be Zika virus positive, the takeaway is that it’s not great news because these are all super common species and they have really high contact with humans so, that was bad more bad news. I’m like sunshine and rainbows when I give these talks like all bad news, so the out-of-sample validation procedure that I wanted to just mention really quick, one thing that you can do to try and figure out how badly your model is doing or how badly your data are up to the task of prediction so, in this example we, what we did was we took each of the reservoirs that we had data, we had a positive label for and we pretend, we relabeled it to be zero so, we changed it from a one to a zero and we retrain the model and we said okay we wanna know, what you would predict this species to be, if you didn’t have prior information about it. And it turns out that the model is not able to compensate for really grossly missing data, so these species are like these these guys right here, so the tenfold CV scores were really low, which means that the model is not doing a good job of assigning accurate labels to these guys and it’s directly correlated with how much imputation had to be done on that species because there’s so much missing data for somebody’s be like, it’s the analogy that I like to use is for some of these guys like the black howler and the Ringtail lemur like these are not conservation threats, they’re not rare they’re super easy to catch, it’s like not knowing the life basic life history of the Red Robin. Like you see it everywhere, it’s not hard to catch, it’s not hard to study but you don’t know how many eggs it lays and you don’t know what it eats and you don’t know when it’s active.

38:04 - I mean it’s that level of sort of biological ignorance about some of these guys, and so the point that I wanna make here is that data are really powerful and you can get a lot out of them but you can’t make them up, you can’t make up for missing data ,right. You can’t get something from nothing and so that was one of the lessons that we’re really clear from that study is that yeah, the data machine learning is really powerful but it just won’t make up for missing data and the lack of basic research. I think in general for machine learning and especially for deep learning, the language of choice is Python, I don’t code in Python so that’s unfortunate for me but I’m learning Python and there’s analogies to all of these packages in R and if you are interested in a particular package, I can look it up for you or I can tell you what I’ve used in the past for boosted regressions, I generally use GBM which is by Greg Ridgeway, there’s also another package that’s more friendly for spatial analysis and species distribution modeling called Dismo, D-i-s-m-o by Jane Elith and then the carrot package, C-a-r-e-t. Caret has like random forest, GBM and has like a bunch of SVM, it has a whole bunch of different algorithms under the hood, the reason I don’t use Caret is that, I find it really hard to fiddle with things under the hood or figure out why it did certain things like it makes decisions differently than then Dismo does and Dismo does it slightly differently than GBM does, GBM I find is the way that is the easiest to sort of take everything apart and fix things and know exactly what’s happening. But this is a nice little cheat sheet for if you have like really sparse data, it’s just not gonna work just get more data.

39:39 - If you’re trying to predict a category, yes you have label data, yes then you might want to do some of these things to try and figure out what your labels are, if you don’t have label data, you might want to do some clustering analysis to figure out how your data are separating from each other. If you’re not predicting a category and you don’t have a quantity you’re just looking around then you might wanna do something here and if you are predicting a quantity here are some other things that you can do. So, another rule of thumb that I’ve heard that I’ve not validated myself is that sometimes for a wider data sets, like if your data matrix is generally like wider than it is long, Random Forest is supposed to perform better. I don’t actually know why, I actually don’t think computer scientists know why like, that’s why they call the term black box, right. So it’s black box see because you’re not specifying a process and that’s that makes sense but it’s also black boxy because sometimes we can’t figure out why they work so well, like we know exactly what they’re doing but the level of prediction accuracy is kind of staggering and that’s the black box nature of machine learning.

40:37 - That’s what’s computer scientists refer to when they call something black box, thanks for having me. (audience applauds) .