Taxonomic assignment

Mar 30, 2021 19:00 · 4168 words · 20 minute read

Hi, I’m Ben Kaehler. I’m going to talk to you about taxonomic classification. I will cover what taxonomic classification is, how to do it in QIIME 2, and a few ways that you might make your taxonomic classification more accurate.

00:22 - So - what is taxonomic classification? We will assume that, at this point in the pipeline, you have obtained a set of denoised sequences. So these sequences are probably the output of dada 2 or deblur, and so you might be calling them ASVs or sOTUs, I’ll just call them sequences. So you can just use those raw sequences if you’re interested in getting quantitative about the similarities and differences between samples, but if you want to actually figure out which critters are present in your sample then at some point you’re going to have to figure out where these genetic sequences came from, and that’s what taxonomic classification is.

It looks at each one of these sequences, and it tries to figure out which of a range of species, depending on your application, that sequence came from. So the range of species that you might be comparing your sequences against depends on your reference database. So several good reference databases exist and more are being created all the time. Two classics are SILVA and greengenes and they look like these reference databases here. So looks a lot like your sequences, it’s a just a FASTA file inside one of the two database files, which is a feature data sequence QIIME artifact, and it contains a reference sequence and a label for that reference sequence.

And then there’s another file, which is a feature data taxonomy artifact, and for each of the sequence labels from your sequence file it will hold taxonomic information. So in this example here, you’ve got four reference sequences, and each one has a string that categorizes it in terms of its taxonomy. So usually they start with the kingdom or domain, and then the phylum, and then so on down through usually seven levels, including genus and species at the bottom levels.

So in a perfect world, we could just take the sequences from our feature data sequence artifact that we’ve obtained from dada2 or deblur, and we could search for that sequence inside our set of reference sequences. And then when we find a match we could go okay, well that matches this sequence with this label, then we could take that label and look it up in the taxonomy file and that would tell us which species that particular sequence came from. Sadly, in the real world, it’s more complicated than that because we have natural variation and we have noisy sequencing.

So it’s fairly rare that you’ll get an exact match between the sequences and the sequences in your reference database. Not to mention the fact that maybe the particular critter that was in your sample might not be in your reference database. So we tried to come up with the nearest match and this is what taxonomic classification is all about. So we get say, for instance, our sequence from our set from our experiment - it’s called feature five in this example - and our taxonomic classifier might compare it with a range of sequences in our reference database and it might say well, actually it doesn’t match any of them spectacularly well, but it matches most of the bacterial sequences really well so we’ll call it a bacteria.

And for feature 5 here it matches all of the proteobacteria quite well and better than the other phylas, so we’ll call it proteobacteria, but more than that I’m not really confident to say. And depending on how good a match the taxonomic classifier gets, it will classify each of the sequences from your experiment down to a particular level. So for instance, for feature two there, it’s more confident and it’s gone down four levels, four ranks in your taxonomy, whereas for features one, four, and five it’s only managed two, and for feature 3 it’s only managed 3.

Ok, so, how might we go about matching the sequences from our experiment with the sequences in our reference database? Well, we need an algorithm. Now you, because you’re watching this video, you’re probably a biologist or a bioinformatician, so you’ve probably heard of BLAST and you’re probably reasonably familiar with this concept of sequence alignment. So one way that you can do that which might make intuitive sense to you is that you can take each of your query sequences, the sequences from your experiment, and you can align them or blast them against your reference database.

That’s certainly one way to do taxonomic classification - I’ll talk about that a little bit more in a moment. But more commonly, we use what’s known as a machine learning classifier. So a machine learning classifier, in this case, will take your sequence and it will output a classification - and by classification, I mean a taxonomy for that particular sequence. So at the user level, we don’t really care too much about what the algorithm is inside of that machine learning classifier, so long as it can take a sequence and give us a taxonomy string out the other end.

But I’ll tell you about some of the common features of how this problem is approached. So a machine learning classifier takes a set of features and it outputs a classification. So for a sequence of DNA, you might say, how do we get the features out of that that we give to our machine learning classifier? Well, we start by writing out our sequence up the top there as we have in bold, and then we take the first seven nucleotides g-a-c-g-a-a-g and that’s called a 7-mer.

And then we move across one position and we take a second 7-mer, and then we take a third 7-mer, and a fourth 7-mer. They’re called 7-mers because they’re seven nucleotides. In general, they’re called k-mers, because they might be k-nucleotides. So the way that we extract features from our sequence is we look at which specific k-mers are present in the sequence that we’re interested in. And the presence and absence of those k-mers are the features that you feed to your machine learning classifier.

As a user, you don’t have to worry about that too much, this all takes place inside the QIIME 2 feature classifier, taxonomic classifier. So you might think of it as we take a bag of k-mers, we feed it into our machine learning classifier, and it spits out a classification like it has done here. So I think there are six levels down there, so in the way that we usually set it up that means that the machine learning classifier has taken this bag of k-mers and it’s managed or with some level of confidence it can say, okay, it’s actually the geobacter genus of this particular taxonomy, but it wasn’t confident enough to call its species.

So the next question is well, where does the machine learning classifier come from? Well, when we create a machine learning classifier, we say that we train it. And we train a machine learning classifier by showing it examples where we know the answer. And of course the examples where we know the answer are our reference database. So we take our long list of sequences from the reference database, and we take all of the taxonomic strings that are associated with those sequences, and we tell the machine learning classifier: if you see a sequence like this one, you should be out putting a taxonomy taxonomic string like this one.

And turns out, there are certain algorithms that can learn reasonably well to go from a bag of k-mers to a taxonomic string. The most commonly used one is called Naive Bayes. Don’t be tricked by the title, it’s not really from the Bayesian school of thought in statistics, it’s not really from the frequentist school of thought in statistics. It’s a machine learning algorithm. Okay. So, once we’ve got our taxonomic classifications, we can proceed on downstream in our pipeline and we can come up with plots like the one here on the right.

This is a bar plot that shows the relative frequencies across a bunch of different samples that are grouped by sample type, and in this particular case we’ve asked it to display the frequencies down to the phylum level.

10:34 - But because it’s a QIIME 2 visualization, not in this presentation, but if you go on to look at these in the tutorial, you’ll see that this is actually an interactive plot and you can select the taxonomic level down to which you want the plot to be rendered. It can sort it by different criteria, you can filter, you can do a bunch of different stuff where you can select particular taxa and it will show you what it’s doing specifically across the plot, which is really kind of nice.

A once you’ve got it looking how you want it to look, it will output your plot in a publication quality file that you can then go and include in your publication.

11:21 - So like I said, the most common method is to use a machine learning classifier, but you can actually use blast or a similar program, or a program that has similar functions, called vsearch and that will let you compare your the sequences that you got from your experiment with the reference sequences, and it does a thing called consensus classification. So for each query sequence, it will look at the blast hits, for instance, that it gets in the reference database, and it will look at that list and it will say in this example here, we’ve got a sequence, all of the hits were a bacteria, so we’ll call it a bacteria.

In fact all the hits were proteobacteria, so we’ll call it a proteobacteria. And all the way down, until we say okay, all of the hits were Legionellaceae so we’ll call it that genus, but not all of them were legionella, so we won’t say what species it is.

12:32 - Now you can adjust the sensitivity and various thresholds for that, you can when I say consensus you can actually make it vote amongst the hits that it gets, or a certain number of the top hits, to see which one is the majority and then you can adjust the confidence of it in that way. These algorithms work almost as well as the Naive Bayes classifier, and if you’re not happy with the Naive Bayes classifier, or you think it’s playing up in some way, then you can try these consensus classifiers.

Or even if you’re just worried about the the risk that using a particular piece of software is biasing your results in some way, then you can try out these consensus blast and consensus vsearch methods to give you a second opinion on your taxonomic classification. Finally, people have been doing this for a long time, longer than QIIME 2 has been around, so if you do have a taxonomy from the past and you’ve got it for instance in a biome file, you’re able to import that into one of these feature data taxonomy artifacts, and then you can use it downstream in the same way that you would have done had it been generated by the QIIME2 feature classifier taxonomic classifier.

Right, that’s an overview of what taxonomic classification is, now let’s start to get into the nitty-gritty of how this fits in with the QIIME 2 ecosystem. So this graph here, this flowchart, is fairly intimidating. There’s a lot going on - I’m not expecting you to take it all in in the brief time that I will show it to you, you can go and look at it on the internet if you want to really pore over it. But these two bits here are the interesting bits, in the sense that they are related to what we’ve been talking about so far.

So down the bottom left here, we’ve got the query sequences. They’re the sequences that you’ve gotten from your experiment. The little rounded rectangle that says taxonomic classifier is the taxonomic classifier that we’ve already trained, and you can see from the way those flows are coming together that we combine those in the feature classifier plugin we call classify sklearn, with that taxonomic classifier and our query sequences as inputs, and it will output a set of sequence classification results.

From there you can use those to filter your sequences or go downstream and create all the nice plots that we’ve seen. You could even do differential abundance analysis if you wanted to. Alternatively you could use the consensus classifiers there - they don’t need a pre-trained taxonomic classifier, they work straight off the raw reference databases, which you can see being imported on the top left. Now we actually offer pre-trained taxonomic classifiers for your convenience on the QIIME 2 website.

So if you go to the QIIME 2 - I’ll show you the link in a minute - but if you go to the QIIME 2 website and then go to the documentation, you’ll find some data resources that contains some commonly used pre-trained taxonomic classifiers. So if you’re using 16s or you’re using v4 locus in the 16s gene and you aren’t interested in using anything other than greengenes or silva, then I’d recommend that you go online and you can find that taxonomic classifier and you can go ahead and do your taxonomic classification.

There is a caveat to that, and that is that with those classifiers using those standard techniques, you can probably be fairly confident that you’re going to get reasonable classification down to the genus level. If you want to go all the way to the species level, then you probably have to work a bit harder, and we’re going to talk about that in a second. So if you want to do that, you’re going to have to train a taxonomic classifier for your particular problem.

Alternatively, if you aren’t using 16s, if you’re using 18s or its or cii then you will have to train a classifier on a reference database that you have obtained externally to this process. So if you want to do that, you have to go to this part of the flowchart and you’ve got your reference taxonomy, your reference sequences, you’ll import those into feature data taxonomy and feature data sequence artifacts, and then you will feed them into feature classifier fit classifier to train a taxonomic classifier.

There’s an additional step in there that’s labeled trim reference sequences. Now I think in most of the cases where you’re doing this yourself, I’d probably advise against skipping that step. So you can take the feature data sequence data from your reference sequences and feed that straight into the feature classifier training step without going through the trim reference sequences. We found that trimming reference sequences to a particular amplicon region of interest was mildly beneficial for say 16s v4, but if you’ve got a novel database that you’re using and you’re not sure, for instance, that all of your primers are present, or there are other complicating factors, then I’d happily advise you to skip that trim reference sequences step.

Just take your reference taxonomy, your reference sequences, and then go ahead and train your taxonomic classifier. Okay so let’s go a little bit deeper into what’s happening in this orange rectangle here and why we might want to do that.

19:13 - Sorry, before we do that, this is just a link out to the data resources. So there’s some of those pre-trained classifiers, there’s a little bit of a warning that comes along with that if you are using a pre-trained classifier with our feature classifier plug-in, then you should make sure that it’s a coming from a trusted source and you’re downloading it in a secure fashion using say https, and you should probably check your checksums. I know that’s not really what checksums are for but it can’t hurt.

Now the reason for that is that if someone really wanted to hack your system and somehow they figured out that you were using these and somehow could impersonate somebody you know, then it is it is possible to inject malicious code into those pre-trained classifiers. But yeah I’m calling that a fairly remote possibility. Just make sure you either train them yourself or you get them from someone you trust, in a way that you can trust. Now if you want to learn more about training feature classifiers, then we’ll talk about that a little bit in the tutorial I think, or there are online tutorials and there’s a link to a tutorial in the QIIME 2 documentation.

You can find that pretty easily just by going to the QIIME 2 documentation and putting in taxonomic classification, it will be one of the top hits.

20:44 - So why might we want to use something other than the standard reference databases? Well, there are two issues with making reference data - well actually, there’s many issues that come with making reference databases - but one of them is that it’s a big job, and while you might hope that every classification in your reference database looks like this ideal 16s here, where you know where it is in your taxonomic tree right down to the strain level, in reality there’s a bunch of sequences in most reference databases that look like this.

Where they haven’t been classified, for whatever reason, even past genus and the species is a not particularly informative number labelled by OTU, right. So these things exist right through most reference databases and it is ongoing work of a large number of dedicated people to produce better reference databases and if you want the latest and greatest, then maybe you might want to try one of the novel reference databases that are becoming available. A second issue is that well I said it’s a big job, it’s also a hard job.

So, for example, look at this sequence here. This is the first 90 nucleotides of the v4 region in 16s for lacto-helveticus sorry, Lactobacillus helveticus. It is also the first 90 nucleotides of that same region for Lactobacillus hamsteri, right. So two different species, exactly the same sequence. This is kind of a toy example that I have inserted here to illustrate a point, but it is actually a symptom of a much larger problem, where you can have two quite different bacterial species - so different even at the genus level or above there - that are almost identical genetically.

So how are you going to get around the fact that the data that you’re using to pick between species might not be able to tell you the difference between those two species? Well one way that we can do that is to bring in additional information. So for instance I went through, I downloaded data from a 2002 study where they took 3921 human vaginal samples, and I performed taxonomic classification and I found that Lactobacillus helveticus accounted for 21% of the reads across those almost four thousand samples.

Now that’s huge. That shows you that there’s really quite low diversity in these samples. The fact that this one single species accounts for 21% of all the genetic material that we saw. On the other hand, Lactobacillus hamsteri was a bacterial species that was isolated from a hamster gut in 1987. Not known to be associated with human vaginal samples. So if you saw that exact sequence in a human vaginal sample, would you call it Lactobacillus helveticus or would you call it Lactobacillus hamsteri? Well you’d be able to make a pretty safe guess that it’s Lactobacillus helveticus.

The point that I’m trying to get to is that if you know where a sample comes from that this problem of a single read possibly belonging to multiple different species is made much simpler. It turns out that there are many fewer overlaps of a single genetic sequence, or very close genetic sequences, in terms of their taxonomies, if you’re sampling from a specific habitat. So you can use that information. One of the ways that you can use that information is that for several different habitats, you can download a custom database, right.

So there are custom databases for human vaginal samples as well and for several other habitats that people are specifically interested you can download databases for those specific habitats. And they have been shown to be able to increase the accuracy of your taxonomic classification past the genus level down to the species level and possibly beyond. We’ve taken a slightly different approach in QIIME 2, and I’ll show you this fairly complicated slide here.

The approach that we have taken is to say okay, when we usually train a taxonomic classifier - so supervised learning is the particular type of machine learning that we’re interested in here - when you train a taxonomic classifier, the standard approaches implicit in the assumptions that are behind those in standard approaches, are that you’re actually equally likely to observe any of the classifications that you might output. So if you use the standard taxonomic classifier that we have make available under the data resources for instance, there is an assumption in the mathematics behind that that says, well, actually if we have to guess between hamsteri and helveticus, it’s a 50 50 split.

We can do better than that by looking at the distributions of species across samples for a particular habitat type. So that’s what we did and we created a tool so I and the floating head of Nick Bokulich there wrote this tool called clawback, which can which plugs into QIIME and can query the cheetah database, which is an online database of many thousands of microbiome samples. And it can build up a profile of what the taxonomic distribution is for a particular type of habitat.

So for instance if you’re interested in stool you might look at animal distal gut, for instance. We use the earth microbiome project ontology habitat classifications. So you can pull out a particular set of what we call taxonomic weights. So for animal distal gut, for instance, you’ve got a particular distribution where it says okay, you’re more likely to observe some species than some other species, if you’re looking at animal distal gut. And then you can feed that as an input when you’re training your taxonomic classifier.

So that all sounds a little bit complicated. We’ve tried to take some of the hard work out of that by making a repository of taxonomic weights that we’ve trained on cheetah samples and from some other sources at this repository called ready to wear, which is in my github account if you want to go and have a look at that. So you can download those, you can train them against usually greengenes or silva, and then you can go on to use that taxonomic classifier to get better accuracy in your taxonomic classifications.

Does it work? Yes it works. So we did a bunch of rigorous testing where we trained weights and classifiers on a particular set of samples for a particular reference database and then we tested those classifiers on samples that hadn’t been used to generate the weights or the classifier. And we found that if you are using taxonomic weights from a habitat and then you go to classify samples drawn from a habitat like that, then it’s always going to improve the accuracy of your taxonomic classification.

So for instance here, here’s a data set that Nick Bokulich had where previously, with a certain level of confidence, the classifications the taxonomic classifications were down to the genus level there. So you can see that’s on the top plot here where you’ve got Enterobacteriaceae and Bifidobacterium and Bacterioides. And on the bottom level, with the same level of confidence, using these taxonomic weights to train our taxonomic classifier, we were able to say okay, actually we can say it’s E.

coli or B. adolescentis or B. ovartus. And if you’re interested in your actual taxonomic classification, you’re probably interested in which specific species that you’ve got in your sample, whether in fact it came from a hamster or from a human. Okay, that’s it, so if you would like to know more please look at the tutorials ,there’s tutorials for how to do taxonomic classification using the standard classifiers, how to train taxonomic classifiers, how to use clawback, and if you have any questions, please don’t hesitate to ask more on the QIIME 2 forum. .