PD Mice: Taxonomic classification

Mar 30, 2021 19:00 · 3408 words · 16 minute read

Hello, my name is Anthony Simard, and in this video we will be discussing the taxonomic classification portion of the QIIME2 Parkinson’s mouse tutorial. This video will generally assume you’ve been following the tutorial videos up to this point, and also you’ve watched Ben Kaehler’s lecture video on taxonomic assignment, because in this tutorial video we will be discussing how to implement some of those concepts in QIIME2. So to start with, you can see I have these three files in my current working directory.

These three files are what you will need at bare minimum from previous portions of the tutorial to complete this portion of the tutorial. If you’ve been following the tutorial up to this point, you will have these three files and more, and you will be fine. If you’re trying to hop right in on this portion of the tutorial, you will need to go download these from previous portions. And these are our feature table sequence, our feature table frequency, and our tabular metadata, respectively.

00:56 - So moving on to the tutorial itself, we have taxonomic classification. For this analysis, we will use a pre-trained Naive Bayes machine learning classifier, which you may remember from Ben Kaehler’s lecture, is perhaps the most common way to identify different taxa. This particular classifier was trained to differentiate taxa present in the 99% greengenes 13 8 reference set, trimmed to 250 base pairs of the v4 hypervariable region. Now why do we care about that? We care because Naive Bayes classifiers perform best when they’re trained for the specific specific hyper-variable region amplified.

So if you’re using a Naive Bayes classifier that was trained on p4 v4 data or rather, if your data is v4 data, you’re going to want to use a machine learning classifier that was trained on v4 data. If you have v4 data and you use a classifier that wasn’t trained on v4 data, you’re not likely to get good results. So just keep that in mind when you’re doing this, make sure the classifier you use matches the data that you have. So let’s actually get that classifier by using this w get command in the workshop server terminal, and you can see that went off fine and we now have our classifier.

02:16 - The next thing we’ll do is run the command that’s really the meat of this portion of the tutorial, and that’s QIIME feature classifier sklearn. This will take our feature data sequence, the classifier we just downloaded, and it will produce a feature data taxonomy artifact. Now that artifact will contain all of our ASVs with different taxa assigned to them by the classifier. So I will run this command and I strongly suspect this command will take probably more time than all the other commands in this portion of the tutorial combined, it really is the meat of this portion of the tutorial, and in the real world, using a real full-size data set, this one could take some significant amount of time.

I mean you can see here we’re using a fairly small sort of toy data set for this tutorial so that things don’t take too long, and even then it’s taking a minute. So you can see we now have our feature data taxonomy saved to taxonomy. qza, but you might remember from previous portions of the tutorial or some previous experience, that these qza files these QIIME artifacts aren’t really meant to be human readable. They contain the data in a format that is easily understood by computers, and not so much by people.

So in order for us to interpret that data, we’re going to want to turn it into a visualization, hence the next command in the tutorial: QIIME metadata tabulate. This takes our feature data taxonomy artifact and it outputs this visualization that will allow us to actually view the data in a more human readable form. And this should run significantly faster than the last command and you can see it did. So to actually view this visualization, I’m going to go over here to this web page that is currently looking at the current working directory on the workshop server that I’m running this tutorial in.

I’m going to refresh the page so that it gets back all of the contents that have entered that directory since the last time I was here and loaded this page, and I’m going to copy this link address, go to QIIME view, go to file from the web, paste that URL, and go. And you can see we now have our visualization. This visualization contains feature ids, the taxa associated with those feature ids, and the confidence with which our classifier was able to associate this taxa with this feature id.

You can sort this any way you like, ascending/descending feature id, ascending/descending taxon, ascendingd/escending confidence. I’m going to keep it on ascending feature id, which is the default. You can also see here we have a search bar. This search bar can be used to narrow down the data in here, as you might imagine. So let’s say I just want this particular feature id, plop it in the search bar, and now that’s all we have. Let’s say I want only things that are in the phylum Firmicutes - type in the search bar, press enter, all that’s left are taxons or taxa that are from the phylum Firmicutes.

You can do the same with confidence but it doesn’t work maybe in quite the way you’d expect, if I were to put in say 0. 9, well we can basically have everything with a confidence of 0. 9 and up, because these all contain 0. 9. But if I were to put in 0. 90, now we only have things that contain 0. 90 explicitly in any one of these three columns. In this case given that I put in 0. 90 it’s only really going to be in confidence. So it’s not anything 0. 90 and above, it’s things that have 0.

90 in them. So don’t think you can put in say 0. 98 to get everything with a confidence of 0. 98 and higher, it won’t do that. It’ll give you everything with a confidence between 0. 98 and 0. 99.

06:23 - So moving on to the next portion of the tutorial, we’re going to want to visualize some more of our data. In particular we’re going to want to visualize our feature table sequence, and so we’re going to do that using feature table tabulate seqs.

06:42 - We’re going to run that and it’s going to be exactly the same as before, we’re going to take in that artifact, we’re going to output a visualization, we’re going to go to our page that has the workshop server directory open in it, we’re going to reload. You can see that we now have this dada2repset. qzv - no, I do not want this page in Catalan, it is not in Catalan, rather - we’re going to copy this link address, QIIME view, go here, file from the web, paste, go, and we have our visualization.

This one contains the feature ids of all of our features, the sequence length which, as you can see, is 150 for all of them, and the sequence itself with a hyperlink on it. And we’ll see what this hyperlink is for later.

07:42 - So going back to the tutorial, we have some questions to answer. First, find this feature. What is the taxonomic classification of this sequence? What’s the confidence for the assignment? So if you learned from before, you might guess that what we want to do is copy this feature, go back to our taxonomy. qzv, and search this feature. And you can see that we have this particular taxon associated with this feature id, and we’ll discuss what this means in a moment, the fact that this is just a g double underscore and an s double underscore as opposed to an actual genus and an actual species.

You can also see that the classifier was about 98. 36 so on, so on, percent confident that this feature id belongs to this particular taxon. This number here might be a little bit different for you, not a huge deal. So that’s how you do that. Let’s clear this search bar and go back to the next question. How many features are classified as g akkermansia? So what we’re going to do is double click this, copy paste in here, and you can see we have two g akkermansia features, and in fact these are both classified as the species mucinephila with a very high confidence.

So you can see we have these two features, both the same taxon, both very high confidence.

09:19 - And at this point I think it is worth the worthwhile to address these notes because we’ve now seen both of them. First you might notice that some features do not have taxonomic assignments, which, for the greengenes database, is indicated by a blank string at the level eg double underscore. These indicate that there is not enough information for the greengenes database to differentiate members of that clade either due to ambiguity in the database or because the gene region being sequenced doesn’t provide the resolution to distinguish members of that clade.

So if we go back to our visualization and we get rid of this, you can see here we have family, genus, and species, double underscore. Let’s actually resort by taxa, by feature id low/high. I don’t know why it wasn’t sorted correctly there, my apologies. But yeah let’s get this example instead. So we can see we have genus and species is just g double underscore and s double underscore. So what that means is that the database itself doesn’t have enough data to determine what genus or what species this particular feature was.

It could get it to kingdom, phylum, and class, order, and family - genus and species it doesn’t know.

10:31 - And there is another kind of missing data which is something like this. You can see we have a genus but we don’t even have an s double underscore for species we just have nothing. That comes from the case where q2 feature classifier cannot reliably classify the ASVto a deeper level. In those cases incomplete taxonomy strings will be provided. So, in other words, the ones where we see the level double underscore come from the greengenes database not having enough data to differentiate that taxon to that level, and the ones where we don’t see the letter or double underscores at all means that q2 feature classifier with your given classifier wasn’t able to assert what that was at that level.

The next note you might notice that one ASV has the same taxonomic assignment as another. That is perfectly normal and in fact we did see that with akkermansia here. Unique ASVs do not necessarily map to unique taxonomic groups, which makes perfect sense. If you’re gathering a bunch of data, especially if you gather a bunch of data from the same kind of area, you might get data from the same thing and that’s perfectly fine. So now that we’ve addressed those two notes, let’s move on to the third question: use the tabulated representative sequences to look up these features - that being the g akkermansia features.

If you blast them against NCBI, which I will show you how to do, do you get the same taxonomic identifiers you obtained with q2 feature classifier? So I’m going to give this back, I’m going to go back to our visualization, I’m going to search it again, and you can see we still have these two feature ids with this particular taxon. So, in order to blast it against NCBI, what I’m going to do is I’m going to take this feature id, I’m going to copy it, I’m going to go over to our rep set qzv, and I’m going to use the browser search - which is control f - I’m going to paste that feature id in,and we’re going to click on this hyperlink on the sequence which will bring up this page.

And as you can see right now, NCBI blast is having some server issues. So I actually ran these commands beforehand, so I will be able to show you the output, but I’m not going to run them live here. I’m just going to show you how you would run them. So again that’s find the id in this qzv that came from your feature table sequence, click on the hyperlink on the sequence, and it’ll take you to this page and you would click view report. Now again, we’re not actually going to do that because the servers are down, and also it can take a significant amount of time to run these commands.

I actually ran them earlier before the servers went down and it took close to half an hour. So depending on the load on the NCBI servers and whether the servers are down for maintenance, etc. , etc. you know all of that can affect the degree to which you can do this. But just reiterate those steps: we’ll look at the second feature id from that same taxon, so again we will copy the feature id, go over to our other visualization, paste it in the search bar.

Here we are, here’s the sequence. Click on the sequence, takes us to the NCBI blast page, this is what it should look like if the server is working properly -this is the server being bad, this is what it should look like if it worked. And again you would click view report and wait for results. And now I will show you what those results look like. So as I said previously, I ran it before and this is the output from that first one, from that first feature id, clicking on the hyperlink on the sequence and then clicking on that view report, this is what it came up with.

So if we scroll down here we can see what different taxa blasts associated with. And also you might remember blast again from Ben Kaehler’s lecture. Basically what it does is it’s taking your sequence and comparing it to a whole bunch of sequences that are already sort of known and whatever it has the most in common with, whatever known sequences has the most in common with, that’s what blast is going to tell you it probably is, which makes sense. If you look down here, we can see that some of these don’t really have a name that we care about, but if we move down a bit we can see akkermansia municiphila, and in fact if we go over to the results for the second feature we can again see akkermansia municiphila.

So yes ncbi blast did give us the exact same taxon for these features as QIIME 2 feature classifier did. So moving back to the tutorial, the next portion is the taxonomy bar chart portion. Since we saw a difference in diversity in this data set, we might want to look at the taxonomic composition of these samples. To visualize this, we will build the taxonomic bar chart of the samples we analyzed in the diversity data set. Before we do this, we will first filter out any samples with fewer features than our rarefaction threshold, which was discussed in a previous tutorial.

In this case our rare fraction threshold was 2000. So we can filter these samples out by using the q2 feature table plug-in with the filter samples method. That is going to take our feature table frequency, our minimum frequency - in this case 2000, and is going to output a new artifact of feature table frequency that has been filtered. So we’re going to run that to get this table2k. qza, which again is pretty much the same as dada2table. qca, only any features with less than 2000 sequences have been removed.

Or rather any samples with less than 2000 features have been removed, my mistake. And again we’re going to want to create a visualization out of this artifact so we can actually view the data. This visualization is going to take the table we just generated, our new feature table sequence, it’s going to take our feature data taxonomy that we generated before using classify sklearn, and it’s going to take our metadata. So let’s run this. I’m going to clear because this is being a bit messy, paste and enter, and we will have a new visualization.

And there’s actually only one question associated with this one, so we’ll look at the question a bit before we go look at the viz to motivate why I look at the vis in the way I do. So the question: visualize the data at level 2 phylum level and sort the samples by donor, then by genotype. Can you observe a consistent difference in phylum between the donors? Does this surprise you? Why or why not? So let’s go look at that viz and see if we can answer this question.

So I’m going to go back here, I’m going to reload the page again - taxa bar plot. qzv is the viz we want now. I’m going to copy the link address, I’m going to go to QIIME view, I’m going to click on this, I’m going to get a file from the web, paste, go, and we now have this visualization. And the first thing I’m going to do is to change the taxonomic level from level 1 kingdom to level 2 phylum, because everything we have is in the kingdom bacteria, and maybe a tiny bit of unassigned here.

So this isn’t very interesting to look at, but if I change it to phylum, now we have more variety. The next thing I’m going to do is take this bar width slider and slide it over as far as I can without moving any samples off the right, and you can see that blew up the viz and made it much more easy to look at. So let’s go look at that question again: sort by donor then by genotype, can you observe a consistent difference in phylum between the donors? How do we do that?Well sort samples by over here, click on it.

We have donor, we have genotype. So if I click on donor, we can see that everything is still pretty dominated by bacteriadites, also has significant numbers of firmicutes, but if you look at this hc1 donor herem this one is the only one that really contains this actinobacteria phylum. And if I go to genotype we can see it’s really only the susceptible mice here that contain this varuco microbia. The others are fairly well spread out on genotype but voruca microbia is only in one wild type mouse and is fairly prevalent in some of the susceptible mice.

So this can give you a good idea of where to start with your analysis, it’s not you know the end-all be-all of what to do, this doesn’t necessarily mean anything significant -your gut feeling, my gut feeling, anyone’s gut feeling looking at this vis does not replace the need to do some real statistical analysis. But this might be a good place to start that analysis. And also worth mentioning, we can do some more fiddling with this visualization. So let’s say you want to publish this in your paper but you don’t like the way it looks, we can change the color palette around, maybe you prefer one of these - that one’s a bit interesting, it can be hard to read.

I kind of like the default one. We can also sort ascending or descending and so on. So that answers this question I believe and remember that the answer to this question does not necessarily give a definitive conclusion for your study. I hope you learned something here today, I hope this was of some use to you, and I hope you will use QIIME 2 in the future to do some taxonomic classification for real world projects. Thank you. .