Beta diversity metrics
Apr 27, 2021 19:00 · 5268 words · 25 minute read
Now, as we move in to talk specifically about beta diversity, what we’re looking at is a comparison between the between two samples. How do the microbes in those samples overlap? So we do this by plotting the beta diversity in distance metrics. So when I’m talking about distance what I mean is how similar or different two samples are from each other. Do they share many of the same microorganisms or do they not overlap at all? And we’re going to get into how we measure this in a second, but first I just kind of want to show you how this data is going to look.
So you may recall from alpha diversity that each sample basically has its own alpha diversity measure. You can add alpha diversity as a column to your metadata because no matter what comparison you’re making, the alpha diversity per sample is the same. Just like any other value, like the temperature at which you took the sample, or the day since the start of the experiment. But in beta diversity, it is very dependent on which two samples we’re comparing so we end up with a matrix like this instead.
So you’ll notice across the top here, we have sample names. We have stilton 4r2 4r3 and then along the rows we have those same samples stilton 4r2 4r3. So what we’re looking at is the distance those two samples are from each other - how different they are. So these two samples in this top left hand corner are exactly the same so you’ll see that the distance is zero they’re exactly the same so there’s no difference between them but if the samples are very different like maybe if we’re comparing these two we get up to like a sample distance of 0.
5 or 0. 6 even so again what we’re looking at is a direct comparison between every single sample and every other sample in your entire data set so as you might imagine these distance matrix a distance matrix or these different matrices can get very very large this is a really small example there’s only like 50 samples in this study so we don’t usually actually look at the distance matrices themselves we kind of visualize them which we will show you nearer to the end of this presentation but you may be wondering where do we get these numbers that go in this distance matrix so you there’s a lot of different metrics that you can choose to evaluate your beta diversity so one of the most commonly used metrics that you’ll see in literature today is the unifrac distance so here’s an example of it and this is a pcoa plot which we’re going to show you how to make later but basically each point is the entire diversity of a single sample and the closer two points are to each other the more similar the microbiomes of those samples are to each other and then we can color them by different metadata to kind of look for patterns in our data so in this case we used unifrac distances to select the or to measure the distances so our distance matrix all of these numbers came from a method called unifrac and then we plotted them so you can see that unifrac really separates out the free living organisms versus the vertebrate gut organisms we don’t see a lot of overlap so the distance between two free living samples is probably pretty small you know 0.
1 or 0. 2 and the distance between the the free living and vertebrate gut is much larger it might be some of those like 0. 6. 7 samples but then if we chose a different metric and these don’t separate the data as well see we see kind of some weird patterns along vertical axes so the choice of metric is really important because it really shapes your data differently so I’m going to get into just a few of the commonly used methods for calculating the and kind of how the math works on those and some advantages and disadvantages of each so the main ones we’re going to talk about today are jacquard bray curtis unweighted unifrac weighted unifrac and generalized unifrac so unifrac is the the one we saw in the last slide but even within that metric there’s a lot of different choices for what you can do so there are a few rules so the distance between one sample and another sample has to be either greater than or equal to zero you can’t have negative numbers on these distance matrix so you’ll notice that all of the numbers were zero or above also the difference between sample one and sample two equals zero if the samples are exactly the same which is another thing we saw in that example matrix earlier the distance between sample one and sample two is the same as the distance from sample two to sample one so it doesn’t matter which order we calculate them in which in the distance matrix you know each of those comparisons is done twice with the diagonal being zero and then being identical across the diagonal and then the fourth rule is a little bit more confusing so in this it’s the triangle rule so what we’re saying is sorry the distance between sample 1 and sample 2 has to be less than the distance between sample one and sample three plus the distance between sample three and sample two so if you picture these three points plotted out as a triangle put up on a graph the distance between two of those between two points plus another two points has to be greater than like the hypotenuse which is something that we’re used to seeing from basic geometry don’t worry too much about these equations especially if it’s confusing to you these don’t matter as much if to just using and applying the distance metrics but for me sometimes knowing the fundamental math behind the decisions we’re making just helps me understand what we’re doing a little bit better but again this isn’t you know this isn’t going to be on the test so to speak you don’t need to know this going forward it’s just something to keep in mind so now we can get in to what some of these actual metrics are so the first and the most simple metric is jacquard distance so if you’ve already listened to the alpha diversity lecture you know that the most simple measure of alpha diversity is a richness measure right it’s just counting the number of different microbes present in a sample well this is kind of the equivalent in beta diversity so the dis it’s how many shared microbes are found in each sample so if you’re a visual learner here’s kind of a picture of what this looks like right if the exact same microbes are found in both samples the distance is zero no matter how much of each microbe there are it’s just a count there’s e coli in both samples the distance zero if they on the other hand there are no similar microbes between the two samples our distance is going to be one they don’t overlap at all and if about half of this microbes in each sample are shared our distance is 0.
5 so this one is really simple and really intuitive which is nice and then for those of you that have more mathematical minds here’s kind of how that would work so the equation is one minus the number of features that are in both samples a and b divided by the number that are in a or b so the total between the two samples so if this is our feature table so notice this is the same table or the same type of table we used in alpha diversity this is not yet a distance matrix that’s what we’re going to make here and we want to so we want to compare 4ac2 and e375 I’m just going to say yellow and blue what we would do is we would count up the total number of features present in each sample and again it doesn’t matter how often those occur it doesn’t matter what the number in the box is as long as we know if it’s greater than zero so in this orange or yellow box we have one two three four features this blue box we have one two three four features and total it’s going to be five because all five features are represented so our denominator here is five and then we for our numerator we are counting how many are shared so feature one is shared feature two is not feature three is shared feature four is shared feature five is not so we have three shared features so it’s one minus three divided by 5 and that is what’s going to fill into this box here and this box here and then of course on the diagonal it’s all going to be 0 because we’re comparing the sample to itself okay I apologize I forgot that there were animations great so then this is how that fills in and because it’s a matrix the other side is going to be exactly the same remember rule number three right that says the distance between sample one and sample two is the same as the distance between sample two and sample one so that’s what we’re looking at here and something that you’ll notice about this matrix is that we don’t actually see a lot of differences here right our distances range from 0.
4 to 0. 5 there’s not a lot of just things distinguishing the sample although if we just glance at this data table we may think that we’re going to see more differences especially between this purple sample and some of the others right because this one has a lot more zeros and a lot and a feature three and a few of feature four whereas these others have a lot of feature four and a few of feature three so maybe in this case jacquard isn’t going to be the best distance metric to distinguish that odd difference of this purple sample here so in that case we may want to also include some sort of evenness measure right because the number of sequences associated with each feature might be important to understanding this data set so if you remember from alpha diversity to do that we use the Shannon vector and include both distant both richness and evenness our beta diversity equivalent is going to be this break curtis difference so for now go ahead and ignore the math on the bottom here because the equation’s a little bit more complicated than it looks but if we want to just look at the visual building blocks so to speak of this of this metric right if we have the same so what we’re looking at here is each column is a feature or a bacterium and the number of squares in a column is how often that feature shows up in a sample and then we have a blue sample and a red sample here so if we’re comparing blue and red here we can see they have the exact same features and the same number of and the same frequency of occurrence of each of those features so our distance again is zero because these two samples are exactly the same on the other end right these two samples share barely any features right we have a and we have feature one and the blue sample is not present in the red sample featured one in the red feature two is present in the red sample but not in the blue sample and so on our only overlap is here with feature five and the distance where we have three in the blue sample and one occurrence in the red sample but overall these are distinct enough from each other that we’re calling this distance about one they barely overlap at all and and really the abundance of each of those features is about the same right there’s a feature with an abundance of four in blue and there’s a feature with an abundance of four in red so they’re not very uneven even though they have a different richness and then here in the middle is an example of something that’s kind of that’s a distance of 0.
5 so more like what we’d probably realistically see right there’s only a couple samples that com or features that don’t overlap at all this feature three is present and high abundance in blue and not at all in red and feature five is present in low abundance in red but not in blue feature 4 is shared between the both feature 2 is very similar between the both and feature 1 is not that different so what we’re seeing here is a distance of 0. 5 this one’s a little more difficult to see visually but hopefully this math will help you understand it a little better so this formula looks a little bit more complicated than the last one we saw so what we’re looking at here to break it down is the sum so add it all together all of our features together so x is the frequency of feature I or feature one in sample a minus the frequency of that same feature feature I and sample b so what that means is feature one and sample a it might be 42 like this yellow sample compared to feature one in sample b which here is this blue sample so that’s 12.
so 42 minus 12 but then we have to do this for every single feature so we compare 42 so to get our numerator it’s 42 minus 12 so that’s 30 plus 0 minus 1 so that’s negative 1 then 37 minus 22 to give us 15 99 minus 88 so 11 and then 1 minus 0 which is 1. so we add all of those numbers together and that gives us 56 for our numerator here then our denominator is simply a sum of all of the features so the frequent this is the opposite right it’s the sum of the frequency of feature I and sample a plus the frequency of feature I in sample b so for the denominator we’re doing 42 plus 12 plus 0 plus 1 plus 37 plus 22 plus 99 plus 88 plus 1 plus 0.
so that adds up to 302 so you don’t have to try to do that math so when we’re comparing yellow sample and blue sample so filling out this box here we are our math is going to be 56 divided by 302 and that gives us about 0. 19 so we would do that comparing all of the samples and fill out one side of the distance matrix and then remember the other side’s exactly the same so if we’re comparing this right back to the distance matrix we saw previously with the jaccard difference distance excuse me you’ll notice that we actually see a lot more distinctions between samples here remember I was saying it’s interesting that we saw really similar distances when comparing sample five to the others because it looks different to me in this case if we’re comparing sample five to the others we have see a much larger difference right compared to the others now we’re at point six five point six nine and point seven where if we’re comparing you know sample four and sample one it’s only 0.
15 so this is going to show us a lot more distance than or a lot more differences between samples in this case than the jaccard difference did so this is an example of why you might choose braid curtis if your sequences are really uneven if you see similar samples if similar features in all of your samples but at very different frequencies bray curtis might be a better metric great and so now finally we’ve gotten to the unifrac distance which is the one I was teasing in the very beginning so this distance is really valuable because it incorporates phylogenetics and there are kind of a couple different ways to measure unifrac the two big ones are unweighted and weighted and just based on the name you can probably guess what the difference is unweighted only measures the richness so only the presence or absence of features whereas weighted includes the amount or the frequency those features are present but unweighted is usually is the one we kind of teach on because it’s a little bit easier to understand so the formula for this is actually quite simple even though the math is a little more difficult our formula is just a sum of the unique branch lengths divided by a sum of the observed branch lengths so this first one here right sample one and sample two share all of the same branches that’s why they are in purple over here because red and blue make purple all of the branches are shared so whatever this math ends up being the top is 0 because none of them are unique and then the bottom is going to be all of these branch links added together remember it doesn’t matter what these numbers in sample 1 and sample two are just whether they’re not zero so that’s why this number ends up being zero on the other end we see sample one and sample two don’t share any stem any features right so this is kind of similar to that first example we saw in the jaccard difference except this is even more valuable because not only do they share no features they don’t even share any branches right sample one and sample two are in two completely different portions of the phylogenetic tree this is almost like comparing a sample of bacteria versus a sample of archaea there’s no overlap at all until a very very distant common ancestor back here so our unique branch length ends up being the same as observed branch length right because every branch that we see is unique there’s no shared which is how our unweighted unifrack distance ends up being one whereas a more middle ground sample right sample one and sample two they still don’t really share very many asps if we look for the purple asvs which is again features if we look for the purple features there’s only one two shared features between these two samples yet so our jaccard distance would be very high it would be very close to one because they don’t overlap much however in unifrac even though they have distinct asvs or distinct features they’re still really phylogenetically related right feature so say for example look here sample one has none of asb 4 and sample 2 has none of asv5 so those branch lengths are unique that lead right to four and right to five however the branch before that where those two diverge is shared so that’s why this ends up being about point five and this is kind of what you might see if you are looking maybe you’re looking at a couple of gut samples right where you see really similar genera between your two samples but the actual species happening are the species that exist for example are different so only the last branches are unique so that’s the general principle of this let’s do a practice just like we’ve done with all of the others so here let’s start like always by comparing the yellow sample to the blue sample all right and I have a little illustration here for you so again we’re looking for unique branches right so both samples have feature one present it doesn’t matter that it’s more frequent in the yellow sample it’s present in both so here’s feature one on our phylogenetic tree both samples share this branch and both samples samples share this branch right so our unique so far is zero and our total so our denominator is one point seven five then let’s look go to feature two and feature three over here right so both samples have feature three so both samples share this branch and this branch that leads to feature three however only the blue sample has feature two so that last tiny branch going to feature two is our first unique branch length up here on the top and then feature four is shared so they both have this branch and they both have this branch the future for only the yellow sample has feature five so that last little branch is another unique one so we only have two unique branches 0.
5 and 0. 25 so up here in our numerator is 0. 75 meanwhile our denominator is going to consist of all of the branches covered by either of these samples so 1. 25 plus 0. 5 plus 0. 5 plus 0. 5 plus 0. 6 plus 1. 45 plus 0. 75 plus 0. 25 notice each of those only gets count counted once even if it’s present in both samples so our denominator there ends up being 5. 8 so we’re doing 0. 75 divided by 5. 8 and we get. 13 and I apologize if it’s kind of confusing to just listen to that if we were doing an in-person workshop I could write these numbers on the board and help you follow along a little better but the good news if you don’t quite understand is that first we have some resources near the end of this section to help you if reading about this instead of listening will help you a little bit better and secondly you’re never actually going to have to do this math by hand because QIIMEwill do it for you the main reason we teach the actual formulas is so you understand what this metric is really telling you and you can make an informed decision as to which ones to include in your analysis real quick if we wanted to do another example we could compare the yellow and the red samples or sorry we’ve switched to purple yellow and purple samples and the reason I point this one out is because feature two is absent in both yellow and purple right so notice here we don’t count that 0.
5 branch length at all it’s not unique and it’s not shared so I just want to make it clear that when we count shared branch lengths we’re only capturing the ones present in the samples it’s not like we’re counting the entire phylogenetic tree that we made earlier I hope that makes sense to you but so what’s interesting now is that come right if we’re comparing yellow if we’re comparing yellow and blue our distance is 0. 13 and if we’re comparing yellow and purple our distance is 0.
14 so even though they look different and they were pretty different based on bray curtis phylogenetically they’re not that different right because there’s only really one different two different samples here and there’s two different samples here and they’re all nested within the same phylogenetic tree so once again we’ve returned to these samples looking very similar to each other even though maybe they aren’t so again if we really wanted to distinguish these we might use a weighted or metric or a or a metric that includes evenness so if we wanted a metric that includes evenness and phylogeny we would turn to weighted unifrac sorry that this top of this slide is not correct so in this case we’re just adding in again the evenness metric and we don’t even give you the formula for this because it’s very confusing and difficult to explain so now instead of this difference being zero it’s still very low but it’s 0.
1 because we do see a different and the weight like if we look at asv10 it’s 170 in sample one and only seven and sample two so that still is a difference even if that branch is shared meanwhile it’s still going to be one entirely over here because there’s still no shared branches the entire thing is unique um and then in this middle one it’s a the weighted unifrac will skew a little higher again because it’s looking at how different the weight is between these two samples so basically unweighted unifac biases towards those rare taxa because it doesn’t matter if it’s present with one with a frequency of one or it’s present with the frequency we have a hundred it shows up the same an unweighted unifrac whereas weighted unifrac it really leans heavily on abundant taxa and kind of ignores the rarer ones so these tell you kind of different things about your data which can be good or bad but more recently what some of the QIIMEdevelopers have started to recommend is this generalized unifrac model and we don’t really have a good picture for it but basically it gives you kind of a medium so if you are interested in this you can go to the plugin linked here we don’t really demonstrate it in the tutorial because it’s a little bit more advanced but you can set the alpha level in this plug-in to 0.
5 so it’ll only kind of half-weight the unifrac distance if you will but really you can put set that number at anything to just change it just a little bit and give you some more even insight into your data and we’re just putting that there as a resource for you to explore as you dive into your own data later and there are some good hints about using it on the forum if you have questions with that I’m going to turn it back to Yoshiki to talk about how we visualize these distance metrics so we can actually gain something out of them since like I showed you earlier looking at a distance matrix is actually not super helpful in real world context thanks for spending this time with me thanks so much ariel now that we know how to compute these tenses between samples and now that we have a better intuition for how to interpret these different distance metrics let’s talk about how can we visualize this by far one of the most popular methods to do this is to use principle coordinates analysis principal coordinates analysis or pcoa is a form of dimensionality reduction where the main input is a distance matrix this is not to be confused with principle components analysis where the main input is a feature table or contingency matrix the main implication here being that principal coordinates analysis can operate on matrices in any metric space whereas principal components analysis can only operate in Euclidean spaces as ariel mentioned beta diversity on its own does not have any direct relationship to the sample metadata that we collect if we wanted to do a pre pcra analysis on a data set and we have no metadata what we would have to look at is a black and white picture like this one where the points are distributed along the space but there’s really not much we can do to interpret these data distributions it is only when we add the sample metadata that we can get to the more interesting results and the more interesting interpretations in this case we can see the different sample type effects or the different differences between sample types in panel a or the lack of differences between the two be between the host subjects or the sex of the hosts who donated these samples and we can also see a little bit of the temporal variability in general principle coordinates analysis is not the only way to visualize this data you can also visualize distributions of distances between groups like ins in panels e and panels f or even cluster the samples or group the samples using a hierarchical clustering scheme to compare the differences between between these samples as a side note we’ve come a very long way in visualizing pco of pca plots in 1957 brian curtis published this paper where they analyzed the forest communities of southern Wisconsin and in that in that paper they included the following figure where they visualized a three-dimensional pcra plot that they built with pieces of wood sticks and balls these days using QIIMEthis is much easier than it was 70 years ago one important thing to note about principal coordinates analysis is that this is not a form of statistical testing if you’re interested in doing statistical comparisons we recommend that you use permanent adonis or perm this for categorical comparisons mantle’s test for continuous univariate comparisons and it’s done is for multiple for other multivariate applications in summary in general we have two different types of metrics qualitative metrics and quantitative metrics qualitative metrics do not account for the relative abundances of the data that are observed in a sample and quantitative metrics do this means that there will be different tradeoffs in terms of how low abundance or rare features are represented in in the end results in addition we also have metrics that account and don’t account for phylogenetic relationships between the features not full of genetic metrics assume that everything in your feature table is equally related the large genetic metrics however take into account the distance between the features in order to assess how different two samples are lastly this is a list of frequently asked questions that we thought it would be beneficial to include one of the most common questions that we get is what is the best distance metric and what should I be using for my analysis the reality is that there is no one single best distance metric different metrics account for different properties of the data and you should rely on the interpretation of the metrics to choose what metric you should use for your analysis another question that we often get is how do I know what metadata category is the most important in my analysis can I assess these alone through a pca analysis in general no you want to use the statistical tests like the ones we mentioned before to support any of the results that you may visualize in a pcoa plot there’s a few options available in QIIMEand in other plugins that would let you visualize additional access of data on top of these pco8 plots pipelines are another very useful tool for for this purpose one other metric one other question we get a lot is what other metrics are available in QIIME2.
there really are a few dozen metrics available in QIIME2. if you want to look at this we recommend that you use the very diversity command with the dash p metric flag and you will be able to see a list of all of these metrics for explanations and citations for these metrics we recommend that you check out this forum post thanks so much.