# Beta diversity metrics

## Apr 27, 2021 19:00 · 5268 words · 25 minute read

Now, as we move in to talk specifically about beta diversity, what we’re looking at is a comparison between the between two samples. How do the microbes in those samples overlap? So we do this by plotting the beta diversity in distance metrics. So when I’m talking about distance what I mean is how similar or different two samples are from each other. Do they share many of the same microorganisms or do they not overlap at all? And we’re going to get into how we measure this in a second, but first I just kind of want to show you how this data is going to look.

So you may recall from alpha diversity that each sample basically has its own alpha diversity measure. You can add alpha diversity as a column to your metadata because no matter what comparison you’re making, the alpha diversity per sample is the same. Just like any other value, like the temperature at which you took the sample, or the day since the start of the experiment. But in beta diversity, it is very dependent on which two samples we’re comparing so we end up with a matrix like this instead.

So you’ll notice across the top here, we have sample names. We have stilton 4r2 4r3 and then along the rows we have those same samples stilton 4r2 4r3. So what we’re looking at is the distance those two samples are from each other - how different they are. So these two samples in this top left hand corner are exactly the same so you’ll see that the distance is zero they’re exactly the same so there’s no difference between them but if the samples are very different like maybe if we’re comparing these two we get up to like a sample distance of 0.

5 or 0. 6 even so again what we’re looking at is a direct comparison between every single sample and every other sample in your entire data set so as you might imagine these distance matrix a distance matrix or these different matrices can get very very large this is a really small example there’s only like 50 samples in this study so we don’t usually actually look at the distance matrices themselves we kind of visualize them which we will show you nearer to the end of this presentation but you may be wondering where do we get these numbers that go in this distance matrix so you there’s a lot of different metrics that you can choose to evaluate your beta diversity so one of the most commonly used metrics that you’ll see in literature today is the unifrac distance so here’s an example of it and this is a pcoa plot which we’re going to show you how to make later but basically each point is the entire diversity of a single sample and the closer two points are to each other the more similar the microbiomes of those samples are to each other and then we can color them by different metadata to kind of look for patterns in our data so in this case we used unifrac distances to select the or to measure the distances so our distance matrix all of these numbers came from a method called unifrac and then we plotted them so you can see that unifrac really separates out the free living organisms versus the vertebrate gut organisms we don’t see a lot of overlap so the distance between two free living samples is probably pretty small you know 0.

1 or 0. 2 and the distance between the the free living and vertebrate gut is much larger it might be some of those like 0. 6. 7 samples but then if we chose a different metric and these don’t separate the data as well see we see kind of some weird patterns along vertical axes so the choice of metric is really important because it really shapes your data differently so I’m going to get into just a few of the commonly used methods for calculating the and kind of how the math works on those and some advantages and disadvantages of each so the main ones we’re going to talk about today are jacquard bray curtis unweighted unifrac weighted unifrac and generalized unifrac so unifrac is the the one we saw in the last slide but even within that metric there’s a lot of different choices for what you can do so there are a few rules so the distance between one sample and another sample has to be either greater than or equal to zero you can’t have negative numbers on these distance matrix so you’ll notice that all of the numbers were zero or above also the difference between sample one and sample two equals zero if the samples are exactly the same which is another thing we saw in that example matrix earlier the distance between sample one and sample two is the same as the distance from sample two to sample one so it doesn’t matter which order we calculate them in which in the distance matrix you know each of those comparisons is done twice with the diagonal being zero and then being identical across the diagonal and then the fourth rule is a little bit more confusing so in this it’s the triangle rule so what we’re saying is sorry the distance between sample 1 and sample 2 has to be less than the distance between sample one and sample three plus the distance between sample three and sample two so if you picture these three points plotted out as a triangle put up on a graph the distance between two of those between two points plus another two points has to be greater than like the hypotenuse which is something that we’re used to seeing from basic geometry don’t worry too much about these equations especially if it’s confusing to you these don’t matter as much if to just using and applying the distance metrics but for me sometimes knowing the fundamental math behind the decisions we’re making just helps me understand what we’re doing a little bit better but again this isn’t you know this isn’t going to be on the test so to speak you don’t need to know this going forward it’s just something to keep in mind so now we can get in to what some of these actual metrics are so the first and the most simple metric is jacquard distance so if you’ve already listened to the alpha diversity lecture you know that the most simple measure of alpha diversity is a richness measure right it’s just counting the number of different microbes present in a sample well this is kind of the equivalent in beta diversity so the dis it’s how many shared microbes are found in each sample so if you’re a visual learner here’s kind of a picture of what this looks like right if the exact same microbes are found in both samples the distance is zero no matter how much of each microbe there are it’s just a count there’s e coli in both samples the distance zero if they on the other hand there are no similar microbes between the two samples our distance is going to be one they don’t overlap at all and if about half of this microbes in each sample are shared our distance is 0.

5 so this one is really simple and really intuitive which is nice and then for those of you that have more mathematical minds here’s kind of how that would work so the equation is one minus the number of features that are in both samples a and b divided by the number that are in a or b so the total between the two samples so if this is our feature table so notice this is the same table or the same type of table we used in alpha diversity this is not yet a distance matrix that’s what we’re going to make here and we want to so we want to compare 4ac2 and e375 I’m just going to say yellow and blue what we would do is we would count up the total number of features present in each sample and again it doesn’t matter how often those occur it doesn’t matter what the number in the box is as long as we know if it’s greater than zero so in this orange or yellow box we have one two three four features this blue box we have one two three four features and total it’s going to be five because all five features are represented so our denominator here is five and then we for our numerator we are counting how many are shared so feature one is shared feature two is not feature three is shared feature four is shared feature five is not so we have three shared features so it’s one minus three divided by 5 and that is what’s going to fill into this box here and this box here and then of course on the diagonal it’s all going to be 0 because we’re comparing the sample to itself okay I apologize I forgot that there were animations great so then this is how that fills in and because it’s a matrix the other side is going to be exactly the same remember rule number three right that says the distance between sample one and sample two is the same as the distance between sample two and sample one so that’s what we’re looking at here and something that you’ll notice about this matrix is that we don’t actually see a lot of differences here right our distances range from 0.

4 to 0. 5 there’s not a lot of just things distinguishing the sample although if we just glance at this data table we may think that we’re going to see more differences especially between this purple sample and some of the others right because this one has a lot more zeros and a lot and a feature three and a few of feature four whereas these others have a lot of feature four and a few of feature three so maybe in this case jacquard isn’t going to be the best distance metric to distinguish that odd difference of this purple sample here so in that case we may want to also include some sort of evenness measure right because the number of sequences associated with each feature might be important to understanding this data set so if you remember from alpha diversity to do that we use the Shannon vector and include both distant both richness and evenness our beta diversity equivalent is going to be this break curtis difference so for now go ahead and ignore the math on the bottom here because the equation’s a little bit more complicated than it looks but if we want to just look at the visual building blocks so to speak of this of this metric right if we have the same so what we’re looking at here is each column is a feature or a bacterium and the number of squares in a column is how often that feature shows up in a sample and then we have a blue sample and a red sample here so if we’re comparing blue and red here we can see they have the exact same features and the same number of and the same frequency of occurrence of each of those features so our distance again is zero because these two samples are exactly the same on the other end right these two samples share barely any features right we have a and we have feature one and the blue sample is not present in the red sample featured one in the red feature two is present in the red sample but not in the blue sample and so on our only overlap is here with feature five and the distance where we have three in the blue sample and one occurrence in the red sample but overall these are distinct enough from each other that we’re calling this distance about one they barely overlap at all and and really the abundance of each of those features is about the same right there’s a feature with an abundance of four in blue and there’s a feature with an abundance of four in red so they’re not very uneven even though they have a different richness and then here in the middle is an example of something that’s kind of that’s a distance of 0.

5 so more like what we’d probably realistically see right there’s only a couple samples that com or features that don’t overlap at all this feature three is present and high abundance in blue and not at all in red and feature five is present in low abundance in red but not in blue feature 4 is shared between the both feature 2 is very similar between the both and feature 1 is not that different so what we’re seeing here is a distance of 0. 5 this one’s a little more difficult to see visually but hopefully this math will help you understand it a little better so this formula looks a little bit more complicated than the last one we saw so what we’re looking at here to break it down is the sum so add it all together all of our features together so x is the frequency of feature I or feature one in sample a minus the frequency of that same feature feature I and sample b so what that means is feature one and sample a it might be 42 like this yellow sample compared to feature one in sample b which here is this blue sample so that’s 12.

so 42 minus 12 but then we have to do this for every single feature so we compare 42 so to get our numerator it’s 42 minus 12 so that’s 30 plus 0 minus 1 so that’s negative 1 then 37 minus 22 to give us 15 99 minus 88 so 11 and then 1 minus 0 which is 1. so we add all of those numbers together and that gives us 56 for our numerator here then our denominator is simply a sum of all of the features so the frequent this is the opposite right it’s the sum of the frequency of feature I and sample a plus the frequency of feature I in sample b so for the denominator we’re doing 42 plus 12 plus 0 plus 1 plus 37 plus 22 plus 99 plus 88 plus 1 plus 0.

so that adds up to 302 so you don’t have to try to do that math so when we’re comparing yellow sample and blue sample so filling out this box here we are our math is going to be 56 divided by 302 and that gives us about 0. 19 so we would do that comparing all of the samples and fill out one side of the distance matrix and then remember the other side’s exactly the same so if we’re comparing this right back to the distance matrix we saw previously with the jaccard difference distance excuse me you’ll notice that we actually see a lot more distinctions between samples here remember I was saying it’s interesting that we saw really similar distances when comparing sample five to the others because it looks different to me in this case if we’re comparing sample five to the others we have see a much larger difference right compared to the others now we’re at point six five point six nine and point seven where if we’re comparing you know sample four and sample one it’s only 0.