GTN Training - Transcriptomics - Single-cell RNA-seq - Filter, Plot and Explore

Jun 25, 2021 16:13 · 3763 words · 18 minute read

Welcome to this single cell analysis tutorial. My name is Wendi Bacon. I am a lecturer at the Open University as well as a visiting researcher at EMBL-EBI. This tutorial, I’m going to do on the Human Cell Atlas Galaxy instance. It should function on most Galaxy instances that I know of ,and if not, try Galaxy. eu or Galaxy. org. I’m only using the Human Cell Atlas one because it’s the first one I ever used, so it’s my favorite. This tutorial is not going to be explaining a lot of the science, it’s just going to be as a tool or resource for you to be able to follow along, get your right parameters, (or) if anything is going wrong /you’re not sure where to click or what to do, hopefully watching this tutorial will help you.

For any of the scientific content, please read the tutorial itself. That is going to be your home for the information. Okay! So we start with getting data, and you can either use the input history and then import it, or I’m going to import that and the data object that was made in the previous tutorial. So this can sometimes take a while, so I’m gonna copy the link to that input history.

01:22 - And I’m just gonna import that. Cool. Okay.

01:34 - Rename the history and then make sure, is it. h5ad? Yes it is. As part of the first question, you might be Inspecting the Anndata object. And again, if you’re using the tutorial version, you can just… you can just click.

01:52 - So I can just click on that and then I’m sorted.

01:58 - And I want to do that a few more times. I want to see the ‘obs’ and the genes, okay? And then, we can look at all sorts of our lovely information.

02:13 - Okay, and then we’re gonna plot. I’m gonna copy the Keys over.

02:53 - And I want to do the same thing again. And then finally for ‘batch’.

03:12 - And then, I’m going to make some scatter plots.

03:21 - I’m just copying and pasting from the tutorial.

03:41 - And I mean, you can pretty much use anything in your ‘obs’ information. This, I just think, came out the cleanest in terms of being able to visualize on this dataset, because it’s quite a messy dataset, so it helps clarify where you might want to put your filters, etc. Okay, and then we’re naming our plots.

04:17 - That’s just for easy access, because you end up with quite a lot of plots on here.

04:41 - You… All right, well, now we have all the plots in the world! And you can also, if you want, you can enable ‘Scratchbook’, and then you can sort of look at the plots now. You can resize that. You can look at the plots side by side.

05:20 - So if I then click off this, and then I also have that, well, now I can sort of look at the two plots at the same time. And that’s quite handy when you’re using Scratchbook. so that’s a little…

05:33 - a little handy utility within Galaxy. Okay, and then you’re going to be given lots of different questions to analyze what’s going on in each of the plots. And then, it’s time to apply the filters! So Very easy to accidentally hit the ‘Seurat’ on here, by the way, when you’re searching, so make sure you don’t do that! I’m going to do the standard.

06:04 - And there are a few that automatically come up - we’re not using those, because they’re less clear, because the data is quite messy.

06:14 - So I do that. Oh… oh I see. I’m gonna plot next.

06:44 - And we’re looking at genotype just because that’s our most important variable. To be fair, you want all the variables to be as similar as possible, in terms of, like, sequencing depth and output. So that’s an important thing to keep in mind when you’re setting up! And then, I’m going to ‘Inspect’.

07:17 - And this is quite cool, because even if you examine it, you can see right here - without even having to do an ‘Inspect’ step - the cells and the genes that are left. And then, you can compare that with the original, right, which had a lot more cells, about the same number of genes. So, you can look at this, and see each of your different parameters, and what’s happened. You can compare it with the previous one, and say, “oh look at these different batches, how cool is that?” And I think, actually, the best way to set this up right now is to say: we want violin, genotype, so I’m going to enable Scratchbook, violin by genotype, right, and then I also want this violin as well, so I can see what’s changed.

So you can look and see. That’s not really done anything, let’s look over here.

08:30 - And we can see now, we’re seeing a much more significant change, because that’s what we were filtering by. We’ve sort of done a cut off of this bottom, for better for worse, all right, and then now we’re going to filter ‘total counts’ as opposed to ‘gene count’. So, I’m just going to cheat: redo this step! I want to filter by ‘total counts’ rather than the ‘log1p of gene counts’, and we’re gonna go with 6. 3. It’ll always be higher than your genes, because you are expecting to have multiple copies of a transcript… well not copies, not PCR amplicons, rather, but copies that the cell has made. It should have more than one transcript of a given expressed gene.

09:16 - So you would want to put your cutoff higher and then we repeat our plot.

09:32 - When I’m so zoomed in, there’s a lot more effort to do these, sort of, repeat steps! All right, and we’re there again, so I will rename everything.

09:47 - I know it’s a little bit of a faff to rename everything all the time. This is going to be a long tutorial, and it can be so confusing when you’re using the same tools over and over again - which we are! A lot of the single cell tools have a lot of stuff packed into them, a lot of abilities packed into them, and it can be so confusing when you’re looking back. So, the more you can get in the habit of managing your data well, the better your life will be! So, do it! Okay, and then I’ll add this ‘filter by counts’ in, and now I can actually - using my Scratchbook - see each phase of filtering and what happens! This is not a great look, if I’m honest.

To be frank, this is a better look, because you can at least see the bottom of the violin plot, which is a lot better… But you know, this is… I’m putting in quite low cutoffs, so I’m being very liberal with my cutoffs. That’s how I would put it, to keep as much as I can, because I know the date is messy and there was a lot of background. Okay, and now that we’ve done by genes per cell, by counts per cell - or you know the log versions of it - we’re gonna do the same thing with ‘percent counts mito’.

So, we know that mitochondrial rna etc is a sign of stress, and we don’t want to keep our stressed out, angry, half dead cells. So, we’re gonna filter out our… Ah, well, this is easier just do that! And five percent is pretty standard. Always, you know, read the text. It depends on the sample that you’re working with, but we’re going to go with 4. 5 today just to be a bit contrary, really! I always think the ultimate test of your data is how well it survives weird analysis.

11:46 - And, as always, re-labeling! And then, we can again look at all of our plots. One, two, three… okay! And the last step we’re gonna do is… ‘FilterGenes’! Now, the first time I ran this protocol, I got a bit lackadaisical and thought, “Oh, it probably doesn’t matter the order in which I do these steps, filtering’s filtering, right? No big deal!” So I went rogue! I went rogue, and I did the ‘filter gene’ step first - like a fool - and basically, this means that later on, you end up with a whole bunch of genes in your (once you filter out all these cells) you’ll end up with a bunch of genes that don’t have any cells associated with them! So you have these, like, empty columns in your data set, and later on (much later on) a bunch of the tools will break because they can’t handle the fact that some genes have nothing in them.

So, don’t make my mistake and filter genes first! Make sure you do all of your filtering, and THEN hit ‘FilterGenes’.

13:28 - I always love making these videos, because making the videos takes, you know, as long as it takes you guys to get through the tutorial, but then the final video output at the end (when you edit out all of the time of waiting for stuff) is much shorter! And it seems so efficient! So if steps are taking longer than you’re seeing in the video, it’s because I edit out all of the other time waiting. All right, onward with the process! You know what, I’m just gonna… I’m just gonna… if I search Scanpy here, usually… it’ll be a lot easier for me to find all the stuff that I want. Although, if you’re using the ‘tutorial mode’, then you’re pretty much good to go! Okay, normalizing some data? Yes, number 20. Do that! And then, I’m just going to set up the next one, because it’s kind of a sequence of pretty standardized (mostly) steps.

14:22 - Although there’s a whole bunch of perimeters here, you can change if you wish. Okay! And then ‘ScaleData’.

14:36 - Truncate to 10. And then, we’re going to ‘RunPCA’. So, this is our first, well, I guess ‘FindVariableGenes’ you’re also downsampling a bit, but now we’re properly going to start reducing the size of our data, reducing the dimensions. I always double check the inputs on this, because you end up with quite a big history, so I do recommend that! Yes, I want that. And we use 50, so we’ll show 50.

15:30 - Okay, and I will rename that. I don’t know if you can hear it, but it’s a wonderful sunny day! Some lovely birds singing in the background… mocking me… All right, and then you can decide, “Oh, where do I want to put my cut-off? How exciting! Okay, on to our ‘compute graph’! No space! Right, so this is where we’re defining our ‘nearest neighbor graph’ and trying to put everything on, like, a single graph, as opposed to 1001 dimensions.

Okay, um, let’s just check - I want 15 neighbors? Sure! And then the number of PCs? We’re going to use 20, because that’s what we’ve determined from the previous plot, or at least such is what we think! And then, we can also do a bit of… get the rest of our plots calculated, right? So, different ways of reducing dimensions. Different places you’re going to be on your XY graph.

16:49 - Oh perplexity! Yeah, so, perplexity will be 30, but you can change it if you’re working in a group.

16:57 - Yep, and it is working on 29. And we also want ‘RunUMAP’. These other ones, ‘PAGA’ and ‘RunFDG’ are more looking at trajectories, so that’s a separate tutorial - the next one in fact, if you’re following along! So now I have a lovely UMAP, but we’re not going to get any pretty plots out of this yet. Because, it’s just calculating the coordinates, all right? And now - with those coordinates - we can now try and calculate, all right, if those things are right next to each other on this nearest graph.

What’s the likelihood that they’re a cluster? Let’s start calling some clusters. I’ll use ‘Louvain’, lots of people now are using ‘Leiden’.

17:42 - And we’re going to use this resolution, because I know this data set, and it’s not the cleanest. We don’t want to make too many clusters, because we’re… the more detail you have per cell, the more you can trust, the more clusters, is what I would say. But, it also depends on your sample. If it’s a super homogeneous set, then you also don’t want to be calling a whole bunch of clusters where there are not, so you kind of have to take a few things into account on your cluster calling.

All right, we have our lovely clusters! So, let’s now do the fun bit, which is figure out, why a cluster is a cluster? So, let’s look at the genes that make it so! So, sometimes, you just need to hit that. I don’t know if that glitch has been fixed yet, but for whatever reason, you do need to click that sometimes.

18:35 - Loads of parameters you can change there! So it will, sort of, automatically make clusters by louvain clustering. We might also be interested in if there are any differences across genotype, and then, you know, the ‘ManipulateAnnData’ tool could let you filter out specific clusters. Then you could just compare those to each other, like, there’s a lot of fun that you can have with the ‘FindMarkers’ to compare different things! Okay, and again, this is actually particularly important.

You want to make sure that you re-label these things appropriately because you’ve got four things that look like ‘FindMarkers’, and the other side of this is, it will store the result of your ‘FindMarkers’ within the object. Oh! I’ve just done that wrong! Oops! Okay, it will store that information within the object, and we are more interested in… Yeah, so here’s your marker table, and this is, this… We’re more interested in the cluster comparisons rather than the genotype, because the genotype is sort of like glorified bulk rna-seq because you’re just smashing everything together.

So, I’ll just, I’ll just get rid of that one, so that I keep my final object anyway. You want to, so, yeah, you want to keep stored within your AnnData object the results of the comparisons between clusters.

20:00 - All right! Now we’re going to do a little jiggery pokery! Final object, and I want ‘variable’ information.

20:17 - Right, so we want this information because, if you look at it, “Oh, it has the Ensembl ID, which is the most accurate way of counting transcripts… but the symbols are really what we’re going to be talking about when we’re trying to understand it from biological point of view, and right now our lovely cluster and genotype tables, they only have the Ensembl ID as their label!” So we’re going to do a bit of jiggery pokery to make that work! ‘Join two datasets side by side’ I want to specify field, and their column - four is the one that has that Ensembl ID.

Yes, that’s what we want! And column two. That’s: yes, yes, no, yes.

21:26 - Now at this point, I strongly recommend checking, because sometimes the order will be a little bit bizarre, so you just want to make sure you have the right number.

22:00 - All right, and then I want to rename these tables! Should have ‘genes’ and then ‘symbol’, awesome! and I know that it’s the shorter one. It’s going to be when you’re splunging everything together.

22:16 - It’s going to be when you’re comparing across genotype. Really, the most interesting one is when it’s by cluster.

22:28 - All right, and you know what? Just for my own peace of mind, I’m just gonna…

22:36 - do that. I’m gonna hide them. And I’ll have a nice, nice history. And now the best bit of all! It’s time to plot! So we get to see all of our lovely hard work! Yes! Final object! Oh yes, we’re going to start with PCA using our predefined knowledge. We’re going to be plotting by a whole bunch of different things.

23:18 - Lots of different bits and bobs for changing how your plot looks, essentially, and then do the same thing for tSNE.

23:33 - and UMAP. No! Autocorrect mocking me… And we’re there! Oh there’s buckets and buckets of information you can now get from these images. It would help if I zoomed out a bit. All right, so buckets and buckets of information you can analyze and think about and interpret, and I strongly recommend that you do that! You know, the more, sort of, time you spend trying to get into the mind of, “why might you want these plots? what might they be telling you?”, the more easily you’ll be able to direct your questions, direct your analysis.

Yeah! But we’re going to move on to the annotating clusters step.

24:27 - We’re going to rename our categories. We’ve cunningly been able to figure out exactly what each cluster is looking at - their marker tables and our known marker genes.

24:44 - And so, rather than have it be named “Cluster 0”, we’re gonna give them their actual names, cell type names. And then, we don’t want to necessarily delete that. Especially, you know, if you look at the marker table, it will be “Cluster 0”, “Cluster 1”. So, it would be nice - although I suppose you could just run the marker table again using the new categories and that would work too - but we’ll just add them back in.

25:24 - So now we’re copying that cluster annotation back into your original object, and then that means we essentially have “louvain” and “louvain_0”, so that’s not ideal. And now it’s called ‘louvain_0’, and we don’t really want that. Actually, I’m gonna get a fresh one, so they don’t accidentally repeat the same thing again. So, we don’t want it to be called ‘louvain’. ‘Louvain we want it to be called ‘celltype’, so this second ‘louvain’ category is getting changed! There we go! And now we have ‘Louvain’ and ‘celltype’, so that’s a lot nicer.

So, we’re gonna rename this our ‘final cell annotated object’.

26:34 - So, if we want to now plot that - so that we get our lovely labels - we can re-run one of these, and just run it on the file object.

26:44 - And then we can add in ‘cell type’. Or, you can switch ‘louvain’ to ‘celltype’. It’ll color the same way, I’ll just label it differently.

27:00 - And now, if we look at the plot, it’s labeled by ‘celltype’ rather than number, and that’s cool! And there’s whole heaps of information, again, across all of these different plots, and what you can interpret, so please do take some time and do that! The last bit is when we’re looking at some of our interactive visualization. And so, if we go over here, to our UCSC cell browser.

27:26 - Oh, choose the format - so it’s Scanpy that we’ve been using. Yes, our final annotated object, Sure, yeah ‘louvain’, is fine, we can execute that.

27:41 - And we’re up and running! So, now I can hit ‘view data’. Oh, I’ve zoomed far too far in! There we go. Okay, and then - oh this is brilliant, and it’s something you’re going to want to spend some time playing around with! You can look at all sorts of different visualizations. You can color by different things.

28:09 - All right, now we’re coloring by genotype! You can color the batch. Yeah, it’s a lovely thing to mess around with, and to be able to interrogate your data. So, spend a bit of time mucking around and seeing all the wonderful things you can do in this! It’s also nice because you can just share your history, and someone can immediately click on this, and start playing around in it just the same way you were… which is awesome! All right, yeah, we’re going to do that.

All right! And it’s finally changed, so I’m ready. So, I’ll click here to display, and this is going to take you into it. Always makes me do this… that’s fine, the interactive viewers for whatever reason always catch out my security stuff. Oh well. I will find, sometimes, if this happens to you, and it says “proxy target missing” - I don’t know why it glitches this way, it is - I promise - it’s worth the pain, okay? Just run it again, and then leave it for a few minutes before trying to look at it.

As I said, if this plays you up, or if it says “proxy target missing” or something, just exit; leave it for a minute; and then come back and do it again; and then leave it for like five minutes; and then usually you’ll get this to happen, which is pretty great! So cool! This is a whole world to explore, if I’m honest. There’s all sorts of cool things you can do with it, like… Okay, I want, you know, I want to color it by batch; I want to color it by cell type; or genotype.

Right, oh, that’s… this is interesting when you color it by genotype. And then now, I’ve created a little population, and I can click it. There’s all sorts of wicked stuff you can do to explore. And this is just nice, because it means you’re exploring your data without having to recreate plots left, right, and center! And then you, kind of, pick which plots show what you’re looking at, and what you can investigate. And it’s just a really nice exploration tool when you’re trying to interpret your data.

All right, and then just make sure to come here and hit stop when you’re done looking through that tool. And that brings us to the end of this tutorial, so I hope you had a great time!.