GTN Smörgåsbord - Day 2 - Transcriptomics - Introduction to RNA-Seq

Mar 18, 2021 09:35 · 4097 words · 20 minute read

Hi everyone, um today we will be um discussing a bit about transcriptomics and how to do RNA set data analysis and using Galaxy and this is going to be a very short introduction on transcriptomics. So that, we can identify a few of the basic concepts before we move on to the hands-on parts of the day. And before going through the details here which you can find in the Galaxy training page and it would be nice also to have an introduction to the Galaxy analysis how it works and also how to do overall sequence analysis and and they are very some very good slides about quality control and mapping that you can find there.

So um first and foremost, a key question to address is what is actually RNA sequencing and for this I will first do a quick introduction about what RNA is. This might be a lot of this this might be very familiar to a lot of you. So just give a very brief context here, so if you look at the DNA at this level what you actually have are a lot of different parts that comprise a gene. You might have enhancers from others, you have the main part of the gene and the “open reading frame” of the part which actually leads to the protein and you have additional elements on the right and the and left side of the gene.

So the transcription transforms your DNA into the um into an mRNA into the post-transcription part, and which actually comprises only of the exon parts and the intron parts of the gene. So the rest of the area around the gene is is cut off if you like and the main part is left here, the red part. The post-transcriptional modifications of the mRNA, leads to the mature mRNA and this contains only the exon parts. The intron parts, the great parts of the gene, are being cut off and only the protein coding region of the mRNA is being kept.

This part is what can be causally translated into a protein which which does the whole process. In any case RNA is in a nutshell that’s hyped form of the DNA and is what it can be used to produce a protein that will do the activities within within yourself later on. When we talk about RNA sequencing, it’s basically the part where we take the mRNA and we try to quantify it. What it achieves it has a RNA quantification at a single base resolution. So starting from the all DNA as we said earlier and we have the pre-mRNA.

When we’re talking about the mRNA part this is what is being um retrieved during the individual process of RNA sequence. Doing the library prep this is going this is constantly fragmented into RNA fragments. Reverse transcription takes place and the cDNA that is produced is usually sequenced through High-throughput sequencing, and what this is the part that you actually get at the end. And so the RNA sequencing part, the RNA-Seq sequencing process is a cost-efficient way to analyze um the whole of the transcriptome of a particular cell or a particular sample in a high throughput manner.

So if you look at at the process in a bit more detail um in asking where your data is actually coming from. And this is where you take your cells you extract the RNA and um depending on whether you’re talking about the mRNA or the small RNA, you have different processes. In all those cases, eventually what you end up being is a library where you have the fragments of the RNA listed here, and these are going to be sequenced. And the sequence part is what you get as an input of your data.

So moving away from from the um from the biology part, and going directly to the RNA sequencing. So what is the main principle of RNA sequencing? This is one of my favorite um comic strips if you like. Um so you have the the scientist here so asking: so I want to to do a transcript of everything that I have, all the different cells that I have in my my samples, or different samples. So what actually happens is you have the transcriptome, and you basically shred it.

You have your mRNA fragmented into a million literally, or even hundreds of millions, billions of different small pieces. And what you get from the machine from the sequencing, is um this whole mess of of colored strips. The RNA sequencing bioinformatics approach, the computational approach here, is to take all those shredded pieces of paper, and try to reconstruct the original red piece of paper. And as you can imagine, because it’s definitely not an easy process, it has a few, it might have a few mismatches here and there, which means that what you get at the end might have some inconsistencies, on this yellow and blue parts in the red area.

So um part of the computational process is also to ensure that such errors are identified, and how to to keep them in mind at least. So um what are the actual challenges of RNA sequencing? So three, there are three main points. So the first is that um when you do the uh the sequencing, um what you have as a sample and might be completely different, or have some very notable difference with what is the reference genome that we will be using um to create the mapping, and to quantify those RNA sequences.

The second part might be noise, so in other words you might not have a clean um extraction of your RNA, you might have additional information here, fragments of pieces of information, RNA and are present when you do this process, and are consequently sequenced. And finally um this is all a bit of a chemistry, which means that there might be some sequencing biases, so some PCR over-amplification or errors, or anything else that comes into place when doing the preparation part.

So all those are challenges that need to be, one needs to be aware before doing the actual RNA-seq analysis. But beyond the challenges, there are also a lot of benefits. And um the main one is that I mentioned earlier it’s cost efficient and it’s high-throughput, and it allows us to identify a lot of, to get a lot of information in a relatively short time, and a bit with with low enough cost. Um it allows us to have a very good understanding of the quantities of RNA that exists in a particular sample, and to identify uh splicing points, and novel transcripts, gene fusions, and all in all to have a better understanding of what is happening at the molecular level.

So if you look at the actual questions that are being addressed using RNA-Seq you have two main applications if you like. So the one is addressing the question of what are the actual RNA molecules that I have in my particular sample? So this is the transcript discovery part, and in this case the primary goal of RNA-seq is to identify novel isoforms, identify alternative splicing points, fusion genes, potential single single nucleotide variations, and so forth.

So the main focus here is to identify and annotate the RNA molecules that you find. The second question is um what are, what is the concentration of the different um RNA molecules in in my sample? So here the particular point is to quantify RNA, and um we either aim for an absolute expression. So if we look within the particular sample, we want to make, to understand what are the differences in the gene expression between different genes for example. Or, if you’re looking between different samples, or different groups of samples: what is the differential expression of genes there? So these are the two main applications for RNA-seq.

What we are going to be doing, we are going to be seeing mostly the RNA quantification parts, but at some times I’ll be highlighting how the transcript discovery can also be applied in the same context. So: how to analyze RNA-Seq using, aiming for RNA quantification? So roughly the process is as follows: what you get from the sequencing is basically a lot of fragments of um, of of the different molecules. And after you sequence those, you try to map them onto a particular genome.

And you have those black lines here, corresponding to each read. And you might identify cases here, that some part of a particular read is mapped into a particular gene, or a particular exon of a gene, and another part of the same read be mapped to that, to a different part. I will be covering this in a, in a moment. But eventually what you do, is you map your reads onto a reference a genome in this instance, and then you start counting. And you start to count how many reads, or how many layers in this instance, of a particular gene you find.

So for example the purple gene here, you find basically one layer of that. And in the yellow one you find one, two, three, or two if you, if you see this as straight lines. And that’s in the blue one here you have one two three. So quantifying how many layers of reads you have on a particular gene is one way of quantifying. There are different ways, and I’ll hopefully I’ll address a few of them here. So the data processing pipeline is basically those five steps.

So you have um basically a set of reads, either single-end, so only the forward part, or forward end. Or you have paired-end. And you have multiple sets of reads for multiple samples. For the control for example, and for the treatment. And although there is no standardized workflow for RNA-seg like a gold standard that everyone uses, there are different, there’s a lot of best practices, and some standard ideas that can be used for every data set. And these are basically the steps that are corresponding to them.

So after you get these these files from from the sequencing facility, the first thing is to do is to do some basically quality control. Uh this all is already covered in in a different um lesson in Galaxy, you’ve already seen this on on yesterday, on day one of this this event, so I will not be covering this in more detail. And then after mapping into a reference genome you get some annotation, and given some information about how your transcripts are matching to a particular piece of information, like genes.

You can do a recounting. And eventually, from this process, what you get is for every one of those samples you have a count table. Having those multiple tables at the end, per group for example, you can apply different questions. And one of them, one of the most common ones, is to do differential expression analysis. Um if you look at the data pre-processing, so the part right here, and this is a a single step where you try to refine your data, and to ensure that whatever information comes from there onwards.

-I’m sorry- um it’s clear enough and it minimizes um the noise, and also potential errors that may come in from the sequencing part. So the first is to do some adapter clipping. So if there are any adapters from the sequence that have been left over, this is a good step to actually remove them. And also to do some quality assessment, so if there are any low-quality reads, or low quality bases, and you can also remove them or trim them, depending on the strategy that you’re going to be using.

The key part, and this is one that I will focus on a bit more, is how to do the annotation of this RNA-seq set. So you have a lot of different fragments, these small black lines, and the question is: okay, how can we figure out how those, where those came from um from our reference genome? So it’s it’s basically a mapping process, but it’s not an always straightforward or easy approach. So keep in mind that what we have as input is basically um fragments of this mRNA, of this whole orange piece of information.

But if we try to map this onto the reference genome, you actually have a bit of blanks if you like, so black areas, these introns that have been removed when sequencing the mRNA. So if we map those reads to the mRNA, you expect something like that: that everything is going to be mapped across the entire sequence of the transcript. However, if you try to do this onto your genome, you might have cases, those highlight like in red here, that are basically between two different exons.

That if you map them on the genome level, they appear as to have a gap. So one part of the read is in one exon, then you have a big gap which corresponds to the intron, and the second part of the same read is mapped to the next exon. So this is a useful piece information. This is a challenge, and one of the challenges that they underline is, and different mappers need to to address. But it’s something to always consider, especially if you are thinking about identifying novel transcripts, alternative splicing points, fusion genes, and so forth.

This is one way that those things can be identified. So if you look at the mapping, and going back to to this particular process, there are three main ways of of dealing with that. So the first approach is to map directly to the transcriptome. So having this particular piece of information, the transcriptome itself, we map our reads directly here so this is straightforward. The second strategy is to map to the genome. So if we don’t have, or we don’t want to use the transcriptome, the second way is to map those reads onto the genome itself.

Again with the changes that we’ve identified earlier. A third strategy is to do a de novo assembly. So take the reads, and try to reconstruct a transcriptome, and try to use this one as a means of counting. I’ll go through those in a bit, um also a bit more detail. Um going for the transciptome mapping, this is the easiest one to to achieve, because this is what you have, a transcriptome. You have basically exon one, exon two, exon three for a particular transcript.

And what you do: you take your reads, you clearly align them to to the transcriptome, and even if it’s a paired-end, you can see how they can be aligned to different parts. So this is easy enough to achieve, but it has two main disadvantages if you like. The one is that you really need to have reliable gene model. So what you use as a reference for transcriptome needs to be reliable enough. There exists for, for a lot of the reference organisms, different species out there, um but if you are working with a not so common species, that might be something um a bit more difficult to achieve.

Also, if you want to detect novel genes, this is not possible to do because what you’re doing here is, you are aligning your reads to your existing transcriptome, so known, already known transcripts of genes. So novel genes, novel isoforms are not going to be identified through this process. Um, as the second strategy as I said was to use genome mapping. So in this case this is a bit more um difficult to achieve, more challenging. And and as I said if you have a paired-end read, the first one would be easy for example to align this instance because it’s completely mapped on the exon side, on exon 1, but the second read and actually um spans three exons; it’s a bit of exon 1, the entire exon 2, and a little bit of exon 3.

So this one would be a bit harder to align, but it has a very distinct advantage: that you are able to identify splicing points, and also to detect potentially new genes, and new isoforms. So this is um the advantage of genome mapping. So um, both of them as you can understand, have a very common theme. In both cases you require a high quality reference genome, or reference transcriptome. Ideally in a FASTA format. And in order to have a good annotation of the, of the of the regions of this reference genome or transcriptome, you need to have also the annotation of these known genes, again usually in in a GTF file format.

But there are additional formats that are equally compatible here. And both those piece of information are relatively easy enough to find. Especially if you are aiming for um the mostly studied organisms, like human, mouse, and so forth. And some sort of, some projects and organizations that are actually producing and maintaining annotations on that include: EMBL-EBI, UCSC, RefSeq, Ensembl, and so forth. So you can look into these projects and organizations and, and retrieve those um those files.

And so if neither, none of those two strategies work for you, or if you don’t have a reference genome or you don’t need one, then the third strategy is the one that is the one that might work. So in this case what you do is you assemble your reads into transcripts, you do a de novo assembly, into a transcript. And then whatever is produced, use this as a reference to map your reads back, and actually do the quantification. If you aim only at identifying the translation, so putting together the list of the individual molecules that you found, this step one is sufficient.

But if you want to do a quantification as well, which is our goal here and for this particular introduction, then you need to map your original reads back to your transcripts, so you can have a quantification of each read to the transcript itself. So um these are the strategies of the mapping part. And the next step, if you’ve done the mapping, is to actually do the quantification. Which is again addressing the question of: what is the expression level of the genomic feature that we’re looking for? So if we want to count the number of reads per feature that is relatively easy.

If we have the features, and we have the mapped reads, we count them, um depending on how they are structured. But there are also challenges. So the first is the one that we have already touched upon a few times: if we have reads that are mapped into multiple cases, um what will happen in this case? So if you have for example repeat regions, you might have edit that is aligned into multiple cases. So you expect a read to be coming from multiple different regions, so the aligner, the mapper itself, will propose that this particular read can be mapped here, and here, and here.

So how to address this is a, is a question that needs to be decided upon during the analysis. Also, a different question is: um if we want to do a quantification of features, how do we want to distinguish the different isoforms for example? Are we going to do this at the gene level? Are we using the different transcripts? or are we going to do this at the exon level? So all those are different questions that need to be addressed before doing the quantitation, quantification itself.

21:53 - So, given that we have the quantification done, and we’re going to be seeing a few tools in the hands-on later um in this tutorial. Um, we need to move on to the differential expression. So we’ve done the quantification in a single sample, but we may want to identify what are the differences in the um the numbers, in the concentrations, of RNA across different groups, different conditions, or different samples. Um so essentially what we are going to be producing per sample is this sort of a distribution if you like, across the same reference.

And but then, if we want to do this differential expression analysis, we we need to account for the variability of expression. Across both the biological replicates, as well as different technical replicates, again with the help of the counts.

22:47 - The first step usually is to do normalization. In other words try to make the expression levels comparable across the different, different groups. And there are different ways of doing that; you can do it by features, and you do this at the gene level and the isoform level for example. You should do it by samples, so that you ensure that all the samples are comparable. And there are multiple methods that achieve that. And every method usually corresponds to a different tool that actually implements this.

So there are the FPKM and RPKM methods of localization across different samples. And there is the TMM method that is evident in the EdgeR package. And there is also the DESeq2 method, um available through the the same-named package in R, which is the most commonly used um approach. Um it’s important to highlight that um, so far um the ones that are shown to be the most robust are the DEQeq2 and TMM ones. Um because they are more efficient and more robust when you are discussing about different library sizes, and different compositions of the libraries.

So if you’re describing different sets of samples, um then you might, TMM and DEQeq2 are the most relevant ones. And in closing, um it might be also important to keep in mind that the number of replicates used in differential gene expression, as well as the sequencing depth of each individual sample, um are critical aspects and have an effect, on um on which genes and, and how many of those genes are identified as um differentially expressed. This is basically significantly expressed.

As you can see this is a a a a study done um by Conesa et al, 2016, and you see how the number of replicates per group, and the sequencing depth, actually has an impact on the probability of detecting um a differential expression, at a significance level of 5%. And as you as you can see, by increasing the number of replicates, you significantly increase the effect size. And also by having a significant depth of sequencing, also increases um the the the probability of identifying those those two, those differential expressed genes.

So the the rule of thumb basically is to have at least three biological replicates per group in order to have a sufficient enough power of the statistical analysis that is done, at the end when you identify the differential expressed genes.

25:55 - In closing, and after doing the differential expression, um the next step is usually to do some visualization. And there are different visualizations for different parts of of of the process. So for example if you’re looking at the aligned reads. Uh using the BAM files which will be seen earlier, you can use the IGV or Trackster to visualize those, those aligned reads. And you can also do the Sashimi plots, again through the IGV or other tools, to see how the read coverage along exons and splicing points, um splice junctions, look like, and and and how they work.

At the end, after having the cou- the the the counts, you can also do a more efficient visualization of of the um counts, and their and differential and fold change for example, and using packages like CummRbund, which was designed to connect with Cufflinks as part of the Tuxedo pipeline, a few years ago. So um, there are a lot of tutorials that are available to do that, we will be seeing now um the RNA-seq pipeline. Reference-based RNA-seq pipeline, leading up until the R part, how to do the analysis of counts using R, and and I would like to acknowledge the Galaxy training network, and particularly Berenice, Anika and Marcus for putting together this particular tutorial.

And thank you for listening to this!.