GTN Smörgåsbord- Day 1 - Quality Control (Slides)

Mar 18, 2021 09:35 · 5122 words · 25 minute read

Hello everybody, I welcome you to the quality control session; and this is split into the theoretical part and the practical part. Here I will cover, um, briefly the intention behind quality control for next generation sequencing experiments; and then go deeper into how to do quality control. There we use mainly a tool called FastQC and in order to and for you to understand this session you need to have listened to the introductory courses for galaxy, so you need to kind of know how to operate in galaxy; and I will answer these three main questions: so how to control quality of next generation sequencing data, what are so what are the main quality parameters to check, and, um, I also sometimes briefly explain, if you see there’s a quality breach, then how can I improve my data; and the main objectives you get from the session is are: that you can manipulate FASTQ files, that you can control so you can you can do a quality quality control with FASTQ files, and that you can use tools like FastQC to see understand what the output is of FastQC and then act accordingly if you see there are some quality breaches; and then maybe do some quality corrections.

So why do you have to do quality control? I mean it’s, um, obvious why you would do a quality control, because you want to check if your experiment worked all right. So you want to check… You have a lot of steps for next generation sequencing data, so the experiment, the library preparations, the sequencing, and even some additional steps in between; and you want to check for example did my knockout work worked correctly? Did the sequencing, actually I mean if you send it to a sequencing facility, it’s always worth to check if the sequencing facility, um, did something incorrect.

Something might happen there. Or even for your library preparation, right, so there could be some contamination happened and you want to check if your library is really, if there was a quality breach, and there are many many next generation sequencing experiments: CHIP-Seq, CLIP-Seq, RNA-Seq and so on. They all generate their own, so they all have their own protocols, their own biases, which you have to look after; and even, for example for CLIP-Seq. So the investigation of the proteome and the transcriptome, where you link the protein to RNA with cross-linking.

Even there you have different protocols like i-CLIP, e-CLIP and PAR-CLIP, HITS and so on; and all of these protocols, I can tell you, they have even different biases, which you have to check. However every next generation sequencing data follows kind of this these last steps. They are, um, so sequenced of course with illumina ion torrent, or nanopores; and what you get out of it is either a raw FASTA file, but more importantly you need a FASTQ file. Um, for nanopore for example, you get a FASTA, also sometimes, which is called a FASTA5 file.

But even the FASTA5 files you try to convert it into FASTQ; and this FASTQ file you then use for quality control; and after you have done the quality control after you checked if my raw data has some quality pages you go in, in other steps like mapping and your general data analysis.

04:24 - So now let’s talk about the main format. So I mentioned the, uh, you have the FASTA format and you have a FASTQ file. So what, what’s FASTA? If you never heard about it a FASTA file it mainly lists, uh, or it lists mainly all of your reads which you sequenced, and it always follows this format that you have one line for your read, which is your read identifier. Sometimes after the read identifier you have some comments like, um, the organism which was sequenced, and then the second line is the actual sequence, so A T C or G.

We symbolize it here with X’s. You sometimes have, um, N’s in there, um, what an N stands for I will explain later on. And then after the second line follows the second read, so a second read identifier, a second, um, so the sequence for this read, and this goes on and on, so then follows the third, fourth and so on. Um, so this is the basic FASTA file, and then you have the FASTQ format which, so the Q stands for quality, and again you have two lines; the read, um, one line for the read identifier with some comments, and then the second line for the sequence.

Then you have an additional third line; you very very often see a + sign here, so this is kind of a filler line if you want to include again some additional information. And then you have a fourth line where you have the quality string; we symbolize it here with Q’s, but of course there are different characters, and each, um, Q here so to say, so each character, stands for the quality that this base, or this base call, was wrong or not. So the accuracy of this base call from the sequencing device.

06:36 - After the fourth line follows again the, um, the second read; so the second read identifier, the sequence, the additional filler line and then the quality string. And then again the third read, fourth read and so on. So what is this quality string, or what are these quality scores? They are called Phred quality scores, or Phred scores, and they again give you an idea about the accuracy of the the base call. Let’s say you have a Phred quality score of 10, this means that your base call was 90 percent accurate, meaning um, to 10, to a chance or probability of 10 percent this base call was wrong or not.

So let’s say if you have 10 reads, so 10 times the same read, then one of these reads, um, this base is wrong. Obviously if the Phred score increases, so let’s say you have a Phred score of 20, then you have a base call accuracy of 99 percent, meaning there’s only a chance of 1 percent that the base call was wrong, so um that this particular base, so this A, T, G or C, is wrong or not.

08:06 - How is this Phred score actually calculated? So there’s a conversion, so this probability of your base call being wrong; so let’s say one percent, 0. 1. Um, you calculate or you convert it into a Phred score by taking the logarithm, and then multiplying it by -10. So let’s say one percent, 0. 1, the logarithm is then -1 x -10 = 10, so you have a Phred score of 10.

08:43 - And here also in this picture it’s very important; so you have to make sure which kind of sequencing device you use. So this calculation is different for the technologies, so let’s say for Solexa it’s kind of a bit of a different conversion, and so you always have to check which kind of sequencing device you have used. And, um, I always - uh also mentioned that at the end what you have in the FASTQ file is a quality string. So not numbers, because obviously if you have numbers it’s either a very long long line, or it’s very hard to use, to to know them because you have a very very big number to know what kind of quality scores you have or not.

So at some point it was decided that the Phred score is converted to a character by using the ASCII code. So what you have in your computer, all of these characters, so what you have on your keyboard, um, if you type it in actually this is, um, normally represented in the machine, so the computer, as a number. Um, you can see so in ASCII code the @ sign has the number 64. And now we can use this to convert the number of the Phred score into the, um, into a character, so to say.

And again it’s very important here that you know what kind of sequencing machine, so which kind of sequencing machine was used, because if you ever want to investigate, um, these base calls yourself, or these quality strings, then based on the sequencing device, um, there’s a different default where these Phred scores start. So for example here already also with, um, with different versions of the sequencing devices it can change. So for example here in Illumina 1.

3 they started with a Phred score of zero at this @ sign, and then if you have a Phred score of 40 you would end up with an H. Whereas in illumina 1. 8 and higher they changed that; the default is here, then there’s the Phred score of zero is at the ! mark, and if you have a Phred score of 40 or 41 then you’re at J or K. So you always have to check for your sequencing device.

11:26 - Okay, so now we go a bit deeper into identifying potential quality issues, and for that we concentrate on, yeah, the very important tool called FastQC. So, if you try to search for it in Galaxy you will find it very quickly, and you apply it to a FASTQ or a set of FASTQ files. You can also apply it to some BAM files. And then you get as a result an HTML report. So if you open it you’re greeted with the first page, so to say, where you see the basic statistics.

It lists you some general information about your read library; like the file name, the file type, the encoding - so there you can also check again which kind of sequencing device and version you have used, or was used. Then the total number of sequences in your read library; so you have here a thousand reads. Um, then the sequences which are flagged by FastQC as pure quality; and so here it’s zero. Then you see the general sequencing length; so all of the reads in this read library have a length of 100 nucleotides.

And you see the average, or the general GC content, or the fraction of GC content, and it’s here. So almost all rates have like, 50% of, uh, or a GC content of 50%. And if you then go to the first plot, um, this is one of the most important ones, where you see the base quality score. So on the x-axis you see the position and the read, so the base pair or the base, and for each position you get the distribution for the quality, so the PHRED quality score, which is plotted as a box plot, as you can see here.

And the blue line is the mean quality score for all of the positions. And the red zone is basically very poor quality. The green zone, which begins at around 20 to 28, is okayish quality. And 28 to 40, is then a very good or good quality, so the green zone. And so a PHRED score of 20, as you may remember, is around about, like, an accuracy of 99%, so 1% that the base is wrong or not, a probability of 1%. Here you can see, this is an experiment where you see already good quality.

And in this example you see an experiment with a very bad - or not a very bad, but a bad quality. And you see the end of the reads in the read library, um, you have very poor quality, so it’s in this red area. You will also get an information, immediately, an information in the report about this. And now you have two decisions, or you can make several decisions now; so you either can get rid of the data set at all, if you have other, a lot of other quality breaches as well; but you can also decide to actually trim off the ends.

Um, this is actually quite common, so you do this quite often, um, because with every sequencing technology, even with this Nanopore, the base quality at the end of the reads is always a bit worse than in the beginning. And the longer the read the worse it gets. And quite often you then do something called an end trimming; so you cut off the ends if there is some quality issues. And this you can do with Chromatic or Cutadapt. For example here in this intermediate quality example, where you can see that you have only here three positions at the end which are poor quality, you will probably do an end trimming, and leave, of course, leave the whole data set.

After this per base quality you have this per sequence quality, where you see on the x-axis the PHRED score, or the distribution of the PHRED scores, and on the y-axis the number of reads with this average PHRED score. And so it’s basically a distribution of the average PHRED score in your read library, and you hope to see a very high peak, at least with a PHRED score of 20 or 20+. If you see a little bump here with some reads with lower average quality, then you can either again decide to get rid of them; so this you can do either again with, Cutadapt has, has I think some options to get rid of low quality reads in general, and also mappers, some mappers already do this internally like Bowtie.

Of course if you do an end trimming this can also change, so if you do with Cutadapt and end trimming, then the average quality for this read increases, and this can already improve your overall quality for your read library.

17:09 - Then you have, um, this plot here, which is specific for Illumina sequencing. If you’re familiar with Illumina sequencing, you have a flow cell with different lanes. And you have a flow cell with two lanes, here on the y-axis you see this, and on the x-axis is the position in the read. Because it’s a sequence by synthesis, it’s kind of a plot showing you how the sequence performed over time. So the longer, so the, uh, higher the position or the later the position so to say, the later the time, and, um, it’s a called by what scaling here meaning that um cold the cold colors of blue color stands for good quality, hot color so red stands for poor quality.

Ideally this picture should be completely blue but you can see there are some bars some positions here in both lanes which are red. Doesn’t have doesn’t have to mean now that this was a very poor experiment or sequencing. Maybe these two positions have been or these positions have found a poor quality. I would actually keep this experiment. However if you see a very big stretch or a lot of positions with red then this is um an indication that the sequencing went wrong.

So what might have happened is then quite often there’s there was an air bubble in the floor cell and then the sequencing didn’t work out um well enough and you would probably then get rid of this data set.

18:59 - And if we come back to this plot which is the per bay sequence content showing you on the x-axis the position in the read and on the y-axis the fraction of the reads which have this nucleotide so the fraction for the T for C A and G. um ideally um of course there shouldn’t be any bias, so the fraction for all of the nucleotides should be similar through the whole read library. This is of course not the case. Um you see it actually quite often in the beginning or at the end at the end, because the quality is sometimes a bit worse in the beginning, uh could also be because of adapters and also at the end because of adapters.

Um however there are also protocols like clipstick or chipset even like cut and run, they have internal biases because of the restriction enzymes they use or sometimes even if you uh because of the restriction enzymes you have used. And then you see um generally some biases and then this does not mean this is good quality it’s just the bias of the protocol. So you have to kind of so this is um we have to say you really have to again know what your protocol is doing and what kind of biases your protocol is inducing to your data.

So it doesn’t if you do, for example eclipsing experiment and I see something like this doesn’t mean immediately this is a bad quality, it just confirms the bias, the the the protocol and uses. However if you know and you anticipate no bias or known different nucleotide composition in your data then this is either a quality breach because of poor fragmentation or your fragmentation didn’t work so well. Because you have done a very aggressive adapter trimming I come to that in a moment what this means meaning that you remove the adapters quite heavily and ending up with a bias because of this step or because of overrepresented sequences so there’s maybe contamination or some repetitive um RNA in there or some some RNAs from from a repetitive regions then this can show you or this could be the case for for for this nucleotide bias.

21:56 - The other plot here, the sequence GC content shows you what you also get from the initial statistic, shows you the neon the x-axis the mean GC content so the the uh fraction for it or the percent and on the y-axis the number of reads this is GC content and blue is always the tiered theoretical distribution which should be normally distributed and in red you see the actual empirical distribution of your data. So there, can be it’s again a plot where you kind of have to know a bit more about your experiment, your organism and the protocol you have used.

Because you can end up here also with a bimodal distribution so meaning you have a second bump or more than one bump in here. If you do a metagenome analysis of course if you have different organisms if you analyze a sample with different organisms they have different GC contents and therefore end up with a distribution um showing more than one peak. However if you don’t anticipate something like this so it should be unimodal so only one normal distribution one peak then this maybe means that you see a bimodal distribution that you have a contamination in your data.

A contamination in the data probably is also shown if the distribution is very broad. And another thing if the distribution is very spiky so there’s a very spiky peak then this could imply that you have a very heavy adapter contamination in there which has a specific GC content of course and then you have this big spike in there. So look out for this GC content so this indicates a bit more about contaminations about library contaminations.

24:08 - Then you have this per-base N content maybe you remember I mentioned when I talked about the fasta and fastq file format that you have your sequence string. So where you have A, T, G and C however there might be another character an N in there. And this is due to the case that the sequencing machine of course does your base calling so it decides at which position which base you have, and gives you then also this quality score so the spread score for it.

However if the algorithm because the signal was not strong enough or it couldn’t really detect what this base is and it gets really really low then it cannot decide with with high control with some confidence anymore that this is an A, T, G or C and therefore it sets a placeholder and this placeholder character is an N. So showing you the sequencing machine couldn’t anymore detect what kind of base at this position is correct or not or is yeah if this is A, T, G or C.

Of course here on the x axis you see the position in the read on the y-axis how many or the fraction of the reads in your read library having an N at that position and of course ideally you hope to see nothing so everything should be at zero. If you see something like this here so this is very poor quality then maybe you can get rid of it by again and trimming so getting rid of these bases at the end or at the beginning. However if you see a lot of them here so like this there’s also in the middle some a lot of reads which has ends here so there is really then this you would then combine all of the plots which we already discussed and see if there may be happen something with a sequencing device or if there was a problem with contamination um then you would probably maybe get rid of the data set entirely.

26:18 - Then you have the sequence length distribution, just showing you yeah the length of your sequences in your read library. Here on the x-axis on the y-axis is the number of reads, so this here means that all of the reads have a length of 75 bases.

26:42 - This doesn’t have to be so. If you already anticipate that there is a variation in the sequences because of fragmentation, or because of your experimental design, then at least you hope to see a very high or big spike at a specific length. You, how you designed, which… you know how you designed your uh read library, and then it drops immediately. Also when you do and the end trimming or adapter trimming and you investigate again the quality of the reads, to check these steps, and of course also these reads have been different length, and you would also see a non-constant distribution.

Yeah so this is a general checkup if um yeah where you can check if your design of your read library was correct. And then we come back to this plot, duplicated– the duplication level plot showing you on the x-axis the sequences- sequence duplication level As you know, or maybe know, that because of PCR, you will end up with sequence duplicates. So reads or sequences which are simply duplicated because of the PCR and the sequencing technology. This happens and ideally would see actually no duplicates, so duplication level of one but quite often you see at least two.

A bit of a spike here at two, which you can see here. So the blue line is your actual distribution, and the red line is the distribution if you would do the deduplications. Here on the y-axis, so this is on the y-axis is the fractional reads with this duplication level. And how FastQC is doing this is it just checks this the- the sequence so in the read library, if it can find the same sequence once, twice, three- trice, and so on. Again doesn’t mean that this experiment was bad or not.

If you see uh spikes and higher- higher duplication levels like maybe 9, 10, or even higher.

29:29 - It’s just, so for example in Chip-Seq or Clip-Seq, I anticipate to see something like this.

29:39 - And RNA-Seq, also probably. However if you see a really a very big spike in higher levels, then this is an indication that you have done too many PCR cycles and yeah. You probably have to then redo the experiment. If not, so small bumps in here or maybe a bit more bumps and higher levels it’s not that big of a deal. , You only have to then consider if you do a deduplication or not deduplication, so remove some of these non-unique molecules.

30:20 - This is where I would say you have to to decide based on the experiment again, and if you have enough material. For example in Chip-Seq I see that again often that there is a bit of, some doubly duplicated reads in there. And I see that I only have 2 million reads, depending on the org[anism], let’s say on human, 2 million reads. Then probably I wouldn’t do a deduplication if I have to do the differential expression analysis, because I reduce the data too much to…, or I remove too much of important information.

Because also in Chip-Seq maybe these reads come from different transcript isoforms, so I actually remove some important information. And I have not enough data so for Chip-Seq if you don’t want to do differential expression analysis, at least you need to have around about 10 million reads, which is- which is a good number to do a differential expression analysis. And in Clip-Seq however, um you there quite often do a duplication because you’ve, um, it’s very important to know where your protein is binding you to your transcriptome and the reads are generally quite short, and therefore it’s um-, you have a high duplication level and you need to do a deduplication.

31:52 - This one here, this plot shows adapter contamination so FastQC also checks for standard adapters, from known, for well-known devices Illumina and Nexterra, and so on. So these kind of adapters, it checks if it can find these sequences in your data. Hopefully um- so if you already removed your adapters, it will not find anything of course. If you if you really have the raw data, then you will see that there is something in your data and so if you have quality breach, then you do an adapter trimming with cutadapt or TrimGalore.

There are a lot of tools which will help you with that. And I have to point out, if you do an adapter trimming, re-check this again. Don’t assume that I applied now my tool to my data then the adapter is gone now, please check this. It’s also sometimes really important to check this because you, there are sometimes read library which are a bit more complex, you have maybe an adapter at the beginning and at the end, and sometimes a double ligation of an adapter and so you have to do a double adapter trimming.

And so really check this if this is gone. So on the x-axis is the position of the read, on the y-axis the fraction um of the reads having this adapter sequence. And again if you see in, a quality breach in here, so as soon as there is a fraction or some fraction higher than two or three percent, it will show you a warning that there are probably some adapter sequences in your data, and then you have to have to do an adapter trimming. This goes more or less hand-in-hand with this plot, it’s the k-mer content plot showing you the position here on the x-axis of the read on the y-axis the fraction of your reads with this k-mer at this position.

34:04 - I said in hand because if you have adapters in there then you quite often see also something in the k-mer content. Of course you have a molecule or a sequence which occurs more than once in your read library, and therefore you have an increase in some k-mers, um also then shown here. If you removed some adapters and you still see some k-mer content, doesn’t have to be that again that this is a poor quality, maybe it just means it happens quite often for RNA-Seq experiments so um if you have a repetitive or satellite sequences in there, then, of course you see something in the k-mer content.

34:56 - Okay with that we are more or less at the end. So I already talked at every plot a bit um what you can do, what you can to to improve the quality, I told you that um so filtering of sequences, if you see that you have this average or mean quality score distribution of this, is a if you have their quality breach you can filter form low quality sequences some mappers include this already internally, but you can also do it with cutadapt or some other tools.

If the reads are too shortm so you see that your read lengths are too short, so lower than 30, maybe also worth to filter that out depending on your experiment with the tool also you can use cutadapter or Trim-o-matic. With too many N bases you can do an N trimming or again filter out these sequences. GC content wise, if you see a GC content bias in there which you don’t assume, or which you have not anticipated, there is a tool also to correct for this, um it’s called uh if you search for deepTools GC Content correction in galaxy, there’s a tool which you can use to correct for GC content bias.

But only use it if you are sure the gc content, or there’s a bias, should not be in your data. There are other quality measures tools which you can use to remove some quality breaches. I talked a bit already already about, at every step now, again cutting and trimming sequences is possible. So this end trimming adapter trimming um you probably have to do.

37:05 - Okay so I hope you now know how to run the quality control in the practical presentation. I will do the hands-on so we go a bit yeah deeper and doing it a bit um hands-on in galaxy, how to, to run this. I hope you know which kind of quality parameters are important for next generation sequencing data, so you know how to run, maybe, already FastQC. And what kind of impact it has for your quality control. And with that, I thank you very much, and yeah I hope you enjoyed this presentation, which is part of the galaxy training network, so you find these slides or of course there.

And if you want to also join on the practical part then we see and hear each other soonish. Bye bye!.