Hands-on: miRNA differential expression analysis

Apr 19, 2021 22:51 · 2107 words · 10 minute read

Hi everyone! Let’s start with the practical part of our tutorial.

00:06 - At first, we are going to provide a name to our history; for example, Arabidopsis transcriptome analysis.

00:16 - Then, we are going to upload our datasets from Zenodo.

00:19 - In order to do that, we’ll use the tabular data provided in the training.

00:30 - We should copy it, and then, select the uploader tool, by using the rule-based method.

00:39 - Select collection, and paste the tabular data.

00:44 - And build. Now we are going to provide some rules; we should click on Rules, add/modify column definitions, add definition.

00:56 - Then List of Identifiers, and select column A. Now, Add Definition, collection name, and select column B. After that, select URL and column C. Finally, Add Definition, Type, and select the column D. Apply and finally upload.

01:29 - Meanwhile we are going to upload the rest of our datasets.

01:38 - We should copy the tabular data of the additional datasets, and select the uploader tool.

01:54 - Select datasets, paste the tabular data, and Build.

02:04 - Once again, we are going to create two rules.

02:08 - Add Definition, Name and select the column A. And URL, and select the column B. Then apply and upload.

02:21 - Now, we have all the required datasets available in Galaxy.

02:32 - Once the process of uploading has finished, we can check the content of one of the collections.

02:42 - Now we can open a FASTQ file; we can see four lines for each read.

02:49 - The sequence identification, the reads, and a line encoding the quality.

02:59 - Now we can add a few tags in order to identify our sequences.

03:06 - For example, in this case, we can add brassinosteroids and miRNA.

03:21 - Here we can see the additional datasets that we are going to require.

03:26 - It includes the annotation, the whole transcriptome, and a few additional datasets.

03:36 - The next step is to evaluate the quality of our reads, for which we are going to use FASTQC.

03:54 - We should open the collections, select the control miRNAs, leave the rest of the parameters as default, and repeat the same process with the brassinosteroids treated dataset.

04:15 - FASTQC provides a simple way to do some quality control on our reads.

04:26 - Now, we can check the results of this tool.

04:33 - It generates two different outputs, Here we can see the different information which it provides, like the Per Tile sequence quality, Per Sequence quality score, per base sequence content, per sequence GC content, per base unknown bases contents, sequences length, sequences duplications and overrepresented sequences.

05:06 - Here he can see some of the adaptors which are present in our reads.

05:14 - Now, in order to compare the results of the two treatments, we are going to merge the outputs of FASTQ.

05:28 - In order to do it, we are going to use the tool Merge Collections into a single list of datasets.

05:44 - We are going to select the collections. The control and the brassinosteroid raw collections, and execute.

05:58 - Now, we have raw reports merged in a single list.

06:07 - After that, we are going to use the MultiQC tool in order to generate our report.

06:19 - Select FasQC, the merged collection and a title.

06:28 - For example, Initial Quality Assessment, and execute.

06:36 - In that way, will be able to visualize all our datasets in a single report.

06:44 - MultiQC generates two types of outputs, an HTML, and a stats file.

06:58 - We can check the stats file; it provides raw information.

07:08 - But the most interesting one is the HTML file.

07:16 - Here we can see the information that it provides.

07:20 - As we can see, our reads include a lot of duplicated reads.

07:26 - Also it provides information about the quality scores, per sequence quality scores.

07:39 - Here we can compare the value of all our datasets.

07:45 - As we can see, all of them show similar trends.

07:53 - A high percentage of reads include adapters, which will be necessary to remove, as they can interfere in the subsequent analysis.

08:05 - To remove the adaptors, we’ll use the Trim Galore! Tool.

08:14 - Select Trim Galore!, collections, control collection, select the Illumina small RNA adaptors, and execute.

08:31 - Now we should repeat the process with the brassinosteroid treated collection.

08:41 - Select the Illumina small RNA adaptors, and execute.

08:50 - The next step is to evaluate once again the quality of our reads, after the processing by Trim Galore!.

09:05 - We should select the FastQC tool, select the processed reads, control collection, and execute.

09:22 - And, we should do the same with the brassinosteroids treated collection.

09:27 - As we did before, we need to merge the outputs generated by Fastqc.

09:42 - We should wait a little bit. Select the Merge Collection tool; select the control and the brassinosteroid treated collections, and execute.

10:18 - Once we get the merged collection, we should once again use the MultiQC tool, in order to generate the report.

10:33 - Select Fastqc tool, and the merged collection, We should provide a report title, as for example, Post-processing quality assessment.

10:59 - Once we have obtained the quality report of the processed sequences, we’ll compare the quality parameters before and after having processed the samples, for which we are going to use the Scratch book option.

11:19 - We should select Scratchbook, and open the visualization of both reports, side by side.

11:32 - In that way will be quite easy to compare them.

11:43 - Here we have the initial quality assessment, and the post-processing quality assessment.

11:50 - As we can see, we have a quite similar amount of duplicated reads before and after the processing.

12:05 - The sequence quality histograms are also quite similar, The per sequence quality scores have been improved after the trimming.

12:21 - We can also compare the per base sequence content.

12:30 - As we can see the per sequence GC content has been improved in our reads.

12:40 - In the case of the sequence lengths, we can see that the original reads had all of them similar lengths, and after the trimming we have reads with different sizes.

13:01 - The duplication levels are quite similar. Also the overrepresented sequences show a similar pattern.

13:14 - And the adapter content. We can see that we have removed all the adapters.

13:24 - Now we can start with the quantification of the miRNAs.

13:31 - It will be done by using the MiRDeep2 tool.

13:37 - Go to the tool search bar. Here we have three modules; we are going to use MiRDeep2 mapper and MiRDeep2 quantifier.

13:54 - First, we’ll use the MiRDeep2 mapper. We should select the control processed collection, select collapse and execute.

14:16 - Now We’ll repeat the process with the brassinosteroids treated collection.

14:30 - Once again select collapse, and execute. The collapse tool ensures that each sequence only occurs one.

14:42 - To indicate how many times reads the sequence represents, a suffix is added to each fasta identifier.

14:57 - Now we are going to use the MiRDeep2 quantifier tool.

15:03 - We are going to select first the control collection, the precursor sequences, the mature miRNA sequences and the STAR sequences.

15:18 - All those files are required in order to perform the quantification.

15:26 - Now we should leave the rest of the parameters as default.

15:31 - And execute. Now, we should repeat the process with the brassinosteroid treated collection.

15:42 - Once again, select each of the datasets and execute.

15:50 - This tool maps the reads to predefined miRNA precursors and determines by that the expression of the corresponding miRNAs.

16:01 - Firstly, the predefined mature are mapped to the predefined precursors, Then the predefined star sequences are mapped to the precursors too.

16:15 - This tool generates two types of outputs, the raw reads which include the information such as the miRNA names, the read counts, the precursors and the normalized counts.

16:41 - It also provides an HTML file with additional information, such as the miRBase precursors, the mature read counts, the star read counts, and the mature sequences, the miRBase star sequences and the precursor sequences.

17:08 - Before performing the differential expression analysis will be necessary to extract the first two columns from the quantification files.

17:19 - Those columns include the information about the miRNA identification and the read counts.

17:34 - Then, we should select the tool Cut columns from a table.

17:46 - Select the first two columns, select the collection of control sequences and execute.

17:57 - We should repeat the process with the brassinosteroid treated collection.

18:06 - And execute. Now everything is ready for performing the differential expression analysis.

18:24 - We can have a look at the outputs. We’ll use the DESeq2 tool.

18:45 - We should specify a factor name, for example effect brassinosteroids, specify the first factor level, which will be brassinosteroids, select the collection corresponding to brassinosteroids, which is the 104.

19:10 - Then specify the second factor, which is the control.

19:20 - We should unset the files have a header option, and leave the rest of the parameters as default.

19:32 - And execute. DESeq2 is a popular tool for gene level differential expression analysis.

19:39 - It uses the negative binomial distribution, employing a slightly more stringent approach compared to other methods, it provides a good balance between sensitivity and specificity.

19:57 - DESeq2 generates two different types of outputs, plots and the differential expression datafile.

20:09 - Let’s have a look at each of them. One of the plots generated by DESeq2 is the Principal Component Analysis plot, which provides insight into the association between the samples.

20:26 - As we can see, the x-axis includes the 46% of the variance of the sample.

20:35 - Other plots generated by DESeq2 are the sample-to-sample-distance, which provides similar information to the PCA plot, the dispersion estimates, histogram of p-values, and the MA-plot.

20:59 - Now let’s check the tabular data generated by DESes2.

21:06 - It provides diverse statistics, such as the mean base expression, the fold change, standard error, the Wald-statistics, the p-value and the p-adjusted value.

21:22 - The p-value is a measure of the probability that an observed difference occurs just by random chance.

21:33 - The p-adjusted value takes into account the false discovery rate, which is necessary when we are measuring thousands of variables.

21:43 - Now we are going to filter those genes that show statistical significant differences between the two experiment conditions.

21:53 - We should select the output of DESeq2, select the column number seven, which corresponds with the p-adjusted values, and execute.

22:07 - To perform differential expression analysis is recommended to use at least three biological replicates of each experimental condition.

22:17 - In addition, another important factor in determining statistical power is the sample size.

22:26 - Let’s take a look at the results. Unfortunately, we haven’t identify any gene with significant differential expression.

22:32 - This is caused by the fact that our sample size is not large enough, since we have used a subsample obtained from the original data.

22:44 - Let’s repeat the last step by using the full miRNA dataset.

22:51 - We should copy the Zenodo file address, provide a proper name, for example miRNA DESeq2 complete dataset and start.

23:26 - Now we are going to repeat the filtering process with this new dataset.

23:34 - Will re-run the previous process, select the complete dataset, and execute.

23:47 - Let’s check the results. When using the original data, the statistical power is significantly increased, since the samples are one hundred times larger in size.

24:06 - In this case, we have identified 39 genes whose differential expression is statistically significant.

24:15 - Finally, we’ll filter out the upregulated miRNAs.

24:24 - Let’s select the previously filtered dataset, and set the column 3 higher than 0.

24:43 - In this way, we’ll keep only those transcripts whose fold change is higher than 0.

24:56 - Finally we have obtained 16 miRNAs whose expression show a significant increase in brassinosteroid treated samples compared with the control samples.

25:08 - That’s all, I hope you enjoyed it!.