GTN Smörgåsbord - Day 4 - MaxQuant and MSstats for the analysis of label-free data

Apr 19, 2021 22:56 · 8816 words · 42 minute read

Hi everyone! And welcome to the proteomics hands-on training.

00:06 - In this training, we will learn how to use MaxQuant and MSstats for the analysis of a label-free tissue cohort dataset.

00:16 - The data comes from skin tissue samples of 19 patients, and they have different types of tumors.

00:25 - One group consists of metastasizing cutaneous squamous cell carcinoma, and the other one off RDEB cutaneous squamous cell carcinoma.

00:36 - And we shorten the cutaneous squamous cell carcinoma as cSCC here.

00:43 - So it’s a type of skin cancer. And our objective for this training is to learn how to use MaxQuant and MSStats for the analysis of such a real label-free tissue cohort dataset.

01:05 - So we will start by uploading the data. Part of the data is deposited on Zenodo.

01:13 - We can copy everything, and transfer it here to the upload button.

01:24 - Choose Paste and Fetch, and then paste the links to these three files.

01:33 - Then start and close. If you’re not familiar with Galaxy yet, please look at the Galaxy beginners lectures and hands-on training.

01:47 - We then continue with uploading the raw files.

01:55 - Here on this hat, you have a direct link to the Galaxy training material site.

02:01 - That was where I was before. And so we can also copy the links from here.

02:08 - So this is raw data deposited in the PRIDE repository.

02:13 - And this is a repository, um, that hosts publicly shared proteomics raw data.

02:21 - And we have 19 raw data files. And they take really long to load.

02:28 - And they will also, um, take a long time in the MaxQuant run.

02:33 - And that’s why we will already load also the results of the MaxQuant run, which you can find at the end of the “Hands-on: MaxQuant analysis” box.

02:45 - So we will already get the MaxQuant results.

02:49 - And we will also upload them. Because of the Zenodo link, they now have a weird name.

03:07 - So I will rename every of these files. So to do so, I click on the pencil and then remove the beginning of this, um, file a name.

03:22 - So we have here a protein database. This is an annotation file that we need for MSStats.

03:29 - And then we have a comparison matrix that we need for MSStats.

03:38 - The raw files will take quite a while to load, so we can already look at these files here.

03:47 - So this is a human protein database with 20,000, um, entries.

03:54 - You remember from the theory lecture that this first line is the header line, and it contains here the Uniprot ID and, um yeah, more IDs, and the name of the protein.

04:07 - And then this is followed by the actual amino acid sequence.

04:11 - And here then starts the second protein. The other two files, So this, um, annotation file is important because for the statistical analysis, we want to compare our two conditions.

04:27 - So we have the metastasizing tumor type, and the RDEB tumor type.

04:33 - And we would like to compare the files of both groups to find differentially abundant proteins between the two groups.

04:41 - What is- it’s really crucial how those annotation file is set up, and many off the MSStats errors happen here.

04:51 - Um, the raw file actually has to match the name in the evidence raw file column.

05:01 - We can later check that out. So here it’s very important that this name fits to what is written in the evidence file, the output of MaxQuant.

05:10 - Otherwise, um, this metadata cannot be attached to the raw files’ names.

05:17 - Then we have the conditions. So this is a binary comparison.

05:21 - Here we have only two groups, and replicates indicate that from each patient we have one sample processed.

05:32 - So we have 19 files, and we also have 19 different patients.

05:37 - And because it’s label-free, the 19 samples were measured in 19 different runs.

05:44 - And when we want to perform MSStats after MaxQuant analysis, it’s important that we also add an isotope label type column to the annotation file.

05:55 - But this is in all cases only, um ‘L’, which stands for ‘light’, because we don’t have any heavy spike-in peptides.

06:05 - The third file is also needed for the MSStats analysis, it’s called a comparison matrix or table.

06:12 - And in this table we specify which comparison we want to perform on our groups.

06:17 - In our case, there’s only one possibility, because we only have two groups.

06:23 - So we compare the metastasizing condition versus the RDEB condition.

06:28 - It’s important to write again the names here exactly as they were written in the annotation file, so that they can be matched.

06:36 - Um, the name can be anything. It doesn’t matter.

06:41 - So I just named it that we want to compare the metastasize versus the RDEB, um, cancer type.

06:48 - And the condition um, that comes first, has a ‘1’ and the condition that is in the, um yeah, it is compared against, comes in with ‘-1’.

07:03 - And if there would be more conditions, we could also leave them at 0, so that we first compare only those two conditions.

07:09 - And then we could add another row with a new, um comparison.

07:15 - And then we could leave these conditions 0 and, for example, compare two other conditions.

07:22 - But here we only have two. So this table is quite short.

07:28 - And you can see that the raw data is still downloading.

07:33 - And this is, um, not typical for Galaxy training.

07:37 - So often we used Galaxy trainings on cropped and very, um, small, but descriptive data files.

07:45 - But in this training, we actually use a real, um a real cohort.

07:51 - And we didn’t crop the files, to get a proper statistical analysis.

07:56 - And therefore, we yeah, we don’t have time in the training to wait for the MaxQuant run to be finished, and therefore on Zenodo you can already get the results of the MaxQuant run.

08:13 - So this is the evidence output, I rename it again.

08:19 - Then we have the protein groups. Save.

08:29 - And last, the quality control report. So we can already try, we need them in a collection later on.

08:57 - Ah no, let’s first rename. So the renaming might not work well if the data is still uploading, but we might be lucky here.

09:13 - But it can happen that if it’s still yellow, that it will not, um, keep the new name.

09:25 - Yes. So we have only six of the RDEB condition, but that’s actually quite a lot.

09:37 - So RDEB stands for Recessive Dystrophic Epidermolysis Bullosa, and this a rare disease in which patients have a defect in the Collagen I gene, and they cannot produce functional um Collegen VII, actually.

09:58 - And Collagen VII anchors the epidermis to the dermis, And in case this, um, protein is not, um yeah is not functional, the problem is that the two skin layers are causing friction, and this leads to blisters, and inflammation.

10:20 - And one of the long term. Um, yeah, The long term outcome of the disease is the development of such a squamous cell carcinoma, which is then relatively aggressive.

10:36 - And that we have six patients here is already quite a high number, because it’s such a rare disease.

10:43 - And we were lucky because the proteomics experiment was performed on formalin fixed paraffin embedded (FFPE) tissues, and they can be stored at room temperature for many years.

10:56 - So for this experiment, we could use, um yeah, we could yeah make use of also tissues that were collected a long time ago.

11:08 - And the metastasizing cSCC is also not so common because, most cutaneous squamous cell carcinomas are quite yeah, benign, or they don’t, at least they don’t really often metastasize.

11:24 - But in this case, we were lucky to get several metastasizing cSCCs.

11:31 - Um, but these are just the sporadic cSSCs, so there’s no genetic defect.

11:39 - It’s most often, um, generated by UV light.

11:46 - Um, normal sunlight exposure over a long time, many, many years.

11:54 - And the aim here is to find differentially abundant proteins between these two relatively aggressive cSSC types.

12:04 - But, um, they have quite a different origin.

12:11 - So I can try to, so we want to have them in a collection.

12:17 - Then it’s easier to handle in the history. So I click on “operations on multiple data sets”, and then “Select All”, but not these three files, and not these three files.

12:31 - So I built a data set list. This is all the files.

12:39 - I hide the original elements, um, and create a list.

12:52 - And then if I click here again than the, um, boxes go away and now it looks way nicer.

13:00 - So here’s my collection with the 19 raw files, and if I click on them, here they are.

13:07 - And now they are green. Um so in the European Galaxy server the upload immediately um, turns out the right format.

13:17 - So the format has to be thermo. raw. In case this went wrong on your Galaxy instance, you can manually change it by clicking the pencil again, and then click on ‘Datatype’, and then look for thermo. raw, and then press ‘Change datatype’.

13:37 - You would need to do this for every of the 19 files then.

13:41 - We can also give the history a name. So this is the MaxQuant MSStats label-free training.

13:51 - And now we are ready to look into the MaxQuant tool.

13:57 - So MaxQuant is pretty powerful. It has um, it cannot only identify peptides and proteins.

14:06 - It can also quantify them, and it allows many different quantification methods.

14:11 - So label-free, but also many types of labels are supported.

14:16 - In our case, we can leave pretty much the default parameters.

14:24 - So here we have to choose our input type. It’s possible to also load mzml and mzxml, So the open standard formats.

14:33 - Um, but MaxQuant requires them to be in quite a specific type, so it might be a bit tricky to get the right subtype off thes XML files.

14:49 - So we will use the raw data, in the thermo. raw file format, and he has already pre-selected, our FASTA file, with the protein sequences has already chosen as FASTA file input.

15:05 - And we leave the parse rule which determines how we split up the header of the FASTA file, which parts we keep.

15:13 - We don’t have any fractions or PDM, so we don’t need a template.

15:19 - We’re quite happy with all these parameters here.

15:24 - So we do FDR off 1% on peptide spectrum matches and protein level.

15:32 - The protein quantification is performed again in MSStats, so MSStats will only use the peptide-level identification of MaxQuant, and then performs a protein quantification, or summary, summarization by itself.

15:53 - So here the parameters for the protein level quantification don’t matter that much.

16:00 - And here we have now the actual input that we need to choose.

16:07 - Because it’s possible in a parameter group to specify different parameters for different files.

16:14 - But in our case, we want to have the same parameters for all files.

16:17 - And in order to recognize the collection, we need to click here to choose the dataset collection.

16:26 - And now it says it’s not available. That might happen if there’s one data set which was not correctly recognized.

16:38 - But it looks good, so it just reload MaxQuant.

16:47 - So we anyway left everything the same. And it still says we don’t have a collection.

17:12 - Mhm. This one is wrongly recognized.

17:24 - This is interesting. So normally it either works for everything or for nothing.

17:30 - So we change the data type here. And this was right.

17:37 - right. right. This was wrong.

17:48 - This is wrong. This is right. Try it.

18:19 - Yeah, okay. The other seemed to be fine.

18:28 - So for whatever reason, there were three datasets that were not recognized correctly.

18:34 - So now I can go back to MaxQuant, and see if we have it now.

18:41 - Yeah, and now here it is. The collection number 45.

18:48 - We also leave all these parameters as they are.

18:53 - If you want to learn more on the MaxQuant parameters and their meaning, um, please check the beginners MaxQuant training.

19:03 - There’s a whole section explaining the most important parameters of Max Quant.

19:09 - So we had digested with Trypsine, so we have selected this.

19:14 - And here we could choose label-free quantification.

19:18 - Um, but actually, this it’s only a normalization to step on protein, um level.

19:28 - And because we anyway work with the peptide-level data and MSStats, it does not matter.

19:33 - So we can also leave quantitation method as None.

19:39 - Included into the MaxQuant wrapper is PTXQC functionality, and this generates an automated QC report, with many different and very helpful plots that describe the dataset.

19:55 - And also the, um, indirectly the machine, the instrument performance.

20:04 - And now, in the last step, we need to select an output.

20:06 - There are several options. So for Max-, for MSStats, we need protein groups and evidence file.

20:17 - And we could hit start now, but the analysis takes several hours, so it’s not really worth to click the button, because it’s the only finish many hours later.

20:33 - And if we would like to continue with the training, we already have the PTXQC report downloaded from Zenodo, the protein groups, and the evidence file.

20:46 - Ao we can look into them, and just assume that this will also be the output of this MaxQuant run.

20:56 - So I open the PDF with this scratchbook. So this is an overview about different parameters, and how they perform in different files.

21:18 - Um, don’t worry too much about red lines here.

21:23 - It’s often the cut-offs do not always fit toe all types of mass specs, and all types of experiments.

21:35 - So there’s many different parameters written down that MaxQuant has used.

21:41 - We get several overview figures, and this is an important figure.

21:48 - So MaxQuant appends automatically potential contaminant sequences onto the FASTA file.

21:56 - And in this cohort, quite many of these potential contaminant proteins were found, and they have, uh, intensity up to 60% of the samples can be contributed to potential contaminants.

22:15 - But it’s important to think about what this means, and which types of contaminants are in there.

22:22 - And many of the contaminants, or many of the potential contaminants are of course human proteins, um, that the experimenter could, yeah, um, yeah, contaminate the sample, with such as, um, a skin piece or hair.

22:42 - And so all the skin proteins are inside the contaminant list.

22:46 - But here we have analyzed human skin, and therefore we expect that we have the typical skin proteins in our sample.

22:54 - So in the later analysis, we will assume that we have, um, that the human skin proteins are actually from our sample, and not contaminants.

23:10 - So there’s a lot of more plots, which you can explore.

23:14 - I will go back. Yeah, so I close this.

23:22 - Um, let’s also have a look at the protein output.

23:27 - If we click on the file, we can see that we have 2635 lines.

23:34 - One line is the header line. So there’s, um, 2634 for protein entries in the file.

23:46 - Here’s the beginning of the file, we can see that some IDs are already labeled as contaminant.

23:54 - That’s why we have the ‘CON_’ in front of the Uniprot ID.

23:58 - And now there’s a lot of information about peptides, and unique peptides, their mass and so on.

24:05 - What we would need to know is um, yes, so here the intensities of the proteins, but we will not really care about them.

24:17 - So we need to have this column: here every entry that has a plus is considered a potential contaminant.

24:27 - And what MSStats will later do, is to remove all the proteins that have here a plus, that are potential contaminants.

24:37 - So what we will do next is, we will remove the contaminants that are not human, because they actually are contaminants, they’re not expected in our sample.

24:49 - And then we will replace the pluses that are remaining here, um, with nothing.

24:55 - And this removes the plus signs from this column.

25:00 - And then MSStats will not recognize that we will still have a few protein IDs that come from, um, contaminant, potential contaminants.

25:13 - So this was column 118, and there’s also one column in the evidence file.

25:20 - So here we have the feature information, so we have peptide sequences, and they have different charges and come from different of the samples.

25:36 - And what I said before was that in the annotation file for MSStats, we need to put the right name.

25:43 - And this here is actually the name that we need to put in.

25:48 - So the raw file column of the evidence, is how the run, the files should be named in the annotation file of MSStats.

26:00 - And then we also look for the contaminant column.

26:11 - So here this it’s 54. At least here at the top there are no contaminants.

26:16 - Um, so we need to remember that in column 54, we need to replace the pluses with nothing, in order to remove them.

26:29 - So we will start by keeping only the human proteins, because all non-human proteins are probably contaminants.

26:40 - We will do this with the Select tool. So from the Select tool, there’s many different varieties.

26:52 - It’s quite tricky to find the right one. Um, sometimes it’s also easier to just look into the different categories because the search is not always working so well.

27:19 - Um, so it would probably be ‘Filter and Sort’, or ‘Text Manipulation’.

27:39 - No, it’s not here. So the easiest option is, um, if you are on a Galaxy server that supports that feature, you click again on the hat, and go to the training material.

28:07 - And then you can directly click on the tool, and it will be selected for you.

28:13 - So I can already copy also the pattern that we use.

28:16 - I click the select tool, and here it is. So it’s called ‘Select lines that match an expression”.

28:23 - And first we um, so we keep only lines of the dataset that either contain the word ‘HUMAN’, because these are human proteins, or ‘Majority’, and Majority is just there in order that we can keep the header line, because in the header line there is no ‘HUMAN’ word written.

28:46 - And here in the second, in the second column, the header is called Majority.

28:54 - So if we filter the protein groups, which is number 24, with the Majority option, we keep lines that are either matched to ‘HUMAN’, or to ‘Majority’.

29:06 - And we do a similar thing with the evidence file.

29:11 - So we also want to remove non-human proteins from the evidence file that are probably contaminants.

29:17 - We do it again with this Select Tool. Um, I can click again on the training, I can copy the pattern that we’re looking for, and press the selector.

29:34 - And this time we use the evidence file. And the evidence file contains the word sequence in one of the headlines.

29:42 - Here it is. So here in the first column, it’s called ‘Sequence’, so we can keep the header by also selecting the ‘Sequence’.

29:54 - And now we need to get, so now our data set has lost several lines of proteins, and these were non-human proteins that came from the contaminant database.

30:06 - And they are removed now. But still, we have contaminants here in our protein list, but they are only potential contaminants.

30:17 - And here we can already see this is probably a carotene, and we would like to keep it.

30:22 - But because this has a plus in the potential contaminant column, we need to remove the plus.

30:28 - And this we can do with the “Replace” tool.

30:40 - The ‘Replace text in a specific column’ tool.

30:47 - And we start with the ‘Select on Data 24’, that was the protein group file.

30:55 - And if you remember, we have seen the plus sign, potential contaminants in column 118.

31:03 - We’re looking for the pattern ‘+’, and we replace it with nothing.

31:10 - And this is how we remove it. And we do the same again.

31:18 - We can use the rerun button. So this is the rerun button.

31:23 - With this button, you can first of all see the parameters that you have chosen in the run before.

31:29 - But it also allows you to use the same parameters, and the same tool for another run.

31:35 - Um so I only need to change here the input.

31:40 - We want to have 47, the ‘Select’ on the evidence file, and the column already appeared.

31:47 - So 118 does not exist in the evidence file.

31:52 - And this was here, the contaminants were in column 54.

31:58 - And again we find a plus, and we just remove it by replacing it with nothing.

32:06 - And now we are ready for the MSStats analysis.

32:10 - So if you’re not sure, if you can remember which is the protein group, and which is the evidence, you can also rename it.

32:17 - And say: this is the ‘manipulated protein group’, and this will be the ‘manipulated evidence group’.

32:43 - No group, just the manipulated evidence file.

32:48 - And with these files we are now ready to go already to the statistical analysis.

32:56 - So we look for the MSStats tool. There’s two different tools.

33:01 - So the MSStats tool is for label-free data, and MSStatsTMT is for TMT or iTRAQ labeled experiments.

33:09 - So here we choose the MSstats tool for label-free data.

33:15 - And now we need to choose where our data comes from.

33:20 - And this time the data comes from MaxQuant.

33:23 - We could load directly the evidence file from MaxQuant here, but then we would still have, um, the non-human contaminants inside, and would automatically remove our human potential contaminants, which we don’t consider as contaminants.

33:41 - And therefore we use the filtered, or manipulated evidence file, and we use the manipulated protein groups file, so that we make sure that we keep the human skin proteins.

33:54 - And now, here we select the annotation file.

33:59 - And yes, yes, we can here choose that we use the, um, leading razor protein column, so that might be a bit less IDs.

34:17 - Then there are some transformation options, how we can transform, um, the data which is now formatted into an MSstats-compatible format.

34:30 - And here we just select ‘remove the proteins which have only 1 peptide and charge’.

34:36 - The data process options are all good as they are.

34:42 - And here we just add the, the sample quantification matrix table.

34:49 - And we can also add the RunLevelData. Here are some processing options hidden.

34:58 - Um, for the condition plot, we do a zero degree, um, label angle.

35:08 - And we only choose one um, yeah, we only generate a QCC plot for all proteins.

35:20 - Otherwise, we would generate a plot for each protein, and we have more than 1000 proteins, so, um, that’s too much information, which we don’t need now.

35:31 - And last, we would like to do a group comparison: ‘Yes’.

35:35 - And if we select this, we need a comparison matrix.

35:39 - And here it is. We load the comparison matrix, we leave the outputs.

35:46 - We don’t even need a comparison plot, which is only helpful if one has more than one comparison.

35:52 - And again we can adjust the plotting options for this volcano plot.

35:57 - Here we set a fold change of 1. 5 as a cut off.

36:09 - And we don’t want to display the protein names in the volcano plot.

36:13 - Let’s check that we have everything. Yes, this looks good.

36:24 - So we can start MaxQua- MSstats sorry. This is MSstats now, um, just takes a few minutes, so it might be a good point to make a short break.

36:40 - Okay, so let’s have a look again at the MSstats parameter.

36:47 - So that was all to do the statistical analysis.

36:52 - So the first part is only a conversion of the MaxQuant output files into an MSstats-compatible format.

37:03 - And here it’s really important that the annotation file contains the right annotations and the, um, right file names.

37:13 - Because that’s how the metadata in the annotation file is actually mapped to the, um, runs’ names in the MaxQuant output.

37:25 - The second part, is the data processing. You might have realized that we didn’t change any parameters here.

37:35 - So the processing consists of a log2 transformation, median normalization, and feature selection, missing value imputation, and then the feature intensities are summarized with the, um, TMP method, into protein intensities.

37:58 - So these outputs all come from the, um, feature and processing level, and only afterwards we decided to compare the two conditions.

38:12 - And they were defined in our comparison matrix.

38:15 - So only the comparison result table and the volcano plot are actually, um, showing the actual um, statistical modeling results.

38:28 - So let’s have a look at the MSstats output.

38:33 - So first we have ah, log file. This is just a text file that captures information about that data analysis steps, warnings and so on.

38:45 - Um, if you click on one of thes, um, we also see the number of proteins and the number of peptides per protein that were in this dataset.

39:01 - So the process data summarizes, um, or it is a list of features.

39:07 - So we have the peptide and charge state, and for the features, we have a lot of data, including the metadata from the annotation files, of the group they belong to, and the subject they belong to, the run, they belong to, and then for each feature we have an intensity, and we have an abundance value.

39:30 - And the abundance is the log-transformed and normalized intensity value.

39:36 - And in the next step, um, these features were then summarized with the abundances to generate an overall abundance for each protein in each run.

39:50 - And that’s actually the data that we can find in the run-level information.

39:58 - Here we have now a summary of protein intensities per run.

40:04 - So here are the intensities, and we will now, um, calculate the distribution of the numbers of features per protein and run.

40:17 - And see how many features were on average used to quantify a protein.

40:26 - And we will also look into this column, with the run identifiers, and um count how often each unique run entry appears here, in order to find out how many features were present in, in which run.

40:47 - So to do this, we use, um, the ‘Datamash’ tool.

40:59 - So here this. So this is a really helpful toe to perform summarizations on text files.

41:08 - We want to do it on the run level data, and we need to know the column we are interested in.

41:17 - We can still check this here in the small window.

41:21 - So the run name is in column 8. We actually have a header line here.

41:27 - We want to print the header line, and we would like to sort the inputs.

41:34 - And now we count, with this we just count the number of lines that are now getting summarized.

41:43 - And the other thing we would like to know is the amount of features per protein.

41:50 - So we perform a simple summary statistics, again on the run level data, and this time it’s column 4, on the number of features per protein and run.

42:05 - So we change this to c4, and run the step. In the meantime, we can explore more of the MSstats output files.

42:32 - So the next file is a QC plot. It only wants to open like this.

42:41 - So this plot shows the um, protein abundances.

42:46 - So the abundances of all proteins for each sample, and we see that the medians here are, um, pretty on one line.

42:56 - Which is a good sign. And we also have a relatively, um, normally distributed data.

43:06 - And this plot you could also generate for each protein, but then the PDF I would have many, many plots.

43:12 - And that’s why we have chosen that we would only like to get the, um, summary for all proteins.

43:20 - Then there’s the sample quantification matrix.

43:24 - This gives us the quantities for each condition.

43:29 - So for each, um, for each sample and for each protein.

43:35 - So we have 19 columns here, one column per sample, and we get an intensity or abundance value for each protein.

43:46 - And then the same is here, um, per condition.

43:50 - For each protein we get, um, a summary of the intensities for each condition.

43:59 - And then the comparison result is the actual, um, statistical table.

44:06 - It contains the protein name, the label, in our case, we only have the comparison metastasized versus RDEB.

44:14 - The log2 fold change, and here the most important column is the adjusted P-value column.

44:22 - There’s also an ‘issue’ column. Um, and here is an example of a protein that was not at all measured in one condition and therefore the issue says, um that the protein is missing in one condition.

44:40 - And later we will filter this table for the adjusted p-value, and the log3 fold change, in order to find, um, the significantly differentially abundant proteins.

44:54 - And the last output file we have here is the volcano plot.

45:06 - So here we have plotted a p-value. Um, in a way that smaller p-values actually give higher values in this plot.

45:16 - So it’s easier to interpret. And on the x-axis we have to log2 fold change.

45:21 - And in the MSstats tool, we said that we would like to plot the volcano plot with a um look, um, fold change cut-off off 1. 5, and if we take the log2 of it, it’s 0. 58, and -0. 58, that’s the dashed lines here.

45:40 - And everything that is now in the upper right part and in red are proteins that are up-regulated in metastasized um, condition.

45:50 - And here, in this part are the proteins that are down-regulated, which means that they are up-regulated in the RDEB condition.

46:04 - So in the meantime, we have a result for our summary statistics.

46:10 - So this is the summary of the numbers of features per protein and run.

46:15 - So we can see that on median, there are three features summarized into a protein, or on average, around five.

46:28 - The datamash tool gives us an, um, overview now about how many lines, um, the run-level data had for each file.

46:41 - And because one line was one feature, this corresponds to the number of features.

46:46 - So we have here the numbers for each file, but we could also visualize it, when we click here on this visualize button.

46:58 - And then choose the bar diagram. We can give it a name.

47:12 - So it’s the number of features per sample. And here we can choose the, um, data that we want.

47:24 - And we only need to change this here, and now we can interactively browse to see which file is which bar.

47:33 - We can save this, and you will find the plot if you go on to use uh, yeah, to use visualizations That’s where your plots are stored.

47:48 - Okay, so now let’s continue with this um, actually, statistical result here.

47:56 - So we would like to filter only for proteins that have an adjusted p-value below 0. 05, and that have a log2 fold change um, above 0. 58 or below -0. 58.

48:14 - And furthermore, we would like to, um, make the protein ID a bit easier to read.

48:21 - So we only keep the actual Uniprot accession number, and remove this part before, and this part afterwards.

48:31 - So we start with this. We do this with the ‘Replace’ tool.

48:43 - ‘Replace text in a specific column’. We would like to use the comparison result, and column 1.

48:55 - So we want to get rid of this part, but because this pipe sign ( | ),has also another meaning in this regular expression, um, here, we need to use a backslash, to actually, um, find our pipe and because you want to remove it, we do not replace it with anything.

49:18 - And now we do the same with the right part.

49:21 - So again we want to have a pipe, and then everything that comes afterwards.

49:27 - And again we want to remove it so we don’t need to put anything in here.

49:40 - So next, we can already start the filter tool.

49:45 - So when we click here, it’s a safe way to find it.

49:54 - We want to run the filter tool on our replaced file here, and we start by filtering for the adjusted p-value below 0. 05.

50:08 - So it’s column 8, below 0. 05, and we skip one header line.

50:18 - And now we can directly click the rerun button.

50:21 - We want to use the same tool again, but this time on number 61, which contains only the low p-values.

50:30 - And now we filter for the log2 fold change, which is column three, and we only keep the proteins that have a log2 fold change below, above 0. 58.

50:48 - And we repeat the procedure. So we do it still on this p-value filtered dataset.

50:55 - But this time we switch it, and look for the down-regulated proteins.

51:09 - And we can now count the number of lines. So we started with 1297, and we have 123 that have a p-value below 0. 05.

51:29 - And we have now 8 lines that have a negative fold change.

51:34 - So these are the proteins that are higher in the RDEB condition.

51:39 - And once this is finished, we know about the proteins that are higher in the metastasized condition.

51:46 - And we will continue with um, both data sets, and therefore we give them a tag.

51:52 - So when we put a hashtag (#) in front of the tag, we can actually keep the tag.

51:59 - So every file that uses this file is then, whoops, having this tag here.

52:11 - And here I think we can only do it once it’s finished.

52:16 - Ah, ah yeah, here it is, maybe it survives.

52:31 - And you can also use a tag without the hashtag, but in case you use the hashtag, you will find the same, um, tag propagated during your history, and it makes it easier to track um, the many files that we will generate in the next step.

53:03 - The job is finished, and now we can visualize the results.

53:09 - And we see that the adjusted p-value is below 0. 05 for all proteins.

53:17 - But many of the proteins have, um, adjusted p-value of zero, because they’re missing in one condition.

53:25 - So in one condition, there was no feature, um, identified and quantified for this protein.

53:34 - And also these proteins might be of interest, and so the first follow-up that we do is on the, um, proteins that are missing in one condition.

53:47 - And to obtain them we will now filter this adjusted p-value, and keep only the proteins that have a zero here.

53:55 - So this is still column 8. We can use the filter tool again.

54:01 - We would like to use them, um, metastasized up-regulated proteins first.

54:09 - And we say that column 8 should be exactly 0.

54:15 - We still have a header line. And now we would only like to keep the protein IDs.

54:31 - So we used the cut tool on the last file that we filtered, and we keep only column 1.

54:42 - So the other values are now not important anymore.

54:46 - We just look at the proteins that all have in common that they were not detected in the RDEB condition, but in the metastasized condition.

55:00 - And we repeat the same for the RDEB dataset.

55:04 - So, from file 63 which has the RDEB tag, we only keep the 0 adjusted p-value proteins.

55:14 - And then the next step, I’ll use again the rerun button, because it’s way faster.

55:21 - And we filter this, um, we cut the filtered dataset, and keep only the ID column.

55:31 - And here we can already see the ID column. So it still has a header line.

55:38 - And because we would like to combine now both, um, IDs into one file, of one file.

55:46 - We need to remove this header line, and this can be done with the ‘Remove beginning tool’.

55:53 - So from this RDEB ID list, we removed the first line.

56:00 - And then we can, um, combine them. And this can be done with the ‘Concatenate’ tool.

56:07 - We concatenate file 65, they metastasized IDs, with file 68, which is the RDEB IDs that have no header line.

56:24 - So we can directly attach them here. So that was the RDEB, there was only, um, in the RDEB file there was only one protein that was detected in the RDEB condition, but not in the metastasized condition.

56:51 - And now we have removed the beginning, so it’s only one line left, and this line is now attached to file 65.

57:00 - So that we should obtain 81 lines now. And these 81 lines contain the protein IDs um, that we’re missing in one condition.

57:11 - And what is interesting now, is to see in how many samples each protein was actually found.

57:20 - And to do this, um, we go back to the sample quantification table from MSstats.

57:31 - There was the sample quantification matrix.

57:35 - And here we have for each protein and each sample, we have an intensity value, and we would like to get these values now for our proteins that turned out to be missing in one condition.

57:50 - And because we have only kept everything between these two pipes, we now also need to extract only these IDs from this matrix, in order to make it possible to automatically join the IDs of both files, um, to obtain the quantification.

58:12 - So we will do the replace step. Where is it? We have done it before.

58:22 - So we will rerun this step, just this time on the run-level data.

58:29 - And everything else we can even keep it how it is.

58:33 - So we want to remove this, um, part before the ID, and everything afterwards, and we just do it not on the run-level, on the sample quantification matrix.

58:49 - And after doing this, um, there’s a tool called ‘Join’, which automatically joins to files according to mutual information in one column.

59:03 - So we can already select the tool. And in the tool we could even decide if you would like to keep, um, yeah, the IDs that are in both tools.

59:18 - If you would like to keep the IDs that are only the first, but not in the second, and so on.

59:24 - So there are many conditions which make this tool very powerful.

59:28 - So here we would like to use the sample quantification table.

59:36 - That’s the one where we just replaced um, um the protein IDs, in order to have the same format as we currently have it in our IDs.

59:49 - And the IDs are in column 1. And we would like to join this with our IDs that we have from the metastasized and the RDEB file.

60:04 - Where the columns are also, um, where the IDs are also in column 1.

60:09 - Which we can verify by looking at the dataset once it’s finished.

60:15 - And in both files, we should have still a header line.

60:19 - So we say: headerline: Yes. And next or last, um, what we do is we use a heatmap tool, and we will put in our data into the hea map.

60:48 - The data will have a header and row name. So the IDs are in the, after joining the IDs are still in the first column, and that’s what we also have here automatically.

61:04 - And we would like to change the output file a bit, so make it a bit bigger, so 15 width, and 10 height should be fine.

61:40 - So we need to wait for the concatenate data to finish, and then, um, the join tool and the heatmap will automatically start.

62:01 - Okay, so the jobs are finished. We can look at the visualization heatmap.

62:31 - There was only one protein that was present in RDEB samples in 4 out of 6.

62:38 - And in none of the metastasized samples, and there were many proteins that were not at all detected in RDEB, but in several metastasized samples.

62:49 - But there’s only one protein that is actually present in each metastasized sample, and in none off the RDEB samples.

62:59 - And this Uniprot ID corresponds to Collagen VII, which is expected to be absent in the RDEB patients that have the genetic, um, disease.

63:13 - And this shows that looking into proteins that are missing in one condition is actually also an important step.

63:21 - One might want to apply further filter criteria.

63:25 - Because there’s, for example, protein that is, there are a few proteins that are only present in 4 out of the, um, metastasized samples, so they might not be, um, of interest.

63:43 - But others that are more frequently found could actually be interesting proteins.

63:51 - And that was the part on the missing proteins, now we will switch to the proteins that were not missing.

63:59 - We will continue with the significant proteins that we’re not missing in one condition, and to do so, we go back, to the first filtering step here.

64:15 - So we already had a filtering step to get only the proteins with p-values below 0. 05, and then we filtered out for the fold change cutoff, to find up- and down-regulated proteins.

64:31 - And we have seen that these proteins have an adjusted p-value of zero when the protein was missing in one condition.

64:37 - But now we are interested in the proteins that have a p-value above zero, and below 0. 05.

64:44 - So we will continue with the file 62 and then afterwards with 63, and filter for a p-value.

64:57 - So we will filter 62, the metastasized up-regulated file on column 8. and we would like to have a p-value above zero.

65:09 - And because we have already filled it for 0. 05, below 0. 05, this should now give us the proteins that are, um, significant and not missing in one condition.

65:20 - And we repeat exactly the same for the RDEB up-regulated proteins.

65:31 - And then we only keep the ID column. So we used the Cut tool.

65:39 - Yes, we’re now here. We follow up. Now here, we find differentially abundant proteins.

65:50 - We use the Cut tool, and we keep column 1. And because it’s exactly the same procedure.

65:58 - We can actually use the ‘multiple dataset’ here, and select both files that we have just filtered for the p-value between 0 and 0. 05.

66:12 - And from this we cut column 1 to obtain the protein IDs.

66:24 - And here we can already see how many proteins we have.

66:27 - So 7 lines means we have 6 proteins that are up regulated in RDEB, and we have 36 that are up regulated in metastasized cSCC.

66:41 - And now we would also like to visualize, um, their quantities.

66:46 - So we will again match the IDs with the, um, sample quantification matrix.

66:55 - But this time it would separately for RDEB and metastasized.

66:59 - So we don’t, um, concatenate IDs and do only one heatmap.

67:05 - But we repeat the procedure twice. Um, so here.

67:16 - So we would like to have the, um, sample quantification matrix, but we had to replace it before.

67:26 - So it was filed 70, where we had the correct IDs, and then we had the quantities for each sample.

67:34 - And it’s column 1, and we should be able to do the same trick again, because we do it for RDEB and metastasized files.

67:45 - We use both, um, files here, and we will obtain, um, once the join data for metastasized and quantification sample, and then once for RDEB and the quantification sample.

68:01 - And the files all have a header line, and we need to select the column, it’s column 1 for both files.

68:12 - And this time, we use a different heatmap tool, just because it looks nicer.

68:18 - So it’s ‘heatmap2’, and we can do it again in parallel for both files.

68:33 - Not if we want to give it a name. Then we do it separately.

68:37 - So if you want to use a plot title, this would be the up-regulated proteins in metastasized cSCC.

68:50 - You can have a look here, column 1, and then the quantities.

68:56 - We don’t want to use clustering, and we would like to scale the data.

69:06 - But now we can use the rerun button, and we just need to change the input file to the RDEB file, and we need to change the plot title to RDEB.

69:21 - Okay, but now we’re still at the level of the protein IDs, and in order to have more meaningful protein um, IDs.

69:39 - We would like to know the protein names. And this, there are many different steps how we could do it.

69:48 - So there’s a Uniprot tool in Galaxy. The Uniprot ID mapping and retrieval, and here we can put in our metastasized join tool, use the first column, which is the ID column, and then retrieve as a FASTA file, the Uniprot entries.

70:19 - And we repeat the same for the RDEB file, column 1.

70:26 - And because we now have a FASTA file, it is quite tricky to read it, because there’s all the amino acid sequences, so we use as a last step the FASTA-to-Tabular converter.

70:46 - We can convert again both in parallel, and we split up the header into two parts, and that gives us now a table format with the um yeah, with the protein names of the up- and down-regulated proteins.

71:09 - In the meantime, we can inspect the heatmap, one heatmap is at least finished.

71:26 - So this is the heatmap for the proteins up-regulated in metastasized cSCC.

71:34 - And we clearly see here that, um, they have higher intensities than in the RDEB file.

71:45 - And we will see the opposite here, once it is finished, and here is already one result.

71:53 - So this is now the, um, proteins up-regulated in metastasized cSCC, and we have now split the FASTA file into this ID again.

72:07 - But then here in the second column, we can read the actual um protein name.

72:14 - And in the third column comes the sequence with the amino acids.

72:21 - And here’s, for example, mitronectin that, um, was also found in the original study.

72:28 - But in the original study, the MaxQuant parameters were different, and a different, um, statistical approach was used.

72:39 - And we see several histones here, and also, um, many RNA related proteins.

72:54 - Okay, so we will just need to wait to see the results.

73:09 - So if you want to analyze your own data, um, with MSstats, um, the really crucial part is to set up the annotation file correctly, and, because this is the data that MSstats uses to decide on how to fit the linear models to the data.

73:38 - And then, as you have seen now, there’s many text manipulation tools in Galaxy, so everything that you can do in excel is also doable in Galaxy.

73:51 - The most complicated part is probably to know the exact name of the tool.

73:56 - But we have already used a lot of tools here, so you know, already some of them.

74:11 - Now the tools are finished. We can look at the heatmap here, and we see these are the proteins up-regulated in RDEB condition, and they have higher intensities than in the metastasized sized condition.

74:35 - And now we also have the tabular file with the protein names.

74:39 - And we find two collagens here, in the list of proteins that are higher abundant in the RDEB condition than in the metastasized condition.

74:52 - And that’s, um, could be explained by the fact that when collagen VII is missing, there’s a compensation effect, and other collagens become up-regulated and actually for the collagen 14 protein, there were also immunofluorescence stainings done in the original publication, which confirmed that the abundance of collagen in 14 is, um, way higher in RDEB cSCC than in metastasized cSCC.

75:32 - So that was the training. I hope you have learned something, and you will be able to repeat the training on the datasets, and maybe in the future also on your own data.

75:48 - Thank you very much for joining this video.

75:54 - Hands-on demonstration!.