GTN Smörgåsbord - Day 2 - Advanced R
Mar 18, 2021 09:35 · 9436 words · 45 minute read
Hi everyone um welcome to um um to this tutorial about um using R in Galaxy. And this is a continuation of the previous video where we captured a few things about um how to use basic functionality in R um through through Galaxy. And just a quick overview and we’re going to be using the active interactive RStudio provided by the Galaxy interface and now that we’ve already running the session here you can go to user, the interactive tools and you’ll see that it is actually running here.
If I click on the R Studio link it will pop up the interface that we will be working that we were working um before. In our previous uh session we had a look on some basic operations of R: how to create variables, uh what are the different ways of to apply mathematical operations, how to assign values how to remove them, and we talked a bit about how to work with vectors and how to do subsetting and finally a few things about lists. I mean this session what we will be discussing are a few more steps on how to to deal with data and specifically how to do data manipulation um for tabular data and from Galaxy in R.
We will check how to load and explore the shape and contents of tablular dataset using Bazar functions we’ll see a few things about factors and how they can be used to store information. We’ll use one of the most commonly used libraries in R: dplyr, and, but we will manipulate the data and we’ll see a few things about working with that. um An interesting point to keep in mind is that um a substantial amount of data that we work in science is tabular data so data used in in rows and columns so there are some principles that would be nice to keep in mind when working with such data.
So one of them is keep the raw data separate from from the analyzed one. So in principle and you uh you load their data and then you don’t touch them anymore even if you write back you write back your analyzed data, your refined data, normalize data into a different file so you don’t get the risk of by accident changing the original original data. The second point is to keep the spreadsheet type data tidy. um A simple um way of thinking about this is to have one row in our spreadsheet for its observation or sample and one column for every variable that we want to measure or report on.
And although it’s it’s quite simple and easy to explain and it is one of the most um easy concepts and principles to violate. It is interesting to keep in mind that a vast amount of time for scientists is dedicated to tidying data for analysis. And finally a crucial aspect is to also trust the data but always verify. You don’t need to be paranoid about the state that the data is in, but you should always have a plan to to assess and and verify them. And this is one actually of the focus of this particular discussion right now.
The lesson and in many cases you might have some assumptions some expectations about the data range of values and how many they are and and and what are the different observations that you will have and as the data are going um higher in numbers this might create this might not reflect the actual uh you might be easily verified. So for this reason it is good uh good enough practice in the beginning to check actually that the data is is is correct to verify them and we will see how this this works.
So um one of the first things that we’re going to do now is we’re going to create a new R Script. I’m going to close this one and I’m going to go: File, New File, R scripts and I’m going to save this one as R_advanced, advanced, there we go. Um and now we are ready to import some um some tabular data in R. The easiest way to do that is to use a function called “read. csv” and this can get input a lot of different things: a file, a a stream. In our case we will be using is another URL.
Essentially what we will be using will be the output of the um the annotated differential genes differential express genes that we created in the RNA set lesson just before. So I’m going to save this one and also as a new variable called “annotated differential stress genes” and I’m going to assign this here so again the name of our of our variable, the function we’re going to be reading this in, and and the full URL. I’m going to execute this. And as you can see we have um a new um data point here.
Um so if I have a quick look um it might not see as best as we have liked um. Which means that we’ve it kind of worked, but it’s not the best way. So in order to achieve this to to actually check it out um. If I were to open the file I’ve downloaded and checked locally we’ll see that it’s going to be it is tab delimited instead of of comma separated. So what I’m going to I can do is I can use a additional parameter here called um separator um which is defined as “\t”.
So if I’m going to rerun this command you see now that I’m going to actually see that I have 130 observations, observations are equivalent to rows and 13 variables so 13 columns. And if I drop down this list I’m going to be able to see the different columns that this information this table actually has. So this is quite interesting to see. I can also have a look at I can use the console the commands to have a look at the names of this of this variable.
So I’m going to use “colnames” and run this and you can see that the same names that I see here environment I can also see them as a vector now in in my in my file. So congratulations ! you’ve successfully loaded um your data into R Studio.
07:10 - So um now that we’ve loaded the data the next thing that we need to do is to actually um have a quick look at what they actually contain so get some summary of of what the information here is. In order to do that, we can use a function called “summary” which is very convenient because we don’t really need to remember a lot of things. So let’s run this and we can see we run by “Control Enter” you see now in the output in the console we see a lot of information.
We can also use the function that we saw last time, called structure which will allow us to give to get a better understanding of what the structure of the um of this of this variable is. All right so we have a lot of information here so have a better look. So uh in terms of the summary, you can see that every column is now represented here as a block so there’s a block here a block here and so forth. So if the column is numeric, what it gives us are some basic statistics, so what’s the minimum, the maximum, the median, the mean, the first quartile, the third quartile.
So we have a basic inclination of what is the information here. If for whatever reason the column is character, and you can see that you have character here, character here, and those last ones um. You only see a few things like what is the length of this particular column so how many elements it has and what is the class what is the mode and and and that’s it um. Another point that might be relevant to keep in mind is that um here um we have information about the individual structure.
So in other words and we can see that um the gene id um is a column of character type and these are the um the content that it has and this is an integer and so forth. So using both the summary and the structure um we can have a better understanding of the content and and how this works um. An interesting point to keep here is that um a lot of the um of the of the of the variables like uh the base mean, the log2. FC for uh for uh fold change and the P-value and they are numerical data and they provide these particular summary statistics.
Some others are treated as as categorical data and um we have them as length 130, class and and mode. And it is interesting to have this in mind because this will allow us to have a better understanding of how this this can work um. Another point that is interesting to see now is that and this is actually one of the new entries of the lesson is that by default all those columns are treated as either character or numericals. And in some cases like the start and the end which are integer numbers they are considered as integrals instead of numerics and difference being that they might have a um their float points or or or not.
And however there are some cases where the information provided here is um is is of more use if it’s actually categorical data. What does this mean? There is a parameter and when loading the file and which is called “stringsAsFactors” um. So this is a parameter that since um a few months ago by default it was set as “TRUE” and now with a new version of R has been changed to “FALSE”. And I will rerun this command um and I’m going to also rerun the summary and the structure.
And you’re going to see some changes now as opposed to the previous case um. I will go back to the structure you can see that all the all the character based um columns instead of being character has now been changed to factor and factor is um one of the major um data structures that are used in R because they allow us to work with um categories and they are a special case um of of of character type of vectors. So um and what it allows us if we go to the summary you can see now that instead of saying that gene id for example has 130 length it actually gives some basic allocation of how the different values the different categories that are within this particular column correspond across the different cases.
The most easy to see example is strands where we have 72 cases of plus and 58 cases of minus. Or feature for example we have 126 protein coding, or three long coding long coding RNAs, or one pseudogene um. Sometimes you might need to treat um data as a factor or otherwise you may want to keep them as as character, and for this instance I’ve explicitly asked everything to be changed to factors so that we can continue with this process um, but so that we can show exactly how the factors work.
13:14 - So the first thing that we are going to do, let me change the size of the table again, is to use to extract from the annotated differential genes table. I’m going to extract the features, the feature column, and I’m going to save it into another vector a new variable called “feature”. Let’s upload it like that. So as you can see now here we have our original data frame our table and now we have a yet another one. If I were to do um a function called “head” so it can show as the top first columns of of the um of the variable you see that it actually prints out that the first few values of the vector are “protein_coding”, “protein_coding”, “protein_coding”, “protein_coding”, “protein”_coding”, “protein_coding”.
So everything is the same but it also gives us another piece of information saying “levels”. So “levels” are essentially the different categories that are supported by this particular factor. If an easy way to think of them is like drop down lists so a factor is drop down list where there is a limited number of options. You cannot have a different value different type of information there except the ones that are already defined here. And if I do the structure of feature um like this one it actually gives us the information here that we have a factor with three levels, these are the levels and now it’s actually some more information here which is numbers.
So why are we having numbers here? So for the sake of efficiency, and storing less information, R stores the content of a factor as a vector of integers, where x each integer is assigned to each of the possible values with them with an alphabetical order. So in other words, the first element in our feature objects is “protein_coding”, the second one is um uh sorry the first one is the long non-coding RNA this is the first if you like look at the order L is the first and the first letter then we have the protein coding and then we have pseudogene because PR and PS.
So if we assign them in numbers R does one is a lincRNA, long non-coding RNA, two is protein_coding and three is pseudogene So by using the structure command the structure function it gives us what are the different levels plus the first few values. And as we’ve seen before the first few values are all protein coding. So if I count six: first two values, which are the number we have here, you can see that it’s: two, two, two, two, two. So in other words, um we have the representation the internal representation of the different um values the different factors as as integers.
One of the most common uses of for factors is to actually plot categorical values. And so let’s actually try to do that and we’ll be using the base function plot to do that and I’m going to type here “plot(feature)” and I’m going to execute this and as you can see down here let me zoom this a bit more and it produced a simple math plot we’re going to be seeing much more information about how to create nice publication quality graphics in a later tutorial but for time being and it is good to know how this this can work so in other words factors are an efficient way of storing categorical data with the entire representation of using this this kind of information.
17:13 - So now that we’ve seen the the factors and we will return to them in a bit. Another point to actually check out is um how to do subsetting so we have a table here but we might need to extract particular piece of information. We’re going to use the exact same structure as we did with vectors so we’re going to be using the square brackets so let’s try a few ways of subsetting so let’s say for example that we want to extract the first value of the first observation.
So in other words if we see this as different columns and I can use this particular information this particular button to see this as an actual table what I want to extract is this one so my first row and my first column this particular value. I can use index one comma one if I run this it will give me my first value it will actually give me more information here because I’ve extracted a particular piece from from a factor column. It actually allows me, you know, reminds me that this is a factor and these are the different values that they exist here.
um I can do the exact same thing with different indexes let’s say [2,4] if I run this I get this particular value if I check the actual table and I got two rows fourth column there it is this is the value that we’ve actually just retrieved and so as we did in vectors. We can actually specify ranges and I can use for example one double um number 4 and 1 and for this 1 if I run this it will give me the first four values of my first column so rows 1 and colon 4 for column 1.
Again I’m going to go here my first column 1, 2, 3, 4 it gives me those four values. I can combine these in both directions so I can say 1 to 10 and 1 to 10 here and 1 to 5 here if I run this I get a subset of the same table and similar to again vectors I can be very more explicit I’m going to say that I want to have the first 10 rows but I’m also want to capture um columns named “Feature” and “Gene. Name” so in this case uh okay so I’ve created a pr an error here because I I mistyped some of the columns and I can check exactly what the names are, and there we go so this is my problem I have a typo you see that I put the name with a capital N whereas the actual column name is with small “n” so I’m going to change this to an “n” I’m going to run this and now it actually gives me the correct sum so the first 10 columns the first 10 rows but for columns “Feature” and “Gene.
name”. I can do also in the same process last time I can create I can extract columns I can disregard columns and I can say I want you to give me everything but the first column. So in this case, I’ve requested all the rows so as you can see I have all the rows here with the exception all the columns sorry because I have nothing on my second part I have all the columns but for the first row as you can see instead of starting from position 1 this starts from position 2 ; I can do the exact same thing instead of except I can do this one and for this case will give me all the columns and only the second row ; or I can combine that and I will ask the first the rows 2 and 3, the rows 2 and 3 and all columns and vice versa.
I’m going to change that and I’m going to ask all the rows but for columns 1 to 3 and again here I’m going to have as you can see 130 observation but only the first 3 columns which is generated and log2. FC so um. A final point which is very specific for data frames is that you can use the um dollar sign that we saw earlier to actually retrieve a particular piece of a particular column. So for example in this case I’ve requested to give me all the “GeneID” column and this is how it’s it’s working.
A final point again specific for data frames we also kind of saw this into the vectors as well is I can use this one um and I can access for example the feature and I can say that I want to extract only “Features” only um contents of of the of the table and that have “pseudogene” as a feature. So as you can see, it gives me all the columns this is I have absolutely nothing here but only the rows for which the feature column contains “pseudogene”.
I could take it one step further and I will stop the subsetting here. I can continue this even further and I can have also a column to the subset um. There is there are a lot of exercise that you can see on on the on the training material on Galaxy and so I would suggest that you have a look and you try them out and hopefully provide a bit more context on how subsetting works. So now we’ve seen how subsetting works in data frames. I’ll move one step further and I’ll go back to to the factors.
Let’s see that as we’ve loaded the the original data table and you see that here we have factors and numbers and so forth but we actually might not want all of these to be factors we might want them as originally the words characters so one of the um common activities is to change the type of a column of the values of a column from one a to another so basically what we’re talking about is is coercion um. Let’s try to do this with a gene name so we don’t want the “Gene.
name” to be factors we want them to be actual values. So let’s check what is the structure of the of the “Gene. name” column let me scroll it down there it is if I check this out it gives you information that is basically a factor with 130 levels and and these are different integers associated per level. So um let’s try to do something weird um we know that these contain um characters this is our if we check the table and I’m going to do it right now I’m opening a table I’m going to scroll the end I see characters here so I I want characters and we saw previously in the vectors that if I try to coerce strings to numbers, it gives us a warning it basically adds a lot of missing values so let’s try to do this here let’s try to “as.
numeric” and I’m going to put the annotated “Gene. name” as input for that um. It will quite possibly give an error huh. This is interesting um actually it worked rather well and no warnings and for some reason the characters have been changed to to numbers. So it works but it actually doesn’t so instead of giving an error message R returns the numeric values which in this case are the integers as you can see here assigned to the levels in this factor. So for example “Ama” was 88 and you can see that actually 88 here and so forth and this is a kind of behavior that sometimes can lead to hard fined bugs.
For example if you load a table and by accident it has to be converted to the factor and you try to do some numerical operations there it will change to numbers no error you see no errors so you consider that everything is going well but actually this is this is a problem. So if you don’t look rather well you may not notice a problem. So how do we do that we actually con coerce you we make a explicit coercion of this particular um character a certain column as characters.
So if I try this one, I’m going to paste it here I’m going to run this and you can see that this time around actually works as we wanted and these are the different values and now that we know that this works, I’m going to overwrite and this is important to highlight I’m overwriting my original column in the data frame that I loaded with the changed information and you see now that in my table this has been changed from a factor to a character. However bear in mind that I have not saved on my original data so this is only within R Studio I can easily if need be reload my original data and have the exact same process in place, um but this is something to um to to keep in mind.
So again just as a reminder when loading um a a data table by default now since um a few months ago um this particular parameters start to “FALSE” so no actual column is changed to a factor. However if you select this to be “TRUE” or if you convert explicitly a column to a factor you might always try to remember when to apply coercion, especially if it applies numbers because if by any accident you had a column where it should have numbers but by accident it had some characters it will change to to different factors different level factors everything is going to be character all change factors the factors will correspond to integers the different levels by coercing this to numeric the numbers of the corresponding var levels will be retrieved you’ll see them still as numbers and without actually checking that the numbers correspond to what you expect you might uh you might miss this altogether.
So it’s common enough mistake um and and can be encapsulated in the single points when dealing with factors and converting from a factor to a and you want to apply a function to the factor always keep in mind what is the data that you’re actually trying to apply this to.
28:41 - Alright so um now we’ve uh we’ve captured all all this information about the factors and let’s try to um check a bit how we can apply also numerical functions to it um. For example let’s say that if you if we look at this table we have also the “Base. mean” column and which corresponds to the normalized counts of all the samples and normalized by sequence depth and and lambda correction so let’s say that we want to um check a bit about information with numeric information there we can apply uh functions as we’ve just seen and directly to uh to a column so for example we can use um the mean which gives us the average value the min which gives us a minimum the max which gives the maximum value and so forth.
And let’s try by finding for example the maximum value that this particular column has we’re going to use the exact same approach. So we’re going to use the name of the variable and using the dollar sign and I’m going to select the “Base. mean”. So “control enter” and as you can see it gives us here information about what is the maximum value that come comes here. And this is useful and we can also make a a sorting of this information and and in order to do that we can use the approach of subsetting in a sense so in order to do that I’m going to use basically the subset function but instead of subsetting I will ask an R to “order” the information here based on the value of of the “base.
mean” and I’m putting comma here because I want all the columns I literally want to reorder my entire table by ordering my rows based on on this particular value. And I can use this to sort this information into a new variable so for example sorted by “Base. mean” and I’m going to assign this here so if I run this um you see it is it has been executed and here is the information um available here. I can do a quick check by doing “head” of the um sorted by “Base.
mean” here and we’re going to check and here we see um the information um in the “Base. mean” it starts 19, 23, 24, 26 and so forth if I do the exact same same thing on on my original on my original table we’ll see that the information is is rather randomized so we have -4 then 2 then -2 and and and so forth so it’s exactly as we got them um. So as you can see here what we got is um the uh the “Base. mean” is by um sorry I was talking about different columns so the “Base.
mean” again if you see that’s a thousand and sixty five thousand and two thousand suppose it’s going up and down um. So we see that is basically by ascending order so first 19 from minimum from the smaller to to the um to the to the higher number. There is an option in order that allows us to change that in the parameter the option is called “decreasing” so by default is “FALSE” but we can set it “TRUE”. So if I run the exact same command again and and now we can check again the “sorted_by_mean”, the “head”, we see now that it starts with the maximum value that we’ve already checked out earlier so we know that this is the maximum value and then in the decreasing order.
So by doing that we can play around with the different uh different options and and and and create uh the version of of the table in the form that we find more appropriated. Now that we’ve already done that, uh the next step that we might uh that might want too, is to um to save this um this new table this uh changed table into a um into a new file so that we will be able to reduce it again. And so the function to do that is called “write. csv” and what exact expects as input is first of all what do we want to um to save so let’s say like we want to save this particular file and um we are going to uh this particular data and we want to save it in the file named annotated um differentially expressed genes and we can say plus strands because there is also information about strands here csv.
So if I run this, we can go in the files and we see that indeed we have the brand new name the brand new file created here. And there is an option of pushing the information from the R Studio back to the Galaxy you can find this information in the tutorial so you can really interact with all the outputs that are here back into Galaxy so you can continue your pipeline your pipeline there.
33:50 - So um now that we’ve covered um a bit of uh some of the basic functions of working with uh with tables and again real fast we talked about what are the features how we can subset table and what are the different ways that we can see the structure how um how factors works and what are the caveats that we saw here about make coercion and and by accident and converting something into numbers that it should not be, and and then how to apply functions. Um so selecting columns, rows, subsetting playing around with the data to reformat them in a sense um is is is one of the most common things that are uh done before the actual analysis at the end.
Um and um if we want to do a lot of things one after the other and it might be possible to create a rather complex set of commands eventually. Um so there are there is a particular package called “dplyr” which was created around 2014 and that provides an additional level of functionality in R that specifically allows us to aggregate and combine and analyze tabular data uh in in a much more efficient way and and also an important point is that it addresses directly the data as they are located so it is um generally quite um memory efficient.
So how can we do that the first thing that we need to do is to incorporate the functionality into what we are doing. In order to do that, we use the command so basically we want to “load dplyr” in order to do that we specify “library dplyr” if I press enter it gives us that information here that R has attached the package dplyr and these are the different packages the different some changes that we need to be aware of and and and so forth. So um let’s see a few things that can be done now for sub-setting but using dplyr functionality and you can see how much more intuitive and convenient this um this this process and this process is.
So first of all let’s say that we want to select columns right and and filter some rows which is basically sub-setting um dplyr provides a function called “select” and what it does it expects as input the table that we want to to use so the annotated genes and then we specify by name what are the columns that we need to use. So if we open this one here and you can see that the columns that we need, we might want, maybe GeneID we want the start and we want the end and we also might want the strand.
So if I run this now you see that the output that we get is a table that only has the columns that we specified and as you can see we haven’t we didn’t need to specify the index or anything else and just the table itself that we’ve already know and what are the names as they are listed here of the columns that that we that we that we need. Similarly to uh the standard sub-setting we can do we can select everything except some of the columns by using the minus sign.
So let’s say that we want everything except for example the chromosome “Chromosome” so if I run this you can see that it prints out all the columns I’m just going a bit more all the columns here after P adjust you see that this should be chromosome but you see now it’s missing then it goes like from start, end then so forth so it is a very a very easy way to to to extract information. Plus it gives additional information additional functionality like um let’s put again the same structure the beginning and let’s say for argument’s sake that we know that we want to extract some columns and and the columns that we want to use are all start with with “P” for example “P”, “P.
” we want to extract only the P. value and the P. adjust for example. We can either specify P. value and P. adjust or we can say that um we want sorry I’ve put double s here starts with and we can put exactly a string and that we want your column to start with. So “P. ” so if I run this, you see that dplyr is clever enough to do a quick pattern checking across all the columns and gives us only the ones that are starting with ”P. ”. You can have even more complex and you can check that and more functionality here and that has um more complex ways of the information so for example um that it contains a character or ends with a character and even creates more complex and regular expressions to do that.
So with “select” and I’ll put this as a comment select allows us to select columns basically. So features of the information and sub-setting as we did before also needs to select rows so “filter” and is the functionality in dplyr allows us to filter for rows. So how we do that, again the same structure as before if we use the filter function and what we want the function to be applied to and then the question is okay how do we want our rows to be filtered by ? So here we need to put a statement for example let’s say that we want to keep only the rows for which the strand is the +1.
So if I run this um we see that it prints out all the columns but instead of printing out the 130 observations that we have here it only brings out 72 and if we check the strand you see that everything is is the +1. So the equivalent in base R that we saw before um is to use “annotatedDEGenes” again and then can have annotated genes, strands equals to “+” and then comma so in other words give me all the columns and only the rows that this particular column has a + sign.
So if I run this it will give me the exact same information but as you can see in terms of um reading the code and thinking of what you’re trying to do um this is a bit it’s much more convenient and so these two are equivalent. Let’s try something more and we can filter for example by chromosome let’s do again the same thing so filter the rows but for which the chromosome and again we can put a bit more complex let’s say that we want to select only for chromosome X and chromosome 2R sorry 2R so if I run this you see again I’m going to scroll a bit more and you can see in this column here the chromosome you don’t have chromosome X comes into our so in other words here I want to filter all the rows for is the chromosome and the value of chromosome is either chromosome X or chromosome 2R and we can have in more logical questions here logical operations so I’m going to do the same thing again annotated genes and let’s say that one filter on before log2FC, “log2.
FC” dot to be greater or equal to 2, greater or equal to 2 um so if I run this it will give us only the 6 um 6 rows for which the log2FC value is greater than two and here’s one of the most um usual useful uh functionalities are done here and it’s also important to highlight that um you can do combinations of those of those two functions let’s say that we want to combine those 2. I can say and I’m going to copy this part um directly um but I want to have, to filter for rows that have this particular chromosome only X or 2R but also I want the “LOG2.
FC” to be greater or equal to 2. And so in order to combine those 2, I can put a logical AND (&) so if I run this and let me run uh it will give me only the two as you can see here this is our entire table you see chromosome X, chromosome 2R so basically it filtered both of those aspects both chromosome and for the LOG2FC to be greater or equal are um. So this is one way of changing multiple information um. An interesting point, because if we if if someone who have multiple criteria you can easily imagine this will be quite extensive so you’re going to have multi-different logical operations one after that so eventually it might end up being a very very long line and so dplyr actually provides a very nice additional functionality called “piping” so in other words um it’s a way for um for to code where the output of one function is provided as input to the next one and so forth it’s very similar uh to what is done um in a Unix environment and let me show you how these are.
So the actual symbol of the piping is the “%>%” so this is the pipe in dplyr so what I’m doing here I’m saying um I have this command and I want to um the command by itself let me remove the pipe if I run this it actually prints everything out. So the output of this command is basically the entire table so I’m piping the entire a table and then I’m asking to filter by strand to be only on “+” right so the output of this entire thing is being passed into the next one and this is why and this important highlight I don’t need again to highlight this as input.
So if you can see “filter” expects as a function an input and then you have the rest of the parameters. Because it’s now part of a pipe and it doesn’t require an input it doesn’t expect an input because it’s already given through the pipe. So if I run this it will actually filter the rows for “+” strand. If you check here this is the plan and I can continue doing that so I can put another pipe here and I want this output to be, so I’ve selected the rows I can now select some columns and now that I’ve selected based on that I want only to keep “GeneID”, “Start”, “End”, and “Chromosome” right so if I run this you see now that I have only the 1, 2, 3, 4 columns.
And more importantly you can check that now already I filtered based on strand but strand doesn’t need to exist in my second part. I’ve already moved it by “select”. So if I swap those commands after one after the other it wouldn’t work because in this case the “Strand” column will be will be missing. So I would urge you to try this and see why it will actually fail and just for for verification purposes you see that this command gave us 72 lines by filtering only by columns we still maintain the same number of rows.
And I can continue the pipe and let’s say that we want to see only the top few lines so I pipe with these two head and it gives me only the top lines for me to truly check how this how this goes. Um so this is a very convenient way as you can hopefully see to create subsets based on your criteria on your purpose based on an original table without actually having to do a lot of complex operations within the same command. So you can split your commands one after the other.
And so this allows us essentially to create a new object a new variable that we can then use to save. So, I will remove the head because I did this only for showing how many lines you have and I’m going to say here I want to keep this as the “+” strand genes and I’m going to save this so I’m going to run this entire thing here you’re going to see that there is a new table here called “plus_strand_genes”. Let’s check ! It actually has only the 4 columns.
And now if I do “head” of the “plus_strand_genes” and I’m going to run this it actually gives me the exact same command as as before. And so by starting with the annotated genes the new the original table we can do subsets we can do filters we can retain only the columns that we need and we can save this information eventually as as as a new table that we can use for our variable analysis.
47:53 - So here is how we can select and filter the columns. And another key and quite useful functionality is to actually create a new column. So let’s say that right now in our original table and we have the LOG2FC um, but this is the um the logarithmic version. Let’s say that we want to change that to um to the fold change not the log of the fold change. In order to do that we will create a new table, a new column sorry based on this particular function.
To do that and I will be using pipes now to be a bit more explicit. So I’m going to pipe the table in. I’m going to use a new function called “mutate”. So “mutate” says expect first of all the table, which I provide through the pipe, and second what is the name of my new column. I’m going I’m going to call it fold change (FC) and how this is going to be calculated and I’m going to call that as 2 in the exponent of my original column which is LOG2. and this is the on, right ? um.
Let’s check this out, I’m going to put this to “head” so we can see only the first few column first two rows. So it’s interesting to see now so you see that all those columns are the same. It hasn’t been changed but now there is an additional column at the end called fold change. And if you do the the the operation um this is if you if you do the log of that you’ll provide this information. So this is how you can create a a new column um. A final point to keep in mind is um now we’ve created a new column we’ve selected filtered um it might be extremely useful also to try to think of more complex situations.
So for example let’s say that um we want to um to answer the question: how many um genes differential expressed genes we have per chromosome? Right this is a very natural question. If we think about about it what we are asking is to create one subset of the um annotated genes per chromosome and then basically count how many genes we have there. So this process that I just described to split our initial data into groups and then apply an operation to each group is called “split-apply-combine” and this is a functionality offered directly by um by dplyr through the “group” and “summarize” uh functions.
And let’s put this subtitle so group and summarize and let’s let’s see how this works. So in order to do that I’ll start again with the annotated DE genes and let’s address this particular question. So how many genes we have per chromosome? So in order to do that first we need to group. And um let’s call a function called “group_by” and I want to group my table by “Chromosome”. Right, and now that we have our groups now it’s not actually printed out groups, this is an internal functionality of dplyr.
So it says that now I want to apply any new function I provide considering these as groups. And the second question would be: okay I want to summarize um and I want to summarize them by counting them. The count: so how many if we split if we group our original data per chromosome how many rows it has ? how many rows which basically corresponds how many genes ? is done by the functionality n() and so n() basically counts rows per table. So if I run this very it’s a very short piece of code but you can see that it addresses a very um interesting question and one that is quite often used in research.
So you see that it provides now new two new columns, one is the “Chromosome” which is the question that we’ve asked, the second column is “n”, because this is sort of what we’ve asked here and you see that chromosome 2L has 24 rows, chromosome 2R has 31 rows and so forth. So um it is a a very common function we can use this even better. We can say, I can copy this entire thing, and I will use the equivalent version, but instead of “n()” which is the shorthand, I’m going to use “count”.
So “count” basically counts how many rows you have and I can name the column as “n” so if I run again if this is the exact same command if I run this um. Sorry, I mistyped something here. I have to change this one because I have not named my um column. There um. Again my apologies, I have changed um sorry I’m I am.
53:16 - Well, I realized that um what actually happened is that um the count is the let me repeat that um. Because it’s grouping by chromosome and doing the count of data of rows per group is actually a very common function and the the dplyr provides a command, a function, that actually is shorthand for this particular thing. So group_by chromosome and summarize is equivalent to doing count chromosomes. So if I run this you see that it provides this kind of information and and if I want to actually name the column I can set um a name there.
So um as I want to say so this is one way of applying the same functionality into subgroups and you don’t necessarily need to have one group. I may want to ask the same thing, not by chromosome only, but also by strand. So if I run this it will create additional groups because now it will create one line for each combination of chromosome and strand. So chromosome 12 Strand -, chromosome 12 um Strand + and will provide again the count for all of those things.
So this is a very versatile command and will allow us to um to create quick summaries of the information that we have. And we can have um different functions here as well. We can have an average, we can have an absolute number, or depending on if we want something more advanced, we can have a full different and mathematical operation done into in into this piece of information.
55:01 - Right, so um we’ve seen how we can sort of massage and aggregate and summarize information into a um into R using dplyr. And another thing that is quite useful to keep in mind especially when we start thinking about plot information, which we’ll be seeing shortly in the next video, and is how to reshape the data. So I’m going to use the exact same example here. I’m going to copy the same here so what we will have more information and we can try about this.
Let me actually change the name here to um to “n” um so we name the column. If I rerun this you’ll see that the um the name now has been changed to “n”. We can change this to a couple… . [lacking words, (of times)] if you want and so [on and so] forth um. So here we have a um quite traditional way of representing information. And we have um the “Chromosome”, the “Strand” and how many time times this particular how many genes this particular strand has, how many rows my original table had that correspond to this particular piece of of of criteria.
If we want however, we might need some cases to have a different type of representation. For example, we might need a table that has chromosome then “+” as one column and “-” as another column. So again there is a a a library in um in in R called uh let me call it that we have here called “tidyr” um which allows us, let me run this, which um and now it’s loaded, which allows us to actually transform and reshape a table from one format to another. So changing the way that the information is representing without actually changing the information.
So this is the key point between dplyr and tidyr. Dplyr aims to aggregate, summarize, filter, select on the particular table. Tidyr is aiming towards changing the shape that this information is represented. So here again let me run this command. We see this kind of information so let’s say that what we want to do is to um use the representation of chromosome strand “+”, how many, strand “-”, how many. I’m going to use a new function here by tidyr called “spread”.
So what we what spread expects is um two pieces information which is: um what column needs to be um taking consideration, and what is the information that will be split across different um columns. Let me explain this a bit more; and so it expects and to select which column and which values of a column should be changed from a single vector to multiple columns. So strand in this case has two values: “+” and “-”. Those two will be changed across um as different columns and then what is going to be the column that the information will be split into these particular cases.
So if I run this and you see now it has been a new representation of this information. We have the chromosome and then we have the “+”, and we have the “-” and we have “+”. If we do the um the math we’ll see that um the 2 different piece of information will completely match. We have 2L and we have 2L, “-”, 12 there we go 2R, 2L, “+” is again 12, “+”, 12. If we check 2R the minus 17 the plus is 14 and the minus 17 the plus is 14. So it it is actually the same exact same piece of information, but said from one way to another um.
And tidyr allows us to similarly transition from one place to another. As you can imagine this might be quite easy to represent visually. For example, as a heat map and having um these 2 piece information shown as as a particular um as a particular column. And let’s say that we um save this one. I’m going to check save this one to save some typing as a wide represent representation, and it’s called wide because you take a column and you spread it across multiple different cases.
So let me say let me run this and we’ll see here that we have a brand new um type of information. The opposite operation of “spread” as you might imagine is a function called “gather”. So I’m going to use the exact same approach; I’m going to pipe this in to a function called gather and it will work in a similar situation. So I’m going to gather all the information on a new column called “Strand” and I’m going to utilize the information that is ordered there into a new column called “n” and I’m posing that the “Chromosome” in the minus in the sense that this column should not be affected at all by this functionality.
So if I write this it will give me essentially my original table. So I’m going to save this as a as a long representation. So that it’s um clear what difference between those 2 versions are, and you can see here and that we have um the wide case and and the long case. So um we can use these functions from dplyr and tidyr as we’ve seen um to go from one representation to another, to do some analysis, and to save each individual um table that is being produced so that the rest of the analysis can continue from then onwards.
It’s important to highlight one of the principles that we said earlier in this video, that we don’t touch again upon the original data; we load it once we continue working on that, but everything is repeatable. We haven’t overwritten original data set. If we want we can save any of those using the write. csv function and save it into our local folder and then we can use it from from the normal ones and if you go to the Galaxy training material you will be able to find some exercises on that.
61:35 - Tidyr and dplyr have many many more functions than the few that I’ve just shown you, but these are the most basic ones and as you’ve seen they’re quite versatile and they allow a lot of things to be done. I hope you found this useful and please feel free to check out the tutorial itself for more information and exercise that you can that you can do. .