GTN Smörgåsbord - Day 2 - R Basics

Mar 18, 2021 09:35 · 7409 words · 35 minute read

Hi everyone, um, in this video um I’ll be showing you a few of the basic um structures and how to use um and coding in R. Um as you’ve seen the previous video, um you’ve seen how you can start Rstudio through Galaxy. This is the instance that I was running before so it’s still ongoing and in case you are using a galaxy instance that is not supporting um Rstudio um at this point, um an alternative option of running this exercise is to use the um Rstudio cloud service, um which is um provided by Rstudio and you can create a free account and you’re going to see the exact same environment that you can use to run this particular exercises.

For our purposes I will continue discussing using the instance of Rstudio that I’ve been spun onto Galaxy. So if I, if I want to open the interface again, I can click on the Rstudio link here, um which will open up my my interface. Um, this is the script that I’ve created in the previous video, if I open up, you can see some information here, um but in order to have a brand new um brand new place to put our information, I will be creating a new script.

So in order to do that you go to file, new file and select a script and you see now a new piece of information here. So um, just a few things about R, in case you’re unaware, R is a free and open source programming language; it is being gained and it’s been growing in popularity for quite some time now, and it’s widely used and it has a broad community that continuously supports um both the base R as well as provides a whole set of packages and libraries that um extend and enhance the functionalities provided by R.

Um, it’s quite powerful; it can run on multiple different environments and platforms including Windows, mac OS and Unix and, as you can see, you can have it on several different other platforms, including Galaxy that can run an Rstudio directly. R and Rstudio like there. So, in this tutorial, I’ll be talking about a few of the basic aspects of R and um I will create initially by um talking about um one main thing in R which is how to create variables.

So variables basically is a piece of information that maintains a value that is useful for R to remember and to use. Um so if you want to create a variable or an object if you like  um variable a, and so I’ve created this comment in order to do that you name your variable as a. Um actually like make this a bit more clear by putting um the brackets there and so we have the name and we use the assignment operator which is looked like an arrow which is the um less than symbol with a dash and the value.

So if I want to run this uh, first of all it would be nice to save as you can see it’s still untitled, so it’s unsaved; I press ctrl S to um it’s a shortcut and I will name this as r basic and, as you can see, our new script appeared down here. So if I want to run this, I can click CONTROL ENTER and now you’ve seen on the console um that the value has just run, plus um it’s it’s useful to see here on the environment that you have um that Rstudio is is clever enough to tell us that “look you have a new variable that you’ve defined and this is um this is the particular value”.

So, the environment um is basically a space where we can maintain and have always a good understanding of what information is is there. Um so in in this instance, um we’ve named our variable a. So there are a few points that may be useful to remember. So for example, we should avoid spaces. So if I want to call a new variable and this is, for example, we want to create a new variable here and this is not a good name because we’ve added um spaces here and um in order to avoid that we tend to put the underscore as a way to connect those different words together as a single string.

Um it’s also a very good practice to avoid putting symbols like exclamation mark or add or hashtag as as part of a name because each of them each of those symbols have a different functionality in R. Um so this um this will create um a problem, for example, I put hashtag, you see that by um the Rstudio already has identified that this is a comment and it’s not a variable as I was imagining. Um also we cannot start a value with a number so if we want one number this is not going to work, you can see that already here, um there is an indication saying that you are doing something wrong.

Um so this is a a very good rule of thumb, um and sometimes let’s say that we want to create, let’s create another variable, another variable and I’m going to call this as human chromosome number and I’m going to assign these to two three. So, um there is a problem here as you can see and um the interesting thing is that it gives us information that what you’re doing might include an error and the problem is that as you can see by accident around here, I left a blank.

So it’s basically a text a string here called human, then I have a second one and the problem is that it starts with a symbol that it doesn’t understand. As I said it’s a useful practices to always start with a character and um the underscore creates a mesh. So this is, by deleting the the blank here, this this works absolutely fine. Um so um now that we’ve done that let’s create yet another one and I’m going to reassign this time reassign object names.

Uh reassign object names. Um so let’s say that we want to create a new variable called gene name and I’m going to put the value of tp 53 here. If I want to run this um CTRL ENTER and you see that the command has been executed and now I have a yet another variable here. If I want to actually see the value and I can type gene name and you see that it prints out exactly this and this content. Um so in this case, we see how we have put the value into the environment and how we can retrieve it from from that.

Um if I want um to, for some reason I’m done with this variable, I don’t need it anymore, I can um use the function called remove rm and I can run gene name and provide gene name as the attribute so um the command has been executed and now you see in my environment that this does not exist anymore, um which is also um true if I try to access now the value of gene name, if I run this, it gives us the error that the object gene name is not found in other words um this has been created and now this has been deleted so if I try to access it again it gives me an error which is very well expected.

So this is how we define variables, how we assign values and how we remove them. Um let’s try to have a look into some of the properties that those variables have. So every object in R has two main properties: the one is the length which is in other words how many distinct values are held within this particular object; the second one is a mode. The mode is in other words what is the classification, what is the the type of of this particular object. There are several different um types of modes and the most common ones is the numeric so these correspond to types like float, integers, decimals and so forth.

There are the character ones which is um representing something that has a consequence of of a sequence of letters numbers all together so like names, text and so forth. And we have also the logicals, which are boolean values like true false that’s it, There are a few more, uh which I will not go into because um they will try some additional context but um the main idea for you to remember is that length gives us the number of values that are contained in this object and mode is what is the type of information there.

So let’s try this out. Um so let’s try, let’s see. Sorry, let’s see mode and length. Um so I’m going to first define a um a new variable. Let’s let’s call it chromosome name and I’m going to assign the value chromosome 0 2. I’m going to run this and now we see that this is our our our new variable. If I type mode and the chromosome name you see that rcd is clear enough and it’s typed um the by by typing a few characters it’s identified that this is what I’m trying to do so it auto-filled it.

Um if I run the mode, you see that its type is of of character, and um so, this is um, if I type length of the chromosome name, um you see that it actually contains only one value which is exactly what we’ve should have expected. So um this is how mode and length works, um if you go into the corresponding tutorial on the galaxy training network material, you’ll see some exercises and I will definitely encourage you and find going through them as well to familiarize with you which is uh with um how mode and length actually work.

Um the second point um of uh that might be worth in discussing now is that, beyond assigning values you can actually do also operations between them and one of the main operations is to do mathematical operations. There there there are all all the basic math operations are available in R, um basic math operations, uh and these correspond to um the values are plus, minus, um for addition, subtraction. We have the asterisk for multiplication and the backslash for division and either the um up arrow or double or double asterisk correspond to the exponential and there’s also the um double percentage, so for example if I have two values a and b a double sense of b is the modulus so the remainder of the integer division um and and this is how we can use them.

So for example it’s 5 1 and um as in in math we can use parentheses to actually prioritize um the the application of the operations one of the other so let’s say that we want to do this kind of operation um so in other words uh 5 in the exponent of 0. 5 and the result of that add 1 and the whole thing divided by 2. I can run this and it will give me a number 1. 618034. Um we this is um one way of working with that but given that we already have some um some variables, we can also use the operations like not only on numbers but on on variables that contain numbers.

Um in our environment here we have the human chromosome number 23, so we can try hum actually human chromosome number times two. And if I run this it will give us 46. So um R um takes the name of the variable of the object. Here it replaces it with its content at that particular time at execution time which is 23 and multiplies this by two this is operation and this is what it gives us. Um it gives us, it prints out the result. So this is how um how this whole thing works; another point that is extremely useful in in R is how to use multiple values at the same time.

So um. vectors, are one of the most commonly used object types in R and it’s basically a collection of values, but importantly are the same type. So for example we have a vector of numbers, a vector of characters and so forth. So this allows us to put um lots of pieces of information of the same type into a same object in the same variable; so we can access them at the same time on the operations with them. So um so let’s start working with vectors.

So what I’m going to do initially is, i’m going to create a new vector called snp genes and in order to do that I’m going to use a new function called um c which stands for combine and I’m going to put a few genes here; let’s say O X T R, A C T N 3, A R and O P R M 1. So 4 genes here. If I run this. Now you see here we have a new representation so that snp genes object exists here and this time around you actually see that it is represented a bit differently.

So it has a character, it has um elements one to four and this is um the element. So if we have a longer vector with more and more uh more um values in it, it will not show everything just the first few but it’s always a good idea to have a look on what is in our environment. And let’s connect this to what we saw earlier and and to see the properties of this of this object. As we said we have the mode and we have the length. So let’s try mode of the snp genes.

16:28 - Um as you can see, um again the mode gives us what is the type of the values that are contained um and because we’ve put something that is basically text, it gives us what is actually true which is the characters. We can also use the length of strips snp genes of this new um variable, uh as you might expect the results going to give us four which also can be seen here. So both mode and length gives us the information that we expect. Um there is another function that is extremely useful that combines basically the output of both mode and length produce both piece information called structure str and so if I try um structure snp genes around this, it basically gives us um the same information that the mode is character, the length is one to four.

So four piece information and these are some of the um of the values there. If it was, if it were a longer um variable it’s going to give us um a few less um a few of the values instead of all of them. I think the cutoff is around 1000 when printed out um but this can be changed if we’d be. Not that it is extremely used to see a thousand values at the same time. So um this is how we create a vector. This is how we can see its properties but in if we create a vector we actually want to start working with it.

Um so a few things that we might need to do is to um get a value from this particular vector or to substitute vector or to retrieve a range and so forth. So let’s create a few more vectors of different times so that we can see how this works. So let’s start with a new object called snps again. I’m going to use the combine and I’m going to put a few of um snps identifiers: here five three five seven six and let’s create another one rs one eight one five seven three nine um and yet another one rs six one five two and finally, sorry, rs one seven and nine nine nine seven one.

So these are four um snp variables from from db snps basically snp identifiers. I pressed enter, not control enter. So, it does, it hasn’t executed the the command yet. So let’s create also another one um which is the chromosomes and and the idea here is to capture the information of where this chromosome is coming from, from where this is corresponding um snip corresponding. So let’s say chromosome 3, chromosome 11, chromosome X and chromosome 6.

um and also let’s put also the snp positions positions and and this is going to be, sorry, I forgot the das in my assignment again I’m going to use the combine um this time around I’m going to create only values.

19:50 - And which are going to be 7 6 2 6 8 5 and 6 6 5 6 0 6 24 six seven five four five seven eight five and one four four zero three nine six six two. Um so in this case we have created this we have defined those three um variables here we haven’t run them yet so that’s why we don’t see them in the environment and so what I’m going to do I’m going to highlight them and I click on run which will run all those commands at the same time. So now you can see um that I have um four vectors here; the one that we played before the snp genes and now we have also some chromosomes positions and and snps and you can see that with the exception of position which is numeric because we’ve literally put numbers there all of them are our character ones.

So now that we have them um let’s say that we want to access a value from from those one. Let’s say that we want to access the third value of of of the of the snipped genes vector. So to do that I’m going to specify snip genes and I’m going to use the square brackets and I’m going to put um the index that I’m looking for. So in this case, we want to look for the gene in position three. So if I type that you see that types “AR”. Bear in mind and this is a inform point and that in R uh indexes um starts uh on in one and so here as you can see O X T R is one, A C T N 3 is two and A R is three.

And that’s why when I execute this, um this retrieve the third element. It gives me A R. In addition to retrieving just a single um element, I can actually retrieve multiple ones and I can do for example a range. Let’s say that we want to retrieve all values from one to three. So what I can do is one and two dots three and if I run this, it will give me a subset but return only values one two and three as you can see and the same is here and if I don’t want these values to be sequential, I can do the same approach but instead of of of specifying a range directly, I can use the c to combine specific indexes.

So for example I want value position one, three and four. So with this command what I will have and I’m going to run this here. It will give me the first value, the third value and the fourth value as as a new um new element. I can also combine these two representations. um so for example I can do um snp genes and I can put here that I want the elements of one, two, three and also four. So in this case what I will actually do is, I’m creating here a vector of elements one, two and three, and in this element, I also add the fourth one.

So if I run this it will. Okay! As you can see, I did a typo so it gave me that uh the spn genes is not found because I spelled this the other way around. So I’m going to change this to snp genes and if I run this it’ll give me an incorrect number of dimensions um because I’m trying to access the fourth element outside of of this one. So what I was I was expected to do is not have the parentheses here but basically here, because I want to combine this vector 1 to 3 plus 4.

Again because this is a quite common mistake and if I want to access all these elements instead of. So this particular case I’m trying to access multiple dimensions basically, so instead of saying that I want to combine the vector of elements 1. 3 and 4. I said I want to access two different vectors. So I’m going to change this one back to the original one, I’m going to rerun this and it will give me essentially my original original vector. So um this is how we can access elements or um subsets of the elements of in a vector but the question is um what happens if we want to add new elements in the vector.

So let’s say that in the snp genes vector, I want to add more; this is what we actually did here and as you can see, we have a vector of elements one, two, three and we extended this vector to add element four in this context and I’m going to say that I want to combine the existing snps gene’s um vector but i’m going to add a few more um genes so cp181 a1 and let’s add also apoa five so if I run this it will give me a new vector of this one two three four five six elements because we had four already in snp genes and and we add two more however bear in mind that if I want to print uh the content of snipped jeans and if i’m running again my original four elements are still there so i’ve added them but I did not save it back to my variable so in order to do that I have to overwrite my original variable so snp genes with the output that is produced by this particular command so again i’m combining the contents of the snp genes vector to um this additional two elements if I run this now it doesn’t print anything but now you see here that snp jeans actually is a range of six and I have additional values here so we have four original plus the two i’ve just added so please keep this in mind that you are essentially changing our original vector so this is a process that should be done if you are actually aware that you are actually exchanging the original data you’re you’re updating it so you’re adding new elements to your vector um so as you can see by um using these positive indexes I can um access access elements into into a inter vector so let’s say that we want to do um the opposite thing let’s say that we want don’t want to add elements to that but we want to to to remove them in order to do that we use negative values so i’m going to do snp jeans again but i’m going to put -6 if I run this um it will give me um it will um give me the results of of um we’ll see that it will give us a um sorry the interface froze for a bit there um so if we run this you’ll see that from from the original um vector that contained six elements actn up until apoa5 the sixth element one two three four five six which is this one is now removed so by indicating with a minus six in the same square brackets and we’ve removed this particular value from from from this point so let’s say that we want to save these chains in the same process we did before i’m going to use snp genes and i’m going to overwrite my original value by specifying minus 6.

so if I run this and I can check now that my my vector has changed from from a length of six to a length of five um another interesting point and that is something you should be aware of is that um you can always explicitly add a value to a specific position and we use this with with double brackets so for example I can say that I want to add in snp genes in position seven um the the element called um apoa five so if I run this um it works and now you see that my element has seven so from five it moved to seven let’s have a look in how actually this looks like i’m printing here and as you can see the original five elements are still here up until cyp1a1 then we have something called n a and then we have the one that we just defined so what happened is that because we explicitly asked r to add this element in position seven it creates a not a number a missing value for position six and then it added the element in in position five so this might be a good or a not so good thing depending on what you’re trying to do but some to be absolutely aware at any at any particular um and that particular time um so what we’ve seen so far um in with vectors and this is how we can create a vector by saying combine and we put the elements that we want there we can do this for numericals like this one if we want to access a particular element we we position and we request the particular value by its index number and indexes start in one and we can create a range like this one or a combination of ranges like this one and if we want to remove an element we use the negative index so we want to remove the element from position six and we can add an element in a particular position by the double brackets as you’ve seen here another way of extracting information of subsetting is to use logical subsetting so let’s say that we want to use the positions which are numerics and what I want to do now is I want to um to retrieve so I can put like an index here a so give me the position that is in position three control enter and gives me the the correct value but let’s say that I want to actually retrieve information that exists um and I want to retrieve all values and that are greater than let’s say one million um or what the number is this so it’s um 10 million or 100 million 10 million 100 million sorry so um over greater than 100 million so it actually gives me a single element because if we look in our original table and the only value that is big enough is is the fourth one so I can use this kind of of a logical operation to reach information from here and just provide the context so logical operations they are less than less or equal greater than greater or equal this is the exactly equal to so it’s double double equal symbol not a single equal please be aware of that this is one of the common mistakes the not equal is exclamation point and an equal and then we have the logical or which is the vertical line or the logical end which is the um symbol um so it is good to um to keep this in mind so how did this function actually work and this is a a a a nice um structure to have in mind so here and this is a vector right so let me copy this command right here so what I i’ve put here as index is basically the application of a of a logical operation on a vector the number so if I run this by itself you see that what it produces is a logical vector a vector set false false false through basically it applies this operation to each individual element of my original vector so it checks whether this value the first value is greater than 10 million the second greater than 10 million so forth and depending on the outcome um of this of this operation it gives us the value of false or true and because of that the only true value is for the um for the last element the fourth one so what happened here is basically a function where we said um okay so I want you to give me all the positions for which the original index and the original value is greater than 10 million so it creates a logical value and gives us only the indexes for which the original operation gave us true so if we want to so this answers the questions the question um what are the positions that are greater than 10 million but I can also ask the question okay what is the index of those so I might be interested to know which is this particular position so in order to do that I actually use a function called which and I can pay I can put the exact same operation here so which index or which indices of my original vector have values that are greater than 10 million and you can see here that it’s it’s it’s it’s four so why this is important and why do I stress this a bit more because in when we we program and we create a surface structure we might not always know um what are the inputs or what are the values basically that are um going to be used when when we’re running our code so instead of using a predetermined value like 10 million that I put here we can use sort of of of parameters of of variables as parameters that will define that so in this case I can say that I want to have a snp marker cutoff marker cutoff that is going to have the value of 10 million i’m going to copy it directly here and i’m going to run this so you can see that we have a new value here a new variable and then what i’m going to say is that um I want to give I want you to give this new positions for which the snp positions are actually greater than the snp sorry marker cutoff so I don’t need to keep so if I run this it will give me the exact same value so if I do this I won’t need to change this particular value at every particular point in my code but just do it once in line 79 with this instance and from here onward I will only refer to it by using the corresponding variable called snp marker cutoff and so another point and in order to close with um with with the vectors is to have a a good understanding of um how we can investigate whether we have missing values or not so in order to do that there is a value called a function called is na so it it it requires whatever you give is input so for example snp genes um whether you have any non um any missing values in there as you can see if I run this the output is false false false false false true false if you remember and because we inserted a new gene on on position seven the sixth position was an na so by using the is n a we can get this kind of information back and so this is a good tip to remember as another point is that we might have a case where we want to retrieve specific values by a name in order to do that we can use another operation that’s called in so it’s and types as percentage in percentage so let’s say that we want to retrieve from the snp genes all um to check basically and if this particular two genes are present there or not apoa5 and snp jeans so if I run this it will give me true true if I put also let’s say tp 53 and I run this it’ll give me true true false because it will give me true and for the first two for the last one it will say okay I cannot find this one in snipping so this is a function this is an operation in which it checks every element of my original vector against the elements in genes and checks whether this one is in my final vector continuing with the vectors and if you go against the training material you’re going to see that there are a few exercises there I will definitely encourage you to have a better look and see um if you can you can try it out and you can have some better understanding of how markers and how sorry vectors work a um i’m going to continue with another key point in in r which is about coercing values so um coercing values is in some cases requesting from r to change the type of a data in a vector to a different type um for example we might have a list that has um the um the position of of a snp but by uh some accidents um some text was thrown in there for this reason everything is going to be changed to to a um to an actual character so um let’s see about um how those things work first and then i’m going to try to show you how we can change types between different vectors so um if you remember we have the chromosomes so let’s check again the snp chromosomes if I run this we have 3 11 x and 6.

so if I try to check the mode of the um snip chromosomes we’ll see that it’s type of character right um and this has been done because if I scroll a bit up and we check the chromosomes I explicitly stated them as characters I put the quotation marks before and after let’s create a second variable let’s call it chromos chromosomes chromosomes ii underscore two and but this time what i’m going to do i’m going to say okay i’m not going to change them as numbers i’m going to say that I have 3 and 11 x is actually a number so i’m going as a symbol so i’m going to put it in quotation marks and we have six so here I have a mix of numbers and strings if I run this you see that it actually worked absolutely fine but let’s check what is the mode so mode of the chromosomes two if I run this it’s still a character so um what happened so the problem is that um because the vector has to have a single type of information there and I need to be um I need to r needs to transform all the contents to the largest common type in other words it tries to change all of these values into a single type sense and three level and six are numbers so they are easily converted to a text and by saying that this is the symbol three instead of the actual number three so in order to have this all as a single type because x is not easy to change into a number because r does not know what the number of x corresponds to and it changes everything into a um intro into a not a into a character this is a question so it it’s it’s automatically changes um the type of the values into the largest common type and however we can force r to change if possible from one type to another but we have to be explicit about this so let’s create and i’m going to copy the positions from here but this time around i’m going to save it into a different environment called positions2 and instead of having them as numbers i’m going to put everything in quotation marks so instead of dealing with them as strings as as characters i’m going sorry as numbers i’m going to deal with them as characters if I run this and we see this here you can see that it’s like characters if I put mode positions 2 I can verify that it’s in that character and I can also try access one of the element positions to element one and it gives me the first one as an actual character so um however this is not really a very convenient way of dealing that because we already see and we are aware because we put the data there then all of these should be numbers so this is where the question comes in and we say that we want to change positions 2 by using the as dots and now it will we can select a type so we want to convert positions to from a character perspective into a number so this is done by us explicitly and we’ve requested um sorry let’s see now the mode again positions two and now you see that it’s been converted to numbers if I check this here you can see indeed that it is numbers and all of them are considered as numbers so um this is a bit straightforward because we’ve converted um positions and this these are all characters that are basically numbers and so it was not difficult to do however let’s try to do the same thing but for chromosomes let’s try to convert chromosomes to as numeric and i’m going to consume two again so as a remainder reminder we have numbers here 3 11 and 6 but we also have something that is not easy to convert in a number we don’t cannot think of how so if I run this it actually executes fine but you see that r put out a warning which says that because you coerced you explicitly asked me to change the type of the values from strings from characters to numbers i’ve done as best as I could but some that I could not i’ve changed them to na’s so in other words if I if we if we check first of all our the type of of um the mode of our vector is is numeric but if we want to check it out and run it you can see that we actually have numbers 3 11 6 but you also have a missing value so it works but at a cost this again might be a good or a bad thing but it’s something to always be aware of so if I want to summarize this a bit before we move on to the next um to two lists and talk about this a lot a little bit um it’s always important to be careful when um and and to check the results when explicitly coercing one data type into another and it is the implicit question the implicit change is happening like here by r and this is a safe conversion because no loss of information is actually happening um it’s always a good plan to use the structure of um of a chromosomes to use this the function of a structure for a particular variable before using them and before we apply the conversions just so we are always aware of of how it’s done the implicit version again it’s fairly safe but because um this may have been a vector of 10 000 numbers and one character if looking briefly through the data you might have the um misconception that this will indeed be a number but because there was by accident a character in there somehow r will implicitly queries everything to a character so checking right after loading or lineups creating a vector what is the actual structure will help you easily identify um such issues the final point to keep in mind is another type of structures that are provides which are called lists so as opposed to to lists to two vectors lists are able to contain multiple different data types and and this is extremely useful because it allows us to store multiple piece information at the same time and if you look through the galaxy training network tutorial and you’ll see a few links with additional tutorials about lists but one of the easiest and ways to convey how this work let’s say that we want to combine all the piece of phrase that we have so far and let’s call it the data by adding them into a list so to be more clear on what is happening here i’m going to split into my lines so all of this is within the list function and i’m going to say that i’m going to have a column in my list called genes and this one will contain the snp genes comma i’m going to have the reference snp the column called reference snp and these are going to be my snps let me check I have uh snp there it is um just making sure that everything is in place and we want to have also the chromosome chromosome and this is going to take information from snp chromosomes and finally we also want to have a position and this one is going to be the snp positions so by doing that i’m going to highlight and run everything and now you see that we have a different type of data here so our study in order to have them um easily seen provided this information into a different section called data and as opposed to values which are all of the same type a data type a a list in this case can have multiple different types so you can see that we have chromosome vectors a chromosome a character vector a character vector character vector and a numeric vector and we can retrieve information directly so if I want to have um to access the snp data I can use the dollar sign and I can say okay give me the position so by using that I use me um the list of elements um there in the same sense I can use um in the same sense in the vectors I can access uh the vector position sorry this vector position and um give me the element in in position two for example and if I run this it will give me the second element in in my position list so um accessing a um the contents of a list is done using the dollar sign as soon as we get into a list we actually have vectors now so we can apply exactly the same process we seen before so in other words lists are quite elegant ways of combining information at the same time and so that we have them compacted into a single entity that can be accessed and point i’m going to save this so that we have the script ready for um the next video and I will again definitely encourage you to go through the tutorial on the Galaxy network there are a few exercises and additional links there and I hope you found this useful and Bye!.