GTN Training - Galaxy Interface - Rule Based Uploader

Jun 25, 2021 16:13 · 3258 words · 16 minute read

Hello! My name is Assunta DeSanto. I am a core developer for Galaxy - um - working at Penn State University - um - I’ve been on the team for a little over a year now. And today I’m going to be stepping you through the Rule Based Uploader Tutorial.

00:22 - So the Rule Based Uploader allows you to upload data sets or collections, depending on what you have, and apply rules to them as you upload them. Rather than upload them just as data sets as they are and then apply the rules, you can kind of modify them as they’re being imported. And this can save you time, it can be a little bit more efficient, especially if you’re doing the same thing over and over again.

00:50 - Hopefully that gives you some background as to what it is that we’re doing or why.

01:02 - All right - so this is the actual tutorial. That’s what it should look like. Um - And we’re just gonna go through some of the basic things that you can do with the tutorial. One thing I want to draw your attention to is these data blocks, with this little copy button. When you click this copy button it will automatically copy this data and then it’s very easy then for you just to Ctrl + V or paste - um - the data in the upload box. That’s a really nice feature.

We’ll be using that a bit, so you’ll want to have the actual - um - tutorial open in a tab that way you can quickly copy that data like that. Alright, well, let’s get started.

01:45 - So, this is the first chunk of data we’re going to be working with. I’m going to copy that now and then we’re going to head over to the - to Galaxy and we’re going to upload it into the [sic] rule builder. Uh - first thing you might want to do is create a new history. I’ve already done that, but I’m just going to rename it Rule Based Uploader - um - and then it’ll all be in one place.

02:11 - So from this main page here, over on the tool panel, is a button that says “Upload Data”. Looks like a little arrow pointing up and if you click that you’ll see the uploader. It should start up for you like this - um - we don’t want the regular uploader, this time at least. We want the rule based uploader, so you just can navigate to that tab. This first one we’re uploading the data as datasets and it’s from a pasted table so we’re just going to paste what we had from that copy block and click build.

Now this is what the rule based uploader looks like. You can see that it has broken our data into these different tabs we have a column header here that we’re going to be getting rid of and we have a - um - a warning up here which is what is halting us from clicking upload. So when you see that this warning has disappeared, this button for upload should be blue and you should be able to click it, but right now it says that it’s disabled because the data is not validated yet.

It’s not valid yet. So the first thing that we’re going to do [… ] is get rid of this first row. We don’t need it, it doesn’t have any data that we actually need. Uh - if we tried to use it, it doesn’t have a url link, so it’ll just break. It won’t work - um - so we’re going to — let me show that again. Go to filter, filter first or last n rows, and we want to filter the first row. Just one of them. If you had a number of data - a number of - um - headers at the top here that you wanted to get rid of, then you could get rid of more than one.

But we just want to get rid of the first and that’s the only one that doesn’t have real data. So click apply.

04:00 - So the next two things, I believe, that we’re going to do to our data [… ] is add a column definition for the name and add a column definition for the url. So our name is going to be column C and the url definition is going to be column D - the one that looks like a url. And that is actually what we need in order to move on - is the url. So we’re going to go to the rules button here and click add or modify column definitions. From here I said we’re going to add a name and that name is column C.

We can apply that, and then we can do it again. Add a definition, url and that’s column d. And once we apply that, we see that that warning at the top has disappeared, and we can click upload now. That’s what we’re going to do. This should create six datasets, named these different things, and it should get that data from the Zenodo link. And you just gotta wait a few minutes while it - uh - while it does the upload. It shouldn’t take very long, but if a lot of people are - um - uploading at the same time, it can take a bit.

05:25 - In the meanwhile, let’s talk about why we should be using the rule based uploader instead of manually editing our data. So manually editing your data is not reproducible which means you can’t keep doing it over and over again. It’s not scalable, so if you have a thousand data sets, and you’re trying to put them all into the same collection, and you have to change something for each one of them, that’s going to take you a very long time. Using this will allow you to get it done like that.

And it’s also error prone. So when you’re doing things all by yourself, you can make mistakes. I went out of order, but that’s what the section says. There’s also - uh - a link here. “Why not use excel for this?” - which you guys can check out for more context there.

06:15 - So I see up here, where it says six jobs have completed, and we can see now that they’re all green. And we can see that they have their data that was - um uh - taken from from zenodo. And that’s a basic use case of the uploader. That’s just for datasets.

06:36 - So now we’re going to work on a collection. It does say you need to create a new history here, but we just did that. Maybe we’ll do it anyways - um - we’ll call it - simple data list - simple list uploader. There we go. You can call whatever you want that doesn’t matter - um - and the first thing we’re going to do is upload the metadata from the first example. And this is important - as a normal paste upload. We don’t want to use the rule builder this time, we want to do upload data.

We want to go to regular, paste fetch data [… ] - ah - [… ] and copy that again, because I lost it. And we want to paste it there, and I want to click start. Then we wait for that to finish uploading.

08:07 - Let me take a look at it. It’s just the same exact - um - chart that we - um - had already uploaded. Alright, now we’re going to open up the rule based uploader, but this time we’re going to upload the data as a collection and we’re going to load the data from that data set that we just uploaded. So we’re going to go to upload data, rule-based, and then you’re going to change this to collection and we’re no longer putting a pasted table in here.

We’re using a history data set and that history data set is the very same one that we just uploaded.

08:57 - And you can click build. We see, again, that there are some warnings up here, and we’re going to need to resolve all three of these in order to move forward with the tutorial. But of course I’m going to show you how to do that.

09:15 - So, the first thing we’re going to do, just like last time, is get rid of this first column. Again, it doesn’t have any data that we need, so there’s no sense in keeping it around.

09:32 - The second thing we’re going to do is add or modify our column definitions. We’re going to be using D as the url just like last time, because that is the column that looks like a url and is in fact the url - um. Except for, the only difference this time, is C is not going to be name, it is going to be a list identifier. We’re going to go then add modify, we’re going to add a list identifier C, and we’re going to add a url D. We’re going to apply that - um.

10:15 - The type for this is a fastqsanger. gz, so we can go over here to type and we can start - start to type that. And whenever it matches up, you can find it you just click it and you’re done there. And the last thing that we have to do is - this time it’s a collection - so a collection needs a name and we’re going to name it [… ] According to the tutorial we’re going to name it this. That’s fine, you can name it whatever you’d like, you can name it “my special list” or whatever.

And finally you’re gonna upload them and this time we should see that those six data sets are in one collection, one list. We’ll wait for it to - um - finish up loading there. And again, this can take just a little bit of time but - uh.

11:24 - It can also be very quick, it really depends on who else is using the same resources.

11:36 - There we have it, we have our collection with our six datasets, all the information that we didn’t want was stripped from them, and there they are.

11:48 - So that’s how you create a simple list. Next we’re going to be creating a list of dataset pairs. This is a little bit more of a complex collection.

12:00 - We’re going to copy this chunk of data here, and we’re going to be uploading it as a collection in the rule builder. The rule builder upload we’re uploading as a collection, from a pasted table. So we’re going to do this, I’m going to want to clear that out, we don’t - we don’t want that anymore. We want a collection from a pasted table - um - we just want to get rid of all that. Make sure that I’m copying the right thing, and I’m going to click build.

12:36 - Um - so - we again have our warnings up at the top that tell us what we need to do to move forward, and let’s get that data in order to to move forward. So again, this line up here. If you ever have a header that doesn’t have any valid data in it, you want to get rid of that, So that’s how we’re going to start here, start get rid of the first row - apply.

13:12 - Um and then we’re going to go to our rules menu, select add or modify the column definition, and set column C to the list identifier and add our type as well. Those are things that we just did, they shouldn’t be – I clicked the wrong thing - list identifier - we don’t have a paired end indicator - it’s not - yet. Um, you apply that, and then down here, we’re going to do that, set our type. Okay.

13:51 - So, I want to look at column D here. Now if you look at column D, which we’ve been using as our urls, you’ll see that that looks like two urls and in fact it is there are two urls there. And they’re separated by a semicolon in the middle, so we need to break that up, and that’s what we’re going to do next.

14:14 - From column we’re going to select use a regular expression, and we’re going to create matching group expressions using this regular expression. I’m going to actually copy this regular expression to make it easy, although this is a pretty simple one.

14:32 - Column, using a regular expression and we’re looking at column D because that is the one that we want to break up. we’re going to paste our regular expression there, and we’re going to do create columns matching regular expression groups. So the second radial button, and I believe our number of groups is two. Yes, that is correct, two groups. So this is how they should look from column D matching expression groups, the regular expression which is the parentheses the dot star inside each one the semicolon in between.

And the number of groups is two. When you click apply, we’re gonna see that we have two new columns on the end here we’ve got column E and column F. And they are column D split into two different columns.

15:38 - There’s some information on the tutorial explaining how to use regular expressions - um - briefly, but really dot star means any number of characters - anything inside of the parentheses - which is why which is how it matches that up. So now we’re going to get rid of column column D. Column D is the - is this one it has the two urls separated by a semicolon, and we really don’t need it because we - um - just took out the data from it that we needed and separated it out.

So we can do - um - sorry - rules, remove columns, column D. And then when you click apply, D has disappeared and those E and F, they have jumped over. So this is what was on the the left-hand end of that UR, the double url column, and this is what was on the right side.

16:34 - Um - moving on. Now we’re going to split our columns - um - the odd row columns are going to be column D, and the even row columns are column E. Again from the rule menu, we go to split column, the odd row was D, the even was E and that’s going to line these up very nicely. However, now it just looks like we have a list, and what we really wanted was a list paired, so we need to keep going forward.

17:14 - We need to define - it says inform Galaxy which of these rows are our forward [… ] reads, and which ones are our reverse reads. And we’re going to do this by adding a new column, using a regular expression. The underscore 1 or underscore 2 in the name of the file. So we’re going to go to use regular expression again, column D. And this time again, I suggest you just copy the regular expression. It’s easier to just copy it, if you’re following a tutorial, because if you mess up by just a little bit, you could be matching on the wrong thing.

So - um - using a regular expression. Column D, and we’re going to create one matching group. So, column using a regular expression, from column D, create groups. We can paste in our regular expression there. I’m looking for one group. I’m just going to make sure that that is - um - all correct. And looks like it is, so we’re going to click apply, and now we can see that there’s a new column added to the end here. E which has taken the one or the two - underscore one or two from here in the file name - and put it into this column.

18:36 - Um - there’s an optional step here to swap the columns. I’ll show you guys how to do that, because - uh - it could be useful. It says it’s more useful to see what you’re doing. We’re gonna swap column D and E, and that’s just going to make this column with the urls go to the end and this column with the paired indicators come forward. That way we can see them more clearly, they’re more at the forefront there.

19:10 - Um and now we’re going to tell the rule based uploader that those are our paired indicators, because it doesn’t know that yet - um - so from rules menu, we’re going to add or modify a column definition, our paired end indicator is going to be column D, and our url is column E. So rules, add a modify, we’re going to add a paired end indicator, and like I said, that is column D. That is the ending that we took off of the url, and then we’re going to also add our url, and that is column E, and we’re going to click apply.

And now we can see that most of our conflicts have been resolved, all that’s left up here at least is to name the collection, and, in fact, that is all that the tutorial also asks you to do. So we can name it whatever we want. We can call it - uh - “our paired list PRJDB3920”, since that is what is in column A, and we can upload that. You can call it, again, whatever you’d like, and when that’s done we should have a list of pairs that split up along that paired end indicator that was in the url, and matched along those.

20:41 - Um - again this might take just a little bit of time. Um- while it’s waiting - uh - hopefully you get through it successfully. But I want to direct your attention to - uh - the feedback form, which is at the bottom of this tutorial. Uh - providing us the feedback on how you thought this tutorial went is really helpful for us. And if you’re interested in learning more about the Rule Builder, there is an advanced rule builder - um and that’s down here too.

Do you want to extend your knowledge, using Galaxy, managing your data, rule-based uploader, advanced and you can click hands-on. That will take you to the advanced rule-based uploader - um - hopefully that’ll help you - uh - understand how to, how to use this - um - tool even better. Um - again, here’s our completed paired list and I hope that was informative for you.

21:54 - Thank you for watching this and participating. I hope you guys have a great rest of the day and enjoy whatever else you’re up to. .