Launch, monitor and manage data pipelines on any infrastructure with Nextflow Tower

Aug 9, 2021 01:04 · 10945 words · 52 minute read

[Captions are autogenerated] Welcome to the Australian BioCommons webinar series. My name’s Christina and I’m the Australian BioCommons Training and Communications Manager and I’ll be your host for today. In this series we aim to share useful information about the latest digital techniques data and tools for the life science community. We invite local and international peers to present a bioinformatics topic that we hope will support Australians to deliver their best environmental, agricultural and medical research.

You can keep up to date with the latest Australian BioCommons news and events via the channels listed here on the screen. Before we begin we’d like to take a moment to acknowledge the traditional owners and their custodianship of the lands on which we come together today. In my case this is the Wurundjeri people. We pay our respects to their ancestors and their descendants who continue cultural and spiritual connections to Country. We recognise their value contributions to Australian and global society.

00:57 - In our mission to improve the digital infrastructure available to life scientists Australian BioCommons and our partners regularly meet with the world’s best innovators and enablers to understand if their approaches and products might just be the solution we need. In the Australian BioCommons team we’re keen to investigate how Nextflow Tower can interact with institutional national and commercial compute and cloud services. Our focus is always on community scale benefits so we’ve opened our discussions via this webinar offering everyone the opportunity to join in and ask questions.

Seqeralabs is a leading provider of open source workflow orchestration software needed for data pipeline processing cloud infrastructure and secure collaboration and we’re really grateful to have two members of the team with us today. They’ve been very generous with their time and are joining us at 8 pm and 2 am in their respective time zones. Firstly we have Rob Lalonde here he’s the chief commercial officer of Seqeralabs. Hi Rob.

02:02 - Hey folks welcome everyone nice to be here.

02:05 - Thank you very much for being here. Great feel free to reach out if you have questions uh post the the webinar obviously you have my contact info there in the slides thanks.

02:22 - Now thank you very much Rob and today’s speaker is Evan Floden. He’s the CEO and co-founder of Seqeralabs and a founding member of the Nextflow project. His role at Seqera is executing on the vision of simplified scalable analysis pipelines in the cloud. Evan has experience bringing transformative healthcare solutions from ideation to market through roles in biotechnology and medical device industry. He holds a PhD in biomedicine for work on large-scale sequence alignment his broader interests income encompass everything at the intersection of life sciences and cloud computing.

So I’ll hand over now to Evan to start your presentation.

03:08 - Cool, awesome. Thank you very much for the introduction Christina and welcome everybody to this webinar, thanks everyone for for making the time and joining. So we’re going to talk today mostly about Nextflow Tower which is really our our sort of infrastructure side of what we do at Seqera. But I wanted to set the scene a little bit um with Nextflow and really how we came about developing tower what the background was, really so it kind of sets the scene for uh for really what we what we ended up building um over the last couple of years.

So it’s just a little bit of background on myself and on the project. My background was much more like a biotech um as you may be able to tell I was originally from New Zealand but moved to Europe to do bioinformatics really that sort of saw biology was becoming digital and really wanted to get some um some experience in that and it was really when I moved to to CIG that um in Barcelona here that I met my co-founder Paolo di Tommaso and this was nearly coming up to about eight years ago or so now.

At the time he was working on some some sort of interesting technologies on the way that you could combine things like containers um the cloud was becoming very interesting at this time, github etc, and kind of bringing all of these ideas together for developing at times for workflow a workflow engine and this workflow engine obviously eventually became Nextflow which we’ve been working on now for quite some time. So just set a little bit of background on on what kind of problem we went to solve initially was we kind of see that there’s a couple of issues um that existed around handling uh bioinformatics data but also the common problems that we were seeing around uh in bioinformatics.

One is that these data analysis pipelines can get really complicated and they can get complicated in a few ways. One of them is really around the the volume of data that you need to deal with so we all have very much increasing data sets. The other is the different tools that you need to deal with so you have many different tools maybe you’ve taken something from Github some some code that you’re taking from Biocondo you kind of pull all this together.

The flip side of that is that then managing the infrastructure managing the cloud managing the cluster integrations can become can become difficult. So what we’re trying to solve is both of those problems in the context of teams of people who are collaborating together, maybe collaborating from other sides of the world ,who need to share infrastructure and share results need to share data etc. And then finally we’re trying to kind of do that to via what we think is the kind of the key entry point to this which is for us data pipelines or the actual analysis itself um particularly distributed compute.

So what are data pipelines and what are we trying to do? Well here’s a kind of typical example and a very much an idealised example of what a data pipeline is. You take some input data in this case you’ve got some reads some germline reads and somatic reads say, and then you’ve got some fastq’s you process them through some some pipeline maybe you process them in parallel in this case, then you need to do some variant calling by say comparing the two samples together with several different tools, and you get some result and some output.

This is kind of as I said before a very idealised view of what they are in reality these things become much more complicated become much more nuanced and if you have some experience writing pipelines you can if you can’t understand it over some time um you’re going to be ending up with some pretty complicated pipelines and managing all of that orchestration becomes becomes quite a challenge. So here’s an example of what a realistic pipeline looks like this is a companion so this is a parasite genome annotation pipeline so annotating parasite genomes and you can see each one of those circles on the right hand side there is a script or a piece of software or tool and you can see that the data kind of flows through this via these arrows which link these different tools together.

So these things can become incredibly complicated and kind of managing um that is kind of main thing that we were going after with Nextflow itself.

06:54 - So kind of summarising summarising or generalising the problem we thought that everyone’s got large data this is coming from sequencing data, this is coming from imaging data, but we see all sorts of um sort of new I guess data types coming up and challenges that that possesses. Bioinformatics is typically file based so we’re dealing with big files which is just huge strings that means that the kind of the tools that we need to deal with these are a little bit different than say typical database like executions that you that you see in some other fields.

They’re also embarrassingly parallel this means that we can take our samples and we can kind of split them up and we just run, maybe you run say bam tools or something on each of your samples, this means you parallelise it sort of we call it an embarrassing way, this basically means at the task level, and each task gets sent off to be computed on some on some cluster or some some vm in the cloud and this means that you can really parallelise them in this way.

We use a ton of different tools in bioinformatics as well and we typically we don’t want to rewrite these tools themselves, although obviously many people do, but one of those things means that what this means is that we’re often taking things from things like conda or bioconda there’s over 7000 packages you want to be able to pull those things off the shelf and use them very easily. However, that this creates some difficulties with dealing with dependencies.

I’m sure if many of you in situations where you know you spend a lot of time when you first start uh by a bioinformatics project just installing software or even just installing software from someone else’s toolkit. So we really want to kind of eliminate that problem as well. You also end up with situations where the dependencies can conflict with each other you have one version of this that sort of conflicts with another version of that that you need so that that kind of creates another problem and that kind of dependency tree can often make the pipelines very fragile.

08:44 - So we published early on um 2017 here at the publication around Nextflow our first kind of angle in on this at least in the academic perspective was around reproducibility uh it was kind of it still is a very very hot kind of hot topic. And what we’re able to show here is that essentially developing writing the same pipeline with the same versions of software um everything identical is that you could still get different results depending on the operating system you used.

So in that annotation pipeline I showed you before we observed that we were getting genes which are starting and stopping in different places depending on if you ran that in mac or if you ran that in linux. And obviously we’re able to solve that now by containers but it was a kind of a nice insight into some of the problems that you have and then some of the challenges in writing pipelines like this um and making them in this case reproducible. um There’s another example in the bottom here where we did the same thing with Kallisto and Sleuth, they’re quite common transcript or or quantification tools, and there we were seeing genes which would be called as differentially expressed or different in each of those.

So reproducibility was one part of it that we were kind of thinking about and sort of say the first angle on that but there’s also some other parts that which we think are also as is just as important. One is portability and this means to be able to take your your analysis to take your your pipeline you need to take your data and to be able to give it to someone else and run it in a different infrastructure. This obviously has some important aspects around actually just being able to publish your work or to be able to share it with colleagues to see and differ with different infrastructure, but it also links into this third point here which is around scalability.

And if we’re trying to improve the quality of the software which we developed and we obviously want to try and incorporate some modern software development practices and one of them is really around test driven development the idea that you write a small test data set associated with your analysis with your pipeline that can always run that you can link in with continuous integration that means that I wanted my pipeline to be able to run on my laptop and then I want to when I need to be able to scale it out into the cloud or into the cluster point it at the big data set and run it this way.

These pipelines also become increasingly used in production settings or in clinical settings and therefore they need to be very they can be validated so that okay they can need to be tested um and be able to have the the outputs of them kind of verified um and stored and validated in this way, as well as usable and and by this i mean not just usable by a bioinformatician, definitely not just usable by the person who wrote the pipeline, but really used by people who don’t even know ideally what what next flow is or what bioinformatics is to a certain extent, um so that makes it kind of usable in the sense of creating very simplified GUIs for launching these things, and the other end of the spectrum is people who need to integrate their pipelines as part of larger systems so we need to be able to launch them based on api calls or integrate say some long-running services uh for the execution and the monitoring of those pipelines.

That’s very much what we’re going to talk about and what i’m going to show you in the demo later on um with Nextflow tower. Okay so that kind of brings us to our solution for for these uh these problems and this is where we developed Nextflow for. So Nextflow is a open source workflow manager. It’s a couple of things in itself one of them is it’s a way to write pipelines, so you can think of it as a language or a syntax for writing these pipelines. You take your existing code, so this could be perl, python, bash scripts etc, R scripts, anything existing that you’ve already got, and you wrap those in uh those kind of code blocks and what we call processes.

You then link these processes together with data flow programming, which I’ll explain in a bit, define the dependencies for those tasks and then you wrap your pipeline in a git repository. And this kind of approach has become quite almost quite popular quite a kind of common way to do things now but early on this was kind of really an emergence of of several different technologies at the time. So the key point though is that once you have an Nextflow pipeline that is developed in this way you can then run it automatically on any of the different supported platforms.

And those supportive platforms are traditional schedulers things like slurm, lsf, pbs etc as well as the managed services in the cloud so AWS batch, Google sciences, Azure as well as Kubernetes and some also some custom executors there. The point though is that because there’s a separation between the definition of the pipeline and the actual configuration essentially where it runs that means that you can take things off the shelf uh and essentially run them anywhere, it also means that the community can come together develop those pipelines in git, or develop those pipelines in some common repository, work together, get domain experts and then anyone can kind of take them and leverage them there.

And there’s this kind of ability to have very complex pieces of software made up of 100 different steps, you know hundreds of different software tools, and be able to collaborate on them, I think it’s a very powerful um and maybe one of the most powerful lessons about Nextflow project so far is for enabling that enabled something that we didn’t even imagine um would be would have happened. Okay a little bit on the syntax itself. It’s designed for for for scientists, for engineers, bioinformaticians.

You want to be able to write highly parallel applications uh and and and workflows without having to really have a ton of experience in doing so. So on the left hand side you can see a task here we can see we’re using salmon which again is a quantification tool and you reuse your existing script this is something that you might type into the command line now or something that you have in the script existing, you define the inputs the outputs in the script section, and then you link those script sections together or you link those sorry there’s tasks together inside this inside this workflow block and we’ll see a little bit more of what that looks like in a moment.

I just want to go over some of the high-level kind of differences here and significancesNextflow is not the only workflow manager out there um there’s a couple of different approaches here some popular ones include things like things like Snakemake, um there’s also some language specifications um things like CWL and whittle and this this kind of point just kind of go through some of the differences and then what makes Nextflow a little bit unique in some of these in some of these ways.

So the first of all is that it is a custom DSL so this is a domain specific language, it’s a it’s a language it’s a syntax written on top of another programming language. This can be useful in the sense that allows you to very quickly write a pipeline because the key things you need to do define processes link them together do modifications on on channels is all kind of pre-written for you. But in situations where you need to do something a little bit different you can access the underlying programming language underneath.

So this is a kind of key difference with something like CWL where it’s a specification itself and you’re kind of stuck in with with what’s available to do there. Tt has this programming model which is a little bit different and this is what this is where the data flow part of Nextflow comes from. So it’s called what we call a reactive programming model, what it means is that all of your tasks all of your processes that you set up are kind of alive and they’re sitting there and they wait for data they react to the data entering them, and it kind of this ends up in a cascade of of the ability to kind of have millions of tasks from a single execution of a Nextflow pipeline.

So this really makes it highly uh highly scalable in this manner, and it takes a little bit of getting used to though in terms of the way that the model model works. Another thing that was really kind of the key decision early on was this self-contained approach that means that every task that you run so if you have a pipeline might have you know 100, 10 000 tasks, whatever, each of those tasks is its own working directory and you can think of it that each is like a complete independent unit of compute, and this makes it very powerful and kind of lends itself very nicely to the idea of using containers, because containers are kind of like environments that you can essentially put some compute in and run in some um in some third-party service.

And that kind of those kind of ideas match together really well. And the final one here is around this this portability which I said before around the separation of this to the the definition of the pipeline from the configuration itself okay what does it look like in practice? Let’s have a look here we can see that this is a task example bwa mem like something that again you might do from the command line. How do we write this Nextflow? Well we take the existing command line here we wrap it in a process block we define the inputs, so you can see here I have a reference and a sample, define the outputs, these are things I’m going to capture from the output of this of this process, and then I define the script section which can remain exactly the same as I would before.

Now the real magic though the real kind of part of kind of data flow here is linking this together what do I want how could I use the bam sample here downstream? And the way that we can do this and this is the kind of thing that’s a little bit different here is what we call these channels. So this bam channel if I use it I can take it from the output of my aligned sample and use it as the input of this index sample this is a downstream task or something which is going to take place after this and the way that the model works is that both of these processors are as I said alive and as soon as the sample bam enters into that bam channel it will fire off the execution of the of the second task there so that’s kind of and that kind of cascades through obviously the pipelines can become very complicated and they’ve got all of their links together but it’s really simple in the way that each of those channels really defines the linking between the processes themselves.

Here’s a kind of more technical explanation of this so these are what we call asynchronous first and first out queues and they really link processes together you can also do operations on them we can modify them etc. Okay in practice you can imagine that the channels themselves they contain data, imagine that we had a channel we had three fastq files and we wanted to do some processing on that if we wanted to treat each sample in a parallel manner. What we would do is we would place each of those files into a channel, so let’s say this channel contains x y z here these three files, and the fact that the channel contains three files would mean that we would get three executions um of this process taking place, what we would call three tasks, so a task is a a single initiation of a process here which would create three outputs.

It’s a little bit maybe a little bit technical but let’s imagine in practice so I had a fastqc process that I wanted to run here it was just running fastqc on some some samples on some reads here, I could set this up to run with a single sample by saying this channel, the samples channel, contains only a single sample the sample. fastq file and if I then wanted to say okay well now I want to run it on all my samples I simply specify that I want to put *fastq and now fastqc will run all of those in parallel.

And like when I say in parallel I mean really in parallel, um in the sense and that they are they are completely operating if they have 100 cpus and I’ve got 100 files there and they’re going to be running um in parallel in that manner or they’ll be submitted out to my cluster or my queue etc. Okay that was kind of a very very kind of 10-minute introduction into the syntax itself let’s think about now the deployment because this kind of goes into more of where we’re going with uh with Nextflow Tower.

So there’s a couple of ways you can run Nextflow itself. uh The kind of main basic way is is via local execution, this is typically when you’re developing an application or maybe you’re doing some ci cd work you need to run a small test data set where Nextflow runs locally on a machine. There could also be in situations where you spin up a big box maybe like you know 64 cpus and you just run everything locally in that in that machine. Obviously that has some limitations but maybe it’s not the most efficient use of resources but it’s certainly something that’s possible and you can do that with or without containers.

In that scenario Nextflow’s using the local storage for the intermediate files that are taking place there and it’s kind of running in a in a pretty basic manner. What’s obviously very popular though is the ability to connect in instantly with your cloud and with your cluster resources here and these are the traditional schedulers things like slurm, grid engine etc and here Nextflow submits each task to make each Nextflow task execution becomes a slurm job for example.

And this is this can be a very useful approach it makes it very easy to interact with a cluster it also makes it very powerful in the sense that you don’t have to change any of your configuration, now you no longer need to think about um necessarily what’s the specific commands for that cluster to to do this, Nextflow takes care of of that and again it will take care of the fact that there’s a there’s an idea of what cpus are there’s an idea of what memory is of what a queue of what a queue is and all of those things can be translated across the different uh execution platforms they support.

What’s become really popular though is this managed cloud batch orchestration, so this is things like AWS batch and in this case what happens is Nextflow submits each task to say the batch API and that batch api in turn is linked to a computer environment with some vms those vms will spin up as that task enters the queue and from there it will run the task and then from it will run the task say put the data back into s3 and then those vms will go down.

The nice thing about this is you scale to zero so you can ask for you can require you know 10 000 cpus but I only need them for a couple of hours because i need to run some big sequencing experiment very quickly and then I want to kind of kill everything afterwards. That approach can work very powerful particularly in bioinformatics where we have very um sort of lumpy workloads often we need a lot of compute and then we need nothing, or we need specific resources gpus etc, and then I need something that I don’t necessarily need to buy it would make sense but I could I can kind of rent them in the cloud for a certain amount of time.

And then if you want to go the next step and actually manage that cloud infrastructure yourself there’s kind of a couple of ways you can do this. This is a little bit of a detailed explanation on what happens at least in AWS where you can have multiple jobs on the same virtual machine so there’s a kind of bin packing problem which is taking place here which again improves the efficiency a little bit because it means that you can have essentially a collection of vms which are spun up and then those those tasks are automatically going onto this.

I should say that as a user you don’t really need to know any of this, this is kind of more of the guts of what’s happening in the back there. And we’ve been we’ve been spending some time on looking at how we can improve the performance of this because now we can make it very easy to uh to set up that infrastructure and I’ll show you how we do that. That means that we can start to think about what’s the most efficient way to run bioinformatics pipelines and is there things that like obvious easy things that we can do that improve that? One of this is example with with gtk here where we were looking at can we use a shared file system in the cloud so in this case it’s fsx versus using s3 storage.

In s3 storage you have to kind of transfer the data from each of the individual from the storage bucket to the virtual machine run the job and it goes back, with fsx you no longer need to do that, now there’s a trade-off here because the fsx has a cost associated with it so there’s a cost associated with having that disk uh while while the workload is running. Now we’re actually able to show here that even though it’s more expensive in the sense that you have to have the uh the storage it turns out to be cheaper overall because the virtual machines are uh are, also the runtime for each of those virtual machines is actually lower um so it means that the overall cost ends up being lower and you end up in some cases up to sort of two and a half times faster, um so this is we think there’s a whole bunch of um things that we’re really working on here and again making that easier for people to to utilise um and access in that way.

Here’s a quick and kind of visualisation of of that portability and how it works so I could run my Nextflow script next to around just run locally with the local executor here on my machine i could specify then okay I want to take that pipeline maybe something off the shelf and then I want to run it in slurm, I specify execute slurm, my queue how much memory I want for that for each of those tasks to run, and then I could say okay I want to take that same pipeline and literally just need to change out here the executor pointed at AWS batch, it still has the same concept of what a queue is so it has an abstraction for queue, abstraction for memory et cetera, so it will convert those numbers into the the corresponding ones for the API in this case for batch and then run the job in that way.

Now a lot of these things are all kind of um all kind of key, the fact that you write the pipeline in a way which is decoupled from there and that portable deployment also relies on a couple of other things and one of the main technologies for doing that was was containers obviously and containers um if you’re not familiar with them I think I think they’re becoming you know much more popular than when we when we even started the company three years ago or so there was sort of not so much awareness of them but it’s really become a driving force for for a lot of this portability that comes about.

Comparing them to virtual machines obviously they can spin up very fast so this means that you can have a have a container which comes up really quickly, virtual machines would typically a bit slow to do this, but it also means that they’re they’re immutable so when you when you spin up a container every time it’s going to be the exact same state that you built it in which means it’s obviously this comes down to the reproducibility makes it makes it very easy, they’re composable so you can have layers you can build containers on top of other containers and there’s also just really just a whole bunch of great tooling out there for using them, I mean things like the advent of of the Docker tool set of container registries which are available to everybody now that has made all of this really easy to use and and kind of we’re piggybacking on the back of of a lot of other technologies and really just taking containers and using them in a way they’re not they weren’t really designed to be used but it’s um they kind of work fantastic for the for this way.

So Nextflow has support for um Docker very early on actually so Docker was released in I think in the middle of 2013 there was the first talk about that we kind of saw this technology and thought hey we can really take this and to use this for um for bioinformatics workloads or for long-running workloads. You can see here there’s also been some some other supports that come in so things like Singularity which is really for using containers in HPC environments, uh things like podman um there’s a couple of others that Nextflow supports as well and they’re all essentially run times for it.

The nice thing about them is that you’re really pretty interchangeable so you write your pipeline once, you define it with say Docker and the other formats are able to uh to handle them in this way. Okay one thing I think it’s really good is when you use containers and like once you get used to them uh hopefully the audience is now say more used to this is it’s really always once you start using them they’re really kind of hard to imagine how you did bioinformatics before you know things about the spending so much time installing software as it’s hopefully a thing of the past, or at least something you do once, which is when you when you create the container.

Okay some high level things now on Nextflow itself so the project has grown um quite considerably particularly in the last couple of years a lot of that has really been driven not necessarily by say contributions to the Nextflow open source code, um which obviously are you know very valuable, and but this really comes about from from from contributions of the community of Nextflow pipelines. So this means people who are coming together to create these pipelines sharing them with the community and working together um in this way.

So we’re up to around ten thousand people who read Nextflow documentation uh every month and as part of that like a big driver has been as I said this project such as nf-core which is community uh which came about sort of organically from a couple of Nextflow conferences and from there they’ve been developing these pipelines which you can now it’s going to take off the shelf um and use that. If you’re interested in this I’d recommend going checking out nf-core, uh we’ll show some pipeline examples in a moment, but you can join the Slack community there they’ve got a couple of thousand people there um very um, very welcoming community as well um and they’re very useful for uh it’s like taking stuff off the shelf but also they you know contribute back maybe modifying it and we see a lot of customers now who are coming to us primarily to run nf-core pipelines which is obviously great to be great to see.

29:50 - Okay another couple of things here around the the use of Nextflow. So one thing we’ve been really pleased to see is the way it’s been able to be taken very quickly and adopted into projects and then run um is kind of real relatively significant projects in this case where the UK has been very kind of forward and the sequencing of COVID particularly around the identification of variants and and one of these things was the development of the Nextflow pipeline very early on uh that was taken up by this consortium there they were unable to distribute that pipeline to different infrastructure groups around the UK.

And then as part of this I think over at least at least why there’s one group all this one consortium has published around 25% of all of the uh COVID sequences to date and then all they’ve gone through a single Nextflow pipeline um which they which they develop there’s obviously been many others that have been developed um for for this analysis but kind of think it’s a nice nice kind of showcase of what you can do with Nextflow to be able to create a pipeline very quickly share it and then run it on distributed, different infrastructure depending on what’s what’s available at the time.

We see like a lot of adoption as well in the in the life sciences area generally. So we have commercial deployments of Nextflow Tower inside of many large pharma companies AstraZeneca, Roche Johnson&Johnson et cetera and we also see a big increase of this and particularly personalized medicine space so there’s a lot of sequencing data particularly companies who are sort of cloud forward in this way um maybe dealing with clinical data or production like setups where I think it’s a very good sweet spot for Nextflow.

Obviously there’s also the traditional research organisations who are using that in this case. While Nextflow has yeah it’s it’s very useful tool we developed it really for the bioinformatician who you know had the problems that we had around wanting to write these pipelines, share them and kind of run them in different infrastructure, there’s still some limitations associated with that. So if you have a command line tool it’s not really typically very good for having long-running services.

It doesn’t have a database on the back end, it doesn’t it’s not very useful for sharing you know a command line tool with with colleagues or kind of interacting in a in a collaborative way for that. So while Nextflow makes it easy to use all of these different compute and and allows it to use containers etc we found it kind of quite difficult on how we were going to take it to the next step of this. And so thinking about this talking about this with users over many years start to think that we needed something else uh other than um other than a command line tool for doing this.

And this is really what we developed Nextflow Tower for. So Nextflow Tower is full stack web application, so it’s a it’s a you can think of it as a centralised command post for your Nextflow pipelines, it’s a full kind of management piece of software. And we thought about a ways to go about this early on about what kind of model we wanted to use here and this obviously has been in the past things like like SAS like approaches to this problem where there’s a service running which you go pay to use and we didn’t think it really fit well with what we feel kind of the Nextflow way of doing things which is really just install something and and run it anywhere.

And for that reason we’ve decided that that Nextflow Tower should be as portable as Nextflow is so in the end you can take an Nextflow tower and you can install it on your laptop um you can pull it up in your cluster you can run it in the cloud if you wish. We started off with the monitoring of the workflows so the ability to monitor them, understand kind of what’s happening, have a database on the back end, and then moving to things like be able to launch them, via GUI, via API, team management, organisation, as well as improving some of the uh the ability to uh to set up and to provision the infrastructure for the for the computer environments.

So I’m going to go and give you a little bit of demo now on how it all works and then can so we can see how this works. So I’m going to show you Nextflow tower and this is at tower. nf and I should point out this tower. nf is available for you free to use to go to evaluate here. So I’d recommend you can go try this out if you wish. What you can see is that uh first thing is that you can log in here with any authentication provided so we provide for our customers of course their own authentication here so when they install if they have any kind of open id connect or or any set up here they can have their own authentication.

I’m just going to use Github here for my login. And the first thing we’re going to think about is how we can do what the most kind of common action which we want to do which is typically launch a pipeline. So a pipeline here is is something that we wanted to be predefined for us and that we just want to kind of kick it off and launch that. So I’m in my own personal workspace here I’m just going to first off like select a pipeline that I wish to run and these Nextflow pipelines now have um customisable GUIs for each of those pipelines which is defined in the pipeline script.

So here you can make it really easy for people who don’t really know anything about Nextflow or even or even too much about bioinformatics where they just know okay here is my uh here’s my pipeline I wish to run this is the type of input data maybe it’s some Pacbio data, maybe I want to have drop down menus uh associated with that but it can make it pretty trivial to kick off and launch a pipeline now. Once I launch a pipeline, again this can be this can be any pipeline in this case, then I can start to think about kind of visualising the results of of that.

So here is um some pipelines I’ve launched before here’s the one we’ve just kicked off, you can see that in the back end it’s really just running Nextflow, so there’s just it’s really kind of running Nextflow here, um or a long running service for launching Nextflow pipelines, you’ll notice that Nextflow runs it runs git repositories so this could be public or private it could be hosted git as you prefer, and then you have parameters um configuration execution logs and you can kind of follow these things live.

One of Nextflow’s real advantages is the ability to relaunch a pipeline so maybe something failed and you want to relaunch the execution and it’ll kind of start from the beginning if you know anything about Nextflow this is the resume functionality so it uses the hash of each of those tasks and is able to compute or or only compute the tasks which have failed and kind of kicks off at the um at the right stage. You can also do things like download the logs uh have an execution timeline etc visualize the visualize what the execution of that pipeline looked like understand which tips taking long et cetera for doing so.

Now I should point out this this is all able to all of these features and functionality are able to be run on any of the different compute environments uh which I’ll show in a moment. You can see here get some general information so the name of the pipeline where it ran and this is running on aws batch you can see that it’s connected to aws batch in paris so this is eus3 the version of Nextflow we used and the work directory which in this case is an ns3.

You have the status of the tasks as they run through the processors on the left-hand side as well as some aggregate statistics. And on the back end here we have a full database of all of the cloud costs for all the different cloud providers and all the regions and all of the instance types. So this means that you’re able to quickly sum up all of those costs and generate an estimated cost which can be sometimes very difficult and people are a little bit say hesitant to go to cloud because they don’t know what’s kind of not even like the the true cost of this but at least what’s an estimated cost of what I’m running.

Am I was it ten dollars or was it ten thousand dollars and obviously there’s a you know some fear around um you know about burning through your your your PI’s credit card um in in this regard so it’s just kind of ways to to kind of create some guard rails um around that to make that simpler for people to use. There’s information around efficiency but I think what’s most useful here is that is to jump into the task information. When you have a pipeline which is 10 000 tasks going into understanding exactly what’s happening in each of those searching for that maybe assigning finding out what an error is can be can be particularly difficult.

So here we can jump in and we can see okay this is the this is a different task for this let’s go see like fastqc you can see this command that was run for this, you can see the exit status, work directory etc. I can also see the execution time the resources requested, so this was running in this container, on this queue, with two cpus, six GB of memory, six hours on AWS batch on this instance type n eus3 on on spot instances and this is the total cost of that of that task.

That’s the resources that were requested, in this regard, but I can also see the resources that were actually used. And comparing these two can be quite useful because it means that now we can start to think about optimising them we’ve got a full database now of all of that information how can we visualise that, firstly, but then also kind of act on that to try and save resources and make our pipelines more more efficient as they run. So there’s a visualisation of that.

At the bottom here you can see this is kind of cpu usage or memory usage for these different tasks. So you can see there’s quite a few different processes uh in this step and you can see each of them this is kind of getting towards I’d say somewhat efficiency would be sort of around 70 80 and this is running on test data sets so it’s pretty low in terms of the memory utilisation here. But something that you can always optimise uh from an inside Nextflow and specified we’re working on optimising this automatically now um at that time.

So you can see that this is this first pipeline here is kicked off you can see that these tasks have been submitted to us in a moment I’m going to go through and I’ll show you into into AWS and kind of show you this is going to transparently happening in your own account in your own computer environment whether that’s slurm or aws or any of the any other different execution platforms that we have. Before we do that though I want to quickly show how can you create pipelines from for different users? So how do I say make that a pipeline which I’m developing and make it available for other people in my group, maybe my institution, maybe you want to make it public and available for the world.

So to create a new pipeline here we simply select the pipeline that we want, we can put in a name that’s required here. The first thing that we want to think about is a description as we want but all links to the pipelines are git repositories so this means that I can go into here and say nf-core, select the pipeline that I want, you could choose anything from git, I’m just going to select something relatively simple here so let’s say we want to run this pipeline, I’m just going to copy the git repository here and paste this into here and then going to select a description, let’s just copy paste this description, and paste this in here you can choose a computer environment.

I’ll show you exactly how we set these up in a moment, but for now I’m just going to choose my default one so let’s just choose batch in Paris. Once I’ve selected a pipeline it’s going to show me all of the git commits and all of the versions the tags the branches that are available to me to run, so this means, going back to the reproducibility element, it enables me to specify you know really a particular version of a pipeline or a particular code commit that iIm launching here and then all of those things come together to really make the pipeline very reproducible in the sense and we’re just leveraging git here there’s so many fantastic tools out there why not make the most of them here.

So I’m going to say here rna let’s just say rnaseq we’ll do BioCommons here and then I can choose a config profile so look up the config I’ll just put some test data in here and we are good to go create the pipeline, this then becomes available for anyone in my workspace to be able to run they select the pipeline if we want this will have a custom GUI, here so you can see there’s input, output, email, multiqc, we can save these merge fastqs and and launch this pipeline here.

This form, so this is kind of something which you would define inside the git repository, here so this is kind of a nice feature which we’ve added in the last six months or so, but it’s really the ability to define custom GUIs and custom forms for it for your input of your Nextflow pipeline so this leads us to the ability to also do parameter validation in the future so we can say okay this this file should have this format and it will warn you before it kicks off the pipeline, or you have things like um you saw before I just was saving the merge fast queues, this was a boolean, this is the icon that was used for that and all of that just gets rendered inside of tower for for being able to launch there.

There’s a couple other ways to launch pipelines than specifically through specifically how we’re doing them by the GUI here. You can launch them um via what we call actions. So actions are ways that I can create an endpoint specifically for launching the pipeline so exactly the same as before where I create a pipeline action it’s the same setup as creating a pipeline before and that action in this case is an endpoint, so this is a launch hook, where if I hit this location here so if I hit this endpoint with my parameters that will then trigger the execution of the pipeline.

This is a kind of easy way into the API and the whole thing can be defined by the API here so you have full full access to the documentation for all of the endpoints so the whole tower system is controlled by an open API in this regard. Other ways of doing, this we have a command line interface tool so instead of saying Nextflow run your pipeline tower launches um it’s coming out and and that’ll be on a few weeks time. In terms of the runs as well there’s more information here so that you can get firstly data sets functionality which is which we can pretty cool so this means that you can define data sets it’s like csv files.

Currently people typically upload these things to a location, so your csv file is telling you where your rate where your reads are, and then you launch the pipeline you’re linking that in as well as reports so the outputs of the pipeline um to make them make them visible. But i think the most interesting thing to talk about now I’m sort of finishing off this is really around the infrastructure itself and how easy that can be to set up here. So you can see here that this is running um we’ve got some tasks running this is running in aws batch in Paris.

I’ll show you now what’s happening in the back end so this is happening kind of transparent as I said before if I go into my batch environment I mean let’s just make sure I’m in the right zone first so I’ll go to Paris inside of here I’m going to go to then batch which is the kind of main service that Nextflow interacts with and then if we go into a batch here you can see all of my tasks which are running here and kind of tower’s providing an interface into this and you can see that we’ve got these cues which are automatically created by tower and then you have compute environments and those compute environments link in to each other so this computer environment for this is going to be o4j.

If I select um that compute environment itself that is as if you’re familiar with aws that is actually an ecs cluster which is sp which is going to then in turn spin up easier to instances like spot instances um based on that and those spot instances then it kind of manages the ability to kind of control them to to it hasn’t been on them anymore but at the price that you specify and then from there spin up those machines the the tasks themselves will run as Docker containers on the side those on the side of those tasks, so you can see those are running here, so if I select on this task you can see that this is the individual task which is launched there if I look on the ec2 instances inside of this if I was to jump into here you’ll see that underneath it’s actually just a big c5 large instance so all of that compute that we have here running is really just a c5 large instance.

Now if I send a hundred tasks or a thousand tasks there’ll be multiple of these which gets. Now this is very powerful and very very useful way to run pipelines but it’s very intimidating for most people and maybe showing you this you get a kind of feeling of that that’s not something that I want to be touching um and I and I typically want to be sitting up and it’s one of the things we’ve spent a long time thinking about how can we make that easy how can we make that accessible? And the way that we did this is via this concept of compute environments and these computer environments here can be set up for any of the different cloud providers or or on-premise schedulers or Kubernetes environments.

Now I’m going to show you how I can do this completely on the fly so let’s give this guy say AWS batch and let’s just say we’re going to do this for BioCommons I can choose a platform here, in this case Amazon, it’s very similar I put my credentials in here and this is just a key the permissions required for that are are shown below. I choose a region that I want to do this in I think i’ve got some buckets in in the us1 so i’ll select that I’ll select my working directory this is using the API to look up and show me what resources I have available which is also very nice for this it means that I don’t have to remember exactly what’s available.

I choose how many instances or how many sorry cpus in total I want that queue to have, this is kind of a limit on how big that cluster should grow to when I’m submitting thousands of tasks so let’s say 256. I can choose some order ebs autoscaling here I can choose various different systems for efs fsx so it’s kind of performance storage updates. If I want to get really advanced I can be okay which instance types do i want to put in there do I need gpus do i need to have specific things for this kind of workloads which i can place inside of that queue.

But ultimately all of that just needs to be created and this way and from here this will create the queue for me this will create the compute environment this will do the linking of those resources and within about 10 seconds or so this will have created those resources inside of AWS for me and we would be able to simply then go run the pipeline um with with that computer. I’ll show you another quick example of this how this looks like for something like slurm ,so for a scheduler.

Here instead of using the cloud credentials we would use ssh credentials and we’d specify similar things so again we have a work directory on our slurm cluster we would have a username we used to log into that as well as the the slurm queue name for the Nextflow job so we call the head queue job and the slurm queue for the for the actual task jobs themselves and they can be obviously overwritten from the Nextflow configuration. All of this can be managed in the context of teams though so you have organisations inside of the application and this means that you kind of think there’s several users here like someone who wants to run pipelines someone who wants to create pipelines or someone who wants to manage the infrastructure.

So from inside of here the Seqeralabs, I can go inside my verified pipelines you can see I have credentials here which are the shared credentials or credentials available to everyone in the in this workspace. We have participants who can have different roles here so I could be an owner in this case maintainer launcher etc and those different roles are really going to define how people can interact with the resources and the permissions that they have to be able to to be able to do so.

Let’s just go quickly um and consider um a few more settings here. From the organisations themselves you can see that we have um the ability to to create different workspaces and workspaces are the main ways that we interact with this. You can also have members so these are these people who are part of your organization as well as um teams itself. You also have the ability to have like an admin panel so this is if you are kind of a global admin of the whole of the whole application um and this means that you have have a fair bit of control um over there and the way that it runs.

I think i’m getting towards the end of end of time here um so I think I’ll um I think I’ll stop sharing now maybe we can take some take some questions for the last 10 minutes or so.

50:02 - That’s perfect. Thanks very much Evan. Thank you for the um excellent overview of um Nextflow and Nextflow Tower and a showcase of what’s possible. So now is the time for everyone in the audience to send through your questions please type any questions you have for either Evan or Rob into the Q&A box and we’ll do our best to answer them. We’ll just wait a few moments for some Q&A to come through. So Evan if you did have something quickly you wanted to to show if you’re feeling cut short of time we might have a few more moments.

50:42 - Yeah I can always also speak about sort of where we’re going with this in um in the future as well and obviously this is this has been about a couple of years work um. We started off and kind of the model for this which can obviously a lot of people often ask is you know what’s the business model then if you’re if you’re applying this. And and the way that we think about it is really say we’re open core company where Nextflow is the core of that obviously 99% of the value that we create then is really provided for people who want to use Nextflow and in the model it’s really around is really around people who want to use Nextflow on production and particularly in organisations then um who who want to deploy Nextflow Tower.

So as I said before you can take the application itself and then and then run it anywhere you want and the model for that is kind of a subscription license to install the software um for yourself. There’s a there’s a question there is is it possible to implement Nextflow tower locally linking it to a local cluster? Yeah so you can you can take Nextflow Tower itself uh it’s a couple of Docker containers itself with a database on the back end so you install it in your local cluster like you would a similar service.

Think think about something like uh R studio that’s available for everyone or Jupyter lab something like this that you would install Tower and then you would link that to your local cluster maybe a slurm cluster etc that you could um that you could install that. An interesting point on that is that is is that it’s very kind of um distinct where Tower is installed and where the execution runs. So I showed you all those execution um compute environments so I showed you all the different ones now Tower in this case is actually installed in AWS um I think in in London but all of the people who are using tower they’re connecting to their clusters all over the world so they’re having to have their clusters um you know essentially anywhere.

And there’s a kind of clear distinction there between where Tower is installed and the computer environment itself and I think this creates kind of portability that we have um as well. There’s a question there well does Nextflow support a private git repositories? Yeah absolutely so you can Nextflow yourself, you can have your private git repositories to to log into so that would using the the traditional keys and then there is a as well as inside of inside of Tower um there’s a kind of credential management which I can quickly show you here uh.

The credential management here you create your credentials provided for gitbucket, github gitlab etc if you have another um git provider that you want integrated here that’s always possible as well. This also works for like self-hosted um so if you have your own git lab you can point it at your own url there um for for doing so so that can be quite useful for kind of people working with that kind of model. Okay there’s another question there what’s the recommended strategy for migrating existing legacy pipelines to Nextflow? I think that the if if you’re going to do it yourself it’s obviously something I’d recommend there’s a kind of key concept so it’s probably worth investing a little bit of time understanding the Nextflow model.

Ah, there is a couple of blog posts from I think from six months ago so now it kind of summarises a lot of the learning material for going through that process of of learning initially and converting those pipelines to Nextflow. It starts off for kind of like five minute overview things and then goes through to you know full 10 hour courses if you want to if you want to do that. I can quickly share with you um the the blog post link for that um right now.

It’s the blog post called learning Nextflow in 2020 I believe. Let’s paste that inside of the inside of the chat. All right well I think that is all we have time for today. So I’d once again like to express our thanks to you Evan and also to Rob for taking the time uh to talk to us today especially given the unsociable hours that you joined us. So thank you very much for being here. Anytime thanks a lot folks. Thanks a lot for your time. Thank you, thank you.

54:39 - So just before we wrap up I’d like to share some information quickly. So this webinar is part of a series of training events organised by the Australian BioCommons. You can find out more about upcoming webinars, workshops and events on our website and watch recordings of previous events on our Youtube channel. Finally we’d like to acknowledge our funding. The Australian BioCommons is enabled by NCRIS via Bioplatforms Australia funding. Thanks so much for watching and we hope that you enjoyed today’s presentation and that you’ll join us again in the future soon.

Until next time, goodbye. .