Supercomputing with MPI meets the Common Workflow Language standards, An experience report

Oct 15, 2020 12:10 · 3356 words · 16 minute read likwid max edinburgh purposes locally

[Rupe] Hi! Thank you for listening to our talk about what happens when supercomputing with MPI meets the Common Workflow Language standards. So who are we? My name is Rupert Nash and I work at the EPCC in the University of Edinburgh. My colleague Nick Brown also contributed to this work as has Max Kontak at the German Aerospace Center and joining me to present today is Michael [Crusoe]. [Michael] Hello. I’m at VU Amsterdam and with ELIXIR Netherlands. I’m also the project leader of- and one of the co-founders of-, the Common Workflow Language standards project. [Rupe] What are we going to talk about? This is sort of an experience report talk.

We want to tell a bit of a story about how this work came about. The sort of 90% solution that we found that was missing a key feature – spoiler alert: the solution was CWL – and the thing it missed was MPI. Then we’ll talk about the work we did together to produce a solution that does work and some of unanticipated benefits that we got from this from the HPC side of things and then talk about some of the limitations and thoughts about how to take this forwards. Myself, Nick, and Max, have been working on the Visual Exploration and Sampling Toolkit for Extreme Computing project (which is funded by the European Commission) and we’re seeking to fuse high-performance computing and real-time data for urgent decision making for disaster response. So things like wildfire and that’s what this example workflow shown here is for.

And one of the things 01:44 - that we need to do is we need to run these workflows that include MPI-parallelised applications for doing simulations and these need to be running on proper High Performance Computing: so tens to tens of thousands of cores. My colleague Gordon who is credited with that is going to talk about this in one of the other S[uper]C[omputing] workshops on UrgentHPC. We wanted to use existing tools. We did have to have control over whether, where, and with what other parameters parts of the workflow needed to run to deal with new data; user interaction; things like that. So we did have to implement our [own] custom workflow management system but we didn’t want to implement all the parts of this. In particular we were sure that tools must exist that could describe a workflow steps’ inputs; actually do the execution (hopefully of an MPI parallel program); describe the results that this process generates and to represent all of this in a structured (and) computer manipulatable way.

Really what we wanted was a 02:48 - standardized way of doing this. An we found that CWL was a very close match to what we wanted. Why didn’t the Common Workflow Language and it’s toolings work with MPI? So, first of all, for anyone who doesn’t know, I’m going to briefly discuss what MPI (or the Message Passing Interface) is. This is probably one of the most important standards for High Performance Computing. By which I mean large tightly coupled simulations.

The basic way it works is that you start many copies of the same program 03:20 - across one or more servers; and the differ only a unique index, which is called their rank, and MPI is a library that allows these programs to communicate by passing messages in a point-to-point way; or to do collective communications ; do synchronisation ; do in input/output and other sorts of things that you might need to do. Probably the most important thing on this slide is that the final part here: typically when you run an MPI program you have to start with a special launcher program that deals with connecting between multiple nodes on your cluster and this could be called something like “mpiexec” and the MPI standards specifically doesn’t require that you start it in any particular way or at all. But if you do it does recommend that you provide a program called “mpiexec” which accepts as an argument the number of processes to start in parallel, your application, and any arguments (command line arguments) that it would normally need. [Michael] So, briefly the Common Workflow Language is a set of standards for describing command line tools and connecting these descriptions together to form workflows. We’re talking about batch automated data processing workflows and one of the things we tried to do with CWL is (for portability) is disconnect things that we might write in a certain way on one particular system, but we should maybe be more explicit about what the goal was.

So, 04:54 - for example, let say you want to use or you have to use a Docker software container and instead of baking “docker run my_container” into the command-line we model that as as “Hey! There’s this Docker (format software) container available!” Because the workflow engine might want to use the Singularity runtime engine or Docker or a different one and we want to provide that flexibility. So since Rupert pointed out that MPI (invocation) doesn’t have this fixed form, [then] we need to have MPI modeled in the same way [as we modeled Docker]. And today in the [CWL] standards there is no MPI [concept] which is why Rupert came to us. So if we look at an example CWL description without any MPI, we see we’ve got this baseCommand command called “echo” -- we’re just doing a little “hello, world” thing here – we got this input message [that] we want to put on the command line. And so we see the description on the the left and on the right we see the CWL reference runner actually executing that.

You can imagine stuff to make tjos do 05:58 - Docker and other things but this is just that basic example. [Rupe] So yes, so now I want to talk about sort of how we came to work together , a little, so myself, Nick, and Max were working on VESTEC. We were exploring using CWL for our purposes and we discovered that CWL supports using JavaScript to compute values or some of the parts of the tool description. I created a few functions that would programmatically insert the necessary MPI job launch commands to the front of the command line that [the] CWL [runner] was constructing. So there’s a little example on your right and you might notice that this is has got a bit more going on in it than the one Michael showed a few minutes ago.

This worked fine on my laptop and it works on one of those 06:52 - supercomputers that we’re seeking to use (which is a Cray machine in Edinburgh). As a sort of neutral thing this required using Node.js as a JavaScript execution engine which is, you know, its more software to build when you deploy these things; so that’s not great. But probably one big problem is that this is a fairly rather ugly tool description; you got a lot of extra stuff going in; we got to import a load of things and then, this is probably the worst bit, instead of actually using “echo” as the baseCommand in the tool description I’m having to do this ugliness with setting arguments and computing the [actual baseCommand] by calling this JavaScript function here. And even worse than that it didn’t work on any of the SLURM based cluster as we were trying to use because SLURM, in its common setup, requires that you set some environment variables and CWL does not support setting environment variable names that have to be computed at runtime and this totally failed to work with containers, as well.

So we sort of reached out to the CWL community and started to work with Michael on this 08:09 - and it became clear that MPI has to be treated differently and it needs to be integrated into the job runner; comparable to the [CWL] software container functionality. So together we formulated these requirements. Michael? [Michael] Working through this process we realized which parts needed to be modelled in the tool description and in which part actually need to be site specific configuration. Because we want to - , we’re going to make this an extension to the CWL standard for now, but ideally in the future will become part of [the official] CWL [standards]. So we determined: actually we just need to tag the tool as “MPI” and how many processes we’re going to pass to it; basically everything else needed to be site-specific. So lets look at that CWL extension.

08:59 - Here we see on the right: this is our language for describing the CWL standard itself and so we’re just gonna add in this MPI Requirement and it is going to have one field called “processes”, which would be the number of MPI processes desired. If it is set to zero than we determine we’re not going to use MPI with this tool [for now]. So there’s actually some additional flexibility there. This [extension] was implemented in the CWL reference runner by Rupe, but we hide it behind this future flag because it is an extensive. We encourage the development of extensions [to the CWL standards], but we want people to know that they’re doing something outside the standard.

So let’s see what it looks 09:44 - like now applied to our “hello, world” example. Because it is an extension we have to do a namespace thing but really that its that “MPIRequirement” and “processes”. We’ve [added] an additional input so that in the workflow we could control the number of MPI process [without changing the tool definition]. But that could have been a fixed number as well. [Rupe] Yeah, exactly [Michael] So more brief than what we had before.

10:06 - [Rupe] Yeah, and from a user point of view, you just need to figure out how to choose the number processes which is a thing that you would have to be doing anyway, if you’re doing a parallel job. So the other thing that we had to do was to have a way to adapt to the various idiosyncrasies of the HPC systems we’re using. Then we settled on adding a configuration file. So we are passing this configuration file into the runner behind a command line option. And it’s just a simple YAML file that contains the sort of data you might expect. The full description is here [on the slide]. So things like the name of the [MPI launcher] command to use; the flag that you’re going to use to set the number of processes; information about environment variables and any extra flags you might need to pass in. And we’re going to show an example of this [configuration file] in a few slides. When you actually execute your tool the runner checks for the “MPIRequirement” that Michael just explained [and the runner] evaluates the “processes” attribute and if it’s present and not zero it uses this configuration details to actually construct an appropriate command line. Which is prepended to the tool’s actual command line.

Given that we are doing this talk then it is no surprise 11:25 - that this basically works, but we from the VESTEC project had few unanticipated bonus features. [The] tests pass, of course, and they are integrated with the CWL reference runner’s test suite. And we use this [CWL reference runner] within the VESTEC workflow management system to wrap individual tasks and then we could get this going on laptop; on the Archer UK national supercomputer; on a couple of SLURM-based clusters we used. There was no need to JavaScript; the tool descriptions were much improved. And we managed to make it work at least once using a container engine.

And here is some pretty pictures to show – the outputs of a wildfire 12:05 - simulation. We also got this working with sub-workflows, which is a feature of CWL where you can compose many steps together in a workflow and treat that one thing as your entire tool invocation. This was quite useful to VESTEC for the cases when we’ve got a series of steps that are always going to occur together; so for example: pre-processing some data; doing a large simulation; and post-processing to reduce that data down to just the results you need. And when we set up one of these workflows, which show on the right here, we found that it just worked transparently. We had to do literally no further work here.

And we can, for example, run this preprocessing 12:50 - step on single processes (for various reasons it had to be run on the backend nodes of the supercomputers) and then we can run our parallel simulation across anything from a few dozen up to a few hundred processes; and do the the post processing in serial, in the usual way. And this uses the same platform configuration for all of these MPI steps. Another thing that was useful, from our point of view, was performance monitoring. So we have used the “LIKWID” tool from Friedrich- Alexander University in Germany to do this within the VESTEC system because we are interested in monitoring the performance of our runs. And we use the MPI configuration file to basically replace/add some extra arguments to our MPI run command to have it track various things.

So for 13:45 - example on the right: this is using it to track the double precision floating-point performance here and output this. And this was on one of the clusters at DLR, the German Aerospace Centre. We validated that this was working by using the High Performance Conjugate Gradient benchmark which we picked because it reports its own estimate of the floating-point performance so we would be able to be confident that we got things working correctly. And I don’t really want to dwell on this slide but you can see that [we are comparing] HPCG reporting numbers to what we’re measuring by using LIKWID. Here are the two numbers in the [inaudible] are very similar to each other.

The fact that LIKWID is slightly over reporting 14:27 - with respect to HPCG is not surprising because HPCG gives an estimate based on a lower bound of the number of operation required to do the operations. And one of the nice things is that we can just vary the flags in the configuration file to track different metrics which may be of interest to you. [Michael] So as we hinted at earlier we did test once using [the] Singularity [software container engine] and different HPC sites may be experimenting or using software container approaches, so this is compatible with that but probably some more work needs to be done on that in the future. [Rupe] Another issue If you’re running across multiple high performance computing centres is getting all your software compiled. And because you can’t use software containers, which are one of the the nicer ways of encapsulating all these things on general these systems. CWL supports this “SoftwareRequirement”s feature.

So its somewhat 15:29 - orthogonal to the classic workflow language approach that Michael was explaining before but basically you add in your tool description a hint which says “I require some software packages”. It’s call “Meso-NH” (which was the name of the weather simulation code that we were using) and specify an optional version and the reference when it has a feature which allows you to map these identifiers to locally available packages because not every centre is going to install these with the exact same name and perhaps there might be more details about exactly how it is being compiled. For example on ARCHER our “Meso-NH” module is loaded by this command, it is “module load mesonh” (with no dash) and a long complicated version string that is specifying which compilers were used. So CWL lets us encapsulate all those things and have it work across multiple systems. Wrapping up here I’m just going to mention the issues that we know about and ask if the audience has any thoughts about those.

So a few of those sort of those know limitations: so we’ve given 16:49 - the Common Workflow Language only a very simple model of what an MPI program is. It just knows that it has to be started in a special way with a given number of processes. But if you are using an HPC code you often need to consider many more things; so, for example, if you had a hybrid MPI OpenMP code, you might want to run one MPI process per NUMA region on your hardware and then fill up the rest of the cores on that NUMA region by setting OMP_NUM_THREADS to this. So the issue is that when you’re writing your tool description you don’t know how many cores and NUMA regions your execution hardware is going to have. So the issue is how do you express this to the runner in a fairly generic.

This is a difficult problem 17:39 - and so for now we are just setting these via the platform configuration file but the problem with that, from our point of view, is that then applies across all tools run by that invocation. So if you had a workflow that had pure MPI tool in it and then a hybrid MPI/OpenMP one, you would have issues around setting the number of processes correctly and any extra flags about memory affinities and so on. So we’ve just deferred this to future work but we do have a feeling that we could tackle this using the overrides feature of the [CWL] reference runner; if anyone in the audience has suggestions that would be really nice to hear. There a few smaller issues that we want to look at so want to be able to specify which version of the MPI standards your tool requires. So the MPI Forum is working towards version 4.0 at the moment. We want to be able to specify what level of thread support your application actually needs; because it can be anything from no thread support to full-on multithreaded MPI library [must be] available.

We need to do some 18:49 - proper testing with software containers as this is now becoming much more popular in the HPC world. Finally, we want to extend the CWLProv tool to capture runtime information specifically about things that your “module load” invocations that you do as part of setting up the tool. To sum up, we’ve created a minimal extension to the CWL [standard] that lets you specify “It’s an MPI tool” and to set the number of processes to start. And we’ve implemented this in the [CWL] reference runner along with a mechanism for platform configuration. So, from our point of view in the VESTEC project, we are pretty happy because we found that CWL was a powerful and flexible tool that we could alter to support MPI.

19:35 - And that the CWL developers were very open and helped us to make these alterations and then accepted them into the reference very promptly. We were also pleased to get the extra benefits around software requirements (making our tool descriptions more portable); the multi-step workflows that we could have; and the performance monitoring. [Michael] From our perspective we’re really excited to see that our work on the CWL standards [was found] to be applicable to new communities and finding somebody wanting to extend and improve an existing solution. So please go wheel shopping and if the wheel isn’t perfect go see what you can do to make it fit all your needs. We look forward to getting some version of the MPI Requirement in a future version of the CWL standards.

20:25 - [Rupe] All that remains is to say thank you very much for your attention and acknowledge funding from the VESTEC project, and the use of the ARCHER national supercomputer centre [Michael] and we look forward to having questions and answers next; thanks! [Rupe] Thank you. .