Jeroen Janssens - Set your code free; turn it into a command-line tool

May 20, 2021 07:43 · 2521 words · 12 minute read

thank you and hello there my name is jeroen and in the next 20 minutes i’m gonna try to convince you that if you have written some code that you should consider turning it into a command line tool in the next 20 minutes this is what i have in mind i’m going to say a few words about myself so that you know where i’m coming from i’m going to briefly explain what the command line is what tools are and most importantly why you should care about this then we’re actually going to build a command line tool so that you can see how that process goes and then i’m gonna wrap up so a few words about myself i have a background in computer science and artificial intelligence i did my phd in machine learning and this was around 2010 and back then i worked in windows and i coded everything in matlab and my code was not free at all i wasn’t thinking in tools so to speak so after my phd i moved to new york to work as a data scientist at various startups and that’s where things changed i got the opportunity to write this book data science at the command line currently working very hard on the second edition which should come out somewhere in october you can read both editions for free if you’re interested at data science commandline dot com and if i am not writing i’m spending my time coaching and training others in a variety of topics related to data science so about that title now there are a couple of interpretations of the word free there’s free as in beer there is free as in speech but i’m not going to talk about those things i’m also not going to talk about versioning code packaging code and distributing code they’re all important yes but they’re outside the scope of this talk now i’m going to talk about free as in bird in the sense that your code is light and can go everywhere it pleases the command line is a stark and unforgiving environment but if you know which spells to cast it can offer unlimited possibilities so um a while ago in february nature argued that researchers should embrace the command line they say that it can help you wrangle big files and that it can parallelize your experiments and automate your work and yes there are plenty of of other reasons um most importantly it can save you from velociraptors but i’m not here to convince you to use the command line no i want you to consider to turn your code into a command line tool so something that can be executed from the command line and that can interact with other command line tools and the real reason i am talking to you about this topic is that because i believe that command line tools enable these yeah wider community ideals um that i read here on about csv conf so those are interoperability hackability and simplicity i took the liberty of not highlighting data because there can of course also be interoperability regarding tools so command line tools are are simple well most of them are they usually do one thing and they do it well they’re hackable in the sense that you can combine tools right they they all have text as a universal language and there’s also a sense of interoperability in yeah regarding that you are able to leverage the tools and the command line in other places so for example here is a here’s jupiter lab right i’m showing here some python code there’s a jupyter notebook and there is a full terminal here at the bottom all showing different ways in which you can use the command line within python so plenty of programming languages and environment have this ability also r and rstudio allow you to use the command line and their tools so excuse me yeah you wanted to say something i’m sorry your your slides aren’t popping up oh oh my you’ve just been looking at my face and that’s the that’s the least important thing of of all i forgot to press this button oh my and just the thing that i worked so hard on these slides so how are we doing on time i have plenty of time left so let’s just you’re good i was thinking where is all the laughter i mean all these yeah i’m working on this book yeah data science at the command line here we go my company free as in bird this is what the command line looks like in case you have not seen it before there’s this nature article which you should be able to find if you google for nature and bash now this of course refers to well i’m not going to explain it you see the slides these are the core values of csv conf there’s value in repeating all this of course uh jupiter lab r r studio well i’m glad you you told me thank you for that and even spark here i highlighted a sentence from the book spark the definitive guide by the original author of of spark and they say the pipe method is probably one of spark’s more interesting methods now and i think that’s quite a compliment um coming from well the author of this 800 pound gorilla when it comes to wrangling a lot of data and i i think it’s really interesting that they’ve decided to add the functionality to leverage a 50 year old technology so now that i have sort of established uh what command line is uh tools we’re gonna look at tools in a moment uh but also that it’s in a in a lot of different places right not only different programming languages and environments but it can be found on supercomputers to microcontrollers and of course on your laptops even windows now with it’s a wsl windows subsystem for linux you can easily run a unix command line on your windows system so it’s everywhere and it’s here to stay so so about turning um your code into a tool right all you have to do is follow these six easy steps now i can imagine that right now you’re feeling a bit like alice tumbling down the rabbit hole oh don’t worry let’s all just take the red pill and i show you how deep the rabbit hole goes and let’s go to here we go if you don’t see the full screen i now have jupiter lab open what i’ve learned about crowdcast is that if you resize your browser if you make it a bit wider the aspect ratio changes and the entire slide or my entire screen should be visible so in this demonstration i’m going to use python but the same principles apply to other programming languages right if your weapon of choice is r or javascript or java the same steps right the syntax will be slightly different but the same steps can be taken so i’ve also chosen a very simple example so that we can focus on the process itself rather than the code so here we’ve got some some analysis right what we’re going to do here in this piece of code is we’re going to open up a text file that contains the adventures of alice in wonderland and we’re going to count the number of lines that contain the term alice then we’re going to print the result now now of course it’s up to you to use your imagination and think like okay how does that apply to my code right um so the very first step although it’s a bit trivial at this point is to copy this uh to a file right a regular text file and that’s what i have right here that’s the that’s the first step that you need to do and this becomes more complicated when of course your uh code is uh scattered across many different files or notebooks and so forth let’s see now i shoot where is my command line oh there it’s hiding let’s move that over here so now you should see click on that line got a couple of files here count.

pie now let’s first check that it’s working right 401 that’s the number of times alice appears in this text and this is again a really simple command line tool in fact um what we’ve just did is i’ve implemented quite poorly to be honest uh grep oh should be uh right but that doesn’t matter it’s the process like i said which matters so i’m now going to take you uh through the steps that are needed in order to turn this into a proper command line tool by the way if you have any questions um i unfortunately don’t have the time to look at them now but there are various ways you can get a hold of me after this session so okay the very first thing we want to do not this file that is the text itself i’m gonna now let’s let’s copy that over from the previous directory that will speed up things a little bit faster so the very first step that we want to do or the second step actually after we’ve put things into a file is to add some arguments right because this piece of code does this the same thing over and over again right and in order to make this tool more uh usable in in more general you have to think about okay what are some parts that can vary that i would like to change and so what i’ve added here is this import statement and i’m using the uh well uh an argument that is passed here at the command line the first element is always the name of of the file and so the second element in python that is element one that would be an argument so now what i could do well i can test if alice still works but i could also try out other values so we’ve already made our code a little bit more uh general um but of course there are other things we can do now let’s see um this piece of code only works on alice right this code takes on the responsibility of opening up a file and reading that file and it’s the same file over and over again now of course you could turn that into an argument just like the pattern is then yeah so alice hadn’t get our that’s an argument so we could also turn the file name as an argument but another um approach is to re let your tool read from standard input so that’s a a standardized way of feeding data into a tool and that’s what this code is doing it’s let me put them side by side i guess if i do it like this so now we have those two files side by side and you the uh the second version here the newer version is even a bit simpler it doesn’t open a file it just reads from standard input so it has moved that responsibility to outside the tool so now we have to read it in this is one way you can do it and then we can pipe that into well okay so this works but now that this tool is reading from standard input anything is possible we could even this is your sorry this is your five minute warning thank you very much that’s actually a minute more than i had i’m gonna make use of it so it can read anything it any any data that you feed into this it can now handle so what i’m doing here is i’m using the command line tool curl to download a file this is the sequel to that book through the looking glass wow oh that took a long time to download a uh a book i can i can silence that one is it going to take that long still anyway you can see that alice is mentioned on 465 lines in this book so um there’s already a lot more possible now in fact it is just text so so we could we could even generate a uh a list of numbers here i’m generating 1 through 100 and i could type that to that tool and say i want you to count the number of times three a3 appears on a line so moving on because this doesn’t really feel like a proper command line tool right so what i’ve changed here is i’ve added a single line here these two first characters that that is known as the shebang or hashbang and this lets the command line or the shell know that this is executable and what’s here what follows that is the program which is responsible for interpreting uh this source code now unfortunately or no that’s not really unfortunately we need one more step in order to uh make this work and that is we need to change the permissions on the on this file so otherwise we would we would get an error like hey you don’t have to write permissions so now what i can do is i could use it like so and then i guess if you change this to just being count because the command line doesn’t really care about extensions right this already starts to feel more like a proper tool now there’s one uh final step i’m gonna leave that as a take home exercise you’ll be able to find it in my book let’s uh let’s wrap up because i have a few closing thoughts here so these steps i’d say are pretty easy it’s the thinking about what goes into this tool that is hard right um once you have a tool yeah even if you don’t use it yourself it will benefit others yeah if you at least of course want others to be able to use your tool they would then and perhaps you as well able to tap into the existing ecosystem of all these command line tools so all this functionality of downloading files of scheduling of monitoring and parallelizing even all becomes available and then i haven’t even thought uh talked about packaging and distribution which is is uh is very important in itself so that’s my talk thank you very much for listening i hope i’ve been able to sort of convince you if you need further convincing if you have any questions at all um you can leave a message on slack or you can send me a tweet i’m on twitter here and again good luck thank you to the organizers of csv conf enjoy the rest of your conference and uh yeah i’ll.