AutoML (Machine Learning para Vagos) - PyConES 2020

Oct 3, 2020 14:30 · 2635 words · 13 minute read target variable come aand live

Hello and welcome to another PyconES 2020 Pandemic edition talk. The title of this talk is: “AutoML, Machine Learning for lazy people” Let me share a little bit about myself, Im Manuel Garrido Im the Lead Data Architect at Daltix Im also a professor at ENAE Business School and the Lisbon Data Science Academy Im also the author of Udemy s course: “Machine learning y Data Science con Python” (in spanish) About my educational background, I studied Industrial Engineering at Universidad Politècnica de València and I have a Master in Management at IE Business School Regarding my work experience, Ive had a pretty standard career. I started as as Consultant, then moved to an Analyst position and then I became a Data Scientist About technologies I have used: As a Consultant I used Excel, As an Analyst I also used Excel, but there is a time when you are tired of doing the same reports again and again and you learn about the world of programming with VBA Macros. amazing!. Then you want to move to something more serius and you start using R, you start developing predictive models and finally you move to Python, basically because it’s better, im sorry This is my contact info (hola@manugarri.com) And well, a little bit of corporate info about Daltix Daltix, Belgian Company with its tech department here in Portugal (i live in Portugal, in lisbon), and well, we do Retail Analytics Analyses and Information for the main supermarket chains in Europe We are looking for someone to work in Lisbon a Data Engineer/Analyst, to work with me on my Squad, my team, if you want to come aand live in Lisbon, that is super trendy and there are less coronavirus cases, you know send me a message lets continue, what is AutoML?, We call AutoML to the set of techniques that help to automate the development of predictive models.

If Machine 02:15 - Learning is already a way to automate certain processes, using historical data so that a machine can perform tasks autonomously, Automated Machine Learning means going one step further, it focus on automating the steps that can be automated in the Machine Learning development process. I have looked for the origins of AutoML, its fairly recent, one of the most important research groups is this one, automl.org, a joint research group between the universities of Freiburg and Hannover, they are one of the most prolific groups in the area of AutoML research. this is their site, automl.org pretty neat, they have a lot of papers As I mentioned before, on every Machine Learning project, on every Data Science project there are a series of steps that are common, there are steps that happen on every project even though the actual problem and the dataset are different each time. These are the general steps we follow on a Data Science Project, Goal definition with the business stakeholders, or the client, they meet with you to define what is what they actually want to achieve and how it can be translated into a Machine Learning problem, which goal are we going to consider, what will be considered a success, expected deadlines, etcetera.

03:44 - Afterwards we perform Exploratory Data Analysis to gain an understanding of our data find errors if we have enough data or the required fields if its impossible to achieve the defined goal with our current data, etcetera. Next we move to the model development phase, here i have expanded this step a bit in detail We usually start definining the data preprocessing i.e. if there are null values, outliers, etcetera, encodings of categorical fields. Next we move to feature selection/engineering selection meaning which independent variables we want to keep feature engineering, how are we going to vectorize our fields, which embeddings we will use for text data, etcetera. Next we move to model selection, phase in which we either use our domain knowledge, or we select models that the existing literature shows are promising, or we do what most people do; import a ton of different models and brute force them in a for loop, trying to assess which ones seem to perform better.

Once you have found the most promising algorithm, 04:54 - we have to optimize it and find that set of hyperparameters that means, the settings that each model has to find which ones work best in terms of our initial goal. and then we go back to the beginning. We keep doing this hopefully improving our goal along the way. minimizing our loss function. And finally we deploy our final model. All the tasks inside this box are sometimes quite tedious, and they are the steps that AutoML tries to automate. What is not AutoML?, as of today, year 2020, AutoML techniques are not a plug and play solution. at least in my opinion. The existing Open Source libraries, and I have been checking them for years, they need maintenance and someone taking care of them, they are not something you can import and expect them to work magically with your dataset.

05:46 - If the field of Machine Learning is by itself a fairly new field, the field of AutoML is in a bit of chaos currently: Libraries in alpha, or libraries that dont even work out of the box etcetera ,you need to know what you are doing to use them, They are like the Javascript of Data Science, where you have tons of different frameworks and its madness. Comercial solutions work, but they always require your data to fit into their expected schemas. Its weird the system in which you drag a dataset as it was provided by your client and that it just works. So AutoML is not a plug & play solution and of course is not a replacement of an actual Data Scientist Most of the value that a data scientist can provide is not in the Machine learning modeling part, even though that is what most hype articles on the internet try to make us believe. more and more the model development part is becoming a commodity, where is quite easy to get a model that works well enough.

06:49 - and if your model performance is the core competitive edge on your company, you can always set everything up with an existing solution and after that you can spend time developing cool algorithms, publishing papers and being a rockstar. A big part of the value we Data Scientists provide is to translate busness needs into a Machine Learning problem. i have been working on this for some time and at the end of the day that is the most important part to be able to define goals clearly, to have a clean dataset, etcetera is where you actually spend most of your time and when Data Science projects make or break. That being said, im going to show a few of the most established libraries “stablished”, meaning that I would consider all of them as Alpha projects, and I dont know any company that uses any of these libraries without being very careful. Me, I dont use them in production at all. We start with autoscikitlearn this is the project homepage, which is part of the automl.

org 07:57 - group, autoscikitlearn is one of their projects. This library is one of the oldest ones and part of the automl group. It is quite easy to use, it tries to be a replacement of a scikitlearn estimator. and it provides model selection (of sklearn models) and model optimization via SMAC which is a python library for bayesian optimization of complex functions. I am not going to dive into the topic, if you are interested here is the paper describing the SMAC algorithm you can go to the link, spend a couple days reading it.

08:38 - Lets just say it uses bayesian optimization techniques to decide which models seem to perform better and which models to search. All AutoML libraries wok by doing a search The same search that you would do manuall, these libraries try to automate it. Autosklearn tries to be like a sklearn estimator. How does it work? Simple, here we import “from autosklearn regression” if its a regression problem or “classification” if its a classification problem, here I import, AutoSklearnRegressor o AutoSklearnClassifier Then I import the scoring function that i want to use and create an instance of the estimator. AutoSklearnRegressor, in these AutoML libraries we dont know which model we want to use, so we cant specify the hyperparameters.

09:28 - So we use some meta hyperparameters the parameters we need to define the search. in this particular library, autosklearn uses the parameter “time_left_for_this_task” that means, how long can autosklear use to search for a model (defined in seconds), here we are telling autosklearn: “You have 3 minutes to search for the best model, try to optimize the mean squared error”. We fit the model like any other sklearn model with train_x and y, and we are done! now we have a trained model we can predict with it we can save it with joblib like any other sklearn Estimator. This is the automl library I would recommend if you want to use it in production. It does not prepreocess data, meaning if you have categorical variables, or null values its going to explode.

10:22 - TPot, this library is quite mature, then again with quotes. This is the homepage, it tries to use an interface like scikit-learn as well. It supports scikitlearn and xgboost. and the model search is completely different than autosklearn, It uses evolutionary algorithms. It creates Trees of ML pipelines, it creates generations, chooses the best ones based on the scoring function, and then it creates new generations from those pipelines, with slight variations, it is pretty neat. It uses DEAP, a library for implementing evolutionary algorithms in Python, it uses that to find the pipeline that works best.

11:06 - How do we use Tpot? Simple, “from tpot import TPOTRegressor” or TPOTClassifier for classification problems we instantiate the search, we ca use the parameter “max_time_mins”, how many minutes it can use for the search, for example 3 minutes, and the score function, it works with any one of the sklearn ones: for example mean squared error. afterwards we fit, and done, we have the best model fitted. We can predict and because it uses sklearn or xgboost we can save the best pipeline once trained, the best pipeline is accesed via “_fitted pipelne” and we can save it with joblib. One neat yet useless thing that TPot does, is that you can it autogenerates the required code to build the pipeline in a.py file like a script, when i have tried this it has never worked out of the box, but well, its something interesting.

11:59 - Next we have autokeras, a library slightly newer that tries to generate deep learning models using the popular library keras, now tensorflow.keras It implements model selection and because it uses deep learning models it supports many types of inputs as long as t hey are numerical that is. It supports text, images structured data, multi modals, those are cool, for example at Daltix we have product data containing images, product descriptions, prices, etcetera You can use them all as inputs for the model search. It uses a technique named NAS, neural architecture search, which basically is, this is autokears site btw, Neural Architecture Search is the set of automl techniques to find deep learning model architectures automatically, here we see a tiny machine that decides when to add a layer or when to remove it. It works similarly as the previous libraries We import our classifier or structured regressor, there are other types for text or images, we fit one thing i dont like about autokeras is that it doesnt allow to specify the search time you have to tell it how many models it can test for the search, and then you have to fit and tell it how many epochs it should train for once the search is done you can predict as usual, export the model as a keras model with “export_model” Finally we have Pycaret, this is the homepage pycaret.

org 13:38 - ITs fairly new, but it seems to be well maintained, Ive included it because it looks super promising its focused in helping in all of the Data Science aspects, you can open your jupyter notebook and it provides a ton of features to help you in all of the aspects of ML like data preprocessing, feature selection, model selection, it supports scikitlearn, xgboost, lightgbm model optimization, it can also deploy models to AWS, pretty neat. How to use it?, frompycaret.regression or classifictcion we import, they recommend using the asterisk, but i dont like to do that and its not necessary setup function to compare automl models I use the original dataset, I dont split train and test, pycaret handles that, I specify the target variable, whats the train size and it does the preprocessing “compare_models” performs a search, the model search I think pycaret just brute forces the search so its not as cool as other libraries we specify the search time, in minutes and the score metric you want the models to be ranked by. This generates a ranking, if you use a notebook it generates pretty cool html tables, and then you can fit the final model with the function “automl” this generates a fully trained model, and then you use “predict” to generate predictions. There are maaany more libraries, I put them here ranked from the most trustworthy to the least ones. These 2 at the end have an asterisk, because they are not pure automl, Prophet is automl, but exclusively focused on time series, its from facebook, pretty stable and works very well, that is, only for time series.

“pandas-profiling, it does 15:29 - not do any automl, but it generates an automatic report on your data that is pretty useful, I always recommend it to my students. About Automl as a service, al of the major cloud providers have an automl offering, google has google cloud automl AWS offers as part of its Data Science suite Sagemaker, it has something called autopilot, and Microsoft´ s Azure has automated machine learning, and finally I have here Data Robot because it was one of the first companies to offer something like this all the way back in 2012 or 2013. I used their tool back in the day and it worked well. they provide a full platform to train and deploy models. And finally, I have been testing these libraries for quite some time I have a github repository named autoautoml where I try to automate the automated search of Machine Learning models I share it here in case anyone would like to check it out, if anyone would like to help here is the repository.

I implemented some basic utilities 16:44 - to help test and run ml problems in different containers and i have some utils to run the problems in AWS Batch or docker lets say you want to do something like this i have this problem and I want to run it in docker, I want to read the dataset from this working directory in S3 this is the target variable its a regression problem and we want to run this container, use ludwig, and generate these artifacts its in veery early alpha because i work on it on my spare time. and I dunn how long do I have left for the talk, I hope we are done because I dont have anything else to say, if you have any questions let me know, and thanks for watching! .