Visualizing AutoML Processes with AutoMLizer

Summary

AutoML systems attempt to find the best machine learning pipelines through data preprocessing, feature engineering, algorithm and hyperparameter selection. Most AutoML systems are blackboxes. Opening the covers of these black box systems to understand the inner workings is the step in the right direction for greater transparency. Why did the AutoML system select a particular pipeline? What was the scope of the hyperparameter space ? And what was the impact of various hyperparameters on the pipeline performance? This blog proposes a system of analytical methods to answer such questions and provide an intuition on the inner working of such systems.

Background

The impact of Machine Learning in our world is significant. Despite the impact, the model building flow is fairly consistent and comprises of a series of steps as shown below. Most of them are iterative in nature and are prime targets to automate. Automated Machine Learning (AutoML) are systems that automate parts of these flow.

Several open-source and commercial AutoML systems are currently available and their use is slowly becoming ubiquitous given their value to automate the drudgery involved in finding the best performing model.

And just as everything AI related has a tendency to be hyped, AutoML is also not spared. Cutting through this hype, the concept and working of AutoML systems is fairly simple.

AutoML systems primarily solve the CASH (Combined Algorithm Selection and Hyperparameter) problem. They generate a vast domain of pipelines using predefined set of algorithms and relevant hyperparameter (or hyperpartition) values. Executing each pipelines and tracking the performance is almost an impossible task given the sheer permutations and combinations in the algorithms and hyperparameter space. To effectively navigate this search space and determine top performing models, AutoML systems take advantage of combining optimization techniques like Bayesian Optimization with meta-learning. This allows AutoML systems to identify the best pipelines within an acceptable amount of budget (usually provided in terms of processing time, Generations or limited by the number of algorithms and range of hyperparameters provided).

In other words, AutoML system find the best performing pipeline within the budget specified by building thousands of pipelines and searching this space using optimization techniques.

Users running AutoML systems are faced with some key decisions and need answers to critical questions.

Does the default configuration space (Algorithm and Hyperparameter choices) contribute to high performing pipelines or do they need to be adjusted? Is the search space optimally configured?
Does increasing the training budget effectively find a “better” pipeline?
Can I get deeper insights into the performance of the AutoML system during the execution cycle in order to increase confidence in the pipelines generated?

The proposed analytical and visualization methods (referred to AutoMLizer going forward) enables the user to reach these key decisions.

Related Work

This work is heavily inspired by the paper ATMSeer: Increasing Transparency and Controllability in Automated Machine Learning. In the paper, the author proposes similar ideas for ATM — an AutoML system built by the author. I am using TPOT as the underlying AutoML system.

AutoML Visualizer (AutoMLizer)

AutoMLizer is a real-time advanced visual intelligence application written in Python that helps the user reach critical decisions by analyzing the effectiveness of pipelines generated by the AutoML system and a deeper analysis on the effect of hyperparameters/hyperpartition on pipeline performance.

The front-end UI comprises of 3 sections

Run Specifications: This control panel allows the user to provide the specifics of a “Run” including the training data, label, problem type (Classification or Regression) , evaluation metric and budget. The label selection drop-down is automatically populated based on the underlying columns in the provided dataset.
Run Status and Pipeline Performance: This section provides the user the ability to track the status of the Run in real-time and the performance tracking of pipelines via a histogram, Pipeline profile charts and Top 10 pipelines. In addition, two gauge charts provide the percentage hyperparameters and algorithms search space covered during the course of the Run.
Detailed Profiler: This section contains Algorithm Histograms for deeper analysis on performance of each algorithm and an interactive experience driven by the Top 10 Pipelines chart. Selecting any pipeline on this chart will display details of the pipeline components (currently limited to 10 values). By further selecting, each algorithm in the pipeline the user is provided with the detailed profiles of the hyperparameters and their performance during the run.

Application Architecture

The application architecture comprises of:

Backend Engine: Facilitates all requests from the front end, runs the AutoML engine and coordinates i/o with a back-end database. As it is built from the ground-up with APIs, all aspects of the application can be accessed via any standard client.
Front End: A thin visualization client built with Dash that uses APIs to communicate with the Backend Engine
Database: MongoDB is used to persist all the data generated during the course of the Run.

Use Case 1:

Diabetes Dataset: The dataset is used to predict whether a patient has diabetes based on the various attributes provided in the dataset.

Type: Classification:

The user has selected 29 Generations as the budget for the AutoML framework.

The Run Status components of the app indicate a successful run with most pipelines scoring above 70% providing the necessary benchmark for the further modification. In addition, the best score is around 78.8% with more than 90% of the hyperparameters and 80% of algorithms covered from the given configuration.

By comparing the Algorithm Histograms, the user gets an intuition that LogisticRegression and Randomforest and their corresponding ensembles are performing better than the rest of the configuration.

This fact is further confirmed in the Top 10 Pipelines chart as Pipeline-122 and Pipeline-6 are composed of these 2 algorithms. The user further analyzes these pipelines and how the hyperparameters / hyperpartions are contributing to the pipeline accuracy by selecting the pipeline and the corresponding algorithm. The details of the hyperparameters and hyperpartitions are shown in the Detailed profiler.

In addition, the user further notes the profiles of the hyperpartition of C and max_iter may need to be adjusted during the next run. Selecting randomforest indicate something interesting on max_features and min_samples_leaf profiles. Keeping min_samples_leaf below a 10 and max_features between 0.4 and 0.6 seem to be getting the highest performing pipeline.

Use Case 2:

Boston Dataset: The dataset is used to forecast the median price of a house in Boston based on various real-estate attributes

Type: Regression

The user selects the training file and then selects the target to be trained from the drop-down box. In addition, the user selects the type of the problem and picks 5 Generations as the budget for the AutoML framework to set up the baseline

Once the training starts, AutoMLizer starts providing the details of the training through the real-time Run Status and Pipeline Performance components. The best score (RSME) is 3.2

On close examination of the the algorithm histograms, it is clear that the random forest regressor and gradientboost regressor are involved in high performance pipeline which can be confirmed from the top 10 pipeline chart.

And similar to the last use case, further analysis can be done through the detailed profiler section.

The end to end process is shown below.

Future Work

Support multiple AutoML Frameworks: Generalize the underlying functions to handle multiple AutoML frameworks including Auto-Sklearn, Auto-Keras, Auto-Gluon, Lucid and others.
Support multiple problem space: Extend the current problem space of regression and classification to time series and others
Performance: The looping over generations can be parallelized via a Dask Cluster.
Pause / Resume: Build the Pause / Resume functionality to allow the user to analyze mid-stream and determine if further processing is required or not.
Integration: Make the functionality available via the Machine Works Platform

Credits: This work would not have been possible without the Open Source community (Dash by Plotly and TPOT). A special shout-out to the kind and enthusiastic folks at Epistasis Labs who quickly answered my questions on TPOT.