+ All Categories
Home > Software > Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Date post: 28-Jan-2018
Category:
Upload: benjamin-bengfort
View: 823 times
Download: 2 times
Share this document with a friend
60
Visualizing Model Selection with Scikit-Yellowbrick An Introduction to Developing Visualizers
Transcript
Page 1: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visualizing Model Selection with Scikit-Yellowbrick

An Introduction to Developing Visualizers

Page 2: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

What is Yellowbrick?

- Model Visualization

- Data Visualization for

Machine Learning

- Visual Diagnostics

- Visual Steering

Not a replacement for visualization libraries.

Page 3: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Enhance the Model Selection Process

Page 4: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

The Model Selection Process

Page 5: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

The Model Selection TripleArun Kumar http://bit.ly/2abVNrI

Feature Analysis

Algorithm Selection

Hyperparameter Tuning

Page 6: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

The Model Selection Triple- Define a bounded, high

dimensional feature space that can be effectively modeled.

- Transform and manipulate the space to make modeling easier.

- Extract a feature representation of each instance in the space.

Feature Analysis

Page 7: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Algorithm Selection

The Model Selection Triple- Select a model family that

best/correctly defines the relationship between the variables of interest.

- Define a model form that specifies exactly how features interact to make a prediction.

- Train a fitted model by optimizing internal parameters to the data.

Page 8: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Hyperparameter Tuning

The Model Selection Triple- Evaluate how the model

form is interacting with the feature space.

- Identify hyperparameters (i.e. parameters that affect training or the prior, not prediction)

- Tune the fitting and prediction process by modifying these params.

Page 9: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Automatic Model Selection Criteria

from sklearn.cross_validation import KFold

kfolds = KFold(n=len(X), n_folds=12)

scores = [

model.fit(

X[train], y[train]

).score(

X[test], y[test]

)

for train, test in kfolds

]

F1

R2

Page 10: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Try Them All!from sklearn.svm import SVC

from sklearn.neighbors import KNeighborsClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier

from sklearn.naive_bayes import GaussianNB

from sklearn import cross_validation as cv

classifiers = [

KNeighborsClassifier(5),

SVC(kernel="linear", C=0.025),

RandomForestClassifier(max_depth=5),

AdaBoostClassifier(),

GaussianNB(),

]

kfold = cv.KFold(len(X), n_folds=12)

max([

cv.cross_val_score(model, X, y, cv=kfold).mean

for model in classifiers

])

Page 11: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Search Hyperparameter Space

from sklearn.feature_extraction.text import *

from sklearn.linear_model import SGDClassifier

from sklearn.grid_search import GridSearchCV

from sklearn.pipeline import Pipeline

pipeline = Pipeline([

('vect', CountVectorizer()),

('tfidf', TfidfTransformer()),

('model', SGDClassifier()),

])

parameters = {

'vect__max_df': (0.5, 0.75, 1.0),

'vect__max_features': (None, 5000, 10000),

'tfidf__use_idf': (True, False),

'tfidf__norm': ('l1', 'l2'),

'model__alpha': (0.00001, 0.000001),

'model__penalty': ('l2', 'elasticnet'),

}

search = GridSearchCV(pipeline, parameters)

search.fit(X, y)

Page 12: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Automatic Model Selection: Search?Search is difficult particularly in high dimensional space.

Even with techniques like genetic algorithms or particle swarm optimization, there is no guarantee of a solution.

As the search space gets larger, the amount of time increases exponentially.

Page 13: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visual Steering Improves Model Selection to Reach Better Models, Faster

Page 14: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visual Steering- Interventions or guidance

by human pattern recognition.

- Humans engage the modeling process through visualization.

- Overview first, zoom and filter, details on demand.

Page 15: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

We will show that:

- Visual steering leads to improved models (better F1, R2 scores)

- Time-to-model is faster.

- Modeling is more interpretable.

- Formal user testing and possible research paper.

Proof: User Testing

Page 16: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Yellowbrick Extends the Scikit-Learn API

Page 17: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

The trick: combine functional/procedural matplotlib + object-oriented Scikit-Learn.

Yellowbrick

Page 18: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

EstimatorsThe main API implemented by Scikit-Learn is that of the estimator. An estimator is any object that learns from data;

it may be a classification, regression or clustering algorithm, or a transformer that extracts/filters useful features from raw data.

class Estimator(object):

def fit(self, X, y=None):

"""

Fits estimator to data.

"""

# set state of self

return self

def predict(self, X):

"""

Predict response of X

"""

# compute predictions pred

return pred

Page 19: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

TransformersTransformers are special cases of Estimators -- instead of making predictions, they transform the input dataset X to a new dataset X’.

Understanding X and y in Scikit-Learn is essential to being able to construct visualizers.

class Transformer(Estimator):

def transform(self, X):

"""

Transforms the input data.

"""

# transform X to X_prime

return X_prime

Page 20: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

VisualizersA visualizer is an estimator that produces visualizations based on data rather than new datasets or predictions.

Visualizers are intended to work in concert with Transformers and Estimators to allow human insight into the modeling process.

class Visualizer(Estimator):

def draw(self):

"""

Draw the data

"""

self.ax.plot()

def finalize(self):

"""

Complete the figure

"""

self.ax.set_title()

def poof(self):

"""

Show the figure

"""

plt.show()

Page 21: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

The purpose of the pipeline is to assemble several steps that can be cross-validated and operationalized together.

Sequentially applies a list of

transforms and a final estimator.

Intermediate steps of the pipeline

must be ‘transforms’, that is, they

must implement fit() and

transform() methods. The final

estimator only needs to implement

fit().

Pipelinesclass Pipeline(Transformer):

@property

def named_steps(self):

"""

Sequence of estimators

"""

return self.steps

@property

def _final_estimator(self):

"""

Terminating estimator

"""

return self.steps[-1]

Page 22: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Scikit-Learn Pipelines: fit() and predict()

Page 23: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Yellowbrick Visual Transformers

fit() draw()

predict()

fit() predict()score()draw()

Page 24: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Model Selection Pipelines

Page 25: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Primary YB Requirements

Page 26: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Requirements1. Fits into the sklearn API and

workflow

2. Implements matplotlib calls efficiently

3. Low overhead if poof() is not called

4. Just flexible enough for users to adapt to their data

5. Easy to add new visualizers

6. Looks as good as Seaborn

Page 27: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Primary Requirement:Implement Visual Steering

Page 28: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

DependenciesLike all libraries, we want to do our best to minimize the number of dependencies:

- Scikit-Learn- Matplotlib - Numpy

… c’est tout!

Page 29: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

The Visualizer

Page 30: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Current Package Hierarchy: make uml

Page 31: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Current Class Hierarchy: make uml

Page 32: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Current Class Hierarchy: make uml

Page 33: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Current Class Hierarchy: make uml

Page 34: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visualizer InterfaceVisualizers must hook into the Scikit-Learn API; data is received from the user via:

- fit(X, y=None, **kwargs)

- transform(X, **kwargs)

- predict(X, **kwargs)

- score(X, y, **kwargs)

These methods then call the internal draw() method.

Draw could be called multiple times for different reasons.

Users call for visualizations via the poof() method which will:

- finalize()

- savefig() or show()

Page 35: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visualizer Interface# Instantiate the visualizer

visualizer = ParallelCoordinates(classes=classes, features=features)

# Fit the data to the visualizer

visualizer.fit(X, y)

# Transform the data

visualizer.transform(X)

# Draw/show/poof the data

visualizer.poof()

Page 36: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Axes ManagementMultiple visualizers may be simultaneously drawing.

Visualizers must only work on a local axes object that can be specified by the user, or created on demand.

E.g. no plt.method() calls, use the corresponding ax.set_method() call.

Page 37: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

A simple example- Create a bar chart

comparing the frequency of classes in the target vector.

- Where to hook into Scikit-Learn?

- What does draw() do?

- What does finalize() do?

Page 38: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Feature VisualizersFeatureVisualizers describe the data space -- usually a high dimensional data visualization problem!

Come before, between, or after transformers.

Intersect at fit() or transform()?

fit() draw()

predict()

Page 39: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Some Feature Visualizer Examples

Page 40: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Score VisualizersScore visualizers describe the behavior of the model in model space and are used to measure bias vs. variance.

Intersect at the score() method.

Currently we wrap estimators and pass through to the underlying estimator.

fit() predict()score()draw()

Page 41: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Score Visualizer Examples

Page 42: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Multi-Estimator VisualizersNot implemented yet, but how do we enable visual model selection?

Need a method to fit multiple models into a single visualization.

Consider hyperparameter tuning examples.

Page 43: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Multi-Model visualizations

Page 44: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Visual Pipelines

Page 45: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Multiple VisualizationsHow do we engage the pipeline process to add multiple visualizer components?

How do we organize visualization with steering?

How can we ensure that all visualizers are called appropriately?

Page 46: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

InteractivityHow can we embed interactive visualizations in notebooks?

Can we allow the user to tune the model selection process in real time?

Do we pause the pipeline process to allow interaction for steering?

Page 47: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Features and Utilities

Page 48: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Optimizing VisualizationCan we use analytics methods to improve the performance of our visualization?

E.g. minimize overlap by rearranging features in parallel coordinates and radviz.

Select K-Best; Show Regularization, etc.

Page 49: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Style ManagementWe should look good doing it! Inspired by Seaborn we have implemented:

- set_palette()

- set_context()

Automatic color code updates: bgrmyck

As many palettes and sequences as we can fit!

Page 50: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Best Fit LinesSupport for automatically drawing best fit lines by fitting a:

- Linear polyfit - Quadratic polyfit - Exponential fit - Logarithmic fit

Page 51: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Type DetectionWe’ve had to do a lot of manual work to polish visualizations:

- is_estimator()

- is_classifier()

- is_regressor()

- is_dataframe()

- is_categorical()

- is_sequential()

- is_numeric()

Page 52: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Exceptions

Page 53: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Documentation

Page 54: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

reStructuredText: cd docs && make html

Page 55: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Contributing

Page 56: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Git/Branch ManagementAll work happens in develop.

Select a card from “ready”, move to “in-progress”.

Create a branch called “feature-[feature name]”, work & commit into that branch:

$ git checkout -b feature-myfeature develop

Once you are done working (and tested) merge into develop.:

$ git checkout develop$ git merge --no-ff feature-myfeature$ git branch -d feature-myfeature$ git push origin develop

Repeat.

Once a milestone is completed, it is pushed to master and released.

Page 57: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Milestones, Issues, and LabelsEach release (identified by semantic versioning; e.g. major and minor releases) is stored in a milestone.

Each milestone is a sprint.

Issues are added to the milestone, and the release is done with all issues are complete.

Issues are labeled for easy categorization.

Page 58: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Waffle Kanban

Page 59: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

Testing (Python 2.7 and 3.5+): make test

Page 60: Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Developing Visualizers

User Testing and Research


Recommended