Introduction to Machine Learning with Python and scikit-learn

Post on 27-Jan-2015

134 views 7 download

Tags:

description

PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.

transcript

Introduction to Machine Learning with Python and scikit-learn

Python AtlantaNov. 14th 2013

Matt Hagymatt@liveramp.com

Slide #2 Intro to Machine Learning with Python matt@liveramp.com

Machine Learning (ML):• Finding patterns in data

• Modeling patterns

• Use models to make predictions

ML can be easy*• You already have ML applications!

• You can start applying ML methods now with Python & scikit-learn

• Theoretical knowledge of ML not needed (initially)*

*Gaining more background, theory, and experience will help

Slide #3 Intro to Machine Learning with Python matt@liveramp.com

Simple Example

Slide #4 Intro to Machine Learning with Python matt@liveramp.com

Simple Model

Slide #5 Intro to Machine Learning with Python matt@liveramp.com

Slide #6 Intro to Machine Learning with Python matt@liveramp.com

import numpy as npfrom sklearn.linear_model import LinearRegression

x,y = np.load('data.npz')x_test = np.linspace(0, 200)

model = LinearRegression()model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])

Slide #7 Intro to Machine Learning with Python matt@liveramp.com

Variance/Bias Trade Off

Slide #8 Intro to Machine Learning with Python matt@liveramp.com

• Need models that can adapt to relationships in our data

• Highly adaptable models can over-fit and will not generalize

• Regularization – Common strategy to address variance/bias trade off

Slide #9 Intro to Machine Learning with Python matt@liveramp.com

Slide #10 Intro to Machine Learning with Python matt@liveramp.com

import numpy as npfrom sklearn.svm import SVRfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler

x,y = np.load('data.npz')x_test = np.linspace(0, 200)

model = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ])model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])

regularizationterm

Supervised Learning

Slide #11 Intro to Machine Learning with Python matt@liveramp.com

031342934

Input, X

1637931767

Output, Y Modeling relationship between inputs and outputs

Sam

ple

Multiple Inputs

Slide #12 Intro to Machine Learning with Python matt@liveramp.com

Input, X

031342934

X1

231689123

X2

103127542

X3

470291321

Xn

1637931767

Output, Y

Sam

ple

Example: Image Classification

Slide #13 Intro to Machine Learning with Python matt@liveramp.com

• Classify handwritten digits with ML models

• Each input is an entire image

• Output is digit in the image

Slide #14 Intro to Machine Learning with Python matt@liveramp.com

9Input, X Output, Y

2

Slide #15 Intro to Machine Learning with Python matt@liveramp.com

import numpy as npfrom sklearn.ensemble import RandomForestClassifier

with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’]with np.load(’test.npz') as data: pixels_test = data['pixels']

# flattenX_train = pixels_train.reshape(pixels_train.shape[0], -1)X_test = pixels_test.reshape(pixels_test.shape[0], -1)

model = RandomForestClassifier(n_estimators=50)model.fit(X_train, labels_train)labels_test = model.predict(X_test)

Trains on 50,000 images in roughly 20 seconds.96% accurate !!

Kaggle Data Science Competition

• Given 6 million training questions labeled with tags

• Predict the tags for 2 million unlabeled test questions

www.users.globalnet.co.uk/~slocks/instructions.htmlstackoverflow.com/questions/895371/bubble-sort-homework

Predicting the tags of Stack Overflow questions with machine learning

Slide #16 Intro to Machine Learning with Python matt@liveramp.com

Text Classification Overview

Raw Posts Vector Space Machine Learning Model

Feature Extraction & Selection

Model Selection & Training

Slide #17 Intro to Machine Learning with Python matt@liveramp.com

Term Frequency Feature Extraction

“Why is processing a sorted array faster than processing an array this is not sorted?”

Characterize text by the frequency of specific words in each text entry

Example Title:

whyprocessing

sorted

array

faster

1 2 2 2 1

Term Frequencies

Ignore common words (i.e. stop words)

Slide #18 Intro to Machine Learning with Python matt@liveramp.com

Frequency of key terms is anticipated to be correlated with the tags of the question

why

processing

sorted

array

faster

need

help

java

homework

Title 1 1 2 2 2 1 0 0 0 0

Title 2 0 0 0 0 0 1 1 1 1

Title 3 0 0 1 1 0 0 1 0 1

Slide #19 Intro to Machine Learning with Python matt@liveramp.com

Example Model Coefficients

Slide #22 Intro to Machine Learning with Python matt@liveramp.com

ML can be easy*• You already have ML problems!

• You can start applying ML methods now with Python & scikit-learn

• Theoretical knowledge of ML not needed (initially)*

scikit-learn.org

github.com/scikit-learn

Slide #24 Intro to Machine Learning with Python matt@liveramp.com

Check out: liveramp.com/careers

Helping companies use their marketing data to delight customers

Opportunities•Backend Engineers•Data Scientists•Full-Stack Engineers

Tools•Java•Hadoop (Map/Reduce)•Ruby

Build and work with large distributed systems that process massive data sets.

Slide #25 Intro to Machine Learning with Python matt@liveramp.com