Introduction to Machine Learning with Python and scikit-learn
Python AtlantaNov. 14th 2013
Matt [email protected]
Slide #2 Intro to Machine Learning with Python [email protected]
Machine Learning (ML):• Finding patterns in data
• Modeling patterns
• Use models to make predictions
ML can be easy*• You already have ML applications!
• You can start applying ML methods now with Python & scikit-learn
• Theoretical knowledge of ML not needed (initially)*
*Gaining more background, theory, and experience will help
Slide #3 Intro to Machine Learning with Python [email protected]
Slide #6 Intro to Machine Learning with Python [email protected]
import numpy as npfrom sklearn.linear_model import LinearRegression
x,y = np.load('data.npz')x_test = np.linspace(0, 200)
model = LinearRegression()model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])
Slide #7 Intro to Machine Learning with Python [email protected]
Variance/Bias Trade Off
Slide #8 Intro to Machine Learning with Python [email protected]
• Need models that can adapt to relationships in our data
• Highly adaptable models can over-fit and will not generalize
• Regularization – Common strategy to address variance/bias trade off
Slide #9 Intro to Machine Learning with Python [email protected]
Slide #10 Intro to Machine Learning with Python [email protected]
import numpy as npfrom sklearn.svm import SVRfrom sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScaler
x,y = np.load('data.npz')x_test = np.linspace(0, 200)
model = Pipeline([ ('standardize', StandardScaler()), ('svr', SVR(kernel='rbf', verbose=0, C=5e6, epsilon=20)) ])model.fit(x[::, np.newaxis], y)y_test = model.predict(x_test[::, np.newaxis])
regularizationterm
Supervised Learning
Slide #11 Intro to Machine Learning with Python [email protected]
031342934
Input, X
1637931767
Output, Y Modeling relationship between inputs and outputs
Sam
ple
Multiple Inputs
Slide #12 Intro to Machine Learning with Python [email protected]
Input, X
031342934
X1
231689123
X2
103127542
X3
…
470291321
Xn
1637931767
Output, Y
Sam
ple
Example: Image Classification
Slide #13 Intro to Machine Learning with Python [email protected]
• Classify handwritten digits with ML models
• Each input is an entire image
• Output is digit in the image
Slide #15 Intro to Machine Learning with Python [email protected]
import numpy as npfrom sklearn.ensemble import RandomForestClassifier
with np.load(’train.npz') as data: pixels_train = data['pixels'] labels_train = data['labels’]with np.load(’test.npz') as data: pixels_test = data['pixels']
# flattenX_train = pixels_train.reshape(pixels_train.shape[0], -1)X_test = pixels_test.reshape(pixels_test.shape[0], -1)
model = RandomForestClassifier(n_estimators=50)model.fit(X_train, labels_train)labels_test = model.predict(X_test)
Trains on 50,000 images in roughly 20 seconds.96% accurate !!
Kaggle Data Science Competition
• Given 6 million training questions labeled with tags
• Predict the tags for 2 million unlabeled test questions
www.users.globalnet.co.uk/~slocks/instructions.htmlstackoverflow.com/questions/895371/bubble-sort-homework
Predicting the tags of Stack Overflow questions with machine learning
Slide #16 Intro to Machine Learning with Python [email protected]
Text Classification Overview
Raw Posts Vector Space Machine Learning Model
Feature Extraction & Selection
Model Selection & Training
Slide #17 Intro to Machine Learning with Python [email protected]
Term Frequency Feature Extraction
“Why is processing a sorted array faster than processing an array this is not sorted?”
Characterize text by the frequency of specific words in each text entry
Example Title:
whyprocessing
sorted
array
faster
1 2 2 2 1
Term Frequencies
Ignore common words (i.e. stop words)
Slide #18 Intro to Machine Learning with Python [email protected]
Frequency of key terms is anticipated to be correlated with the tags of the question
why
processing
sorted
array
faster
need
help
java
homework
Title 1 1 2 2 2 1 0 0 0 0
Title 2 0 0 0 0 0 1 1 1 1
Title 3 0 0 1 1 0 0 1 0 1
Slide #19 Intro to Machine Learning with Python [email protected]
ML can be easy*• You already have ML problems!
• You can start applying ML methods now with Python & scikit-learn
• Theoretical knowledge of ML not needed (initially)*
scikit-learn.org
github.com/scikit-learn
Slide #24 Intro to Machine Learning with Python [email protected]
Check out: liveramp.com/careers
Helping companies use their marketing data to delight customers
Opportunities•Backend Engineers•Data Scientists•Full-Stack Engineers
Tools•Java•Hadoop (Map/Reduce)•Ruby
Build and work with large distributed systems that process massive data sets.
Slide #25 Intro to Machine Learning with Python [email protected]