Scikit Learn: Machine Learning in Python
Gianluca Corrado
Machine Learning
G. Corrado (disi) sklearn Machine Learning 1 / 22
Python Scientific Lecture Notes
Scikit Learn is based on Python
especially on NumPy, SciPy, and matplotlib
which are packages for scientific computing in Python
Basics on Python and on scientific computing
http://scipy-lectures.github.io/
G. Corrado (disi) sklearn Machine Learning 2 / 22
Downloading and Installing
Requires:
Python (≥2.6 or ≥3.3)
NumPy (≥ 1.6.1)
SciPy (≥ 0.9)
http://scikit-learn.org/stable/install.html
G. Corrado (disi) sklearn Machine Learning 3 / 22
Documentation and Reference
Documentationhttp://scikit-learn.org/stable/documentation.html
Reference Manual with class descriptionshttp://scikit-learn.org/stable/modules/classes.html
G. Corrado (disi) sklearn Machine Learning 4 / 22
Outline
Today we are going to learn how to:
Load and generate datasets
Split a dataset for cross-validation
Use some learning algorithmsI Naive BayesI SVMI Random forest
Evalute the performance of the algorithmsI AccuracyI F1-scoreI AUC ROC
G. Corrado (disi) sklearn Machine Learning 5 / 22
Datasets
The sklearn.datasets module includes utilities to load datasets
Load and fetch popular reference datasets (e.g. Iris)
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html
Artificial data generators (e.g. binary classification)
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
Now inspect the data structures
G. Corrado (disi) sklearn Machine Learning 6 / 22
Cross-validation
k-fold cross-validation
Split the dataset D in k equal sized disjoint subsets Di
For i ∈ [1, k]I train the predictor on Ti = D \ Di
I compute the score of the predictor on the test set Di
Return the average score accross the folds
G. Corrado (disi) sklearn Machine Learning 7 / 22
Cross-validation
The sklearn.cross validation module includes utilities forcross-validation and performance evaluation
e.g. k-fold cross validation
http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html
Now inspect the data structures
G. Corrado (disi) sklearn Machine Learning 8 / 22
Naive Bayes
Hint
Attribute values are assumed independent of each other
P(a1, . . . , am|yi ) =m∏j=1
P(aj |yi )
Definition
y∗ = argmaxyi
m∏j=1
P(aj |yi )P(yi )
G. Corrado (disi) sklearn Machine Learning 9 / 22
Naive Bayes
The sklearn.naive bayes module implements naive Bayesalgorithms
e.g. Gaussian naive Bayes
http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
Now inspect the data structures
G. Corrado (disi) sklearn Machine Learning 10 / 22
SVM
Hint
G. Corrado (disi) sklearn Machine Learning 11 / 22
Hint
The sklearn.svm module includes Support Vector Machinealgorithms
e.g. Support-C Vector Classification
http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
Now inspect the data structures
G. Corrado (disi) sklearn Machine Learning 12 / 22
Random Forest
Hint
G. Corrado (disi) sklearn Machine Learning 13 / 22
Random Forest
The sklearn.ensemble module includes ensemble-based methodsfor classification and regression
e.g. Random Forest Classifier
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Now inspect the data structures
G. Corrado (disi) sklearn Machine Learning 14 / 22
Performance evaluation
Recap
Acc =TP + TN
TP + TN + FP + FN
Pre =TP
TP + FPRec =
TP
TP + FN
F1 =2(Pre ∗ Rec)
Pre + Rec
AUC ROC
G. Corrado (disi) sklearn Machine Learning 15 / 22
Performance evaluation
The sklearn.metrics module includes score functions, performancemetrics and pairwise metrics and distance computations.
e.g. accuracy, F1-score, AUC ROC
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html
G. Corrado (disi) sklearn Machine Learning 16 / 22
Choosing parameters
Some algorithms have parameters
e.g. parameter C for SVM, number of trees for Random Forest
Performance can significantly vary according to the chosen parameters
It is important to choose wisely
train, VALIDATION, test
G. Corrado (disi) sklearn Machine Learning 17 / 22
Choosing parameters e.g. SVMnp.argmax requires to add import numpy as np
where
G. Corrado (disi) sklearn Machine Learning 18 / 22
Summary
sklearn allows to:
load and generate datasets
split them to perform cross-validation
easily apply learning algorithms
evaluate the performace of such algorithms
G. Corrado (disi) sklearn Machine Learning 19 / 22
Assignment
The second ML assignment is to compare the performance of threedifferent classification algorithms, namely Naive Bayes, SVM, and RandomForest.For this assignment you need to generate a random binary classificationproblem, and train (using 10-fold cross validation) the three differentalgorithms. For some algorithms inner cross validation (5-fold) for choosingthe parameters is needed. Then, show the classification performace(per-fold and averaged) in the report, briefly discussing the results.
Note
The report has to contain also a short description of the methodology usedto obtain the results.
G. Corrado (disi) sklearn Machine Learning 20 / 22
Assignment
Steps
1 Create a classification dataset (n samples ≥ 1000, n features ≥ 10)
2 Split the dataset using 10-fold cross validation3 Train the algorithms
I GaussianNBI SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], and RBF
kernel)I RandomForestClassifier (possible n estimators values [10, 100, 1000],
and Gini purity)
4 Evaluate the cross-validated performanceI accuracyI F1-scoreI AUC ROC
5 Write a short report summarizing the methodology and the results
G. Corrado (disi) sklearn Machine Learning 21 / 22
Assignment
After completing the assignment submit it via email
Send an email to [email protected] (cc:[email protected])
Subject: sklearnSubmit
Attachment: id name surname.zip containing:I the Python codeI the report (PDF format)
NOTE
No group work
This assignment is mandatory in order to enroll to the oral exam
G. Corrado (disi) sklearn Machine Learning 22 / 22