+ All Categories
Home > Documents > Machine Learning - antidot.net · Antidot is a Partner organization in WDAqua project Ludovic...

Machine Learning - antidot.net · Antidot is a Partner organization in WDAqua project Ludovic...

Date post: 21-Apr-2019
Category:
Upload: vuongcong
View: 220 times
Download: 0 times
Share this document with a friend
78
Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77
Transcript

Machine Learning

Ludovic Samper

Antidot

September 1st, 2015

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77

Antidot

Software vendor since 1999

Paris, Lyon, Aix-en-Provence

45 employees

Founders : Fabrice Lacroix CEO, Stephane Loesel CTO, JeromeMainka Chief Scientist Officer

Software products and solutions

Antidot Finder Suite (AFS) search engine

Antidot Information Factory (AIF) a pipe & filters framework

SaaS, Hosted License, 0n-site License

50% of the revenue invested in R&D

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 2 / 77

Antidot

Machine Learning

Automatic text document classification

Named Entity Extraction

Compound Splitter (for german words)

Clustering algorithm (for news agregation)

Open Data, Semantic Web

http://www.rechercheisidore.fr/ Social Sciences andHumanities research platform. Enriched with open resources

https://github.com/antidot/db2triples/ open source libraryto export a db in RDF

Antidot is a Partner organization in WDAqua project

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 3 / 77

Tutorial

Study a classical task in Machine Learning : text classification

Show scikit-learn.org Python machine learning library

Follow the “Working with text data” tutorial :http://scikit-learn.org/stable/tutorial/text_analytics/

working_with_text_data.html

Additional material on http://blog.antidot.net/

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 4 / 77

Summary of the tutorial

1 Problem definitionSupervised classificationEvaluation metrics

2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 ConclusionMethodology

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 5 / 77

Sommaire

1 Problem definitionSupervised classificationEvaluation metrics

2 Extracting features from text files

3 Algorithms for classification

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 6 / 77

20 newsgroups dataset

http://qwone.com/~jason/20Newsgroups/

20 newsgroups

20 newsgroups documents collected in the 90’s

The label is the newsgroup the document belongs to

A popular collection

18846 documents : 11314 in train, 7532 in test

wiss-ml.ipynb#The-20-newsgroups-dataset

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 7 / 77

Classification

Problem statement

One label per document

Automatically determine the label of an unseen document. Set ofdocuments and their labels

A supervised classification problem

Training

Set of documents and their labels

Build a model

Inference

Given a new document, use the model to predict its label

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 8 / 77

Precision and Recall I

Binary classification

C C

Labeled C TP True Positive FP False Positive

Not labeled C FN False Negative TN True Negative

Precision

TP

TP + FP

Proba(e ∈ C |e labeled C )

Recall

TP

TP + FN

Proba(e labeled C |e ∈ C )

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 9 / 77

Precision and Recall II

F1

F1 = 2P × R

P + R

Harmonic mean of Precision and Recall

Accuracy

TP + TN

TP + TN + FP + FN

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 10 / 77

Multiclass I

NC = number of class

Macro Average

Bmacro =1

NC

NC∑k=1

(Bbinary (TPk ,FPk ,TNk ,FNk))

Average mesure by class. Large classes count has much as small ones.

Micro Average

Bmicro = Bbinary (

NC∑k=1

TPi ,

NC∑k=1

FPi ,

NC∑k=1

TNk ,

NC∑k=1

FNk)

Average mesure by instance

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 11 / 77

Multiclass II

Micro average in single label multiclass

NC∑k=1

(FNk) =

NC∑k=1

(FPk)

andNC∑k=1

(TNk) =

NC∑k=1

(TPk)

Then,

Precisionmicro = Recallmicro = Accuracy =

∑NCk=1(TPk)

Nbdoc

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 12 / 77

Sommaire

1 Problem definition

2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)

3 Algorithms for classification

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 13 / 77

Bag of words

From text to features

Count the number of occurrences of words in text

“bag” because position isn’t taken into account

Extensions

Remove stop words

Remove too frequent words (max_df)

lowercase

Ngram (ngram_range) tokenize ngrams instead of words. Useful totake into account word positions

wiss-ml.ipynb#Bag-of-words

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 14 / 77

Term frequency inverse document frequency (tfidf) I

Intuition

Take into account relative importance of each word regarding the wholedatasetIf a word occurs in every document, it doesn’t hold any information

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 15 / 77

Term frequency inverse document frequency (tfidf) II

Definition

Term frequency × inverse document frequency

tfidf (w , d) = tf (w , d)× idf (w , d)

tf (w , d) = term frequency(word w in doc d)

idf (w) = log(Ndoc

doc freq(w))

In scikit-learn :

tfidf (w , d) = tf (w , d)× (idf (w) + 1)

Terms that occurs in all documents idf = 0 will not be ignored

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 16 / 77

Term frequency inverse document frequency (tfidf) III

Options

Normalisation ||doc|| = 1. Ex, for norm L2,∑

w∈d tfidf(w , d)2 = 1

Smoothing : add one to document frequencies as if an extra doccontained every term in the collection exactly once

idf (w) = log(Ndoc + 1

doc freq(w) + 1)

Example

Show most significants words of a doc wiss-ml.ipynb#Tfidf

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 17 / 77

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 18 / 77

Supervised classification problem I

Notations

x = (x1, · · · , xn) = (xi )0≤i<n feature vector

{(xd , yd)}0≤d<D the training set

∀i , xi ∈ Rn

xi feature vector for document in dimension of the feature space

∀d , yd ∈ {1, · · · ,NC}NC the number of classesyd the class of document d

y class predictionFor a new vector x , y is the predicted class of x .

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 19 / 77

Supervised classification problem II

Goal

Find a function F :

Rn → {1, · · · ,NC}x 7→ y

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 20 / 77

In 20newsgroups I

Values in 20 newsgroups

n = 130107 nb features (number of unique terms)

D = 11314 training samples

NC = 20 different classes

Goal

Find a function F that given a new document predicts its class

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 21 / 77

Naıve Bayes Algorithm I

Bayes’ theorem

P(A|B) =P(B|A)P(A)

P(B)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 22 / 77

Naıve Bayes Algorithm II

Posterior probability of class C

P(C |x) =P(x |C )P(C )

P(x)

P(x) does not depend on C ,

P(C |x) ∝ P(x |C )P(C )

Naıve Bayes independent assumption : each feature i is conditionallyindependent of every other feature j

P(C |x) ∝ P(C )×n∏

i=1

P(xi |C )

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 23 / 77

Naıve Bayes Algorithm III

Classifier from the probability model

y = arg maxk∈{1,··· ,NC}

P(y = k)×n∏

i=0

P(xi |y = k)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 24 / 77

Parameter estimation in Naıve Bayes’ classifier

Prior of a class

P(y = k) =nb samples in class k

total nb samples

Can also be uniform : P(y = k) = 1NC

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 25 / 77

Multinomial Naıve Bayes I

Naıve Bayes

P(x |y = k) =∏n

i=1 P(xi |y = k)

Multinomial distribution

Event word is i follows a multinomial distribution with parameters(p1, · · · , pn) where pi = P(word = i)

P(x1, · · · , xn) =n∏

i=1

pxii

Where∑

i pi = 1.pi = P(w = i)One distribution for each class y .

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 26 / 77

Multinomial Naıve Bayes II

Multinomial Naıve Bayes

One multinomial distribution for each class

P(i |y = k) =sum of occurrences of word xi in class k

total nb words in class k

=

∑d∈k xi∑

0≤j<n

∑d∈k xj

With smoothing,

P(i |y = k) =

∑d∈k xi + α∑

0≤j<n

∑d∈k xj + αn

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 27 / 77

Multinomial Naıve Bayes III

Inference in Multinomial Naıve Bayes

y = arg maxk

P(y = k |x)

= arg maxk

P(y = k)∏

0≤i<n

P(i |y = k)xi

= arg maxk

(log(P(y = k)) +

∑0≤i<n

xi log(P(i |y = k)))

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 28 / 77

Multinomial Naıve Bayes IV

A linear model

In the log space,

(log P(y = k|x))k ∝ W0 + W T .x

W0, is the vector of priors :

W0 = log(P(y = k))

W is the matrix of distributions :

W = (wik), i ∈ [1, n], k ∈ [1,NC ]

wik = log P(i |y = k)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 29 / 77

Multinomial Naıve Bayes V

Example step-by-step

http://www.antidot.net/wiss2015/wiss-ml.html#Naive-Bayes

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 30 / 77

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 31 / 77

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 32 / 77

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 33 / 77

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 34 / 77

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 35 / 77

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 36 / 77

Support Vector Machine, notations

Problem

S, training set{(xi , yi ), xi ∈ Rn, yi ∈ {−1, 1}}i∈0..D

Find a linear function 〈w , xi 〉+ b such that :

sign(〈w , xi 〉+ b) = yi

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 37 / 77

SVM, maximum margin classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 38 / 77

Margin

distance(x+, x−) = 〈 w

||w ||, x+ − x−〉

=1

||w ||(〈w , x+〉 − 〈w , x−〉)

=1

||w ||((〈w , x+〉+ b)− (〈w , x−〉+ b))

=1

||w ||(1− (−1))

=2

||w ||

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 39 / 77

SVM, maximum margin classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 40 / 77

Solving an optimization problem using the Lagrangien

Primal problem

minimizew ,bf (w , b)

Under the constraints, hi (w , b) ≥ 0

Lagrange function

L(w , b, α) = f (w , b)−∑i

αihi (w , b)

Let, g(α) = inf(w ,b) L(w , b, α)∀w , b, g(α) ≤ L(w , b, α)Moreover, L(w , b, α) ≤ f (w , b)Thus, ∀αi ≥ 0, g(α) ≤ minw ,b f (w , b)And with Karush Kuhn Tucker (KKT) optimality condition,

maxα

g(α) = minw ,b

f (w , b)⇔ αihi (w , x) = 0

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 41 / 77

Support Vector Machine, problem

Primal problem

minimize(w ,b)||w ||2

2

Under the constraints, ∀0 < i ≤ D, yi (〈w , xi 〉+ b) ≥ 1

Lagrange function

L(w , b, α) =1

2||w ||2 −

∑i

αi (yi (〈w , xi 〉+ b)− 1)

Dual problem :maximize(w ,b,α)L(w , b, α)

with αi ≥ 0Optimality in w, b is a saddle point with α

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 42 / 77

Support Vector Machine, problem

Derivative in w, b need to vanish

∂wL(w , b, α) = w −

∑i

αiyixi = 0

∂bL(w , b, α) =

∑i

αiyi = 0

Dual problem

maximizeα −1

2

∑i ,j

αiαjyiyj〈xi , xj〉+∑i

αi

under the constraints, { ∑i αiyi = 0

αi ≥ 0

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 43 / 77

Support Vectors

Support vectors

w =∑i

yiαixi

Karush Kuhn Tucker (KKT) optimality condition

Lagrange multiplier times constraint equals zero

αi (yi (〈w , xi 〉+ b)− 1) = 0

Thus, {αi = 0αi > 0⇒ yi (〈w , xi 〉+ b) = 1

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 44 / 77

Experiments with separable space

SVMvaryingC.ipynb

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 45 / 77

What happens if space is not separable

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 46 / 77

Adding slack variable

Problem was

minimize(w ,b)||w ||2

2

With,yi (w .xi + b) ≥ 1

With slack

minimize(w ,b)||w ||2

2+ C

∑i

ξi

With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 47 / 77

Support Vector Machine, without slack

Primal problem

minimize(w ,b)||w ||2

2

With,yi (w .xi + b) ≥ 1

Lagrange function

L(w , b, α) =1

2||w ||2 −

∑i

αi (yi (〈w , xi 〉+ b)− 1)

Dual problem :maximize(w ,b,α)L(w , b, α)

Optimality in w , b, is a saddle point with α

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 48 / 77

Support Vector Machine, with slack

Primal problem

minimize(w ,b)||w ||2

2+ C

∑i

ξi

With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0

Lagrange function

L(w , b, ξ, α, η) =1

2||w ||2 + C

∑i

ξi −∑i

αi (yi (〈xi ,w〉+ b) + ξi − 1)−∑i

ηiξi

Dual problem :maximize(w ,b,ξ,α,η)L(w , b, ξ, α, η)

Optimality in w , b, ξ is a saddle point with α, η

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 49 / 77

Support Vector Machine, problem

Derivative in w, b, ξ need to vanish

∂wL(w , b, ξ, α, η) = w −

∑i

αiyixi = 0

∂bL(w , b, ξ, α, η) =

∑i

αiyi = 0

∂ξL(w , b, ξ, α, η) = C − αi − ηi = 0⇒ ηi = C − αi

Dual problem

maximizeα −1

2

∑i ,j

αiαjyiyj〈xi , xj〉+∑i

αi

under the constraints,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 50 / 77

Support Vectors

Support vectors

w =∑i

yiαixi

Karush Kuhn Tucker (KKT) optimality condition

Lagrange multiplier times constraint equals zero

αi (yi (〈w , xi 〉+ b) + ξi − 1) = 0

ηiξi = 0⇔ (C − αi )ξi = 0

Thus, αi = 0⇒ yi (〈w , xi 〉+ b) ≥ 10 < αi < C ⇒ yi (〈w , xi 〉+ b) = 1αi = C ⇒ yi (〈w , xi 〉+ b) ≤ 1

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 51 / 77

Support Vector Machine, Loss functions

Primal problem

minimize(w ,b)||w ||2

2+ C

∑i

ξi

With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0

With loss function

minimize(w ,b)||w ||2

2+ C

∑i

max(0, 1− yi (w .xi + b))

here,loss(xi , yi ) = max(0, 1− yi (w .xi + b)) = max(0, 1− f (xi ))

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 52 / 77

Support Vector Machine, Common loss functions

Common loss functions

hinge loss, L1-loss : max(0, 1− yi (w .xi + b))

squares hinge L2-loss : max(0, (1− yi (w .xi + b))2)

logistic loss : log(1 + exp(−yi (w .xi + b)))

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 53 / 77

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 54 / 77

Expermiments with different values for C

SVMvaryingC.ipynb#Varying-C-parameter

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 55 / 77

Non linearly separable data

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 56 / 77

Non linearly separable data, Φ(x) = (x , x2)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 57 / 77

Non linearly separable data, Φ(x) = (x , x2)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 58 / 77

Linear case

Primal Problem

minimizew ,b1

2||w ||2 + C

∑i

ξi

subject to, yi (〈w , xi 〉+ b) ≥ 1− ξi and ξi ≥ 0

Dual Problem

maximizeα1

2

∑i ,j

αiαjyiyj〈xi , xj〉+∑i

αi

subject to,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Support vector expansion

f (x) =∑i

αiyi 〈xi , x〉+ b

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 59 / 77

With a transformation Φ : x 7→ Φ(x)

Primal Problem

minimizew ,b1

2||w ||2 + C

∑i

ξi

subject to, yi (〈w ,Φ(xi )〉+ b) ≥ 1− ξi and ξi ≥ 0

Dual Problem

maximizeα1

2

∑i ,j

αiαjyiyj〈Φ(xi ),Φ(xj)〉+∑i

αi

subject to,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Support vector expansion

f (x) =∑i

αiyi 〈Φ(xi ),Φ(x)〉+ b

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 60 / 77

The kernel trick

Kernel function

k(x , x ′) = 〈Φ(x),Φ(x ′)〉

We just need to compute the dot product in the new space

Dual Problem

maximizeα1

2

∑i ,j

αiαjyiyjk(xi , xj) +∑i

αi

subject to,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Support vector expansion

f (x) =∑i

αiyik(xi , x) + b

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 61 / 77

Kernels

Kernel functions

linear : k(x , x ′) = 〈x , x ′〉polynomial : k(x , x ′) = (γ〈x , x ′〉+ r)d

rbf : k(x , x ′) = exp(−γ|x − x ′|2)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 62 / 77

RBF Kernel imply an infinite space

Here we’re in dimension 1, x ∈ R

k(x , x ′) = exp(−(x − x ′)2)

= exp(−x2)exp(−x ′2)exp(2xx ′)

With Taylor transformation,

k(x , x ′) = exp(−x2)exp(−x ′2)∞∑k=0

2kxkx ′k

k!

= 〈(· · · , 2k−1√k!

exp(−x2)xk , · · · ),

(· · · , 2k−1√k!

exp(−x ′2)x ′k , · · · )〉

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 63 / 77

Experiments with different kernels

www.antidot.net/wiss2015/SVMvaryingC.html#Non-linear-kernels

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 64 / 77

SVM in multiclass

one-vs-the rest

NC binary classifiers (but each involving all dataset)

At prediction time, choose the class with maximum decision value

one-vs-oneNC (NC−1)

2 binary classifiers

At prediction time, vote

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 65 / 77

SVM in scikit-learn

SVC : Support Vector Classification

sklearn.svm.linearSVC

based on Liblinear library

strategy : one-vs-the rest

only linear kernel

loss can be : ‘hinge’ or ‘squared hinge’

sklearn.svm.SVC

based on libSVM

multiclass strategy : one-vs-one

kernel can be : linear, polynomial, RBF, sigmoid, precomputed

only hinge loss

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 66 / 77

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 67 / 77

Cross validation I

http://scikit-learn.org/stable/modules/cross_validation.html

Overfitting

Estimation of parameters on the test set can lead to overfitting :parameters are the best for this test set but not in the general case.

Train, test and validation dataset

A solution :

tweak the parameters on the test set

validate on a validation dataset

only few data in training dataset

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 68 / 77

Cross validation II

Cross validation

k-fold cross validation

Split training data in k partitions of the same size

train the model on k − 1 partitions

then, evaluate on the kth partition

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 69 / 77

Cross validation III

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 70 / 77

Grid Search

http://scikit-learn.org/stable/modules/grid_search.html

Grid search

Test each value for each parameter

brut force algorithm to find the best value for each parameter

In scikit-learn

Automatically runs k× number of parameters’ values trainings

Keeps the best model

Demo with scikit-learnhttp://www.antidot.net/wiss2015/grid_search_20newsgroups.html

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 71 / 77

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classification

4 ConclusionMethodology

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 72 / 77

1 Problem definitionSupervised classificationEvaluation metrics

2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 ConclusionMethodology

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77

Methodology

To solve a problem using Machine Learning, you have to :

1 Understand the data

2 Choose an evaluation measure

3 Be able to test the model

4 Find the main features

5 Try the algorithms, with different parameters

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77

Conclusion

Machine Learning has a lot of applications

With libraries like scikit-learn, no need to implement algorithmsyourself

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 74 / 77

Questions ?

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 75 / 77

References

Machine Learning in Python :

http://scikit-learn.org

Alex Smola very good lecture on Machine Learning at CMU :

http://alex.smola.org/teaching/10-701-15/

Kernels : https://www.youtube.com/watch?v=0Nis-oMLbDs

SVM : https://www.youtube.com/watch?v=bsbpqNIKQzU

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 76 / 77

Bernoulli Naıve Bayes

Features

xi = 1 iff word i is present in documentElse, xi = 0The number of occurrences of word i doesn’t matter

Bernoulli

For each feature i ,P(xi |y = k) = P(i |y = k)xi + (1− P(i |y = k))(1− xi )Absence of a feature is explicitly taken into account

Estimation of P(i |y = k)

P(i |y = k) =1 + nb of documents in k that contains word i

nb of documents in k

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 77 / 77


Recommended