Introduction to ML - IIT Bombay...Introduction to ML - Roadmap • Definition of Machine Learning...

Introduction to MLAbhijit Mishra

Research Scholar

Center for Indian Language Technology

Department of Computer Science and Engineering

Indian Institute of Technology Bombay

Email: [email protected]: http://www.cse.iitb.ac.in/~abhijitmishra

mailto:[email protected]

http://www.cse.iitb.ac.in/~abhijitmishra

Task: Get mangoes of a particular type from the market

Randomness??Ambiguity??

Nuances??

Task 1: Solve an equation

Task 2: Get mangoes of a particular type from the market

RandomnessSlight Variation in shape, size, color and odor etc.

AmbiguitySimilarity in size, color but belong to different categoriesNuances??Differences in size, color but belong to the same categoryHow to make machines understand these?

Introduction to ML - Roadmap• Definition of Machine Learning• Learning to predict

• Classification• Regression

• Learning Paradigms• Rule based• Statistical • Example Based

• Statistical Machine Learning • Supervised • Semi-supervised• Unsupervised• Reinforcement

• Supervised approaches• Probabilistic approaches• Non-probabilistic approaches

• Example - Text Classification• Books, Online Courses and Tools

Definition of Machine Learning• Machine learning1 is a type of artificial intelligence

(AI) that provides computers with the ability to learn without being explicitly programmed.

• Explores the study and construction of algorithms that can learn and make predictions on data

• Applications:• Pattern Recognition (e.g., Handwriting Recognition, Face

detection, Gesture detection)• Prediction of events (e.g., Stock market predictions,

weather forecasting, prediction of diseases based on symptoms)

• Almost all popular online services (e.g., Google, Facebook, Amazon) use ML.

https://en.wikipedia.org/wiki/Machine_learning






• Example - Text Classification

Learning to Predict - Classification• Classification is the problem of predicting to

which of a set of categories (sub-populations) a new observation belongs.

• Input: Properties of the new observation • Output: or the class of the new observation• When , the problem is called “binary

classification problem” (e.g., classifying emails into spam or non-spam categories)

• When the problem is called N-class/multi-class classification problem (e.g., classifying documents into multiple categories like sports, health, politics etc.).

•

Learning to Predict - Regression• When the out-put space of a predictor is

a real number instead of (nominal categories as in classification), the prediction problem is referred to as statistical regression or simply regression.

• Input: Properties of the new observation • Output: where • Example: Predicting the temperature of a

day given the climatic conditions of the previous day, estimating number of units of a new product to be sold in an year.

•

Note: Structured prediction

• Deals with more complex output (instead of scalar output as in cases of classification and regression)

• Output: where N

• Example: Automatic text translation (output is a sentence in another language), Parse tree generation (output is a tree structure), Image Captioning

We will only focus on classification problems.

•







Learning Objective• Back to Mangoes -

Task: Given some basic measurable properties of a certain mango, predict which category it belongs to.

ColorWeightSmell

DimensionsTaste??

Alphonso/Alice/Irwin

(Measurable properties/ Attributes/ Features)

(Classes)

Learning Objectives

• What to learn?• Correspondences between various

attributes of the input object and the classes

• How to learn? • Rule based learning • Statistical learning • Example based learning

Learning Paradigms – Rule Based• Learning is based on a set of rules handcrafted by

humans.

• The collection of rules or the “rule-base” has to be exhaustive enough to capture all the corner cases.

• Problems: Extremely hard, needs domain expertise and is highly time-consuming

If (weight<0.5 &&color == “yellow” || color== “green”){ category = “Alphonso”;}else if (…){ category = “Alice”;}

Learning Paradigms – Example Based• A very small set examples having of complete

information (both input and classes) are available. • Templates for each classes are learned automatically. • When a new observation arrives, class prediction is

made based on the template that fits the observation best.

• Problems: • Templates are generic representatives of classes that are

supposed to represent the whole sub-population belonging to certain classes. For many problems, it is quite hard to come up with such representatives with small number of examples.

• Susceptible to change in the nature of the input data

Learning Paradigms – Statistical • Beneficial if a large set of diversified

examples are available.• Feature-Class correspondences are

learned better.• Easy to update classifier if the nature of

the input data changes.• Leverage huge volume of available web-

data • Problems: Overlearning can happen

sometime (referred to as overfitting). Feature selection affects system accuracy.







Statistical Machine Learning- Supervised Approaches• Learning is based on a set of

observations for class labels are available.

Alice Irwin Alphonso

Learned Model

Alphonso

Statistical Machine Learning- Semi-Supervised Approaches

• Learning is based on a set of observations for class labels are available AND another set (typically of larger volume than labelled set) of observations for which class labels are not available

Alice Irwin Alphonso

Learned Model

Alphonso

Statistical Machine Learning- Un-Supervised Approaches• Learning when no class labels are

available.

Statistical Machine Learning- Reinforcement Learning• Learning happens with the objective of

maximizing the reward associated with the task.

• Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented. Association is captured in terms of rewards.

Introduction to ML - Roadmap• Definition of Machine Learning• Books, Online Courses and Tools• Learning to predict






Supervised Approaches

• Recap:

ColorWeightSmell

DimensionsTaste??

Alphonso/Alice/Irwin

(Measurable properties/ Attributes/ Features)

(Classes)

Supervised Approaches – Probabilistic Models• Given a set of features the classification

decision of probabilistic models can be expressed as

where ,

•

Supervised Approaches – Naïve Bayes

The prior can be assumed to be a multinomial distribution for classification problems

•

Likelihood

Prior

Posterior

Supervised Approaches – Naïve Bayes (1)• Now if we assume that features are

independent of each other.

• Note: The independent assumption may not hold true for many real life problems.

•

Supervised Approaches – Logistic Regression• Remember:

• In Logistic Regression is directly estimated

Where u follows a regular weighted linear equation

The coefficients ( and have to be learned during training).

•

Supervised Approaches: Non-Probabilistic Models

Class-1

Class-2

(x1,x2)Class-1

Class-2

Supervised Approaches: K-Nearest Neighbor

Class-1

Class-2

K-closest neighbors are decided based on a pre-defined distance measure. The class to which maximum number of close neighbors belong to becomes the winner class

Distance/similarity measures• Euclidian Distance (between vectors X1

and X2)

Which is a special case of Minkowski Distance

• Cosine Distance

•

Supervised Approaches: Support Vector Machines

Class-1

Class-2

f(x,w,b) = sign(w. x - b)

w. x – b = 0

SVMs: Specifying the boundary

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class

= +1”

zone

“Predict Class

= -1”

zone

M = Margin Width =

w. x – b = 1

w. x – b = -1

w. x – b = 0

Given a guess of w and b we can• Compute whether all data

points in the correct half-planes• Compute the width of the

marginSo now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the datapoints. This is primarily done through quadratic programming.

ww.

2

Supervised Approaches – Decision Tree

catego

rical

catego

rical

continuo

us

class MarSt

Color

TaxInc

BA

A

A

Yellow Green

Small Big, Medium

< 80

There could be more than one tree that fits the same data!

Supervised Approaches - Note• It is important to decide a set of features that

adequately explains the data.• Selecting extremely small number of features may

underspecify the data and may not help the classifier to learn properly

• As the number of features increases, the model-complexity increases (i.e., more number of parameters to be learned and chances of overfitting increases).

• Very high dimensional feature vectors make it unintuitive to analyze them, design distance functions and performing combinatorics and optimizations. This is known as “Curse of Dimensionality”

Introduction to ML - Roadmap• Definition of Machine Learning• Books, Online Courses and Tools• Learning to predict




• Supervised approaches• Generative approaches• Discriminative approaches


Example – Text Classification• Text classification is an important

problem in the field of Natural Language Processing and Machine Learning.

• Objective: Assign labels to a given text with a class

• Example:1: Obama won the election: Politics2: Brasil lost the football match: Sports

Problems in Text Classification• Lexical Problems:

• Presence of ambiguous words e.g., Cricket (game) vs Cricket (insect)

• Structural Problems:• Complexity at the syntactic levele.g., Mohd. Kaif, who was the hero of the Natwest final match against England in 2002, has joined BJP and will be running for an MP position. (Politics)

• Semantic Problems:• Complexity at the semantic levele.g., With the humiliating defeat in Bihar, INC’s innings seems to be over.

• Pragmatic Problems:e.g., India lost to Zimbawe yesterday (Sports) Bernie lost to Clinton in Newyork. (Politics)

Text Classification – Method

Some Documents

Training DataAnnotation

MODEL(Naïve Bayes, SVM,

Decision Tree etc.)

AnyUnseen

Document

Prediction

Features

Labels

Compute Features

Text Classification – Feature Extraction• Example:

• Training Sample: (Domain classification)1: Obama won the election: Politics2: Brasil lost the football match: Sports

• Features:• Vocabulary: <Obama, won, the, election,

Brasil, lost , football, match>• Bag of Word Features based on

presence/absence:• 1: <1,1,1,1,0,0,0,0>:0• 2: <0,0,1,0,1,1,1,1>:1

Text Classification – Training and Testing• Training:

• Weight of each feature towards a label is computed by training algorithm. Weight decides predictability.

• Test:• Based on the features presented in the test

data, the combined weightage is computed and a label is decided.

• Problem: When a feature is not seen in the training data (Data sparsity problem).

• Solution – instead of taking Bag of Word based features, consider bag of senses, word embedding etc.

Text Classification – Evaluation Metric• Performance of classifiers are typically

measured by Accuracy, Precision, Recall and F-Measure

• For a binary classification problem, if the class lables are positive and negative

• True Positive (TP): Number of test documents that are actually positive, are predicted positive

• True Negative (TN): Number of test documents that are actually negative, are predicted negative.

• False Positive (FP): Number of test documents that are actually negative, are predicted positive.

• False Negative (FN): Number of test documents that are actually positive, are predicted negative.

•

Text Classification – Evaluation Metric (1)

Text Classification - DEMO

• Package: Scikit-learn (install numpy, scipy, matplotlib and scikit-learn packages)

• Demo:• Naïve Bayes• SVM• KNN• Decision Tree

Books and Online Courses

• Books• Machine Learning by Tom Mitchell• Pattern Recognition and Machine Learning by Christopher M.

Bishop• Foundations of Machine Learning by Mehryar Mohri, Afshin

Rostamizadeh, Ameet Talwalkar • Machine learning: a Probabilistic Perspective – Kevin Murphy• Bayesian Reasoning and Machine Learning - David Barber• Probabilistic Graphical Models: Principles and Techniques by

Daphne Koller, Nir Friedman

• Courses• Machine Learning - Stanford University (Coursera) –Andrew

Ng• Mining Massive Datasets – Stanford Online

Tools

• Java• Weka (for supervised/semi-supervised)(www.cs.waikato.ac.nz/ml/weka/)• Mallet (for unsupervised)(www.mallet.cs.umass.edu)

• Python• Scikit-Learn (http://scikit-learn.org/)• Statsmodel

(www.statsmodels.sourceforge.net)

• R statistical packages (https://cran.r-project.org/web/packages/)

Thank you

Questions?

References

• C.J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):955-974, 1998. http://citeseer.nj.nec.com/burges98tutorial.html

• Statistical Learning Theory by Vladimir Vapnik, Wiley-Interscience; 1998

• Bishop, Christopher M. "Pattern recognition." Machine Learning 128 (2006).

Image URLS

• depositphotos.com• vizagcityonline.com• en.wikipedia.org/wiki/List_of_mango_culti

vars• tropicalfloridagardens.com• alphonsomango.net• alamy.com

Date post:	25-Jan-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times