Introduction to Machine Learning for NLP I
Benjamin Roth
CIS LMU Munchen
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 1 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 2 / 49
Course Overview
Foundations of machine learningI loss functionsI linear regressionI logistic regressionI gradient-based optimizationI neural networks and backpropagation
Deep learning tools in PythonI NumpyI TheanoI KerasI (some) Tensorflow?, (some) Pytorch?
ApplicationsI Word EmbeddingsI Senitment AnalysisI Relation extractionI (some) Machine Translation?I Practical projects (NLP related, to be agreed on during the course)
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 3 / 49
Lecture Times, Tutorials
Course homepage:dl-nlp.github.io
9-11 is supposed to be the lecture slot, and 11-12 the tutorial slot ...
... but we will not stick to that allocation
We will sometimes have longer Q&A-style/interactive “tutorial”sessions, sometimes more lectures (see next slide)
Tutor: Simon SchaferI Will discuss exercise sheets in the tutorialsI Will help you with the projects
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 4 / 49
Plan
9-11 slot 11-12 slot Ex. sheet
10/18 Overview / ML Intro I ML Intro I Linear algebra chapter10/25 Linear algebra Q&A / ML II ML II Probability chapter11/1 public holiday11/8 Probability Q&A / ML III Numpy Numpy11/15 ML IV/Theano Intro Convolution Theano I
9-11 slot 11-12 slot Ex. sheet
11/22 Embeddings / CNNs & RNNs for NLP Numpy Q&A Read LSTM/RNN11/29 LSTM (reading group) Theano I Q&A Theano II12/6 Keras Keras Keras12/13 DL for Relation Prediction Theano II Q&A Relation Prediction12/20 Word Vectors Project Topics Project Assignments
9-11 slot 11-12 slot Ex. sheet
1/10 Keras Q&A, Rel.Extr. Q&A Tensorflow –1/17 optimization methods/PyTorch Help with projects –1/24 Other Work at CIS / LMU, Neural MT Help with projects –1/31 Project presentations presentations –2/7 Project presentations presentations –
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 5 / 49
Formalities
This class is graded by a project
The grade of the project is determined taking the average of:I Grade of the code written for the project.I Grade of project documentation / mini-report.I Grade of presentation about your project.I ⇒ You have to pass all three elements in order to pass the course.
Bonus points: The grade can be improved by 0.5 absolute gradesthrough the exercise sheets before New Year.
Formula:
gproject =gproject-code + gproject-report + gproject-presentation
3
gfinal = round(gproject − 0.5 · x)
where x is the fraction of points reached in the exercises (between 0and 1), and round selects the closest value of 1; 1.3; 1.7; 2; · · · 3.7; 4
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 6 / 49
Exercise sheets, Projects, Presentations
6 ECTS, 14 weeks⇒ avg work load ∼ 13hrs / week (3 in class, 10 at home)
I in the first weeks, spend enough time to read and prepare so that youare not lost later
I from mid-November to mid-December: programming assignments -coding takes time, and can be frustating (but rewarding)!
Exercise sheetsI Work on non-programming exercise sheets individuallyI For exercise sheets that contain programming parts, submit in teams of
2 or 3
ProjectsI A list of topics will be proposed by me: ∼ Implement a deep learning
technique applied to information extaction (or other NLP task)I Own ideas also possible, need to be discussed with meI Work in groups of two or threeI Project report: 3 pages / team member
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 7 / 49
Good project code ...
... shows that you master the techniques taught in the lectures andexercises.
... shows that you can make “own decisions”: e.g. adapt model /task / training data etc if necessary.
... is well-structured and easy to understand (telling variable names,meaningful modularization – avoid: code duplication, dead code)
... is correct (especially: train/dev/test splits, evaluation)
... is within the scope of this lecture (time-wise should not exceed5× 10h)
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 8 / 49
A good project presentation ...
... is short (10 min. p.P. + 15 min. Q&A per team)
... similar to the report, contains the problem statement, motivation,model, and results
... is targeted to your fellow students, who do not know detailsbeforehand
... contains interesting stuff: unexpected observations? conclusions/ recommendations? did you deviate from some common practice?
... demonstrates that all team members worked together on theproject
Possible outlineI Background / MotivationI Formal characterization of techniques usedI Technical Approach and DifficultiesI Experiments, Results and Interpretation
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 9 / 49
A good project report ...
... is concise (3 pages / person) and clear
... motivates and describes the model that you have implemented andthe results that you have obtained
... shows that you can correctly describe the concepts taught in thisclass
... contains interesting stuff: unexpected observations? conclusions/ recommendations? did you deviate from some common practice?
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 10 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 11 / 49
Machine Learning
Machine learning for natural language processingI Why?I Advantages and disadvantages to alternatives?I Accuracy; Coverage; resources required (data, expertise, human
labour); Reliability/Robustness; Explainability
P → NP VPVP → V NPNP → Det NN
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 12 / 49
Deep Learning
Learn complex functions, that are (recursively) composed of simplerfunctions.
Many parameters have to be estimated.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 13 / 49
Deep LearningMain Advantage: Feature learning
I Models learn to capture most essential properties of data (according tosome performance measure) as intermediate representations.
I No need to hand-craft feature extraction algorithms
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 14 / 49
Neural Networks
First training methods for deep nonlinear NNs appeared in the 1960s(Ivakhnenko and others).
Increasing interest in NN technology (again) since around 5 years ago(“Neural Network Renaissance”):Orders of magnitude more data and faster computers now.
Many successes:I Image recognition and captioningI Speech regonitionI NLP and Machine translation (demo of Bahdanau / Cho / Bengio
system)I Game playing (AlphaGO)I ...
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 15 / 49
Machine Learning
Deep Learning builds on general Machine Learning concepts
argminθ∈H
m∑i=1
L(f (xi ;θ), yi )
Fitting data vs. generalizing from data
feature
prediction
xx
xx
xx
xx x
feature
prediction
xx
xx
xx
xx x
feature
prediction
xx
xx
xx
xx x
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 16 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 17 / 49
A Definition
“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performanceat tasks in T , as measured by P, improves with experience E .”(Mitchell 1997)
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 18 / 49
A Definition
“A computer program is said to learn from experience E with respect tosome class of tasks T and performance measure P, if its performance attasks in T , as measured by P, improves with experience E .”(Mitchell 1997)
Learning: Attaining the ability to perform a task.
A set of examples (“experience”) represents a more general task.
Examples are described by features:sets of numerical properties that can be represented as vectors x ∈ Rn.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 19 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 20 / 49
Data“A computer program is said to learn from experience E [...], if itsperformance [...] improves with experience E .”
Dataset: collection of examples
Design matrixX ∈ Rn×m
I n: number of examplesI m: number of featuresI Example: Xi,j count of feature j (e.g. a stem form) in document i .
Unsupervised learning:I Model X, or find interesting properties of X.I Training data: only X.
Supervised learning:I Predict specific additional properties from X.I Training data: Label vector y ∈ Rn together with X
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 21 / 49
Data
Low training error does not mean good generalization.
Algorithm may overfit.
feature
prediction
xx
xx
xx
xx x
feature
prediction
xx
xx
xx
xx x
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 22 / 49
Data Splits
Best Practice: Split data into training, cross-validation and test set.(“Cross-validation set” = “development set”).
I Optimize low-level parameters (feature weights ...) on training set.I Select models and hyper-parameters on cross-validation set. (type of
machine learning model, number of features, regularization, priors).I It is possible to overfit both in the training as well as in the model
selection stage!I ⇒ Report final score on test set only after model has been selected!
Don’t report the error on training or cross-validation set as yourmodel performance!
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 23 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 24 / 49
Machine Learning Tasks“A computer program is said to learn [...] with respect to some class oftasks T [...] if its performance at tasks in T [...] improves [...]”Types of Tasks:
Classification
Regression
Structured Prediction
Anomaly Detection
synthesis and sampling
Imputation of missing values
Denoising
Clustering
Reinforcement learning
. . .
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 25 / 49
Machine Learning Tasks:Typical Examples & Examples from Recent NLP Reserch
What are the most important conferences relevant to the intersection ofML and NLP?
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 26 / 49
Task: Classification
Which of k classes does an example belong to?
f : Rn → {1 . . . k}
Typical example: Categorize image patchesI Feature vector: color intensities for each pixel; derived features.I Output categories: Predefined set of labels
Typical example: Spam ClassificationI Feature vector: High-dimensional, sparse vector.
Each dimension indicates occurrence of a particular word, or otheremail-specific information.
I Output categories: “spam” vs. ‘ham”
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 27 / 49
Task: Classification
EMNLP 2017: Given a person name in a sentence that containskeywords related to police (“officer”, “police” ...) and to killing(“killed”, “shot”), was the person a civilian killed by police?
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 28 / 49
Task: Regression
Predict a numerical value given some input.
f : Rn → R
Typical examples:I Predict the risk of an insurance customer.I Predict the value of a stock.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 29 / 49
Task: Regression
ACL 2017: Given a response in a multi-turn dialogue, predict thevalue (on a scale from 1 to 5) how natural a response is.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 30 / 49
Task: Structured PredictionPredict a multi-valued output with special inter-dependencies andconstraints.Typical examples:
I Part-of-speech tagging
I Syntactic parsing
I Protein-folding
Often involves search and problem-specific algorithms.Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 31 / 49
Task: Structured Prediction
ACL 2017: jointly find all relations relations of interest in a sentenceby tagging arguments and combining them.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 32 / 49
Task: Reinforcement LearningIn reinforcement learning, the model (also called agent) needs toselect a serious of actions, but only observes the outcome (reward) atthe end.The goal is to predict actions that will maximize the outcome.
EMNLP 2017: The computer negotiates with humans in naturallanguage in order to maximize its points in a game.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 33 / 49
Task: Anomaly Detection
Detect atypical items or events.
Common approach: Estimate density and identify items that have lowprobability.
Examples:I Quality assuranceI Detection of criminal activity
Often items categorized as outliers are sent to humans for furtherscrutiny.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 34 / 49
Task: Anomaly Detection
ACL 2017: Schizophrenia patients can be detected by theirnon-standard use of mataphors, and more extreme sentimentexpressions.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 35 / 49
Supervised and Unsupervised Learning
Unsupervised learning: Learn interesting properties, such asprobability distribution p(x)
Supervised learning: learn mapping from x to y , typically byestimating p(y |x)
Supervised learning in an unsupervised way:
p(y |x) =p(x, y)∑y ′ p(x, y ′)
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 36 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 37 / 49
Performance Measures
“A computer program is said to learn [...] with respect to some [...]performance measure P, if its performance [...] as measured by P,improves [...]”
Quantitative measure of algorithm performance.
Task-specific.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 38 / 49
Discrete Loss Functions
Can be used to measure classificationperformance.
Not applicable to measure densityestimation or regression performance.
AccuracyI Proportion of examples for which model
produces correct output.I 0-1 loss = error rate = 1 - accuracy.
Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare
F1-score =2 · Prec · Rec
Prec + Rec
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49
Discrete Loss Functions
Can be used to measure classificationperformance.
Not applicable to measure densityestimation or regression performance.
AccuracyI Proportion of examples for which model
produces correct output.I 0-1 loss = error rate = 1 - accuracy.
Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare
F1-score =2 · Prec · Rec
Prec + Rec
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49
Discrete Loss Functions
Can be used to measure classificationperformance.
Not applicable to measure densityestimation or regression performance.
AccuracyI Proportion of examples for which model
produces correct output.I 0-1 loss = error rate = 1 - accuracy.
Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare
F1-score =2 · Prec · Rec
Prec + Rec
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49
Discrete Loss Functions
Can be used to measure classificationperformance.
Not applicable to measure densityestimation or regression performance.
AccuracyI Proportion of examples for which model
produces correct output.I 0-1 loss = error rate = 1 - accuracy.
Accuracy may be inappropriate for skewedlabel distributions, where relevant categoryis rare
F1-score =2 · Prec · Rec
Prec + Rec
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 39 / 49
Discrete vs. Continuous Loss Functions
Discrete loss functions cannot indicate how wrong a wrong decisionfor one example is.
Continuous loss functions . . .I . . . are more widely applicable.I . . . are often easier to optimize (differentiable).I . . . can also be applied to discrete tasks (classification).
Sometimes algorithms are optimized using one loss (e.g. Hinge loss)and evaluated using another loss (e.g. F1-Score).
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 40 / 49
Examples for Continuous Loss Functions
Density estimation: log probability of example
Regression: squared error
Classification: Loss L(yi · f (xi )) is function of label×prediction
I label ∈ {−1, 1}, prediction ∈ RI Correct prediction:
yi · f (xi ) > 0
I Wrong prediction:yi · f (xi ) <= 0
I zero-one loss, Hinge-loss,logistic loss ...
Loss on data set is sum of per-example losses.
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 41 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 42 / 49
Linear Regression
For one instance:I Input: vector x ∈ Rn
I Output: scalar y ∈ R(actual output: y ; predicted output: y)
I Linear function
y = wTx =n∑
j=1
wjxj
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 43 / 49
Linear Regression
Linear function:
y = wTx =n∑
j=1
wjxj
Parameter vector w ∈ Rn
Weight wj decides if value of feature xj increases or decreasesprediction y .
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 44 / 49
Linear Regression
For the whole data set:I Use matrix X and vector y to stack instances on top of each other.I Typically first column contains all 1 for the intercept (bias, shift) term.
X =
1 x12 x13 . . . x1n1 x22 x23 . . . x2n...
......
. . ....
1 xm2 xm3 . . . xmn
y =
y1y2...
ym
For entire data set, predictions are stacked on top of each other:
y = Xw
Estimate parameters using X(train) and y(train).
Make high-level decisions (which features...) using X(dev) and y(dev).
Evaluate resulting model using X(test) and y(test).
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 45 / 49
Simple Example: Housing PricesPredict Munich property prices (in 1K Euros) from just one feature:Square meters of property.
X =
1 4501 9001 1350
y =
73013001700
Prediction is:
y =
w1 + 450w2
w1 + 900w2
w1 + 1350w2
=
1 4501 9001 1350
· [w1
w2
]= Xw
w1 will contain costs incurred in any property acquisitionw2 will contain remaining average price per square meter.Optimal parameters are for the above case:
w =
[273.31.08
]y =
759.11245.11731.1
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 46 / 49
Linear Regression: Mean Squared Error
Mean squared error of training (or test) data set is the sum of squareddifferences between the predictions and labels of all m instances.
MSE (train) =1
m
m∑i=1
(y(train)i − y
(train)i )2
In matrix notation:
MSE (train) =1
m||y(train) − y(train))||22
=1
m||X(train)w − y(train))||22
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 47 / 49
Outline
1 This Course
2 Overview
3 Machine Learning DefinitionData (Experience)TasksPerformance Measures
4 Linear Regression: Overview and Cost Function
5 Summary
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 48 / 49
SummaryDeep Learning
I many successes in recent yearsI feature learning instead of feature engineeringI builds on general machine learning concepts
Machine learning definitionI DataI TaskI Cost function
Machine tasksI ClassificationI RegressionI ...
Linear regressionI Output depends linearly on inputI Cost function: Mean squared error
Next up: estimating the parameters
Benjamin Roth (CIS LMU Munchen) Introduction to Machine Learning for NLP I 49 / 49