Neural Natural Language Processing
Lecture 1: Introduction to natural language processing and text
categorization
29.10.19 2
Plan of the lecture
● Part 1: About the course: logistics, organization, materials, etc.
● Part 2: Motivation for the course: neural NLP models, the “neural revolution” in NLP.
● Part 3: A short introduction to NLP.● Part 4: Text classification task and a simple
model to solve it using Naive Bayes model.
29.10.19 3
Lecture 1
Part 1: About the course: logistics, organization, materials, etc.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
4
Acknowledgments
● Based on the materials of the following courses:– Lectures and assignments are adopted from “Neural
Networks for Natural Language Processing” course by Nikolay Arefyev (Samsung Moscow Research Center and Moscow State University).
– Seminars are adopted from various sources, notably NLP course of Yandex School of Data Analysis.
– Additional sources will be indicated as needed.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
5
Instructors
Lectures:
● Prof. Alexander Panchenko, Skoltech
● Dr. Nikolay Arefyev, Samsung / Moscow State University
Seminars, assignments:
● Dr. Artem Shelmanov, Skoltech
● Dr. Varvara Logacheva, Skoltech
● Olga Kozlova, MTS Innovation Center
● Viktoria Chekalina, Skoltech / Philips Innovation Center
● Irina Nikishina, Skoltech
● Daryna Dementieva, Skoltech
Final projects:
● Olga Kozlova, Alexander Panchenko, ...
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
6
Tentative schedule of the class
7
Assignments
● A Kaggle-style competition for the best F-score● One task (sentiment analysis), different models
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
8
Assignments
● Sentiment analysis using Naive Bayes classifier.
● Sentiment analysis using Logistic Regression and a Feedforward Neural Network.
● Sentiment analysis using word and document embeddings.
● Sentiment analysis using RNNs.● Sentiment analysis using BERT or ELMo.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
9
Assignments
● Sentiment analysis using Naive Bayes classifier.
● Sentiment analysis using Logistic Regression and a Feedforward Neural Network.
● Sentiment analysis using word and document embeddings.
● Sentiment analysis using RNNs.● Sentiment analysis using BERT or ELMo.
Model complexity, performance(?)
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
10
Assignments
Evaluation criterion:● Results: what was the rank of your solution
among other submissions?● Reproducibility: the possibility to get the
results using your script.● Readability: how easy is to understand your
code? ● Timing: did you deliver in time?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
11
Final project
Various options are:● Find an interesting task and propose a (neural)
NLP model to solve it.● Propose a new NLP task or a variant of some
existing one and come up with a baseline for its solution.
● Get a recently published NLP paper and replicate its results. Discuss the outcomes.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
12
Final project
● The list of topics can be found here: http://bit.ly/nnlp_topics– To be further extended.
● Can be done in a group up to 3 people● You can propose yours as well● To propose a topic, enter here your name and topic
http://bit.ly/nnlp_topics_distribution● It is advised to ask an instructor during a seminar about the
suitability of a topic (but this is not a strict requirement).
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
13
Final project
Requirements:● Outcome of a project is a Jupyter notebook which describes the
entire experiment:– It should be readable (with supporting text: task, motivation, discussion)– It should be executable: we should be able to reproduce your results
from the first try.
● Due to the time constraints: no oral presentation. Rather communicate what you have done in code, text, formulas, tables, and plots.
● Deadline: 19.12.2019 EoD. ● Suggestion is to start ASAP!
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
14
Final project
Evaluation criteria:● Relevance of the task: Are you tackling a relevant research
problem? Did you do something which has not been done yet (at least in some aspect) or a solution is available from Github already before your started?
● Readability: can we easily understand what has been done?● Reproducibility: can we get the same numbers and plots?● Results: did you manage to improve something (or gain some
interesting insights about the negative results)?● Originality: how innovative was the approach you used?● Timing: did you deliver in time?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
15
Exam
● Is not obvious to organize in our case:– … e.g.the Deep Learning course has no exam.
● Mostly questions about various models:– Structure,– Applications,– Training methods,– Objectives.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
16
Cost of various activities
● Assignments: 40%● Final project: 40%● Exam: 20%
● If you already completed a similar NLP course and/or have a publication not lower than a workshop level at a major NLP conference you can do a final project worth 80% and skip the assignments.– The topic will be provided by instructor (less freedom in topic choice).– The load is expected to be the same as Assignments+Final project.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
17
Prerequisites● Basic concepts from Calculus, Linear Algebra, Probability, Statistics, and
Computer Science.
● Fundamentals of Machine Learning:– Recommended machine learning courses
https://www.coursera.org/learn/machine-learning, http://cs229.stanford.edu – … or analogous course on ML and DL in Skoltech!
● Python programming language:– Programming assignments are in Python;– De facto standard for ML/DL/NLP.
● This is NOT a generic machine learning / deep learning course:– Some introductory lectures will give a reminder on the basics, though;– We rather focus on specific architectures of neural networks in NLP.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
18
Outline of the course topics
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
19
Lecture logistics
● 45 minutes of lecture● 10 minutes break● 45 minutes of lecture● 10 minutes break● 45 minutes of lecture
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
20
Let us drive right in!
Image source: http://fastml.com/introduction-to-pointer-networks
29.10.19 21
Lecture 1
● Part 1: About the course: logistics, organization, materials, etc.
● Part 2: Motivation for the course: neural NLP models, the “neural revolution” in NLP.
● Part 3: A short introduction to NLP.● Part 4: Text classification task and a simple
model to solve it using Naive Bayes model.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
22
Natural Language
● Language is what makes us different from other living beings:– Allowing sharing and accumulation of knowledge;– Allowing to organize a society in a complex way;– ...
Image source:Wikipedia
29.10.19 23
Natural Language
Images source: Wikipedia
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
24
Natural Language Processing (NLP)● NLP is a subfield of Artificial Intelligence (AI) which relies on:
– Computer Science (recently most notably machine learning)– Linguistics
● The goal is to make computers understand and generate natural language to perform useful tasks like:– Translate a text from one language to another, e.g. Yandex Translate– Search and extract information
● Search engines, e.g. Google● Question answering systems, e.g. IBM Watson
– Dialogue systems● Answer questions, execute voice commands, voice typing● Samsung Bixby, Apple Siri, Google Assistant, etc.
● Language understanding is an “AI-complete” problem– we hope to train computers to extract signal relevant for a particular task
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
25
More NLP Applications● Dialog systems for customer support
● Sentiment analysis
● Topic categorization
● Spell checking
● Summarization
● Fact extraction
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
26
Traditional NLP Pipeline
Source of the slide: Socher & Manning, cs224n
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
27
A glance on the history of Natural Language Processing
A part of table of contents of Jurafsky & Martin (2009) textbook augmented with points 1.6.7 and 1.6.8
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
28
ML vs. DL: Function family F?
Source: Socher, Manning. Cs224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
29
Good old-fashioned ML
Source: Socher, Manning. CS224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
30
Deep Learning
Source: Socher, Manning. CS224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
31
Why Deep Learning?
Source: Socher, Manning. CS224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
32
Why now?
Source: Socher, Manning. Cs224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
33
Speech recognition
Source: Hinton, Neural Networks for Machine Learning @ Coursera, 2012 (Lecture 1, slide 13)
>30% WER improvement
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
34
Speech recognition
Source: Hinton, Bengio & LeCun, Deep Learning, NIPS’2015 Tutorial, slide 69
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
35
ImageNet● > 1.4M images from the web, 1000 classes
NVIDIA CES 2016 Press Conference, slide 10
● Krizhevsky, Sutskever, Hinton, 2012: ● 74.2% →83.6% Top 5 accuracy● 25.8%→16.4% Top 5 error rate● 36% error reduction (fixed every third error)
29.10.19 36
ImageNetTop 5 Error Rate
● Human error:– 5.1% (trained and patient)– 15% (non-trained, less patient)
● Best result in 2016: 3.08%Inception-v4 + 3xResNet ensemble
[Fei-Fei Li & Justin Johnson & Serena Yeung, cs231n, 2017. Lecture 1]
[Andrej Karpathy, What I learned from competing against a ConvNet on ImageNet, 2014]
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
37
ImageNet – Learnt features
Matthew D. Zeiler and Rob Fergus, Visualizing and Understanding Convolutional Networks
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
38
ImageNet – Learnt features
Matthew D. Zeiler and Rob Fergus, Visualizing and Understanding Convolutional Networks
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
39
Source: Jawahar G., Sagot B., Seddah D. What does BERT learn about the structure of language? ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Jul 2019, Florence, Italy
What does BERT learn about the structure of language?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
40
Source: Jawahar G., Sagot B., Seddah D. What does BERT learn about the structure of language? ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Jul 2019, Florence, Italy
What does BERT learn about the structure of language?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
41
The ongoing “neural revolution” in NLP: from Collobert to BERT
What problems Neural NLP is addressing:● The need of feature engineering.● Curse of dimensionality:
– SVD and NMT can be used to obtain embeddings, but they the algorithms doesn’t scale well do large datasets.
● The need to develop a custom algorithm / model for each task separately.– Rather the idea is to try to develop a single model for any
NLP task.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.
42
The ongoing “neural revolution” in NLP: from Collobert to BERT
What problems Neural NLP is addressing:● The need of feature engineering.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
43
The ongoing “neural revolution” in NLP: from Collobert to BERT
What problems Neural NLP is addressing:● Curse of dimensionality:
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
44
A simpler and a more generic NLP pipeline
Source: https://explosion.ai/blog/deep-learning-formula-nlp?ref=Welcome.AI
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
45
A simpler and more generic NLP pipeline … which yields good results
Step 1: Embed
An embedding table maps long, sparse, binary vectors into shorter, dense, continuous vectors.
Source: https://explosion.ai/blog/deep-learning-formula-nlp?ref=Welcome.AI
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
46
A simpler and more generic NLP pipeline … which yields good resultsStep 2: Encode
Given a sequence of word vectors, the encode step computes a representation that I'll call a sentence matrix, where each row represents the meaning of each token in the context of the rest of the sentence.
Source: https://explosion.ai/blog/deep-learning-formula-nlp?ref=Welcome.AI
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
47
A simpler and more generic NLP pipeline … which yields good resultsStep 3: Attend
The attend step reduces the matrix representation produced by the encode step to a single vector, so that it can be passed on to a standard feed-forward network for prediction.
Source: https://explosion.ai/blog/deep-learning-formula-nlp?ref=Welcome.AI
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
48
A simpler and more generic NLP pipeline … which yields good results
Step 4: Predict
Once the text or pair of texts has been reduced into a single vector, we can learn the target representation — a class label, a real value, a vector, etc.
Source: https://explosion.ai/blog/deep-learning-formula-nlp?ref=Welcome.AI
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
49
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
50Source: Socher, Manning. Cs224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
51
MT vs. Human translation
https://www.eff.org/ai/metrics#Translation
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
52
Google Neural Machine Translation (NMT) System
Source: Socher, Manning. Cs224n, 2017
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
53
GLUE benchmark
Source: Wang et al. GLUE: A Multi-task benchmark and analysis platform for Natural Language Understanding, 2019
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
54
GLUE leaderboard
Source: https://gluebenchmark.com/leaderboard
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.
55
Source: https://super.gluebenchmark.com/leaderboard
SuperGLUE leaderboard
29.10.19 56
Lecture 1
● Part 1: About the course: logistics, organization, materials, etc.
● Part 2: Motivation for the course: neural NLP models, the “neural revolution” in NLP.
● Part 3: A short introduction to NLP.● Part 4: Text classification task and a simple
model to solve it using Naive Bayes model.
Materials in this part are adopted from: Rao D. & McMahan (2019): Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. O’Reilly. 1st Edition. ISBN-13: 978-1491978238
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
57
A Quick Tour of Traditional NLP
● Natural language processing (NLP) and computational linguistics (CL) are two areas of computational study of human language:– NLP – how to build a technical system which knows something about (i.e performs
processing of) human language: solving practical problems involving language, such as:
● information extraction;● automatic speech recognition;● machine translation;● sentiment analysis;● question answering;● summarization.
– CL – how to learn about some aspect of language using various mathematical and computational methods, models, and algorithms: employs computational methods to understand properties of human language.
● How do we understand language?● How do we produce language?● How do we learn languages?● What relationships do languages have with one another?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
58
A Quick Tour of Traditional NLP
● Natural language processing (NLP) and computational linguistics (CL) are two areas of computational study of human language:– NLP – how to build a technical system which knows something about (i.e performs
processing of) human language: solving practical problems involving language, such as:
● information extraction;● automatic speech recognition;● machine translation;● sentiment analysis;● question answering;● summarization.
– CL – how to learn about some aspect of language using various mathematical and computational methods, models, and algorithms: employs computational methods to understand properties of human language.
● How do we understand language?● How do we produce language?● How do we learn languages?● What relationships do languages have with one another?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
59
Corpora, Tokens, and Types
● NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural: corpora).– A corpus usually contains raw text (in ASCII or UTF8) and
any metadata associated with the text.
● The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens.
● Types are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary or lexicon.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
60
Corpora, Tokens, and Types
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
61
Tokenization
● The process of breaking a text down into tokens is called tokenization.– There are six tokens in the sentence “Mary
slapped the green witch.” – “.” is one of them. – Tokenization can become more complicated than
simply splitting text based on nonalphanumeric characters.
29.10.19 62
Tokenization: the case of Turkish
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
63
Tokenization: Twitter data
● Tokenizing tweets involves preserving hashtags and @handles, and segmenting smilies such as :-) and URLs as one unit.
● Those decisions can significantly affect accuracy in practice!
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
64
Tokenzation
● Using SpaCy
● Using NLTK
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
65
Feature engineering
● Feature engineering is the process of understanding the linguistics of a language and applying it to solving NLP problem.
● This is something that we keep to a minimum in neural NLP for:– portability of models across languages;– applicability to more tasks;– avoiding the need in expert knowledge.
● When building realworld production systems,feature engineering is indispensable, despiterecent claims to the contrary.– Will it change in future?
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
66
Unigrams, Bigrams, Trigrams, …, N-grams
● N-grams are fixed length (n) consecutive token sequences occurring in the text:– Bigram has two tokens;– Unigram has one token, etc.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
67
Unigrams, Bigrams, Trigrams, …, N-grams
● When subword information itself carries useful information, one might want to generate character N-grams:– For example, the suffix “ol” in “methanol” indicates it
is a kind of alcohol.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
68
Lemmas and Stems
● Lemmas are root forms of words.● Verb fly can be inflected into many different
word forms: flow, flew, flies, flown, flowing.● Lemmatization is reducing the tokens to their
lemmas, e.g. to keep the dimensionality of the vector representation low.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
69
Lemmas and Stems
● Stemming use of handcrafted rules to strip endings of words to reduce them to a common form called stems.– Cons: quality, the “poorman’s lemmatization”
– Pros: efficiency, was/is popular in information retrieval for this reason.
29.10.19 70
Categorizing Sentences and Documents
● One of the earliest applications of NLP– Topic categorization, predicting sentiment of
reviews, filtering spam emails, language identification, spam filtering, etc.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
71
Categorizing Sentences and Documents: TF representation
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
72
TF-IDF representation: TF(w) IDF(⋅ w)
● TF representation weights word w proportional to its frequency:– Common words do not add anything to understanding. – A rare word is likely to be indicative.
● TF-IDF penalizes common tokens and rewards rare tokens in the vector representation:– nw is the number of documents containing the word w
and N is the total number of documents
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
73
TF-IDF representation: TF(w) IDF(⋅ w)
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
74
Categorizing Words: POS Tagging
● One can label not only documents but also individual words or tokens:– Part-of-speech (POS) tagging– Morphological analysis, etc.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
75
Categorizing Spans: Chunking and Named Entity Recognition
● Label a span of text - a contiguous multitoken sequence.– Chunking:
[NP Mary] [VP slapped] [the green witch] – Named entity recognition:
[PER Mary Johnson] slapped the green witch
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
76
Categorizing Spans: Chunking and Named Entity Recognition
● Chunking:
● Named entity recognition:
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
77
Structure of sentences: identifying relations between phrases
A constituent parse of the sentence “Mary slapped the green witch.”
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP
78
Structure of sentences: identifying relations between phrases
A dependency parse of the sentence “Mary slapped the green witch.”
29.10.19 79
Word Senses and Semantics
● Words can have multiple senses– WordNet– Automatic
discovery of senses from context
– ...
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
80
Lecture 1
● Part 1: About the course: logistics, organization, materials, etc.
● Part 2: Motivation for the course: neural NLP models, the “neural revolution” in NLP.
● Part 3: A short introduction to NLP.● Part 4: Text classification task and a simple
model to solve it using Naive Bayes model.
Materials in this part are adopted from: Jurafsky & Martin (2019): Speech and Language Processing (3rd edition). https://web.stanford.edu/~jurafsky/slp3/
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
81
Who wrote which Federalist papers?
● 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton.
● Authorship of 12 of the letters in dispute● 1963: solved by Mosteller and Wallace using
Bayesian methods
James Madison Alexander Hamilton
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
82
Positive or negative movie review?
● Unbelievably disappointing ● Full of zany characters and richly applied
satire, and some great plot twists● This is the greatest screwball comedy ever
filmed● It was pathetic. The worst part about it was
the boxing scenes.
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
83
What is the subject of this article?
• Antogonists and Inhibitors• Blood Supply• Chemistry• Drug Therapy• Embryology• Epidemiology• …
MeSH Subject Category Hierarchy
?
MEDLINE Article
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
84
Text Classification
● Assigning subject categories, topics, or genres● Spam detection● Authorship identification● Age/gender identification● Language Identification● Sentiment analysis● …
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
85
Text Classification: definition
Input:• a document d• a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class c C
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
86
Classification Methods: Hand-coded rules
● Rules based on combinations of words or other features– spam: black-list-address OR (“dollars” AND“have
been selected”)
● Accuracy can be high– If rules carefully refined by expert
● But building and maintaining these rules is expensive
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
87
Classification Methods:Supervised Machine Learning
• Input: • a document d• a fixed set of classes C = {c1, c2,…, cJ}• A training set of m hand-labeled documents
(d1,c1),....,(dm,cm)
• Output:
• a learned classifier γ:d c
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
88
Classification Methods:Supervised Machine Learning
● Any kind of classifier– Naïve Bayes– Logistic regression– Support-vector machines– k-Nearest Neighbors
….● Deep neural networks
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
89
Naïve Bayes Intuition
● Simple (“naïve”) classification method based on Bayes rule
● Relies on very simple representation of document– Bag of words
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
90
The Bag of Words Representation
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
91
The Bag of Words Representation
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
92
Bayes’ Rule Applied to Documents and Classes
•For a document d and a class c
P(c | d) =P(d | c)P(c)
P(d)
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
93
Naïve Bayes Classifier
MAP is “maximum a posteriori” = most likely class
MAP is “maximum a posteriori” = most likely class
Bayes RuleBayes Rule
Dropping the denominator
Dropping the denominator
cMAP = argmaxcC
P(c | d)
= argmaxcC
P(d | c)P(c)
P(d)
= argmaxcC
P(d | c)P(c)
Document d represented as features x1..xn
Document d represented as features x1..xn
= argmaxcC
P(x1, x2,…, xn | c)P(c)
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
94
Naïve Bayes Classifier
How often does this class occur?
How often does this class occur?
O(|X|n•|C|) parametersO(|X|n•|C|) parameters
We can just count the relative frequencies in a corpus
We can just count the relative frequencies in a corpus
Could only be estimated if a very, very large number of training examples was available.
Could only be estimated if a very, very large number of training examples was available.
cMAP = argmaxcC
P(x1, x2,…, xn | c)P(c)
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
95
Multinomial Naïve Bayes Independence Assumptions
P(x1, x2,…, xn | c)
• Bag of Words assumption: Assume position doesn’t matter
• Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.
P(x1,…, xn | c) = P(x1 | c)·P(x2 | c)·P(x3 | c)·...·P(xn | c)
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
96
Multinomial Naïve Bayes Classifier
cMAP = argmaxcC
P(x1, x2,…, xn | c)P(c)
cNB = argmaxcC
P(c j ) P(x | c)xX
Õ
positions all word positions in test document
cNB = argmaxc jC
P(c j ) P(xi | c j )ipositions
Õ
Applying Multinomial Naive Bayes Classifiers to Text Classification:
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
97
Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates• simply use the frequencies in the data
P̂(c j ) =doccount(C = c j )
Ndoc
fraction of times word wi appears
among all words in documents of topic cj
P̂(wi | c j ) =count(wi,c j )
count(w,c j )wV
å
• Create mega-document for topic j by concatenating all docs in this topic• Use frequency of w in mega-document
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
98
Problem with Maximum Likelihood
• What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?
• Zero probabilities cannot be conditioned away, no matter the other evidence!
P̂("fantastic" positive) = count("fantastic", positive)
count(w, positivewV
å ) = 0
cMAP = argmaxc P̂(c) P̂(xi | c)i
Õ
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
99
Laplace (add-1) smoothing for Naïve Bayes
=count(wi,c)+1
count(w,cwV
å )æ
èçç
ö
ø÷÷ + V
P̂(wi | c) =count(wi,c)
count(w,c)( )wV
å
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
100
Multinomial Naïve Bayes: Learning
• Calculate P(cj) terms• For each cj in C do
docsj all docs with class =cj
• From training corpus, extract Vocabulary
P(wk | c j )nk +a
n+a |Vocabulary |
P(c j )| docs j |
| total # documents|
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
101
Summary: Naive Bayes is Not So Naive
● Very Fast, low storage requirements● Robust to Irrelevant Features
– Irrelevant Features cancel each other without affecting results
● Very good in domains with many equally important features● Optimal if the independence assumptions hold: If assumed
independence is correct, then it is the Bayes Optimal Classifier for problem
● A good dependable baseline for text classification● But we will see other classifiers that give better accuracy
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
102
Evaluation: Precision and Recall
● The 2-by-2 contingency table:
● Precision: % of selected items that are correct● Recall: % of correct items that are selected
correct not correctselected tp fp
not selected fn tn
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
103
Evaluation: F1 score
• A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):
• The harmonic mean is a very conservative average;
• People usually use balanced F1 measure i.e., with = 1 (that is, = ½): F = 2PR/(P+R)
RP
PR
RP
F+
+=
-+=
2
2 )1(1
)1(1
1
aa
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
104
Evaluation: Confusion matrix c
• For each pair of classes <c1,c2> how many documents from c1 were incorrectly assigned to c2?• c3,2: 90 wheat documents incorrectly assigned to poultry
Docs in test set AssignedUK
Assigned poultry
Assigned wheat
Assigned coffee
Assigned interest
Assigned trade
True UK 95 1 13 0 1 0
True poultry 0 1 0 0 0 0
True wheat 10 90 0 1 0 0
True coffee 0 0 0 34 3 7
True interest - 1 2 13 26 5
True trade 0 0 2 14 5 10
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
105
Evaluation: per class measures
Recall: Fraction of docs in class i classified correctly:
Precision: Fraction of docs assigned class i that are actually about class i:
Accuracy: (1 - error rate) Fraction of docs classified correctly:
ciii
å
ciji
åj
å
ciic ji
j
å
ciicij
j
å
29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.
106
Development Test Sets and Cross-validation
• Metric: P/R/F1 or Accuracy• Unseen test set
• avoid overfitting (‘tuning to the test set’)• more conservative estimate of performance
• Cross-validation over multiple splits• Handle sampling errors from different datasets
• Pool results over each split• Compute pooled dev set performance
Training set Development Test Set Test Set
Test Set
Training Set
Training SetDev Test
Training Set
Dev Test
Dev Test