Lecture 1: Introduction to natural language processing and text ......

Neural Natural Language Processing

Lecture 1: Introduction to natural language processing and text

categorization

29.10.19 2

Plan of the lecture

● Part 1: About the course: logistics, organization, materials, etc.

● Part 2: Motivation for the course: neural NLP models, the “neural revolution” in NLP.

● Part 3: A short introduction to NLP.● Part 4: Text classification task and a simple

model to solve it using Naive Bayes model.

29.10.19 3

Lecture 1

Part 1: About the course: logistics, organization, materials, etc.

29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 1: About the course.

4

Acknowledgments

● Based on the materials of the following courses:– Lectures and assignments are adopted from “Neural

Networks for Natural Language Processing” course by Nikolay Arefyev (Samsung Moscow Research Center and Moscow State University).

– Seminars are adopted from various sources, notably NLP course of Yandex School of Data Analysis.

– Additional sources will be indicated as needed.


5

Instructors

Lectures:

● Prof. Alexander Panchenko, Skoltech

● Dr. Nikolay Arefyev, Samsung / Moscow State University

Seminars, assignments:

● Dr. Artem Shelmanov, Skoltech

● Dr. Varvara Logacheva, Skoltech

● Olga Kozlova, MTS Innovation Center

● Viktoria Chekalina, Skoltech / Philips Innovation Center

● Irina Nikishina, Skoltech

● Daryna Dementieva, Skoltech

Final projects:

● Olga Kozlova, Alexander Panchenko, ...


6

Tentative schedule of the class

7

Assignments

● A Kaggle-style competition for the best F-score● One task (sentiment analysis), different models


8

Assignments

● Sentiment analysis using Naive Bayes classifier.

● Sentiment analysis using Logistic Regression and a Feedforward Neural Network.

● Sentiment analysis using word and document embeddings.

● Sentiment analysis using RNNs.● Sentiment analysis using BERT or ELMo.


9

Assignments

● Sentiment analysis using Naive Bayes classifier.

● Sentiment analysis using Logistic Regression and a Feedforward Neural Network.

● Sentiment analysis using word and document embeddings.

● Sentiment analysis using RNNs.● Sentiment analysis using BERT or ELMo.

Model complexity, performance(?)


10

Assignments

Evaluation criterion:● Results: what was the rank of your solution

among other submissions?● Reproducibility: the possibility to get the

results using your script.● Readability: how easy is to understand your

code? ● Timing: did you deliver in time?


11

Final project

Various options are:● Find an interesting task and propose a (neural)

NLP model to solve it.● Propose a new NLP task or a variant of some

existing one and come up with a baseline for its solution.

● Get a recently published NLP paper and replicate its results. Discuss the outcomes.


12

Final project

● The list of topics can be found here: http://bit.ly/nnlp_topics– To be further extended.

● Can be done in a group up to 3 people● You can propose yours as well● To propose a topic, enter here your name and topic

http://bit.ly/nnlp_topics_distribution● It is advised to ask an instructor during a seminar about the

suitability of a topic (but this is not a strict requirement).

http://bit.ly/nnlp_topics

http://bit.ly/nnlp_topics_distribution


13

Final project

Requirements:● Outcome of a project is a Jupyter notebook which describes the

entire experiment:– It should be readable (with supporting text: task, motivation, discussion)– It should be executable: we should be able to reproduce your results

from the first try.

● Due to the time constraints: no oral presentation. Rather communicate what you have done in code, text, formulas, tables, and plots.

● Deadline: 19.12.2019 EoD. ● Suggestion is to start ASAP!


14

Final project

Evaluation criteria:● Relevance of the task: Are you tackling a relevant research

problem? Did you do something which has not been done yet (at least in some aspect) or a solution is available from Github already before your started?

● Readability: can we easily understand what has been done?● Reproducibility: can we get the same numbers and plots?● Results: did you manage to improve something (or gain some

interesting insights about the negative results)?● Originality: how innovative was the approach you used?● Timing: did you deliver in time?


15

Exam

● Is not obvious to organize in our case:– … e.g.the Deep Learning course has no exam.

● Mostly questions about various models:– Structure,– Applications,– Training methods,– Objectives.


16

Cost of various activities

● Assignments: 40%● Final project: 40%● Exam: 20%

● If you already completed a similar NLP course and/or have a publication not lower than a workshop level at a major NLP conference you can do a final project worth 80% and skip the assignments.– The topic will be provided by instructor (less freedom in topic choice).– The load is expected to be the same as Assignments+Final project.


17

Prerequisites● Basic concepts from Calculus, Linear Algebra, Probability, Statistics, and

Computer Science.

● Fundamentals of Machine Learning:– Recommended machine learning courses

https://www.coursera.org/learn/machine-learning, http://cs229.stanford.edu – … or analogous course on ML and DL in Skoltech!

● Python programming language:– Programming assignments are in Python;– De facto standard for ML/DL/NLP.

● This is NOT a generic machine learning / deep learning course:– Some introductory lectures will give a reminder on the basics, though;– We rather focus on specific architectures of neural networks in NLP.

https://www.coursera.org/learn/machine-learning

http://cs229.stanford.edu/


18

Outline of the course topics


19

Lecture logistics

● 45 minutes of lecture● 10 minutes break● 45 minutes of lecture● 10 minutes break● 45 minutes of lecture


20

Let us drive right in!

Image source: http://fastml.com/introduction-to-pointer-networks

http://fastml.com/introduction-to-pointer-networks

29.10.19 21

Lecture 1






22

Natural Language

● Language is what makes us different from other living beings:– Allowing sharing and accumulation of knowledge;– Allowing to organize a society in a complex way;– ...

Image source:Wikipedia

29.10.19 23

Natural Language

Images source: Wikipedia

29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 2: Course motivation.

24

Natural Language Processing (NLP)● NLP is a subfield of Artificial Intelligence (AI) which relies on:

– Computer Science (recently most notably machine learning)– Linguistics

● The goal is to make computers understand and generate natural language to perform useful tasks like:– Translate a text from one language to another, e.g. Yandex Translate– Search and extract information

● Search engines, e.g. Google● Question answering systems, e.g. IBM Watson

– Dialogue systems● Answer questions, execute voice commands, voice typing● Samsung Bixby, Apple Siri, Google Assistant, etc.

● Language understanding is an “AI-complete” problem– we hope to train computers to extract signal relevant for a particular task


25

More NLP Applications● Dialog systems for customer support

● Sentiment analysis

● Topic categorization

● Spell checking

● Summarization

● Fact extraction


26

Traditional NLP Pipeline

Source of the slide: Socher & Manning, cs224n


27

A glance on the history of Natural Language Processing

A part of table of contents of Jurafsky & Martin (2009) textbook augmented with points 1.6.7 and 1.6.8


28

ML vs. DL: Function family F?

Source: Socher, Manning. Cs224n, 2017


29

Good old-fashioned ML

Source: Socher, Manning. CS224n, 2017


30

Deep Learning



31

Why Deep Learning?



32

Why now?



33

Speech recognition

Source: Hinton, Neural Networks for Machine Learning @ Coursera, 2012 (Lecture 1, slide 13)

>30% WER improvement


34

Speech recognition

Source: Hinton, Bengio & LeCun, Deep Learning, NIPS’2015 Tutorial, slide 69


35

ImageNet● > 1.4M images from the web, 1000 classes

NVIDIA CES 2016 Press Conference, slide 10

● Krizhevsky, Sutskever, Hinton, 2012: ● 74.2% →83.6% Top 5 accuracy● 25.8%→16.4% Top 5 error rate● 36% error reduction (fixed every third error)

29.10.19 36

ImageNetTop 5 Error Rate

● Human error:– 5.1% (trained and patient)– 15% (non-trained, less patient)

● Best result in 2016: 3.08%Inception-v4 + 3xResNet ensemble

[Fei-Fei Li & Justin Johnson & Serena Yeung, cs231n, 2017. Lecture 1]

[Andrej Karpathy, What I learned from competing against a ConvNet on ImageNet, 2014]


37

ImageNet – Learnt features

Matthew D. Zeiler and Rob Fergus, Visualizing and Understanding Convolutional Networks


38

ImageNet – Learnt features

Matthew D. Zeiler and Rob Fergus, Visualizing and Understanding Convolutional Networks


39

Source: Jawahar G., Sagot B., Seddah D. What does BERT learn about the structure of language? ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Jul 2019, Florence, Italy

What does BERT learn about the structure of language?


40

Source: Jawahar G., Sagot B., Seddah D. What does BERT learn about the structure of language? ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Jul 2019, Florence, Italy

What does BERT learn about the structure of language?


41

The ongoing “neural revolution” in NLP: from Collobert to BERT

What problems Neural NLP is addressing:● The need of feature engineering.● Curse of dimensionality:

– SVD and NMT can be used to obtain embeddings, but they the algorithms doesn’t scale well do large datasets.

● The need to develop a custom algorithm / model for each task separately.– Rather the idea is to try to develop a single model for any

NLP task.


42


What problems Neural NLP is addressing:● The need of feature engineering.


43


What problems Neural NLP is addressing:● Curse of dimensionality:


44

A simpler and a more generic NLP pipeline

Source: https://explosion.ai/blog/deep-learning-formula-nlp?ref=Welcome.AI


45

A simpler and more generic NLP pipeline … which yields good results

Step 1: Embed

An embedding table maps long, sparse, binary vectors into shorter, dense, continuous vectors.



46

A simpler and more generic NLP pipeline … which yields good resultsStep 2: Encode

Given a sequence of word vectors, the encode step computes a representation that I'll call a sentence matrix, where each row represents the meaning of each token in the context of the rest of the sentence.



47

A simpler and more generic NLP pipeline … which yields good resultsStep 3: Attend

The attend step reduces the matrix representation produced by the encode step to a single vector, so that it can be passed on to a standard feed-forward network for prediction.



48

A simpler and more generic NLP pipeline … which yields good results

Step 4: Predict

Once the text or pair of texts has been reduced into a single vector, we can learn the target representation — a class label, a real value, a vector, etc.



49


50Source: Socher, Manning. Cs224n, 2017


51

MT vs. Human translation

https://www.eff.org/ai/metrics#Translation

https://www.eff.org/ai/metrics#Translation


52

Google Neural Machine Translation (NMT) System



53

GLUE benchmark

Source: Wang et al. GLUE: A Multi-task benchmark and analysis platform for Natural Language Understanding, 2019


54

GLUE leaderboard

Source: https://gluebenchmark.com/leaderboard

https://gluebenchmark.com/leaderboard


55

Source: https://super.gluebenchmark.com/leaderboard

SuperGLUE leaderboard

https://super.gluebenchmark.com/leaderboard

29.10.19 56

Lecture 1





Materials in this part are adopted from: Rao D. & McMahan (2019): Natural Language Processing with PyTorch: Build Intelligent Language Applications Using Deep Learning. O’Reilly. 1st Edition. ISBN-13: 978-1491978238

29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 3: A short intro to NLP

57

A Quick Tour of Traditional NLP

● Natural language processing (NLP) and computational linguistics (CL) are two areas of computational study of human language:– NLP – how to build a technical system which knows something about (i.e performs

processing of) human language: solving practical problems involving language, such as:

● information extraction;● automatic speech recognition;● machine translation;● sentiment analysis;● question answering;● summarization.

– CL – how to learn about some aspect of language using various mathematical and computational methods, models, and algorithms: employs computational methods to understand properties of human language.

● How do we understand language?● How do we produce language?● How do we learn languages?● What relationships do languages have with one another?


58

A Quick Tour of Traditional NLP

● Natural language processing (NLP) and computational linguistics (CL) are two areas of computational study of human language:– NLP – how to build a technical system which knows something about (i.e performs

processing of) human language: solving practical problems involving language, such as:

● information extraction;● automatic speech recognition;● machine translation;● sentiment analysis;● question answering;● summarization.

– CL – how to learn about some aspect of language using various mathematical and computational methods, models, and algorithms: employs computational methods to understand properties of human language.

● How do we understand language?● How do we produce language?● How do we learn languages?● What relationships do languages have with one another?


59

Corpora, Tokens, and Types

● NLP methods, be they classic or modern, begin with a text dataset, also called a corpus (plural: corpora).– A corpus usually contains raw text (in ASCII or UTF8) and

any metadata associated with the text.

● The raw text is a sequence of characters (bytes), but most times it is useful to group those characters into contiguous units called tokens.

● Types are unique tokens present in a corpus. The set of all types in a corpus is its vocabulary or lexicon.


60

Corpora, Tokens, and Types


61

Tokenization

● The process of breaking a text down into tokens is called tokenization.– There are six tokens in the sentence “Mary

slapped the green witch.” – “.” is one of them. – Tokenization can become more complicated than

simply splitting text based on nonalphanumeric characters.

29.10.19 62

Tokenization: the case of Turkish


63

Tokenization: Twitter data

● Tokenizing tweets involves preserving hashtags and @handles, and segmenting smilies such as :-) and URLs as one unit.

● Those decisions can significantly affect accuracy in practice!


64

Tokenzation

● Using SpaCy

● Using NLTK


65

Feature engineering

● Feature engineering is the process of understanding the linguistics of a language and applying it to solving NLP problem.

● This is something that we keep to a minimum in neural NLP for:– portability of models across languages;– applicability to more tasks;– avoiding the need in expert knowledge.

● When building realworld production systems,feature engineering is indispensable, despiterecent claims to the contrary.– Will it change in future?


66

Unigrams, Bigrams, Trigrams, …, N-grams

● N-grams are fixed length (n) consecutive token sequences occurring in the text:– Bigram has two tokens;– Unigram has one token, etc.


67

Unigrams, Bigrams, Trigrams, …, N-grams

● When subword information itself carries useful information, one might want to generate character N-grams:– For example, the suffix “ol” in “methanol” indicates it

is a kind of alcohol.


68

Lemmas and Stems

● Lemmas are root forms of words.● Verb fly can be inflected into many different

word forms: flow, flew, flies, flown, flowing.● Lemmatization is reducing the tokens to their

lemmas, e.g. to keep the dimensionality of the vector representation low.


69

Lemmas and Stems

● Stemming use of handcrafted rules to strip endings of words to reduce them to a common form called stems.– Cons: quality, the “poorman’s lemmatization”

– Pros: efficiency, was/is popular in information retrieval for this reason.

29.10.19 70

Categorizing Sentences and Documents

● One of the earliest applications of NLP– Topic categorization, predicting sentiment of

reviews, filtering spam emails, language identification, spam filtering, etc.


71

Categorizing Sentences and Documents: TF representation


72

TF-IDF representation: TF(w) IDF(⋅ w)

● TF representation weights word w proportional to its frequency:– Common words do not add anything to understanding. – A rare word is likely to be indicative.

● TF-IDF penalizes common tokens and rewards rare tokens in the vector representation:– nw is the number of documents containing the word w

and N is the total number of documents


73

TF-IDF representation: TF(w) IDF(⋅ w)


74

Categorizing Words: POS Tagging

● One can label not only documents but also individual words or tokens:– Part-of-speech (POS) tagging– Morphological analysis, etc.


75

Categorizing Spans: Chunking and Named Entity Recognition

● Label a span of text - a contiguous multitoken sequence.– Chunking:

[NP Mary] [VP slapped] [the green witch] – Named entity recognition:

[PER Mary Johnson] slapped the green witch


76

Categorizing Spans: Chunking and Named Entity Recognition

● Chunking:

● Named entity recognition:


77

Structure of sentences: identifying relations between phrases

A constituent parse of the sentence “Mary slapped the green witch.”


78

Structure of sentences: identifying relations between phrases

A dependency parse of the sentence “Mary slapped the green witch.”

29.10.19 79

Word Senses and Semantics

● Words can have multiple senses– WordNet– Automatic

discovery of senses from context

– ...

29.10.19 Lecture 1: An introduction to NLP and text categorization. Part 4: Text classification.

80

Lecture 1





Materials in this part are adopted from: Jurafsky & Martin (2019): Speech and Language Processing (3rd edition). https://web.stanford.edu/~jurafsky/slp3/

https://web.stanford.edu/~jurafsky/slp3/


81

Who wrote which Federalist papers?

● 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton.

● Authorship of 12 of the letters in dispute● 1963: solved by Mosteller and Wallace using

Bayesian methods

James Madison Alexander Hamilton


82

Positive or negative movie review?

● Unbelievably disappointing ● Full of zany characters and richly applied

satire, and some great plot twists● This is the greatest screwball comedy ever

filmed● It was pathetic. The worst part about it was

the boxing scenes.


83

What is the subject of this article?

• Antogonists and Inhibitors• Blood Supply• Chemistry• Drug Therapy• Embryology• Epidemiology• …

MeSH Subject Category Hierarchy

?

MEDLINE Article


84

Text Classification

● Assigning subject categories, topics, or genres● Spam detection● Authorship identification● Age/gender identification● Language Identification● Sentiment analysis● …


85

Text Classification: definition

Input:• a document d• a fixed set of classes C = {c1, c2,…, cJ}

Output: a predicted class c C


86

Classification Methods: Hand-coded rules

● Rules based on combinations of words or other features– spam: black-list-address OR (“dollars” AND“have

been selected”)

● Accuracy can be high– If rules carefully refined by expert

● But building and maintaining these rules is expensive


87

Classification Methods:Supervised Machine Learning

• Input: • a document d• a fixed set of classes C = {c1, c2,…, cJ}• A training set of m hand-labeled documents

(d1,c1),....,(dm,cm)

• Output:

• a learned classifier γ:d c


88

Classification Methods:Supervised Machine Learning

● Any kind of classifier– Naïve Bayes– Logistic regression– Support-vector machines– k-Nearest Neighbors

….● Deep neural networks


89

Naïve Bayes Intuition

● Simple (“naïve”) classification method based on Bayes rule

● Relies on very simple representation of document– Bag of words


90

The Bag of Words Representation


91

The Bag of Words Representation


92

Bayes’ Rule Applied to Documents and Classes

•For a document d and a class c

P(c | d) =P(d | c)P(c)

P(d)


93

Naïve Bayes Classifier

MAP is “maximum a posteriori” = most likely class

MAP is “maximum a posteriori” = most likely class

Bayes RuleBayes Rule

Dropping the denominator

Dropping the denominator

cMAP = argmaxcC

P(c | d)

= argmaxcC

P(d | c)P(c)

P(d)

= argmaxcC

P(d | c)P(c)

Document d represented as features x1..xn

Document d represented as features x1..xn

= argmaxcC

P(x1, x2,…, xn | c)P(c)


94

Naïve Bayes Classifier

How often does this class occur?

How often does this class occur?

O(|X|n•|C|) parametersO(|X|n•|C|) parameters

We can just count the relative frequencies in a corpus

We can just count the relative frequencies in a corpus

Could only be estimated if a very, very large number of training examples was available.

Could only be estimated if a very, very large number of training examples was available.

cMAP = argmaxcC

P(x1, x2,…, xn | c)P(c)


95

Multinomial Naïve Bayes Independence Assumptions

P(x1, x2,…, xn | c)

• Bag of Words assumption: Assume position doesn’t matter

• Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.

P(x1,…, xn | c) = P(x1 | c)·P(x2 | c)·P(x3 | c)·...·P(xn | c)


96

Multinomial Naïve Bayes Classifier

cMAP = argmaxcC

P(x1, x2,…, xn | c)P(c)

cNB = argmaxcC

P(c j ) P(x | c)xX

Õ

positions all word positions in test document

cNB = argmaxc jC

P(c j ) P(xi | c j )ipositions

Õ

Applying Multinomial Naive Bayes Classifiers to Text Classification:


97

Learning the Multinomial Naïve Bayes Model

• First attempt: maximum likelihood estimates• simply use the frequencies in the data

P̂(c j ) =doccount(C = c j )

Ndoc

fraction of times word wi appears

among all words in documents of topic cj

P̂(wi | c j ) =count(wi,c j )

count(w,c j )wV

å

• Create mega-document for topic j by concatenating all docs in this topic• Use frequency of w in mega-document


98

Problem with Maximum Likelihood

• What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

P̂("fantastic" positive) = count("fantastic", positive)

count(w, positivewV

å ) = 0

cMAP = argmaxc P̂(c) P̂(xi | c)i

Õ


99

Laplace (add-1) smoothing for Naïve Bayes

=count(wi,c)+1

count(w,cwV

å )æ

èçç

ö

ø÷÷ + V

P̂(wi | c) =count(wi,c)

count(w,c)( )wV

å


100

Multinomial Naïve Bayes: Learning

• Calculate P(cj) terms• For each cj in C do

docsj all docs with class =cj

• From training corpus, extract Vocabulary

P(wk | c j )nk +a

n+a |Vocabulary |

P(c j )| docs j |

| total # documents|


101

Summary: Naive Bayes is Not So Naive

● Very Fast, low storage requirements● Robust to Irrelevant Features

– Irrelevant Features cancel each other without affecting results

● Very good in domains with many equally important features● Optimal if the independence assumptions hold: If assumed

independence is correct, then it is the Bayes Optimal Classifier for problem

● A good dependable baseline for text classification● But we will see other classifiers that give better accuracy


102

Evaluation: Precision and Recall

● The 2-by-2 contingency table:

● Precision: % of selected items that are correct● Recall: % of correct items that are selected

correct not correctselected tp fp

not selected fn tn


103

Evaluation: F1 score

• A combined measure that assesses the P/R tradeoff is F measure (weighted harmonic mean):

• The harmonic mean is a very conservative average;

• People usually use balanced F1 measure i.e., with = 1 (that is, = ½): F = 2PR/(P+R)

RP

PR

RP

F+

+=

-+=

2

2 )1(1

)1(1

1

aa


104

Evaluation: Confusion matrix c

• For each pair of classes <c1,c2> how many documents from c1 were incorrectly assigned to c2?• c3,2: 90 wheat documents incorrectly assigned to poultry

Docs in test set AssignedUK

Assigned poultry

Assigned wheat

Assigned coffee

Assigned interest

Assigned trade

True UK 95 1 13 0 1 0

True poultry 0 1 0 0 0 0

True wheat 10 90 0 1 0 0

True coffee 0 0 0 34 3 7

True interest - 1 2 13 26 5

True trade 0 0 2 14 5 10


105

Evaluation: per class measures

Recall: Fraction of docs in class i classified correctly:

Precision: Fraction of docs assigned class i that are actually about class i:

Accuracy: (1 - error rate) Fraction of docs classified correctly:

ciii

å

ciji

åj

å

ciic ji

j

å

ciicij

j

å


106

Development Test Sets and Cross-validation

• Metric: P/R/F1 or Accuracy• Unseen test set

• avoid overfitting (‘tuning to the test set’)• more conservative estimate of performance

• Cross-validation over multiple splits• Handle sampling errors from different datasets

• Pool results over each split• Compute pooled dev set performance

Training set Development Test Set Test Set

Test Set

Training Set

Training SetDev Test

Training Set

Dev Test

Dev Test

Date post:	23-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times