+ All Categories
Home > Data & Analytics > Overview of text mining and NLP (+software)

Overview of text mining and NLP (+software)

Date post: 18-Aug-2015
Category:
Upload: florian-leitner
View: 246 times
Download: 4 times
Share this document with a friend
Popular Tags:
22
Text mining and natural language processing Florian Leitner Technical University of Madrid (UPM), Spain Tyba Madrid, ES, 12 th of June, 2015 License:
Transcript
Page 1: Overview of text mining and NLP (+software)

Text mining and natural language processing

Florian LeitnerTechnical University of Madrid (UPM), Spain

!Tyba

Madrid, ES, 12th of June, 2015

License:

Page 2: Overview of text mining and NLP (+software)

Florian Leitner

Is language understanding & generationkey to artificial intelligence?

• “Her” (Samantha) Movie, 2013• “The Singularity: ~2030”

Ray Kurzweil, Google’s director of engineering• “Watson” & “CRUSH”

IBM’s bet on the future: Datastreams, Mainframes & AI

2

“predict crimes before they happen”

Criminal Reduction Utilizing Statistical History

(IBM, reality) !

Precogs (Minority Report, movie)

if? when?

cognitive computing: “processing information more like a

human than a machine”

GoogleGoogle

Page 3: Overview of text mining and NLP (+software)

Florian Leitner

Examples of text mining andnatural language processing applications.

• Spam filtering

• Document classification

• Social media/brand monitoring

• Opinion mining (& text classification)

• Search engines

• Information retrieval

• Plagiarism detection

• Content-based recommendation systems

• Watson (Jeopardy!, IBM)

• Question answering

• Spelling correction

• Language modeling

• Website translation (Google)

• Machine translation

• Digital assistants (MS’ Clippy)

• Dialog systems (“Turing test”)

• Siri (Apple) and Google Now

• Speech recognit. & language understand.

• Event detection (in e-mails)

• Information extraction

3

Text

Minin

gLanguage Processing

Relevant FOSS (only!) libraries will be down here… (MIT, ALv2, GPL, BSD, …)

Page 4: Overview of text mining and NLP (+software)

Concepts & Terminology

Page 5: Overview of text mining and NLP (+software)

Florian Leitner

Document and text classification/clustering

5

1st Principal Component

2nd

Pri

ncip

al C

ompon

ent

document

distance

1st Principal Component

2nd

Pri

ncip

al C

ompon

ent

Centroid

Cluster

Supervised (“Learning to classify from examples”, e.g., spam filtering)vs.

Unsupervised (“Exploratory grouping”, e.g., topic modeling)

LIBSVM

Page 6: Overview of text mining and NLP (+software)

Florian Leitner

Words, Tokens, and N-Grams/Shingles

6

This is a sentence .

This is is a a sentence sentence .

This is a is a sentence a sentence .

This is a sentence.

{ { { {

{ { {

NB:

“tokenization”

Splitting: Character-based,

Regular Expressions,

Probabilistic, …Token or Shingle

Page 7: Overview of text mining and NLP (+software)

Florian Leitner

Words, Tokens, and N-Grams/Shingles

6

This is a sentence .

This is is a a sentence sentence .

This is a is a sentence a sentence .

This is a sentence.

{ { { {

{ { {

NB:

“tokenization”

Splitting: Character-based,

Regular Expressions,

Probabilistic, …

Snag: the terms “shingle”, “token” and “n-gram” are not used consistently… but “n-gram” and “token” are far more common!

shingles (unigrams)

2-shingles (bigrams)

3-shingles (trigrams)

“k-shingling”

e.g. all trigrams of the word “sentence”:[sen, ent, nte, ten, enc, nce]

Token N-Grams

Character N-Grams

Token or Shingle

Page 8: Overview of text mining and NLP (+software)

Florian Leitner

Lemmatization, Part-of-Speech (PoS) tagging, and Named Entity Recognition (NER)

7

Token Lemma PoS NER

Constitutive constitutive JJ O

binding binding NN O

to to TO O

the the DT O

peri-! peri-kappa NN B-DNA

B B NN I-DNA

site site NN I-DNA

is be VBZ O

seen see VBN O

in in IN O

monocytes monocyte NNS B-cell

. . . O

de facto standard PoS tagset

{NN, JJ, DT, VBZ, …}Penn Treebank

B-I-O chunk encoding

commonalternatives:

I-OI-E-O

B-I-E-W-O

End token(unigram) Word

Stanford CoreNLP FACTORIE and many more…FreeLing

Linguistic annotations of tokens (used to train automated classifiers).

Begin-Inside-Outside (relevant) token

} chunk

Page 9: Overview of text mining and NLP (+software)

Florian Leitner

Word vectors and inverted indices8

0 1 2 3 4 5 6 7 8 9 10

10

0

1

2

3

4

5

6

7

8

9

count(Word1)

coun

t(Word 2)

Text

1

Text2α

γ

βSimilarity(T1, T2) := cos(T1, T2)

count(Word 3

)

Comparing text vectors:E.g., cosine similarity

Text vectorization:Inverted index

Text 1: He that not wills to the end neither wills to the means.Text 2: If the mountain will not go to Moses, then Moses must go to the mountain.

tokens Text 1 Text 2

end 1 0

go 0 2

he 1 0

if 0 1

means 1 0

Moses 0 2

mountain 0 2

must 0 1

not 1 1

that 1 0

the 2 2

then 0 1

to 2 2

will 2 1 INDRI

“Search engine basics”ea

ch t

oken

/wor

d is

a d

imen

sion

!

Page 10: Overview of text mining and NLP (+software)

Florian Leitner

Inverted indices andthe central dogma of machine learning

9

×=

y = h✓(X)

XTy θ

Rank, Class,

Expectation, Probability, Descriptor*,

Inverted index (transposed)

Parameters(θ)

“tex

ts”

(n)

n-grams (p)

instances, observations

variables, features

(Hyperparameters are settings that control the learning algorithm.)

per feature

Page 11: Overview of text mining and NLP (+software)

Florian Leitner

Inverted indices andthe central dogma of machine learning

9

×=

y = h✓(X)

XTy θ

Rank, Class,

Expectation, Probability, Descriptor*,

Inverted index (transposed)

Parameters(θ)

“tex

ts”

(n)

n-grams (p)

instances, observations

variables, features

(Hyperparameters are settings that control the learning algorithm.)

per feature

“Nonparametric”

per instance

Page 12: Overview of text mining and NLP (+software)

Florian Leitner

The curse of dimensionality(R.E. Bellman, 1961) [inventor of dynamic programming]

• p ≫ n (far more tokens/features than texts/instances)

• Inverted indices (X) are (discrete) sparse matrices.

• Even with millions of training examples, unseen tokens will keep popping up in during evaluation or in production.

‣ In such a high-dimensional hypercube, most instances are closer to the face of the cube (“nothing”, outside) than other instances.

✓ Remedy: (feature) dimensionality reductionThe “blessing of non-uniformity.”

• feature extraction (compression): PCA/LSA (projection), factor analysis (regression), compression, auto-encoders & deep learning (compression & embedding), …

• feature selection (elimination): LASSO (regularization), SVM (support vectors), Bayesian nets (structure learning), locality sensitivity hashing, random projections, …

10

Page 13: Overview of text mining and NLP (+software)

Applications

Page 14: Overview of text mining and NLP (+software)

Florian Leitner

Google’s review summaries: Opinion mining (“sentiment” analysis).

12

Don’t do it, please… ;-) (If you must: see document and text classification software.)

Page 15: Overview of text mining and NLP (+software)

Florian Leitner

Polarity of sentiment keywords in IMDB.

• å

13

Cristopher Potts. On the negativity of negation. 2011

“not good”

Page 16: Overview of text mining and NLP (+software)

Florian Leitner

Language understanding: Parsing and semantic analysis.

14

disambiguation!

Coreference (Anaphora) Resolution

Named Entity Recognition

Apple Siri

Stanford BLLIP (C-J) Malt LinkGrammar and many more… RedShift

Entity Grounding

disambiguation!

disambiguation!

L. TesnièreN. Chomsky

Page 17: Overview of text mining and NLP (+software)

Florian Leitner

Automatic text summarization: Automatic text summarization:

• Variance/human agreement: When is a summary “correct”?

• Coherence: providing discourse structure (text flow) to the summary.

• Paraphrasing: important sentences are repeated, but with different wordings.

• Implied messages: (the Dow Jones index rose 10 points → the economy is thriving)

• Anaphora (coreference) resolution: very hard, but crucial.

15

…is very difficult because…

Image Source: www.lexalytics.com

Lex[Page]Rank (JUNG) sumy TextTeaserthe author got hired by Google…

Page 18: Overview of text mining and NLP (+software)

Florian Leitner

Machine translation: Deep learning with auto-encoders.

16

‣have only one gender (en) or use opposing genders (es vs. de: el/die !; la/der "; …/das #) ‣have different verb placements (es⬌de). ‣have a different concepts of verbs (latin, arab, cjk). ‣use different tenses (en⬌de). ‣have different word orders (latin, arab, cjk).

Different languages…

DL4J

Page 19: Overview of text mining and NLP (+software)

Florian Leitner

Question answering: The champions league of TM & NLP.

17

Biggest issue: statistical inference

IBM Watson WolframAlpha

Category: Oscar Winning Movies Hint: Its final scene includes the line “I do wish we could chat longer, but I’m

having an old friend for dinner” !!!!

Answer: Silence of the Lamb

All men are mortal.Socrates probably is a man…

…Therefore, Socratesmight be mortal.(cognitive computing)

Page 20: Overview of text mining and NLP (+software)

Florian Leitner

Information extraction: Knowledge mining for molecular biology.

18

BiologicalRepositories

Binary Interactions

Named Entity Recognition Entity Associations Entity Mapping

(Grounding)Relationship Extraction

Relationship Annotations

Cdk5 Rat

TaxID10116

UniProtQ03114

Experimental Methods

Article Classification

Biological Model

Articles

Short Factoid Question Answering

Ontologies & Thesauri

WWW

MITIE OpenDMAP ClearTK

Page 21: Overview of text mining and NLP (+software)

Florian Leitner

Text mining and language processing is all about resolving ambiguities.

19

Anaphora resolutionCarl and Bob were fighting:

“You should shut up,” Carl told him.

Part-of-Speech taggingThe robot wheels out the iron.

ParaphrasingUnemployment is on the rise.

vs The economy is slumping.

Entity recognition & groundingIs Princeton really good for you?

Page 22: Overview of text mining and NLP (+software)

Florian Leitner

Text mining and language processing is all about resolving ambiguities.

20

Anaphora resolutionCarl and Bob were fighting:

“You should shut up,” Carl told him.

Part-of-Speech taggingThe robot wheels out the iron.

ParaphrasingUnemployment is on the rise.

vs The economy is slumping.

Entity recognition & groundingIs Princeton really good for you?


Recommended