+ All Categories
Home > Documents > 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text...

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text...

Date post: 29-Dec-2015
Category:
Upload: lesley-priscilla-marshall
View: 215 times
Download: 2 times
Share this document with a friend
33
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries
Transcript
Page 1: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

1

CSC 594 Topics in AI –Text Mining and Analytics

Fall 2015/16

2. Linguistic Essentials and Text Mining Preliminaries

Page 2: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

2

The Description of Language• Language = Words and Rules

Dictionary (vocabulary) + Grammar• Dictionary

– set of words defined in the language– open (dynamic)

• Grammar– set of rules which describe what is allowable in a language

• Classic/empirical Grammars– definitions and rules are mainly supported by examples– no (or almost no) formal description tools

• Explicit/formal Grammar (CFG, Dependency Grammars etc.)– formal description– can be programmed & tested on data (texts)

Page 3: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

3

Levels of Language Analysis1. Phonology

• study of sound systems of languages2. Morphology

• study of structure of words: the structure of words in a language, including patterns of inflections and derivations

3. Syntax• study of organization of words in sentences: the ordering of and

relationship between the words in phrases and sentences4. Semantics

• study of meaning in language: the study of how meaning in language is created

5. Pragmatics• study of language in use: the branch of linguistics that studies

language use rather than language structure 6. Discourse

• study of language, especially the type of language used in a particular context or subject

7. World Knowledge

Page 4: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Parts of Speech• There are eight basic parts of speech for words in the English

language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection.

• The part of speech indicates how the word functions in meaning as well as grammatically within a sentence.

1. Noun: people, animals, concepts, things (e.g. “birds”)

2. Pronoun: a word used in place of a noun (e.g. “it”, “they”, “I”, “she”)

3. Verb: express action in the sentence (e.g. “sing”)

4. Adjective: describe properties of nouns (e.g. “yellow”)

5. Adverb: modifies or describes a verb, an adjective, or another adverb (e.g. “extremely”, “slowly”)

6. Preposition: a word placed before a noun/pronoun to form a phrase modifying another word/phrase (e.g. “in”, “for”, “without”)

4

Page 5: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Quiz!

• Identify all words which are the basic eight parts-of-speech in the following sentences.

– The student put books on the table.”

– “We may also collect information you voluntarily add to your profile, such as your mobile phone number and mobile service provider.”

5

Page 6: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

6

Morphology

• The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)

• Two broad classes of morphemes:– Stems: “main” morpheme of the word, supplying meaning– Affixes: Bits and pieces that combine with stems to modify their

meanings and grammatical functions (prefixes, suffixes, circumfixes, infixes)

• Unlike• Trying

• Multiple affixes– Unreadable

Source: Joyce Choi, CSE 842, Michigan State University

Page 7: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

7

Ways to Form Words• Inflection: new forms of the same word (usually in the

same class)– Tense, number, mood, voice marking in verbs– Number, gender marking in nominals– Comparison of adjectives

• Derivation: yield different words in different class– Deverbal nominals– Denominal adjectives and verbs

• Compounding: new words out of two or more other words– Noun-noun compounding (e.g., doghouse)

• Cliticization: combine a word with a clitic (which acts syntactically like a word but in a reduced form, e.g., I’ve)

Source: Joyce Choi, CSE 842, Michigan State University

Page 8: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

8

English Inflectional Morphology

• Word stem combines with grammatical morpheme– Usually produces word of same class– Usually serves a grammatical role that the stem could not (e.g.

agreement)• like -> likes or liked• bird -> birds

• Nouns have a simple inflectional morphology: markers for plural and markers for possessives

• Verbs are slightly more complex:

Source: Joyce Choi, CSE 842, Michigan State University

Page 9: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

9

Nominal Inflection

• Nominal morphology– Plural forms

• s or es• Irregular forms, e.g., Goose/Geese, Mouse/Mice

– Possessives• children’s

Source: Joyce Choi, CSE 842, Michigan State University

Page 10: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

10

Verbal Inflection

• Main verbs (walk, like) are relatively regular– -s, ing, ed– And productive: Emailed, instant-messaged, faxed– But eat/ate/eaten, catch/caught/caught

• Primary (be, have, do) and modal verbs (can, will, must) are often irregular and not productive– Be: am/is/are/were/was/been/being

• Irregular verbs few (~250) but frequently occurring English verbal inflection is much simpler than e.g. Latin

Source: Joyce Choi, CSE 842, Michigan State University

Page 11: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

11Source: Joyce Choi, CSE 842, Michigan State University

Page 12: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

12

English Derivational Morphology

• Word stem combines with grammatical morpheme– Usually produces word of different class– More complicated than inflectional

• Example: nominalization– -ize verbs -> -ation nouns– generalize, realize -> generalization, realization

• Example: verbs, nouns -> adjectives– embrace, pity-. embraceable, pitiable– care, wit -> careless, witless

Source: Joyce Choi, CSE 842, Michigan State University

Page 13: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

13

• Example: adjective -> adverb– happy -> happily

• More complicated to model than inflection– Less productive: *science-less, *concern-less, *go-able, *sleep-

able– Meanings of derived terms harder to predict by rule

Source: Joyce Choi, CSE 842, Michigan State University

Page 14: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

14

Morphological Analysis Tools

• E.g. Porter Stemmer– A simple approach: just hack off the end of the word!

Does NOT convert a word to its base form!!!– Frequently used in Information Retrieval, but results are pretty ugly!

Source: Marti Hearst, i256, at UC Berkeley

• Original *****************************• Rudolph Agnew , 55 years old and former chairman of• Consolidated Gold Fields PLC , was named a nonexecutive director of• this British industrial conglomerate . A form of asbestos once used to• make Kent cigarette filters has caused a high percentage of cancer• deaths among a group of workers exposed to it more than 30 years ago ,

• Results *******************************• Rudolph Agnew , 55 year old and former chairman of• Consolid Gold Field PLC , wa name a nonexecut director of • thi British industri conglomer . A form of asbesto onc use to • make Kent cigarett filter ha caus a high percentag of cancer • death among a group of worker expos to it more than 30 year ago ,

Page 15: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Stemming vs. Lemmatization

• The purpose of both stemming and lemmatization is to reduce morphological variation.

• Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas (morphological stems). – Stemming: car, cars, car's, cars' => car– Lemmatizing: am, are, is => be ;

drive, drives, drove, driven => drive

• In a way, lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;

15

Page 16: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

16

Is Stemming/Lemmatization Useful?

• Both help reduce the size of vocabulary.• However problems…

– Stemming can conflate semantically different words• E.g. “Gallery” and “gall” may both be stemmed to “gall”

– Also truncated stems can be intelligible to users– Lemmatization is better, but it only deals with inflectional

variance (e.g. “go”, “went”, “gone” => “go”, but not “attend”/verb, “attendance”/noun)

• Despite the problems, stemming is done often in Information Retrieval (IR) and Text Mining.

Page 17: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Quiz!

• The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs, would you agree, should NOT be conflated? Give your reasoning.– abandon / abandonment– marketing / markets– university / universe– volume / volumes

• FYI: Porter Stemmer Online (http://9ol.es/porter_js_demo.html)

Introduction to Information Retrieval, C. Manning, P. Raghavan and H. Schutze, 2008 17

Page 18: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

18

POS Tagging

• The process of assigning a part-of-speech or lexical class marker to each word in a sentence (and all sentences in a collection).

Input: the lead paint is unsafe

Output: the/Det lead/N paint/N is/V unsafe/Adj

Source: Jurafsky & Martin “Speech and Language Processing”

Page 19: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

19

Why is POS Tagging Useful?

• First step of a vast number of practical tasks • Helps in stemming/lemmatization• Parsing

– Need to know if a word is an N or V before you can parse– Parsers can build trees directly on the POS tags instead of

maintaining a lexicon• Information Extraction

– Finding names, relations, etc.• Machine Translation• Selecting words of specific Parts of Speech (e.g. nouns) in

pre-processing documents (for IR etc.)

Source: Jurafsky & Martin “Speech and Language Processing”

Page 20: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

20

POS TaggingChoosing a Tagset

• To do POS tagging, we need to choose a standard set of tags to work with

• Could pick very coarse tagsets– N, V, Adj, Adv.

• More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags– PRP$, WRB, WP$, VBG

• Even more fine-grained tagsets exist

Source: Jurafsky & Martin “Speech and Language Processing”

Page 21: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

21

Penn TreeBank POS Tagset

Page 22: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

22

Difficulties with POS Tagging• Words often have more than one POS – ambiguity:

– The back door = JJ– On my back = NN– Win the voters back = RB– Promised to back the bill = VB

• The POS tagging problem is to determine the POS tag for a particular instance of a word.

Another example of Part-of-speech ambiguities

NNP NNS NNS NNS CD NN VBZ VBZ VBZ

VB“Fed raises interest rates 0.5 % in effort to

control inflation”

Source: Jurafsky & Martin “Speech and Language Processing”, Andrew McCallum, UMass Amherst

Page 23: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

23

POS Tagging Techniques

Source: Jurafsky & Martin “Speech and Language Processing”, Andrew McCallum, UMass Amherst

1. Rule-based• Hand-coded rules

2. Probabilistic/Stochastic• Sequence (n-gram) models; machine learning

HMM (Hidden Markov Model) MEMMs (Maximum Entropy Markov Models)

3. Transformation-based• Rules + n-gram machine learning

Brill tagger

Page 24: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

24

Current Performance

Input: the lead paint is unsafe

Output: the/Det lead/N paint/N is/V unsafe/Adj

• Using state-of-the-art automated method, how many tags are correct?– About 97% currently– But baseline is already 90%

• Baseline is performance of simplest possible method:Tag every word with its most frequent tag, and Tag unknown words as nouns

Source: Andrew McCallum, UMass Amherst

Page 25: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Quiz!

1. Find one tagging error in each of the following sentences that are tagged with the Penn treebank tagset.a. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN.

b. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS ?

2. Tag each word in the following sentence with the Penn Treebank tagset.– “We may also collect information you voluntarily add

to your profile, such as your mobile phone number and mobile service provider.”

25

Page 26: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

26

Named Entity Recognition (NER)

• Named Entities (NEs) are proper names in texts, i.e. the names of persons, organizations, locations, times and quantities

• NE Recognition (NER) is a sub-task of Information Extraction (IE)

• NER is to process a text and identify named entities– e.g. “U.N. official Ekeus heads for Baghdad.”

• NER is also an important task for texts in specific domains such as biomedical texts

Source: J. Choi, CSE842, MSU; Marti Hearst, i256, at UC Berkeley

Page 27: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

27

Common Entity Types

NE Type Examples

ORGANIZATION Georgia-Pacific Corp., WHO

PERSON Eddy Bonte, President Obama

LOCATION Murray River, Mount Everest

DATE June, 2008-06-29

TIME two fifty a m, 1:30 p.m.

MONEY 175 million Canadian Dollars, GBP 10.40

PERCENT twenty pct, 18.75 %

FACILITY Washington Monument, Stonehenge

GPE (geo political entity) South East Asia, Midlothian

Page 28: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

28

Difficulties with NER• Names are too numerous to include in dictionaries• Variations

– e.g. “John Smith”, “Mr Smith”, “John”• Changing constantly

– new names invent unknown words• Ambiguities

– Same name refers to different entities, e.g.

• JFK – the former president• JFK – his son• JFK – airport in NY

• Multi-word entities – difficult to find boundaries– “DePaul University”– “Cecil H. Green Library and Escondido Village Conference Service

Center”

Page 29: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

29

Landscape of IE/NER Techniques

Any of these models can be used to capture words, formatting or both.

Lexicons

AlabamaAlaska…WisconsinWyoming

Abraham Lincoln was born in Kentucky.

member?

Classify Pre-segmentedCandidates

Abraham Lincoln was born in Kentucky.

Classifier

which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifierwhich class?

Try alternatewindow sizes:

Boundary ModelsAbraham Lincoln was born in Kentucky.

Classifier

which class?

BEGIN END BEGIN END

BEGIN

Context Free Grammars

Abraham Lincoln was born in Kentucky.

NNP V P NPVNNP

NP

PP

VP

VP

S

Mos

t lik

ely

pars

e?

Finite State MachinesAbraham Lincoln was born in Kentucky.

Most likely state sequence?

Source: Marti Hearst, i256, at UC Berkeley

Page 30: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

30

NER State of the Art Performance

• Named entity recognition– Person, Location, Organization, …– F1 score (similar to accuracy) in high 80’s or low- to mid-90’s

• However, performance depends on the entity types[Wikipedia] At least two hierarchies of named entity types have been

proposed in the literature. BBN categories [1], proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes. Sekine's extended hierarchy [2], proposed in 2002, is made of 200 subtypes.

• Also, various domains use different entity types (e.g. concepts in biomedical texts)

Page 31: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Is NER Useful?

• Yes, especially when the text uses domain-specific vocabulary (e.g. legal, medical).

31

Page 32: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Bing Liu

UIC - CS 594

Stop words• Many of the most frequently used words in English are

worthless in IR and text mining – these words are called stop words.– the, of, and, to, ….– Typically about 400 to 500 such words– For an application, an additional domain specific stop words list

may be constructed• Why do we want to remove stop words?

– Reduce indexing (or data) file size• stopwords accounts 20-30% of total word counts.

– Improve efficiency• stop words are not useful for searching or text mining• stop words always have a large number of hits

• Example Stopword Lists

Page 33: 1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Difficulties with Stopwords

• Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all processing of natural language tools, and indeed not all tools even use such a list.(Wikipedia)

33


Recommended