+ All Categories
Home > Documents > LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5...

LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5...

Date post: 29-Jan-2016
Category:
Upload: darren-brooks
View: 225 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
LING 001 Introduction to LING 001 Introduction to Linguistics Linguistics Spring 2010 Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational Computational linguistics linguistics
Transcript
Page 1: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to LinguisticsLING 001 Introduction to LinguisticsSpring 2010Spring 2010

Syntactic parsingPart-Of-Speech tagging

Apr. 5

Computational Computational linguisticslinguistics

Page 2: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

2

Computational linguistics

Syntax, semantics, grammar, and the lexiconLexical semantics and ontologiesPhonology/morphology, word segmentation, and taggingSummarizationLanguage generationParaphrasing and textual entailmentParsing and chunkingSpoken language processing, understanding and speech-to-speech translationLinguistic, psychological and mathematical models of languageComputational pragmaticsDialogue and conversational agentsComputational models of discourse

Information retrievalQuestion answeringWord sense disambiguationInformation extraction and text miningSemantic role labelingSentiment analysis and opinion miningCorpus-based modeling of languageMachine translation and translation aidsMultilingual processingMultimodal systems and representationsStatistical and machine learning methodsApplicationsCorpus development and language resourcesEvaluation methods and user studies

Page 3: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

3

Computational linguistics

• Emphasis on integrating linguistic and other knowledge to produce working systems.

• System performance is important. Computational linguistics deals with language as it’s actually used.• little need to worry about rare constructions and

distinctions;• need to worry about fragments, typos, false starts,

ambiguities, non-native speakers, etc.

• Ambiguity in natural language is pervasive, which makes computational linguistics hard.

Page 4: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

4

Ambiguity

• Lexical: • Bank• Unlockable

• Syntactic:• I shot an elephant in my pajamas. (How he got in my

pajamas, I'll never know.)• I forgot how good beer tastes.• I met Mary and Elena’s mother at the mall yesterday.

• Semantic:• Every cat chases a mouse.• The police refused the demonstrators a permit because ... ... they feared violence. ... they advocated violence.

Page 5: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

5

Parsing

• Parsing: taking an input and producing some sort of structure for it.

• A syntactic parser is a device (or algorithm) that takes a phrase or sentence as input, and uses a grammar (including a lexicon) to produce the syntactic structure(s) appropriate for that phrase or sentence (often called parse trees or just trees).

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 6: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

6

Context-free grammar

• The type of grammar often applied in parsing is known as a context-free grammar.

• A context-free grammar is a set of rules/productions (and a lexicon) that specify how a syntactic constituent can be composed of smaller constituents (The term “context-free” means that expanding a constituent doesn't depend on what other constituents are around it).

Page 7: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

7

Context-free grammar

• The symbols for constituents (e.g., phrases and sentences) are called non-terminal symbols. Those representing words are called terminal symbols.

• Each rule has a single non-terminal symbol on the left hand side of the arrow. This symbol is expanded into the symbols (non-terminal or terminal) on the right hand side. The non-terminal symbols on the right hand side can then be expanded by other rules.

• The vertical stroke | is just a shorthand for alternative expansions.

• The grammar “accepts” a sentence if there is a way of expanding S (the start symbol), then expanding all the sub-constituents, and so on, until the leaves of the tree match the words in the sentence (which are terminal symbols). If we want to accept noun phrases, we can treat NP as a start symbol.

Page 8: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

8

Parsing

• Parsing is to run a grammar backwards to find possible structures of a sentence. It can be viewed as a search problem.

• Top-down strategy: All the expansions of the start symbol are considered, then expansions of each of those constituents, and so on, until we reach expansions that match all the words in the sentence. (What are the problems?)

Page 9: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

9

Parsing

• Bottom-up strategy: The words are examined and all the small constituents that might contain them are postulated, then we see which of those can be fitted together into larger constituents, and so on, until we reach a tree. (what are the problems?)

Page 10: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

10

Parsing

• The left-corner strategy (top-down prediction with bottom-up verification): Make the left-most expansion (top-down), find rules that handle the left-most words (bottom-up), repeat the procedure.

• Does this flight include a meal?

Page 11: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

11

Parsing

• Does this flight include meal?

Page 12: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

12

Parsing

• Does this flight include meal?

Page 13: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

13Probabilistic CFGs and Statistic Parsing

• Attach probabilities to context-free grammar rules (PCFG): the expansions for a given non-termimal sum to 1.

• Goal: find a single parse tree (the max probability tree) for a sentence instead of all possible parse trees.

Page 14: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

14Probabilistic CFGs and Statistic Parsing

.15*.40*.05*.05*…=1.5*10-6 .15*.40*.40*.05*…=1.7*10-6

Page 15: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

15Probabilistic CFGs and Statistic Parsing

• Probabilities can be computed from an annotated database (a Treebank).

• The Penn Treebank:

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 16: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

16

Human parsing

• While most sentences are ambiguous in some way, people rarely notice these ambiguities. Instead, they only seem to see one interpretation for a sentence.

• Lexical subcategorization preferences:• The women kept the dogs on the beach.

The women kept the dogs which were on the beach. 5%The women kept them (the dogs) on the beach.95%

• The women discussed the dogs on the beach.The women discussed the dogs which were on the beach.

90%The women discussed them (the dogs) while on the beach.

10% (keep has a preference for VP -> V NP PP, discuss has a preference for VP -> V NP)

• Part-of-speech preferences:• The complex houses married and single students and their

families.(houses is more likely to be a noun)

Page 17: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

17

Head lexicalization of PCFGs

• The head word of a phrase gives a good representation of the phrase’s structure and meaning.

• Puts the properties of words back into a PCFG.

• Lexicalized Probabilistic Context-Free Grammars perform much better than PCFGs (88% vs. 73% accuracy).

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 18: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

18

Part of Speech tagging

• Part of Speech (POS) tagging: Input: the lead paint is unsafe Output: the/Det lead/N paint/N is unsafe/Adj

• Uses of POS tagging:• Text-to-speech: how do we pronounce “lead”?

http://www.ivona.com/, which words bear a pitch accent? • It can differentiate word senses that involve part of speech

differences (what is the meaning of “interest”)?• Tagged text helps linguists find interesting syntactic

constructions in texts (“google”, “ssh”, etc. used as a verb).

• POS tagging is not parsing. It is highly accurate, state-of-the-art is 97% accuracy. But the baseline is already 90%: 1. Tag every word with its most frequent tag; 2. Tag unknown words as nouns.

Page 19: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

19

Part of Speech tagging

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• Penn Treebank has 45 different POS tags, which is most widely used.

Page 20: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

20

Part of Speech tagging

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• Percentage of the words accented (“stressed”) under each part-of-speech category in different speech genres:

Page 21: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

21

Hidden Markov Model POS tagger

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

• HMM model has been widely used in many fields: Natural language processing, speech synthesis/recognition, Computer vision, Biology, Economics, Climatology, etc.

• Top row is unobserved states (hidden states), interpreted as POS tags, bottom row is observed output (words).

• Find the most likely hidden state sequences (POS tag sequence) given an observation sequence (word sequence).

Page 22: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

22

Hidden Markov Model POS tagger

• Representation for Paths (hidden state sequences): Trellis

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Page 23: LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.

LING 001 Introduction to Linguistics, Spring 2010

23

HAL

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


Recommended