+ All Categories
Home > Documents > Collins - Statistical Methods in Natural Language Processing (Slides)

Collins - Statistical Methods in Natural Language Processing (Slides)

Date post: 07-Apr-2018
Category:
Upload: priscian
View: 221 times
Download: 0 times
Share this document with a friend
96
Statistical Methods in Natural Language Processing Michael Collins AT&T Labs-Research
Transcript
Page 1: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 1/96

Statistical Methods in Natural Language Processing

Michael CollinsAT&T Labs-Research

Page 2: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 2/96

OverviewSome NLP problems:

   ̄ Information extraction(Named entities, Relationships between entities, etc.)

   ̄

Finding linguistic structurePart-of-speech tagging, “Chunking”, Parsing

Techniques:

   ̄ Log-linear (maximum-entropy) taggers

   ̄ Probabilistic context-free grammars (PCFGs)PCFGs with enriched non-terminals

   ̄ Discriminative methods:Conditional MRFs, Perceptron algorithms, Kernel methods

Page 3: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 3/96

Some NLP Problems

   ̄ Information extraction

– Named entities

– Relationships between entities

– More complex relationships

   ̄Finding linguistic structure

– Part-of-speech tagging

– “Chunking” (low-level syntactic structure)

– Parsing

   ̄ Machine translation

Page 4: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 4/96

Common Themes

   ̄ Need to learn mapping from one discrete structure to another

– Strings to hidden state sequencesNamed-entity extraction, part-of-speech tagging

– Strings to stringsMachine translation

– Strings to underlying trees

Parsing

– Strings to relational data structuresInformation extraction

   ̄Speech recognition is similar (and shares many techniques)

Page 5: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 5/96

Two Fundamental Problems

TAGGING: Strings to Tagged Sequences

a b e e a f h j    µ a /C b /D e /C e /C a /D f  /C h /D j /C

PARSING: Strings to Trees

d e f g    µ (A (B (D d) (E e)) (C (F f ) (G g)))

d e f g    µ A

B

D

d

E

e

C

F

G

g

Page 6: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 6/96

Information Extraction: Named Entities

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:

Profits soared at      

Company Boeing Co.     ℄ 

, easily topping forecastson        Location Wall Street     ℄  , as their CEO        Person Alan Mulally      ℄ 

announced first quarter results.

Page 7: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 7/96

Information Extraction: Relationships between Entities

INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.

OUTPUT:

    Relationship = Company-LocationCompany = BoeingLocation = Seattle    

     Relationship = Employer-EmployeeEmployer = Boeing Co.Employee = Alan Mulally    

Page 8: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 8/96

Information Extraction: More Complex Relationships

INPUT:Alan Mulally resigned as Boeing CEO yesterday. He will besucceeded by Jane Swift, who was previously the president at RollsRoyce.

OUTPUT:

      Relationship = Management Succession

Company = Boeing Co.

Role = CEO

Out = Alan Mulally

In = Jane Swift     

      Relationship = Management Succession

Company = Rolls RoyceRole = president

Out = Jane Swift     

Page 9: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 9/96

Part-of-Speech Tagging

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:

Profits /N soared /V at /P Boeing /N Co. /N , /, easily /ADV topping /Vforecasts /N on /P Wall /N Street /N , /, as /P their /POSS CEO /N

Alan /N Mulally /N announced /V first /ADJ quarter /N results /N . /.

N = Noun

V = Verb

P = PrepositionAdv = Adverb

Adj = Adjective

 

Page 10: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 10/96

“Chunking” (Low-level syntactic structure)

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:

       NP Profits      ℄  soared at        NP Boeing Co.      ℄  , easily topping        NPforecasts     ℄  on        NP Wall Street     ℄  , as        NP their CEO Alan Mulally      ℄ 

announced      

NP first quarter results     ℄ 

.

      NP 

     ℄  = non-recursive noun phrase

Page 11: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 11/96

Chunking as Tagging

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:

Profits /S soared /N at /N Boeing /S Co. /C , /N easily /N topping /Nforecasts /S on /N Wall /S Street /C , /N as /N their /S CEO /C Alan /C

Mulally /C announced /N first /S quarter /C results /C . /N

N = Not part of noun-phrase

S = Start noun-phrase

C = Continue noun-phrase

Page 12: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 12/96

Named Entity Extraction as Tagging

INPUT:Profits soared at Boeing Co., easily topping forecasts on WallStreet, as their CEO Alan Mulally announced first quarter results.

OUTPUT:

Profits /NA soared /NA at /NA Boeing /SC Co. /CC , /NA easily /NAtopping /NA forecasts /NA on /NA Wall /SL Street /CL , /NA as /NAtheir /NA CEO /NA Alan /SP Mulally /CP announced /NA first /NA

quarter /NA results /NA . /NA

NA = No entity

SC = Start Company

CC = Continue CompanySL = Start Location

CL = Continue Location

 

Page 13: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 13/96

Parsing (Syntactic Structure)

INPUT: Boeing is located in Seattle.

OUTPUT:S

NP

N

Boeing

VP

V

is

VP

V

located

PP

P

in

NP

N

Seattle

Page 14: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 14/96

Machine Translation

INPUT:Boeing is located in Seattle. Alan Mulally is the CEO.

OUTPUT:

Boeing ist in Seattle. Alan Mulally ist der CEO.

Page 15: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 15/96

Summary

Problem Well-Studied Class of Problem

LearningApproaches?

Named entity extraction Yes TaggingRelationships between entities A little Parsing

More complex relationships No ??Part-of-speech tagging Yes TaggingChunking Yes TaggingSyntactic Structure Yes Parsing

Machine translation Yes ??

Page 16: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 16/96

Techniques Covered in this Tutorial

   ̄ Log-linear (maximum-entropy) taggers

   ̄ Probabilistic context-free grammars (PCFGs)

   ̄ PCFGs with enriched non-terminals

   ̄ Discriminative methods:

– Conditional Markov Random Fields– Perceptron algorithms

– Kernels over NLP structures

Page 17: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 17/96

Log-Linear Taggers: Notation

   ̄ Set of possible words =     Î , possible tags =     Ì

   ̄

Word sequence   Û

      ½    Ò      ℄ 

        Û

     ½

  Û

     ¾

  Û

   Ò

     ℄ 

   ̄ Tag sequence     Ø

      ½    Ò      ℄ 

         Ø

     ½

  Ø

     ¾

  Ø

   Ò

     ℄ 

   ̄ Training data is    Ò tagged sentences,

where the      ’th sentence is of length    Ò

     

     ́    Û

     

      ½    Ò

 

     ℄ 

  Ø

     

      ½    Ò

 

     ℄ 

     µ

for    

    ½   Ò

Page 18: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 18/96

Log-Linear Taggers: Independence Assumptions

   ̄ The basic idea

    È      ́     Ø

  ½  Ò  ℄ 

         Û

  ½  Ò  ℄ 

     µ

É  

 Ò

    ½

    È      ́     Ø

 

          Ø

     ½

  Ø

 ½

  Û

  ½  Ò  ℄ 

       µ Chain rule

   

É  

 Ò

    ½

    È      ́     Ø

 

          Ø

     ½

  Ø

     ¾

  Û

  ½  Ò  ℄ 

       µ Independence

assumptions

   ̄ Two questions:

1. How to parameterize     È      ́     Ø

     

           Ø

              ½

  Ø

              ¾

  Û

      ½    Ò      ℄ 

       µ ?

2. How to find    Ö Ñ Ü

     Ø

  ½  Ò  ℄ 

    È      ́     Ø

      ½    Ò      ℄ 

          Û

      ½    Ò      ℄ 

     µ ?

Page 19: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 19/96

The Parameterization Problem

Hispaniola /NNP quickly /RB became /VB an /DT

important /JJ base /?? from which Spain expanded

its empire into the rest of the Western Hemisphere .

   ̄ There are many possible tags in the position ??

   ̄ Need to learn a function from (context, tag) pairs to a probability    È      ́

    Ø      

    Ó Ò Ø Ü Ø     µ

Page 20: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 20/96

Representation: Histories

   ̄ A history is a 4-tuple           Ø

        ½

  Ø

        ¾

  Û

      ½    Ò      ℄ 

       

   ̄     Ø

        ½

  Ø

        ¾

are the previous two tags.

   ̄    Û

      ½    Ò      ℄ 

are the    Ò words in the input sentence.

   ̄      is the index of the word being tagged

Page 21: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 21/96

Representation: Histories

Hispaniola /NNP quickly /RB became /VB an /DT important /JJbase /?? from which Spain expanded its empire into the rest of theWestern Hemisphere .

   ̄ History               Ø

        ½

  Ø

        ¾

  Û

      ½    Ò      ℄ 

       

   ̄     Ø

        ½

  Ø

        ¾

    DT, JJ

   ̄    Û

      ½    Ò      ℄ 

             À × Ô Ò Ó Ð Õ Ù Ð Ý Ñ

     

   ̄        

Page 22: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 22/96

Feature–Vector Representations

   ̄ Take a history/tag pair      ́     Ø

     µ .

   ̄     

   ×

     ́     Ø

     µ for    ×    ½  

are features representing

tagging decision    Ø

in context    

.

    

     ½ ¼ ¼ ¼

     ́     Ø      µ

 

 

 

 

      

 

 

 

   

     ½ if current word    Û

     

is base

and     Ø = VB

     ¼ otherwise

    

     ½ ¼ ¼ ½

     ́     Ø      µ

 

      

   

     ½ if            Ø

        ¾

  Ø

        ½

  Ø                DT, JJ, VB      

     ¼ otherwise

Page 23: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 23/96

Representation: Histories

   ̄ A history is a 4-tuple           Ø

        ½

  Ø

        ¾

  Û

      ½    Ò      ℄ 

       

   ̄     Ø

        ½

  Ø

        ¾

are the previous two tags.

   ̄    Û

      ½    Ò      ℄  are the    Ò words in the input sentence.

   ̄      is the index of the word being tagged

Hispaniola /NNP quickly /RB became /VB an /DT important /JJbase /?? from which Spain expanded its empire into the rest of theWestern Hemisphere .

   ̄     Ø

   ½

  Ø

   ¾

    DT, JJ

   ̄    Û

  ½  Ò  ℄ 

            À × Ô Ò Ó Ð Õ Ù Ð Ý Ñ À Ñ × Ô Ö

    

   ̄        

Page 24: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 24/96

Feature–Vector Representations

   ̄ Take a history/tag pair      ́     Ø

     µ .

   ̄     

   ×

     ́     Ø

     µ for    ×    ½   are features representing

tagging decision     Ø in context      .

Example: POS Tagging [Ratnaparkhi 96]

   ̄ Word/tag features

    

 ½ ¼ ¼

     ́     Ø      µ

 ́         

     ½

if current word   Û

 

is base and    Ø

= VB     ¼ otherwise

    

 ½ ¼ ½

     ́     Ø      µ

 ́         

     ½ if current word    Û

 

ends in ing and     Ø = VBG

     ¼ otherwise

   ̄ Contextual Features

    

 ½ ¼ ¿

     ́     Ø      µ

 ́         

     ½ if           Ø

   ¾

  Ø

   ½

  Ø              DT, JJ, VB    

     ¼ otherwise

Page 25: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 25/96

Part-of-Speech (POS) Tagging [Ratnaparkhi 96]

   ̄ Word/tag features

    

 ½ ¼ ¼

     ́     Ø      µ

 ́         

     ½ if current word    Û

 

is base and     Ø = VB

     ¼

otherwise

   ̄ Spelling features

    

 ½ ¼ ½

     ́     Ø      µ

 ́         

     ½ if current word    Û

 

ends in ing and     Ø = VBG

     ¼ otherwise

    

 ½ ¼ ¾

     ́     Ø      µ

 ́         

     ½ if current word    Û

 

starts with pre and     Ø = NN

     ¼ otherwise

Page 26: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 26/96

Ratnaparkhi’s POS Tagger

   ̄

Contextual Features

    

 ½ ¼ ¿

     ́     Ø      µ

 ́         

     ½ if           Ø

   ¾

  Ø

   ½

  Ø              DT, JJ, VB     

     ¼ otherwise

    

 ½ ¼

     ́     Ø      µ

 ́         

     ½ if           Ø

   ½

  Ø              JJ, VB     

     ¼ otherwise

    

 ½ ¼

     ́     Ø      µ

 ́         

     ½ if           Ø               VB     

     ¼otherwise

    

 ½ ¼

     ́     Ø      µ

 ́         

     ½if previous word

   Û

     ½

= the and    Ø    

VB     ¼ otherwise

    

 ½ ¼

     ́     Ø      µ

 ́         

     ½ if next word    Û

   · ½

= the and     Ø     VB     ¼ otherwise

Page 27: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 27/96

Log-Linear (Maximum-Entropy) Models

   ̄ Take a history/tag pair      ́     Ø

     µ .

   ̄     

   ×

     ́     Ø

     µ for    ×    ½  

are features

   ̄     Ï

   ×

for    ×    ½   are parameters

   ̄ Parameters define a conditional distribution

    È      ́     Ø                 µ

   

È

 ×

    Ï

 ×

     

 ×

     ́      Ø

     µ

          ́    

    Ï      µ

where          ́

         Ï

     µ

 

     Ø

 ¼

   ¾ Ì

   

È

 ×

    Ï

 ×

     

 ×

     ́      Ø

 ¼

     µ

Page 28: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 28/96

Log-Linear (Maximum Entropy) Models

   ̄ Word sequence    Û

      ½    Ò      ℄ 

        Û

     ½

  Û

     ¾

  Û

   Ò

     ℄ 

   ̄ Tag sequence     Ø

      ½    Ò      ℄ 

         Ø

     ½

  Ø

     ¾

  Ø

   Ò

     ℄ 

   ̄ Histories     

     

              Ø

              ½

  Ø

              ¾

  Û

      ½    Ò      ℄ 

       

     Ð Ó     È      ́     Ø

      ½    Ò      ℄ 

          Û

      ½    Ò      ℄ 

     µ

   Ò

 

         ½

     Ð Ó     È      ́     Ø

     

           

     

     µ

   

   Ò

 

         ½

 

   ×

    Ï

   ×

    

   ×

     ́     

     

  Ø

     

     µ

   ß Þ

Linear Score

  

   Ò

 

         ½

     Ð Ó           ́     

     

      Ï      µ

   ß Þ

Local Normalization

Terms

Page 29: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 29/96

Log-Linear Models

   ̄

Word sequence   Û

      ½    Ò      ℄ 

        Û

     ½

  Û

     ¾

  Û

   Ò

     ℄ 

   ̄ Tag sequence     Ø

      ½    Ò      ℄ 

         Ø

     ½

  Ø

     ¾

  Ø

   Ò

     ℄ 

     Ð Ó     È      ́     Ø

      ½    Ò      ℄ 

          Û

      ½    Ò      ℄ 

     µ

   Ò

 

         ½

     Ð Ó     È      ́     Ø

     

           

     

     µ

   

   Ò

 

         ½

 

   ×

    Ï

   ×

    

   ×

     ́     

     

  Ø

     

     µ   

   Ò

 

         ½

     Ð Ó           ́     

     

      Ï      µ

where    

     

              Ø

              ¾

  Ø

              ½

  Û

      ½    Ò      ℄ 

       

i

Page 30: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 30/96

Log-Linear Models

   ̄

Parameter estimation:Maximize likelihood of training data through gradient descent,iterative scaling

   ̄

Search for    Ö Ñ Ü

     Ø

  ½  Ò  ℄ 

    È      ́     Ø

      ½    Ò      ℄ 

          Û

      ½    Ò      ℄ 

     µ

:Dynamic programming,     Ç      ́    Ò

     Ì  

 ¿

     µ complexity

   ̄ Experimental results:

– Almost 97% accuracy for POS tagging [Ratnaparkhi 96]

– Over 90% accuracy for named-entity extraction[Borthwick et. al 98]

– Around 93% precision/recall for NP chunking

– Better results than an HMM for FAQ segmentation[McCallum et al. 2000]

T h i C d i hi T i l

Page 31: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 31/96

Techniques Covered in this Tutorial

   ̄ Log-linear (maximum-entropy) taggers

   ̄ Probabilistic context-free grammars (PCFGs)

   ̄ PCFGs with enriched non-terminals

   ̄ Discriminative methods:

– Conditional Markov Random Fields– Perceptron algorithms

– Kernels over NLP structures

D t f P i E i t

Page 32: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 32/96

Data for Parsing Experiments

   ̄ Penn WSJ Treebank = 50,000 sentences with associated trees

   ̄ Usual set-up: 40,000 training sentences, 2400 test sentences

An example tree:

Canadian

NNP

Utilities

NNPS

NP

had

VBD

1988

CD

revenue

NN

NP

of

IN

C$

$

1.16

CD

billion

CD

,

PUNC,

QP

NP

PP

NP

mainly

RB

ADVP

from

IN

its

PRP$

natural

JJ

gas

NN

and

CC

electric

JJ

utility

NN

businesses

NNS

NP

in

IN

Alberta

NNP

,

PUNC,

NP

where

WRB

WHADVP

the

DT

company

NN

NP

serves

VBZ

about

RB

800,000

CD

QP

customers

NNS

.

PUNC.

NP

VP

S

SBAR

NP

PP

NP

PP

VP

S

TOP

Canadian Utilities had 1988 revenue of C$ 1.16 billion , mainly from its

natural gas and electric utility businesses in Alberta , where the company

serves about 800,000 customers .

Th I f ti C d b P T

Page 33: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 33/96

The Information Conveyed by Parse Trees

1) Part of speech for each word

(N = noun, V = verb, D = determiner)

S

NP

D

the

N

burglar

VP

V

robbed

NP

D

the

N

apartment

Page 34: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 34/96

2) Phrases S

NP

DT

the

N

burglar

VP

V

robbed

NP

DT

the

N

apartment

Noun Phrases (NP): “the burglar”, “the apartment”

Verb Phrases (VP): “robbed the apartment”

Sentences (S): “the burglar robbed the apartment”

Page 35: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 35/96

3) Useful Relationships

S

NP

subject 

VP

V

verb

S

NP

DT

the

N

burglar

VP

V

robbed

NP

DT

the

N

apartment

   µ “the burglar” is the subject of “robbed”

An Example Application: Machine Translation

Page 36: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 36/96

An Example Application: Machine Translation

   ̄ English word order is subject – verb – object 

   ̄ Japanese word order is subject – object – verb

English: IBM bought LotusJapanese: IBM Lotus bought 

English: Sources said that IBM bought Lotus yesterdayJapanese: Sources yesterday IBM Lotus bought that said 

Context Free Grammars

Page 37: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 37/96

Context-Free Grammars

[Hopcroft and Ullman 1979]A context free grammar     

    ´           ¦

  Ê Ë     µ where:

   ̄      is a set of non-terminal symbols

   ̄      ¦ is a set of terminal symbols

   ̄     Ê is a set of rules of the form              

 ½

    

 ¾

 

 Ò

for    Ò           ¼ ,         ¾      ,     

 

   ¾      ́               ¦ µ

   ̄     Ë    ¾      is a distinguished start symbol

A Context-Free Grammar for English

Page 38: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 38/96

A Context-Free Grammar for English

    =

    S, NP, VP, PP, D, Vi, Vt, N, P

    

    Ë = S     ¦ =      sleeps, saw, man, woman, telescope, the, with, in     

    Ê    

S   µ

NP VPVP    µ ViVP    µ Vt NPVP    µ VP PP

NP   µ

D NNP    µ NP PP

PP    µ P NP

Vi    µ sleeps

Vt    µ sawN

   µman

N    µ womanN    µ telescope

D    µ the

P    µ withP    µ in

Note: S=sentence, VP=verb phrase, NP=noun phrase, PP=prepositional phrase,

D=determiner, Vi=intransitive verb, Vt=transitive verb, N=noun, P=preposition

Left-Most Derivations

Page 39: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 39/96

A left-most derivation is a sequence of strings    ×

 ½

  ×

 Ò

, where

   ̄    ×

 ½

        Ë , the start symbol

   ̄    ×

 Ò

   ¾      ¦

 £ , i.e.    ×

 Ò

is made up of terminal symbols only

   ̄ Each    ×

 

for         ¾   Ò

is derived from    ×

     ½

by picking the left-

most non-terminal      in    ×

     ½

and replacing it by some     ¬ where             ¬ is a rule in     Ê

For example: [S], [NP VP], [D N VP], [the N VP], [the man VP],

[the man Vi], [the man sleeps]Representation of a derivation as a tree:

S

NP

D

the

N

man

VP

Vi

sleeps

Notation

Page 40: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 40/96

Notation

   ̄ We use      to denote the set of all left-most derivations (trees)allowed by a grammar

   ̄ We use           ́    Ü      µ for a string    Ü    ¾      ¦

 £ to denote the set of all

derivations whose final string (“yield”) is    Ü .

The Problem with Parsing: Ambiguity

Page 41: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 41/96

The Problem with Parsing: Ambiguity

INPUT:She announced a program to promote safety in trucks and vans

    ·

POSSIBLE OUTPUTS:And there are more...

An Example Tree

Page 42: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 42/96

An Example Tree

Canadian Utilities had 1988 revenue of C$ 1.16 billion ,mainly from its natural gas and electric utility businessesin Alberta , where the company serves about 800,000customers .

Canadian

NNP

Utilities

NNPS

NP

had

VBD

1988

CD

revenue

NN

NP

of

IN

C$

$

1.16

CD

billion

CD

,

PUNC,

QP

NP

PP

NP

mainly

RB

ADVP

from

IN

its

PRP$

natural

JJ

gas

NN

and

CC

electric

JJ

utility

NN

businesses

NNS

NP

in

IN

Alberta

NNP

,

PUNC,

NP

where

WRB

WHADVP

the

DT

company

NN

NP

serves

VBZ

about

RB

800,000

CD

QP

customers

NNS

.

PUNC.

NP

VP

S

SBAR

NP

PP

NP

PP

VP

S

TOP

A Probabilistic Context-Free Grammar

Page 43: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 43/96

bab C G a a

S    µ NP VP 1.0

VP    µ Vi 0.4VP    µ Vt NP 0.4

VP   µ

VP PP 0.2NP    µ D N 0.3NP    µ NP PP 0.7

PP    µ P NP 1.0

Vi    µ sleeps 1.0Vt

   µsaw 1.0

N    µ man 0.7N    µ woman 0.2N    µ telescope 0.1

D    µ the 1.0

P    µ with 0.5

P   µ

in 0.5

   ̄ Probability of a tree with rules    «

 

        ¬

 

isÉ  

 

    È      ́    «

 

        ¬

 

         «

 

     µ

   ̄ Maximum Likelihood estimation

    È      ́ VP    µ V NP       VP     µ

     Ó Ù Ò Ø     ́ VP    µ V NP     µ

     Ó Ù Ò Ø     ́ VP     µ

PCFGs

Page 44: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 44/96

PCFGs

[Booth and Thompson 73] showed that a CFG with ruleprobabilities correctly defines a distribution over the set of derivations      provided that:

1. The rule probabilities define conditional distributions over thedifferent ways of rewriting each non-terminal.

2. A technical condition on the rule probabilities ensuring that

the probability of the derivation terminating in a finite numberof steps is 1. (This condition is not really a practical concern.)

TOP

Page 45: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 45/96

S

NP

N

IBM

VP

V

bought

NP

N

Lotus

PROB =     È      ́ TOP     S     µ

   ¢     È      ́ S     NP VP     µ    ¢     È      ́ N        Á Å

     µ

   ¢     È      ́ VP     V NP      µ    ¢     È      ́ V         Ó Ù Ø

     µ

   ¢     È      ́

NP   

N     µ    ¢     È      ́

N   

    Ä Ó Ø Ù ×     µ

   ¢     È      ́ NP     N      µ

The SPATTER Parser: (Magerman 95;Jelinek et al 94)

Page 46: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 46/96

The SPATTER Parser: (Magerman 95;Jelinek et al 94)

   ̄

For each rule, identify the “head” child

S    µ NP VP

VP    µ V NP

NP   µ

DT N

   ̄ Add word to each non-terminal

S(questioned)

NP(lawyer)

DT

the

N

lawyer

VP(questioned)

V

questioned

NP(witness)

DT

the

N

witness

A Lexicalized PCFG

Page 47: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 47/96

S(questioned)    µ NP(lawyer) VP(questioned) ??

VP(questioned)   µ

V(questioned) NP(witness) ??

NP(lawyer)    µ D(the) N(lawyer) ??

NP(witness)   µ

D(the) N(witness) ??

   ̄ The big question: how to estimate rule probabilities??

Page 48: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 48/96

CHARNIAK (1997)

S(questioned)

    ·     È      ́ NP VP       S(questioned)      µ

S(questioned)

NP VP(questioned)

    ·     È      ́ lawyer       S,VP,NP, questioned)     µ

S(questioned)

NP(lawyer) VP(questioned)

Smoothed Estimation

Page 49: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 49/96

    È      ́

NP VP      

S(questioned)     µ

    

     ½

   ¢

      Ó Ù Ò Ø     ́ S(questioned)    NP VP     µ

      Ó Ù Ò Ø     ́ S(questioned)     µ

    ·     

     ¾

   ¢

      Ó Ù Ò Ø     ́ S    NP VP      µ

      Ó Ù Ò Ø     ́ S      µ

   ̄ Where      ¼          

     ½

 

     ¾

          ½ , and     

     ½

    ·     

     ¾

    ½

Smoothed Estimation

Page 50: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 50/96

    È      ́

lawyer      

S,NP,VP,questioned     µ

    

     ½

   ¢

      Ó Ù Ò Ø     ́ lawyer        S,NP,VP,questioned      µ

      Ó Ù Ò Ø     ́ S,NP,VP,questioned     µ

    ·     

     ¾

   ¢

      Ó Ù Ò Ø     ́ lawyer        S,NP,VP     µ

      Ó Ù Ò Ø     ́ S,NP,VP     µ

    ·     

     ¿

   ¢

      Ó Ù Ò Ø     ́ lawyer        NP      µ

      Ó Ù Ò Ø     ́ NP     µ

   ̄ Where      ¼          

     ½

 

     ¾

 

     ¿

          ½ , and     

     ½

    ·     

     ¾

    ·     

     ¿

    ½

(l ) ( i d) S( i d)

Page 51: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 51/96

    È      ́ NP(lawyer) VP(questioned)       S(questioned)     µ

     ́     

 ½

   ¢

  Ó Ù Ò Ø  ́

S(questioned) 

NP VP µ

  Ó Ù Ò Ø  ́

S(questioned) µ

    ·     

 ¾

   ¢

  Ó Ù Ò Ø  ́

NP VP µ

  Ó Ù Ò Ø  ́ S µ

     µ

   ¢      ́     

 ½

   ¢

  Ó Ù Ò Ø  ́

lawyer  

S,NP,VP,questioned µ

  Ó Ù Ò Ø  ́

S,NP,VP,questioned µ

    ·     

 ¾

   ¢

  Ó Ù Ò Ø  ́

lawyer  

S,NP,VP µ

  Ó Ù Ò Ø  ́

S,NP,VP µ

    ·     

 ¿

   ¢

  Ó Ù Ò Ø  ́

lawyer  

NP µ

  Ó Ù Ò Ø  ́ NP µ

     µ

Lexicalized Probabilistic Context-Free Grammars

Page 52: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 52/96

   ̄ Transformation to lexicalized rulesS     NP VP

vs. S(questioned)     NP(lawyer) VP(questioned)

   ̄ Smoothed estimation techniques “blend” different counts

   ̄ Search for most probable tree through dynamic programming

   ̄ Perform vastly better than PCFGs (88% vs. 73% accuracy)

Independence AssumptionsPCFG

Page 53: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 53/96

   ̄PCFGs

S

NP

DT

the

N

lawyer

VP

V

questioned

NP

DT

the

N

witness

   ̄ Lexicalized PCFGsS(questioned)

NP(lawyer)

DT

the

N

lawyer

VP(questioned)

V

questioned

NP(witness)

DT

the

N

witness

Results

Page 54: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 54/96

Method Accuracy

PCFGs (Charniak 97) 73.0%

Conditional Models – Decision Trees (Magerman 95) 84.2%

Lexical Dependencies (Collins 96) 85.5%

Conditional Models – Logistic (Ratnaparkhi 97) 86.9%

Generative Lexicalized Model (Charniak 97) 86.7%Generative Lexicalized Model (Collins 97) 88.2%

Logistic-inspired Model (Charniak 99) 89.6%

Boosting (Collins 2000) 89.8%

   ̄ Accuracy = average recall/precision

Parsing for Information Extraction:Relationships between Entities

Page 55: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 55/96

Relationships between Entities

INPUT:Boeing is located in Seattle.

OUTPUT:

     Relationship = Company-LocationCompany = BoeingLocation = Seattle    

A Generative Model (Miller et. al)

Page 56: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 56/96

[Miller et. al 2000] use non-terminals to carry lexical items and

semantic tagsSis

CL

NPBoeingCOMPANY

Boeing

VPisCLLOC

V

is

VPlocatedCLLOC

V

located

PPinCLLOC

P

in

NPSeattleLOCATION

Seattle

PPin     lexical headCLLOC     semantic tag

A Generative Model [Miller et. al 2000]

Page 57: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 57/96

We’re now left with an even more complicated estimation problem,

P(SisCL    µ NP

BoeingCOMPANY VPis

CLLOC)

See [Miller et. al 2000] for the details

   ̄ Parsing algorithm recovers annotated trees   µ

Simultaneously recovers syntactic structure and namedentity relationships

   ̄ Accuracy (precision/recall) is greater than 80% in recoveringrelations

Techniques Covered in this Tutorial

Page 58: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 58/96

   ̄

Log-linear (maximum-entropy) taggers

   ̄ Probabilistic context-free grammars (PCFGs)

   ̄

PCFGs with enriched non-terminals

   ̄ Discriminative methods:

– Conditional Markov Random Fields– Perceptron algorithms

– Kernels over NLP structures

Linear Models for Parsing and Tagging

Page 59: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 59/96

   ̄ Three components:

     Æis a function from a string to a set of candidates

    ̈ maps a candidate to a feature vector

    Ï is a parameter vector

Component 1:     Æ

Page 60: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 60/96

   ̄     Æ enumerates a set of candidates for a sentence

She announced a program to promote safety in trucks and vans

    ·     Æ

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP

t r uc k s a n d v a ns

S

NP

She

VP

announced NP

NP

NP

a p r og r am

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a p r og r am

VP

t o p ro mo te NP

safety

PP

in NP

t r uc k s a n d v a ns

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

t r uc k s a n d v a ns

Examples of      Æ

Page 61: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 61/96

   ̄ A context-free grammar

   ̄ A finite-state machine

   ̄ Top      most probable analyses from a probabilistic grammar

Component 2:     ̈

Page 62: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 62/96

   ̄     ̈ maps a candidate to a feature vector    ¾ Ê

 

   ̄     ̈ defines the representation of a candidate

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

    ·     ̈

          ½        ¼        ¾        ¼        ¼       ½

            

Features

Page 63: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 63/96

   ̄ A “feature” is a function on a structure, e.g.,

          ́    Ü     µ

Number of times A

B C

is seen in    Ü

     Ì

 ½

A

B

D

d

E

e

C

F

G

g

     Ì

 ¾

A

B

D

d

E

e

C

F

h

A

B

b

C

c

          ́     Ì

     ½

     µ ½          ́     Ì

     ¾

     µ ¾

Feature Vectors

Page 64: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 64/96

   ̄

A set of functions    

 ½

 

 

define a feature vector

    ̈      ́    Ü     µ

         

 ½

     ́    Ü      µ 

 ¾

     ́    Ü      µ 

 

     ́    Ü      µ     

     Ì

 ½

A

B

D

d

E

e

C

F

G

g

     Ì

 ¾

A

B

D

d

E

e

C

F

h

A

B

b

C

c

    ̈      ́     Ì

     ½

     µ            ½        ¼        ¼        ¿           ̈      ́     Ì

     ¾

     µ            ¾        ¼        ½        ½      

Component 3:     Ï

Page 65: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 65/96

   ̄     Ï is a parameter vector    ¾ Ê

 

   ̄     ̈ and     Ï together map a candidate to a real-valued score

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

    ·     ̈

          ½        ¼        ¾        ¼        ¼       ½

            

    ·     ̈   ¡     Ï

          ½        ¼        ¾        ¼        ¼       ½

            ¡

     ½                   ¼        ¿        ¼        ¾        ½        ¿        ¼        ½        ¼           ¾        ¿        

       

Putting it all Together

Page 66: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 66/96

   ̄ is set of sentences,      is set of possible outputs (e.g. trees)

   ̄Need to learn a function

            

   ̄     Æ

,     ̈ ,     Ï define

          ́    Ü     µ Ö Ñ Ü

   Ý    ¾     Æ

     ́    Ü      µ

    ̈      ́    Ý      µ   ¡     Ï

Choose the highest scoring tree as the most plausible structure

   ̄ Given examples      ́    Ü

 

  Ý

 

     µ , how to set     Ï ?

She announced a program to promote safety in trucks and vans

    ·     Æ

Page 67: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 67/96

S

NP

She

VP

announced NP

NP

a program

VP

t o p ro mo te NP

safety PP

in NP

t r uc k s a n d v a ns

S

NP

She

VP

announced NP

NP

NP

a p r og r am

VP

t o p ro mo te NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a program

VP

to promote NP

NP

safety PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

a p r og r am

VP

t o p ro mo te NP

safety

PP

in NP

t r uc k s a n d v a ns

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

trucks

and NP

vans

S

NP

She

VP

announced NP

NP

NP

a program

VP

t o p r om o te NP

safety

PP

in NP

t r uc k s a n d v a ns

    ·     ̈     ·     ̈     ·     ̈     ·     ̈     ·     ̈     ·     ̈

   ½    ½    ¿        ¾    ¼    ¼        ½    ¼    ½        ¼    ¼    ¿    ¼    ¼    ½    ¼        ¼    ¼    ½      

 ·

   ̈ ¡

   Ï ·

   ̈ ¡

   Ï ·

   ̈ ¡

   Ï ·

   ̈ ¡

   Ï ·

   ̈ ¡

   Ï ·

   ̈ ¡

   Ï

13.6 12.2 12.1 3.3 9.4 11.1    ·

    Ö Ñ Ü

S

NP

She

VP

announced NP

NP

a p ro gr am

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

Markov Random Fields

Page 68: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 68/96

   ̄ Parameters     Ï define a conditional distribution over candidates:

    È      ́    Ý

     

          Ü

     

      Ï     µ

   

    ̈      ́    Ý

 

     µ   ¡     Ï

È

   Ý    ¾     Æ

     ́    Ü

 

     µ

   

    ̈      ́    Ý      µ   ¡     Ï

   ̄ Gaussian prior:     Ð Ó

    È      ́     Ï      µ  

          

    Ï      

 ¾

          ¾

   ̄

MAP parameter estimates maximise

 

     

     Ð Ó

   

    ̈      ́    Ý

 

     µ   ¡     Ï

È

   Ý    ¾     Æ

     ́    Ü

 

     µ

   

    ̈      ́    Ý      µ   ¡     Ï

       

           Ï

       

     ¾

     ¾

Note: This is a “globally normalised” model

Markov Random Fields Example 1: [Johnson et. al 1999]

Page 69: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 69/96

     Æis the set of parses for a sentence with a hand-crafted

grammar (a Lexical Functional Grammar)

    ̈ can include arbitrary features of the candidate parses

    Ï

is estimated using conjugate gradient descent

Markov Random Fields Example 2: [Lafferty et al. 2001]

Page 70: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 70/96

Going back to tagging:

   ̄ Inputs    Ü are sentences    Û

  ½  Ò  ℄ 

   ̄     Æ

     ́    Û

  ½  Ò  ℄ 

     µ     Ì

 Ò i.e. all tag sequences of length    Ò

   ̄ Global representations     ̈ are composed from local feature

vectors     

    ̈      ́    Û

  ½  Ò  ℄ 

  Ø

  ½  Ò  ℄ 

     µ

 Ò

 

    ½

          ́     

 

  Ø

 

     µ

where     

 

             Ø

     ¾

  Ø

     ½

  Û

  ½  Ò  ℄ 

      

Markov Random Fields Example 2: [Lafferty et al. 2001]

Page 71: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 71/96

   ̄ Typically, local features are indicator functions, e.g.,

    

 ½ ¼ ½

     ́     Ø      µ

 ́         

     ½ if current word    Û

  ends in ing and     Ø = VBG

     ¼ otherwise

   ̄ and global features are then counts,

    ̈

 ½ ¼ ½

     ́    Û

  ½  Ò  ℄ 

  Ø

  ½  Ò  ℄ 

     µ Number of times a word ending in ing is

tagged as VBG in      ́    Û

  ½  Ò  ℄ 

  Ø

  ½  Ò  ℄ 

     µ

Markov Random Fields Example 2: [Lafferty et al. 2001]

Conditional random fields are globally normalised models:

Page 72: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 72/96

     Ð Ó     È      ́     Ø

  ½  Ò  ℄ 

         Û

  ½  Ò  ℄ 

     µ     ̈      ́    Û

  ½  Ò  ℄ 

  Ø

  ½  Ò  ℄ 

     µ   ¡     Ï        Ð Ó

          ́    Û

  ½  Ò  ℄ 

      Ï      µ

   

 Ò

 

    ½

 

 ×

    Ï

 ×

    

 ×

     ́     

 

  Ø

 

     µ

   ß Þ

Linear model

       Ð Ó

          ́    Û

  ½  Ò  ℄ 

      Ï      µ

   ß Þ

Normalization

where           ́    Û

  ½  Ò  ℄ 

      Ï     µ

È

 Ø

  ½  Ò  ℄

 ¾ Ì

 Ò

   

   ̈ ́  Û

  ½  Ò  ℄

  Ø

  ½  Ò  ℄

 µ  ¡

   Ï

Log-linear taggers (see earlier part of the tutorial) are locally normalised models:

     Ð Ó      È      ́      Ø

  ½  Ò  ℄ 

          Û

  ½  Ò  ℄ 

     µ

 Ò

 

    ½

 

 ×

    Ï

 ×

     

 ×

     ́      

 

  Ø

 

     µ

   ß Þ

Linear Model

  

 Ò

 

    ½

     Ð Ó            ́      

 

      Ï      µ

   ß Þ

Local Normalization

Problems with Locally Normalized Models

Page 73: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 73/96

   ̄

“Label bias” problem [Lafferty et al. 2001]See also [Klein and Manning 2002]

   ̄ Example of a conditional distribution that locally normalized

models can’t capture (under bigram tag representation):

a b c    µ

A — B — C      

a b cwith     È      ́ A B C       a b c

     µ ½

a b e    µ

A — D — E      

a b ewith     È      ́ A D E       a b e

     µ ½

   ̄ Impossible to find parameters that satisfy

    È      ́                     µ    ¢     È      ́               

     µ    ¢     È      ́                    µ ½

    È      ́                     µ    ¢     È      ́               

     µ    ¢     È      ́                    µ ½

Markov Random Fields Example 2: [Lafferty et al. 2001]Parameter Estimation

Page 74: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 74/96

   ̄ Need to calculate gradient of the log-likelihood,

 

 

   Ï

È

 

     Ð Ó     È      ́     Ø

 

  ½  Ò

 

 ℄ 

         Û

 

  ½  Ò

 

 ℄ 

      Ï      µ

=  

 

   Ï

      

È

 

    ̈      ́    Û

 

  ½  Ò

 

 ℄ 

  Ø

 

  ½  Ò

 

 ℄ 

     µ   ¡     Ï   

È

 

     Ð Ó           ́    Û

 

  ½  Ò

 

 ℄ 

      Ï      µ

      

 

    ̈      ́    Û

 

  ½  Ò

 

 ℄ 

  Ø

 

  ½  Ò

 

 ℄ 

     µ

  

È

 

È

 Ù

  ½  Ò

 

 ℄

 ¾ Ì

 Ò

 

    È      ́    Ù

  ½  Ò

 

 ℄ 

         Û

 

  ½  Ò

 

 ℄ 

      Ï      µ     ̈      ́    Û

 

  ½  Ò

 

 ℄ 

  Ù

  ½  Ò

 

 ℄ 

     µ

Last term looks difficult to compute. But because    ̈

is definedthrough “local” features, it can be calculated efficiently usingdynamic programming. (Very similar problem to that solved bythe EM algorithm for HMMs.) See [Lafferty et al. 2001].

Techniques Covered in this Tutorial

Page 75: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 75/96

   ̄

Log-linear (maximum-entropy) taggers

   ̄ Probabilistic context-free grammars (PCFGs)

   ̄

PCFGs with enriched non-terminals

   ̄ Discriminative methods:

– Conditional Markov Random Fields

– Perceptron algorithms

– Kernels over NLP structures

A Variant of the Perceptron Algorithm

Page 76: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 76/96

Inputs: Training set      ́    Ü

 

  Ý

 

     µ for         ½   Ò

Initialization:     Ï    ¼

Define:           ́    Ü     µ Ö Ñ Ü

 Ý  ¾     Æ  ́  Ü  µ

    ̈      ́    Ý      µ   ¡     Ï

Algorithm: For     Ø    ½   Ì

,         ½   Ò

   Þ

 

              ́    Ü

 

     µ

If       ́    Þ

 

           Ý

 

     µ     Ï         Ï     ·     ̈      ́    Ý

 

     µ        ̈      ́    Þ

 

     µ

Output: Parameters     Ï

Theory Underlying the Algorithm

   ̄ Definition:      Æ     ́    Ü

 

     µ      Æ     ́    Ü

 

     µ  

   Ý

 

    

Page 77: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 77/96

   ̄ Definition: The training set is separable with margin     Æ,

if there is a vector     Í    ¾ Ê

  with      

    Í           ½

such that

        

        Þ    ¾     Æ

     ́    Ü

 

     µ     Í   ¡     ̈      ́    Ý

 

     µ        Í   ¡     ̈      ́    Þ      µ          Æ

Theorem: For any training sequence     ́    Ü

 

  Ý

 

     µwhich is separable

with margin     Æ , then for the perceptron algorithm

Number of mistakes     

    Ê

 ¾

    Æ

 ¾

where      Ê is a constant such that          

        Þ    ¾     Æ

     ́    Ü

 

     µ       

    ̈      ́    Ý

 

     µ        ̈      ́    Þ      µ     

     Ê

Proof: Direct modification of the proof for the classification case.See [Collins 2002]

More Theory for the Perceptron Algorithm

Page 78: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 78/96

   ̄

Question 1: what if the data is not separable?[Freund and Schapire 99] give a modified theorem for this case

   ̄ Question 2: performance on training data is all very well,but what about performance on new test examples?

Assume some distribution     È      ́   Ü Ý

     µ underlying examples

Theorem [Helmbold and Warmuth 95]: For any distribution    È      ́

   Ü Ý     µ generating examples, if      = expected number of mistakes

of an online algorithm on a sequence of     Ñ    · ½

examples, then arandomized algorithm trained on    Ñ samples will have probability

 

 Ñ  · ½

of making an error on a newly drawn example from     È .

[Freund and Schapire 99] use this to define the Voted Perceptron

Perceptron Algorithm 1: Tagging

   ̄ Score for a      ́    Û

      ½    Ò      ℄ 

  Ø

      ½    Ò      ℄ 

     µ pair is

Page 79: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 79/96

          ́    Û

  ½  Ò  ℄ 

  Ø

  ½  Ò  ℄ 

     µ

 

 

 

 ×

    Ï

 ×

    

 ×

     ́     

 

  Ø

 

     µ

   

 

 ×

    Ï

 ×

    ̈

 ×

     ́     Ø

  ½  Ò  ℄ 

  Û

  ½  Ò  ℄ 

     µ

   ̄

Note: no normalization terms   ̄ Note:           ́    Û

  ½  Ò  ℄ 

  Ø

  ½  Ò  ℄ 

     µ is not a log probability

   ̄ Viterbi algorithm for

    Ö Ñ Ü

     Ø

  ½  Ò  ℄ 

   ¾ Ì

 Ò

          ́    Û

      ½    Ò      ℄ 

  Ø

      ½    Ò      ℄ 

     µ

Training the Parameters

Inputs: Training set      ́    Û

 

  ½  Ò

 

 ℄ 

  Ø

 

  ½  Ò

 

 ℄ 

     µ for         ½   Ò

.

Page 80: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 80/96

Initialization:    Ï

    ¼

Algorithm: For    Ø

    ½   Ì     ½   Ò

   Þ

  ½  Ò

 

 ℄ 

    Ö Ñ Ü

 Ù

  ½  Ò

 

 ℄

 ¾ Ì

 Ò

 

 

 ×

    Ï

 ×

    ̈

 ×

     ́    Û

 

  ½  Ò

 

 ℄ 

  Ù

  ½  Ò

 

 ℄ 

     µ

   Þ

  ½  Ò

 

 ℄ 

is output on      ’th sentence with current parameters

If     Þ

  ½  Ò

 

 ℄ 

            Ø

 

  ½  Ò

 

 ℄  then

    Ï

 ×

        Ï

 ×

    ·     ̈

 ×

     ́    Û

 

  ½  Ò

 

 ℄ 

  Ø

 

  ½  Ò

 

 ℄ 

     µ

   ß Þ

Correct tags’feature value

       ̈

 ×

     ́    Û

 

  ½  Ò

 

 ℄ 

  Þ

  ½  Ò

 

 ℄ 

     µ

   ß Þ

Incorrect tags’feature value

Output: Parameter vector    Ï

.

An Example

Say the correct tags for      ’th sentence are

Page 81: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 81/96

y g

the /DT man /NN bit /VBD the /DT dog /NN

Under current parameters, output is

the /DT man /NN bit /NN the /DT dog /NN

Assume also that features track: (1) all bigrams; (2) word/tag pairs

Parameters incremented:

     NN, VBD            VBD, DT            VBD     bit    

Parameters decremented:

     NN, NN             NN, DT            NN     bit    

Experiments

Page 82: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 82/96

   ̄

Wall Street Journal part-of-speech tagging data

Perceptron = 2.89%, Max-ent = 3.28%

(11.9% relative error reduction)

   ̄

[Ramshaw and Marcus 95] NP chunking data

Perceptron = 93.63%, Max-ent = 93.29%

(5.1% relative error reduction)

See [Collins 2002]

Perceptron Algorithm 2: Reranking Approaches

Page 83: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 83/96

   ̄     Æ

is the top    Ò most probable candidates from a base model

– Parsing: a lexicalized probabilistic context-free grammar

– Tagging: “maximum entropy” tagger– Speech recognition: existing recogniser

Parsing Experiments

     ÆBeam search used to parse training and test sentences:

around 27 parses for each sentence

Page 84: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 84/96

    ̈              Ä      ́    Ü      µ 

 ½

     ́    Ü      µ 

 Ñ

     ́    Ü      µ      , where     Ä      ́    Ü     µ

log-likelihood fromfirst-pass parser,     

 ½

 

 Ñ

are          ¼ ¼

      ¼ ¼ ¼

indicator functions

   

 ½

     ́    Ü     µ

 ́         

     ½

if    Ü

contains         Ë    

     È Î È    

     ¼ otherwise

S

NP

She

VP

announced NP

NP

a program

VP

to VP

promote NP

safety PP

in NP

NP

trucks

and NP

vans

    ·     ̈

          ½  

             ¼        ¼        ½        ½        ¼        ½        ¼        ¼        ½        ¼        ¼        ½        ½        ¼        ¼        ½        ½        ¼        ¼        ¼        ¼

       ½        ¼        ¼     

Named Entities

     ÆTop 20 segmentations from a “maximum-entropy” tagger

Page 85: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 85/96

    ̈              Ä      ́    Ü      µ 

 ½

     ́    Ü      µ 

 Ñ

     ́    Ü      µ      ,

   

 ½

     ́    Ü     µ

 ́         

     ½ if     Ü contains a boundary = “[The

     ¼ otherwise

Whether you’re an aging flower child or a clueless

[Gen-Xer], “[The Day They Shot John Lennon],” playing at the

[Dougherty Arts Center], entertains the imagination.

    ·     ̈

          ¿  

     ½        ½        ¼        ¼        ¼        ½        ½        ¼        ½        ½        ¼        ¼        ½        ¼        ¼        ½        ¼        ½        ¼        ¼        ¼        ¼

       ¼        ½        ½     

Whether you’re an aging flower child or a clueless

[Gen-Xer], “[The Day They Shot John Lennon],” playing at the

[Dougherty Arts Center], entertains the imagination.

Page 86: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 86/96

    ·     ̈

          ¿  

     ½        ½        ¼        ¼        ¼        ½        ½        ¼        ½        ½        ¼        ¼        ½        ¼        ¼        ½        ¼        ½        ¼        ¼        ¼        ¼

       ¼        ½        ½     

Whether you’re an aging flower child or a cluelessGen-Xer, “The Day [They Shot John Lennon],” playing at the

[Dougherty Arts Center], entertains the imagination.

    ·     ̈

          ¿  

      ½       ½        ½        ½        ¼        ¼        ½        ¼        ¼        ½        ¼        ½        ½        ½        ¼        ½        ¼        ½        ¼        ¼        ¼        ¼

       ¼        ½        ¼     

Whether you’re an aging flower child or a clueless

[Gen-Xer], “The Day [They Shot John Lennon],” playing at the

[Dougherty Arts Center], entertains the imagination.

    ·     ̈

          ¾  

             ¼        ¼        ½        ¼        ¼        ½        ¼        ¼        ½        ¼        ¼        ¼        ¼        ¼        ½        ¼        ½        ¼        ¼        ¼        ¼

       ¼        ½        ¼     

Experiments

P i W ll S J l T b k

Page 87: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 87/96

Parsing Wall Street Journal Treebank

Training set = 40,000 sentences, test     2,416 sentencesState-of-the-art parser: 88.2% F-measureReranked model: 89.5% F-measure (11% relative error reduction)Boosting: 89.7% F-measure (13% relative error reduction)

Recovering Named-Entities in Web Data

Training data     53,609 sentences (1,047,491 words),test data     14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.9% F-measure (17.7% relative error reduction)

Boosting: 87.6% F-measure (15.6% relative error reduction)

Perceptron Algorithm 3: Kernel Methods(Work with Nigel Duffy)

Page 88: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 88/96

   ̄ It’s simple to derive a “dual form” of the perceptron algorithm

If we can compute    ̈      ́    Ü      µ   ¡     ̈      ́    Ý      µ

efficientlywe can learn efficiently with the representation     ̈

“All Subtrees” Representation [Bod 98]

   ̄ Given: Non-Terminal symbols         

    

Page 89: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 89/96

Terminal symbols        

    

   ̄ An infinite set of subtrees

A

B C

A

B

b

E

A

B

b

C

A B

A

B A

B

b

C

 

   ̄ Step 1:

Choose an (arbitrary) mapping from subtrees to integers

    

 

     ́    Ü     µ

Number of times subtree      is seen in    Ü

    ̈      ́    Ü     µ

         

 ½

     ́    Ü      µ 

 ¾

     ́    Ü      µ 

 ¿

     ́    Ü      µ 

    

All Subtrees Representation

i h

Page 90: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 90/96

   ̄     ̈ is now huge

   ̄ But inner product     ̈      ́     Ì

     ½

     µ   ¡     ̈      ́     Ì

     ¾

     µ can be computed

efficiently using dynamic programming.See [Collins and Duffy 2001, Collins and Duffy 2002]

Similar Kernels Exist for Tagged Sequences

Whether you’re an aging flower child or a clueless

Page 91: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 91/96

Whether you’re an aging flower child or a clueless[Gen-Xer], “[The Day They Shot John Lennon],” playing

at the [Dougherty Arts Center], entertains the imagination.

       ·       ̈

Whether [Gen-Xer], Day They John Lennon],” playing

Whether you’re an aging flower child or a clueless [Gen

 

Experiments

Parsing Wall Street Journal Treebank

Page 92: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 92/96

Parsing Wall Street Journal TreebankTraining set = 40,000 sentences, test     2,416 sentencesState-of-the-art parser: 88.5% F-measureReranked model: 89.1% F-measure(5% relative error reduction)

Recovering Named-Entities in Web Data

Training data     53,609 sentences (1,047,491 words),test data     14,717 sentences (291,898 words)State-of-the-art tagger: 85.3% F-measureReranked model: 87.6% F-measure

(15.6% relative error reduction)

Conclusions

Some Other Topics in Statistical NLP:

Machine translation

Page 93: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 93/96

   ̄ Machine translation

   ̄ Unsupervised/partially supervised methods

   ̄ Finite state machines

   ̄ Generation

   ̄ Question answering

   ̄ Coreference

   ̄ Language modeling for speech recognition

   ̄ Lexical semantics

   ̄ Word sense disambiguation

   ̄ Summarization

MACHINE TRANSLATION (BROWN ET. AL)

   ̄ Training corpus: Canadian parliament (French-English translations)

Page 94: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 94/96

   ̄ Task: learn mapping from French Sentence     English Sentence

   ̄ Noisy channel model:

     Ø Ö Ò × Ð Ø Ó Ò     ́      

     µ Ö Ñ Ü

 

     È      ́                        µ Ö Ñ Ü

 

     È      ́            µ      È      ́                         µ

   ̄ Parameterization

     È      ́                        µ

 

 

     È      ́                         µ      È      ́                  

     µ

   ̄

È

  is a sum over possible alignments from English to FrenchModel estimation through EM

References[Bod 98] Bod, R. (1998). Beyond Grammar: An Experience-Based Theory of Language. CSLI

Publications/Cambridge University Press

Page 95: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 95/96

Publications/Cambridge University Press.

[Booth and Thompson 73] Booth, T., and Thompson, R. 1973. Applying probability measures to

abstract languages. IEEE Transactions on Computers, C-22(5), pages 442–450.

[Borthwick et. al 98] Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R. (1998). Exploiting

Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition. Proc.

of the Sixth Workshop on Very Large Corpora.

[Collins and Duffy 2001] Collins, M. and Duffy, N. (2001). Convolution Kernels for Natural

Language. In Proceedings of NIPS 14.

[Collins and Duffy 2002] Collins, M. and Duffy, N. (2002). New Ranking Algorithms for Parsing

and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. In Proceedings

of ACL 2002.[Collins 2002] Collins, M. (2002). Discriminative Training Methods for Hidden Markov Models:

Theory and Experiments with the Perceptron Algorithm. In Proceedings of EMNLP 2002.

[Freund and Schapire 99] Freund, Y. and Schapire, R. (1999). Large Margin Classification using thePerceptron Algorithm. In Machine Learning, 37(3):277–296.

[Helmbold and Warmuth 95] Helmbold, D., and Warmuth, M. On Weak Learning. Journal of Computer and System Sciences, 50(3):551-573, June 1995.

[Hopcroft and Ullman 1979] Hopcroft, J. E., and Ullman, J. D. 1979. Introduction to automata

theory, languages, and computation. Reading, Mass.: Addison–Wesley.

[Johnson et. al 1999] Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S. (1999). Estimators

for stochastic ‘unification-based” grammars. In Proceedings of the 37th Annual Meeting

of the Association for Computational Linguistics. San Francisco: Morgan Kaufmann.[L ff t t l 2001] J h L ff t A d M C ll d F d P i C diti l d

Page 96: Collins - Statistical Methods in Natural Language Processing (Slides)

8/4/2019 Collins - Statistical Methods in Natural Language Processing (Slides)

http://slidepdf.com/reader/full/collins-statistical-methods-in-natural-language-processing-slides 96/96

f f p g g[Lafferty et al. 2001] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random

fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML-01, pages 282-289, 2001.

[MSM93] Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Building a large annotated

corpus of english: The Penn treebank. Computational Linguistics, 19, 313-330.

[McCallum et al. 2000] McCallum, A., Freitag, D., and Pereira, F. (2000) Maximum entropy markov

models for information extraction and segmentation. In Proceedings of ICML 2000.

[Miller et. al 2000] Miller, S., Fox, H., Ramshaw, L., and Weischedel, R. 2000. A Novel Use of Statistical Parsing to Extract Information from Text. In Proceedings of ANLP 2000.

[Ramshaw and Marcus 95] Ramshaw, L., and Marcus, M. P. (1995). Text Chunking UsingTransformation-Based Learning. In Proceedings of the Third ACL Workshop on Very Large

Corpora, Association for Computational Linguistics, 1995.

[Ratnaparkhi 96] A maximum entropy part-of-speech tagger. In Proceedings of the empirical

methods in natural language processing conference.


Recommended