+ All Categories
Home > Documents > Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane...

Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane...

Date post: 04-Jan-2016
Category:
Upload: primrose-lucas
View: 212 times
Download: 0 times
Share this document with a friend
21
Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University
Transcript
Page 1: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Information extraction 2Day 37

LING 681.02Computational Linguistics

Harry HowardTulane University

Page 2: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

2

Course organization

http://www.tulane.edu/~howard/NLP/

Page 3: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Extracting information from text

NLPP §7

Page 4: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

4

Workflow for info extraction

Page 5: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

5

Chunking

Page 6: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

6

Hierarchical structure

Chunks can be represented as trees, seen in the chunk parser from last time.

Hierarchy from tags IOB tags

Inside, Outside, Begin IOB tags for example:

We PRP B-NPsaw VBD Othe DT B-NPlittle JJ I-NPyellow JJ I-NPdog NN I-NP

Page 7: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

7

Results

Page 8: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Developing & evaluating chunkers

NLPP 7.3

Page 9: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

9

Overview

Need a corpus that is already chunked to evaluate a new chunker. CoNLL-2000 Chunking Corpus from Wall

Street Journal

EvaluationTraining

Page 10: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Recursion in ling structure

NLPP 7.4

Page 11: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

11

Nested structure

We have looked at trees, but they are different from normal linguistic trees.NP chunks do not contain NP chunks, ie. they

are nor recursive.They do not go arbitrarily deep.(Example on board.)

Page 12: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

12

Trees

(S (NP Alice) (VP (V chased) (NP (Det the) (N rabbit))))

Page 13: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

13

Trees in NLTK

A tree is created in NLTK by giving a node label and a list of children:>>> tree1 = nltk.Tree('NP', ['Alice'])>>> print tree1(NP Alice)>>> tree2 = nltk.Tree('NP', ['the', 'rabbit'])>>> print tree2(NP the rabbit)

They can be incorporated into successively larger trees as follows:>>> tree3 = nltk.Tree('VP', ['chased', tree2])>>> tree4 = nltk.Tree('S', [tree1, tree3])>>> print tree4(S (NP Alice) (VP chased (NP the rabbit)))

Page 14: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

14

Tree traversaldef traverse(t): try: t.node except AttributeError: print t, else: # Now we know that t.node is defined print '(', t.node, for child in t: traverse(child) print ')',>>> t = nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')>>> traverse(t)( S ( NP Alice ) ( VP chased ( NP the rabbit ) ) )

Page 15: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Named entity recognition & relation

extraction

NLPP 7.5 & 7.6

Page 16: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

16

More named entities

NE Type Examples

ORGANIZATION Georgia-Pacific Corp., WHO

PERSON Eddy Bonte, President Obama

LOCATION Murray River, Mount Everest

DATE June, 2008-06-29

TIME two fifty a m, 1:30 p.m.

MONEY 175 million Canadian Dollars, GBP 10.40

PERCENT twenty pct, 18.75 %

FACILITY Washington Monument, Stonehenge

GPE South East Asia, Midlothian

Page 17: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

17

Overview

Identify all textual mentions of a named entity (NE):Identify boundaries of a NE;Identify its type.

Classifiers are good at this.

Page 18: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

18

Relation extraction

Once named entities have been identified in a text, we then want to extract the relations that exist between them.

We will typically look for relations between specified types of a named entity.

One way of approaching this task is to initially look for all triples of the form (X, α, Y), where X and Y are named entities of the required types, and α is the string of words that intervenes between X and Y.

We can then use regular expressions to pull out just those instances of α that express the relation that we are looking for.

Page 19: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

19

Postscript

Much of what we have described goes under the heading of text mining.

Page 20: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

23-Nov-2009 LING 681.02, Prof. Howard, Tulane University

20

Quiz grades

Q7 Q8 Q9 Q10

MIN 5.0 7.0 9.0 7.0

AVG 8.3 8.8 9.8 7.6

MAX 10.0 10.0 10.0 8.0

Page 21: Information extraction 2 Day 37 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Next time

No quiz

NLPP §10

Analyzing the meaning of sentences


Recommended