+ All Categories
Home > Documents > ELABORAZIONE DEL LINGUAGGIO NATURALE

ELABORAZIONE DEL LINGUAGGIO NATURALE

Date post: 24-Feb-2016
Category:
Upload: zyta
View: 13 times
Download: 0 times
Share this document with a friend
Description:
ELABORAZIONE DEL LINGUAGGIO NATURALE. SEMANTICA: NAMED ENTITIES RELAZIONI. SEMANTICA MODERNA. Sue sottocompiti base Classificazione di entita ’: NAMED ENTITY RECOGNITION (and classification) Riconoscimento di predicati e loro argomenti : RELATION EXTRACTION. - PowerPoint PPT Presentation
Popular Tags:
61
ELABORAZIONE DEL LINGUAGGIO NATURALE SEMANTICA: NAMED ENTITIES RELAZIONI
Transcript
Page 1: ELABORAZIONE DEL LINGUAGGIO NATURALE

ELABORAZIONE DEL LINGUAGGIO NATURALE

SEMANTICA: NAMED ENTITIES

RELAZIONI

Page 2: ELABORAZIONE DEL LINGUAGGIO NATURALE

SEMANTICA MODERNA

• Sue sottocompiti base– Classificazione di entita’:

NAMED ENTITY RECOGNITION (and classification)– Riconoscimento di predicati e loro argomenti:

RELATION EXTRACTION

Page 3: ELABORAZIONE DEL LINGUAGGIO NATURALE

Named Entity Recognition (NER)

Input:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Output:

Apple Inc., formerly Apple Computer, Inc., is an American multinational corporation headquartered in Cupertino, California that designs, develops, and sells consumer electronics, computer software and personal computers. It was established on April 1, 1976, by Steve Jobs, Steve Wozniak and Ronald Wayne.

Page 4: ELABORAZIONE DEL LINGUAGGIO NATURALE

Named Entity Recognition (NER)• Locate and classify atomic elements in text into

predefined categories (persons, organizations, locations, temporal expressions, quantities, percentages, monetary values, …)

• Input: a block of text– Jim bought 300 shares of Acme Corp. in 2006.

• Output: annotated block of text– <ENAMEX TYPE="PERSON">Jim</ENAMEX> bought

<NUMEX TYPE="QUANTITY">300</NUMEX> shares of <ENAMEX TYPE="ORGANIZATION">Acme Corp.</ENAMEX> in <TIMEX TYPE="DATE">2006</TIMEX>

– ENAMEX tags (MUC in the 1990s)

Page 5: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE STANDARD NEWS DOMAIN

• Most work on NER focuses on – NEWS– Variants of repertoire of entity types first studied

in MUC and then in ACE:• PERSON• ORGANIZATION

– GPE• LOCATION• TEMPORAL ENTITY• NUMBER

Page 6: ELABORAZIONE DEL LINGUAGGIO NATURALE

HOW

• Two tasks:– Identifying the part of text that mentions a text

(RECOGNITION)– Classifying it (CLASSIFICATION)

• The two tasks are reduced to a standard classification task by having the system classify WORDS

Page 7: ELABORAZIONE DEL LINGUAGGIO NATURALE

Basic Problems in NER

• Variation of NEs – e.g. John Smith, Mr Smith, John.

• Ambiguity of NE types– John Smith (company vs. person)– May (person vs. month)– Washington (person vs. location)– 1945 (date vs. time)

• Ambiguity with common words, e.g. “may”

Page 8: ELABORAZIONE DEL LINGUAGGIO NATURALE

Problems in NER

• Category definitions are intuitively quite clear, but there are many grey areas.

• Many of these grey area are caused by metonymy.Organisation vs. Location : “England won the World Cup” vs. “The World Cup took place in England”.Company vs. Artefact: “shares in MTV” vs. “watching MTV”Location vs. Organisation: “she met him at Heathrow” vs. “the Heathrow authorities”

Page 9: ELABORAZIONE DEL LINGUAGGIO NATURALE

Approaches to NER: List Lookup

• System that recognises only entities stored in its lists (GAZETTEERS).

• Advantages - Simple, fast, language independent, easy to retarget

• Disadvantages – collection and maintenance of lists, cannot deal with name variants, cannot resolve ambiguity

Page 10: ELABORAZIONE DEL LINGUAGGIO NATURALE

Approaches to NER: Shallow Parsing

• Names often have internal structure. These components can be either stored or guessed.

location: CapWord + {City, Forest, Center} e.g. Sherwood ForestCap Word + {Street, Boulevard, Avenue, Crescent, Road} e.g. Portobello Street

Page 11: ELABORAZIONE DEL LINGUAGGIO NATURALE

Shallow Parsing Approach(E.g., Mikheev et al 1998)

• External evidence - names are often used in very predictive local contexts

Location:“to the” COMPASS “of” CapWord e.g. to the south of Loitokitok“based in” CapWord e.g. based in LoitokitokCapWord “is a” (ADJ)? GeoWord e.g. Loitokitok is a friendly city

Page 12: ELABORAZIONE DEL LINGUAGGIO NATURALE

Machine learning approaches to NER

• NER as classification: the IOB representation• Supervised methods

– Support Vector Machines– Logistic regression (aka Maximum Entropy)– Sequence pattern learning– Hidden Markov Models– Conditional Random Fields

• Distant learning• Semi-supervised methods

Page 13: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE ML APPROACH TO NE: THE IOB REPRESENTATION

Page 14: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE ML APPROACH TO NE: FEATURES

Page 15: ELABORAZIONE DEL LINGUAGGIO NATURALE

FEATURES

Page 16: ELABORAZIONE DEL LINGUAGGIO NATURALE

FEATURES

Page 17: ELABORAZIONE DEL LINGUAGGIO NATURALE

Supervised ML for NER

• Methods already seen– Decision trees– Support Vector Machines

• Sequence pattern learning (also supervised)– Hidden Markov Models– Maximum Entropy Models– Conditional Random Fields

Page 18: ELABORAZIONE DEL LINGUAGGIO NATURALE

EVALUATION

Page 19: ELABORAZIONE DEL LINGUAGGIO NATURALE

TYPICAL PERFORMANCE

Page 20: ELABORAZIONE DEL LINGUAGGIO NATURALE

NER Evaluation Campaigns

• English NER-- CoNLL 2003 - PER/ORG/LOC/MISC– Training set: 203.621 tokens– Development set: 51.362 tokens– Test set: 46.435 tokens

• Italian NER-- Evalita 2009 - PER/ORG/LOC/GPE– Development set: 223.706 tokens– Test set: 90.556 tokens

• Mention Detection-- ACE 2005– 599 documents

Page 21: ELABORAZIONE DEL LINGUAGGIO NATURALE

CoNLL2003 shared task (1)

• English and German language• 4 types of NEs:

– LOC Location– MISC Names of miscellaneous entities– ORG Organization– PER Person

• Training Set for developing the system• Test Data for the final evaluation

Page 22: ELABORAZIONE DEL LINGUAGGIO NATURALE

CoNLL2003 shared task (2)

• Data– columns separated by a single space– A word for each line– An empty line after each sentence – Tags in IOB format

• An exampleMilan NNP B-NP I-ORG's POS B-NP Oplayer NN I-NP OGeorge NNP I-NP I-PERWeah NNP I-NP I-PERmeet VBP B-VP O

Page 23: ELABORAZIONE DEL LINGUAGGIO NATURALE

CoNLL2003 shared task (3)English precision recall F [FIJZ03] 88.99% 88.54% 88.76%[CN03] 88.12% 88.51%

88.31%[KSNM03] 85.93% 86.21%

86.07%[ZJ03] 86.13% 84.88%

85.50%---------------------------------------------------[Ham03] 69.09% 53.26% 60.15%

baseline 71.91% 50.90% 59.61%

Page 24: ELABORAZIONE DEL LINGUAGGIO NATURALE

CURRENT RESEARCH ON NER

• New domains• New approaches:

– Semi-supervised– Distant

• Handling many NE types• Integration with Machine Translation• Handling difficult linguistic phenomena such

as metonymy

Page 25: ELABORAZIONE DEL LINGUAGGIO NATURALE

NEW DOMAINS

• BIOMEDICAL• CHEMISTRY• HUMANITIES: MORE FINE GRAINED TYPES

Page 26: ELABORAZIONE DEL LINGUAGGIO NATURALE

Bioinformatics Named Entities

• Protein• DNA• RNA• Cell line• Cell type• Drug• Chemical

Page 27: ELABORAZIONE DEL LINGUAGGIO NATURALE

NER IN THE HUMANITIES

LOC

SITE

CULTURE

Page 28: ELABORAZIONE DEL LINGUAGGIO NATURALE

SEMANTIC INTERPRETATION 2: FROM SENTENCES TO PROPOSITIONS

Powell met Zhu Rongji

Proposition: meet(Powell, Zhu Rongji)Powell met with Zhu Rongji

Powell and Zhu Rongji met

Powell and Zhu Rongji had a meeting

. . .

When Powell met Zhu Rongji on Thursday they discussed the return of the spy plane.

meet(Powell, Zhu) discuss([Powell, Zhu], return(X, plane))

debateconsult

joinwrestle

battle

meet(Somebody1, Somebody2)

Page 29: ELABORAZIONE DEL LINGUAGGIO NATURALE

OTHER ASPECTS OF SEMANTIC INTERPRETATION

• Identification of RELATIONS between entities mentioned– Focus of interest in modern CL since 1993 or so

• Identification of TEMPORAL RELATIONS – From about 2003 on

• QUALIFICATION of such relations (modality, epistemicity)– From about 2010 on

Page 30: ELABORAZIONE DEL LINGUAGGIO NATURALE

TYPES OF RELATIONS

• Predicate-argument structure (verbs and nouns)– John kicked the ball

• Nominal relations– The red ball

• Relations between events / temporal relations– John kicked the ball and scored a goal

Page 31: ELABORAZIONE DEL LINGUAGGIO NATURALE

PREDICATE-ARGUMENT STRUCTURE

• Linguistic Theories– Case Frames – Fillmore FrameNet– Lexical Conceptual Structure – Jackendoff LCS– Proto-Roles – Dowty PropBank– English verb classes (diathesis alternations) - Levin VerbNet– Talmy, Levin and Rappaport

Page 32: ELABORAZIONE DEL LINGUAGGIO NATURALE

Fillmore’s Case Theory• Sentences have a DEEP STRUCTURE with CASE

RELATIONS

• A sentence is a verb + one or more NPs– Each NP has a deep-structure case

• A(gentive)• I(nstrumental)• D(ative)• F(actitive)• L(ocative)• O(bjective)

– Subject is no more important than Object• Subject/Object are surface structure

Page 33: ELABORAZIONE DEL LINGUAGGIO NATURALE

THEMATIC ROLES

• Following on Fillmore’s original work, many theories of predicate argument structure / thematic roles were proposed, among which the best known perhaps– Jackendoff’s LEXICAL CONCEPTUAL SEMANTICS– Dowty’s PROTO-ROLES theory

Page 34: ELABORAZIONE DEL LINGUAGGIO NATURALE

Dowty’s PROTO-ROLES

• Event-dependent• Prototypes based on shared entailments• Grammatical relations such as subject related

to observed (empirical) classification of participants

• Typology of grammatical relations • Proto-Agent• Proto-Patient

Page 35: ELABORAZIONE DEL LINGUAGGIO NATURALE

Proto-Agent

• Properties – Volitional involvement in event or state– Sentience (and/or perception)– Causing an event or change of state in another

participant– Movement (relative to position of another

participant) – (exists independently of event named) *may be discourse pragmatic

Page 36: ELABORAZIONE DEL LINGUAGGIO NATURALE

Proto-Patient

• Properties:– Undergoes change of state– Incremental theme– Causally affected by another participant– Stationary relative to movement of another

participant– (does not exist independently of the event, or at

all) *may be discourse pragmatic

Page 37: ELABORAZIONE DEL LINGUAGGIO NATURALE

Semantic role labels:

Jan broke the LCD projector.

break (agent(Jan), patient(LCD-projector))

cause(agent(Jan), change-of-state(LCD-projector))

(broken(LCD-projector))

agent(A) -> intentional(A), sentient(A), causer(A), affector(A)

patient(P) -> affected(P), change(P),…

Filmore, 68

Jackendoff, 72

Dowty, 91

Page 38: ELABORAZIONE DEL LINGUAGGIO NATURALE

VERBNET AND PROPBANK

• Dowty’s theory of proto-roles was the basis for the development of PROPBANK, the first corpus annotated with information about predicate-argument structure

Page 39: ELABORAZIONE DEL LINGUAGGIO NATURALE

PROPBANK REPRESENTATION

a GM-Jaguar pact

that would give

*T*-1

the US car maker

an eventual 30% stake in the British company

Arg0

Arg2

Arg1

give(GM-J pact, US car maker, 30% stake)

a GM-Jaguar pact that would give the U.S. car maker an eventual 30% stake in the British company.

Page 40: ELABORAZIONE DEL LINGUAGGIO NATURALE

ARGUMENTS IN PROPBANK

• Arg0 = agent• Arg1 = direct object / theme / patient• Arg2 = indirect object / benefactive /

instrument / attribute / end state• Arg3 = start point / benefactive / instrument /

attribute• Arg4 = end point• Per word vs frame level – more general?

Page 41: ELABORAZIONE DEL LINGUAGGIO NATURALE

FROM PREDICATES TO FRAMES

In one of its senses, the verb observe evokes a frame called Compliance: this frame concerns people’s responses to norms, rules or practices.

The following sentences illustrate the use of the verb in the intended sense:– Our family observes the Jewish dietary laws.– You have to observe the rules or you’ll be penalized.– How do you observe Easter?– Please observe the illuminated signs.

Page 42: ELABORAZIONE DEL LINGUAGGIO NATURALE

FrameNet

FrameNet records information about English words in the general vocabulary in terms of

1. the frames (e.g. Compliance) that they evoke, 2. the frame elements (semantic roles) that make up the

components of the frames (in Compliance, Norm is one such frame element), and

3. each word’s valence possibilities, the ways in which information about the frames is provided in the linguistic structures connected to them (with observe, Norm is typically the direct object).

theta

Page 43: ELABORAZIONE DEL LINGUAGGIO NATURALE

NOMINAL RELATIONS

Page 44: ELABORAZIONE DEL LINGUAGGIO NATURALE

CLASSIFICATION SCHEMES FOR NOMINAL RELATIONS

Page 45: ELABORAZIONE DEL LINGUAGGIO NATURALE

ONE EXAMPLE (Barker et al1998, Nastase & Spakowicz 2003)

Page 46: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE TWO-LEVEL TAXONOMY OF RELATIONS, 2

Page 47: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE SEMEVAL-2007 CLASSIFICATION OF RELATIONS

• Cause-Effect: laugh wrinkles • Instrument-Agency: laser printer • Product-Producer: honey bee • Origin-Entity: message from outer-space• Theme-Tool: news conference • Part-Whole: car door• Content-Container: the air in the jar

Page 48: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE MUC AND ACE TASKS

• Modern research in relation extraction, as well, was kicked-off by the Message Understanding Conference (MUC) campaigns and continued through the Automatic Content Extraction (ACE) and Machine Reading follow-ups

• MUC: NE, coreference, TEMPLATE FILLING• ACE: NE, coreference, relations

Page 49: ELABORAZIONE DEL LINGUAGGIO NATURALE

TEMPLATE-FILLING

Page 50: ELABORAZIONE DEL LINGUAGGIO NATURALE

EXAMPLE MUC: JOB POSTING

Page 51: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE ASSOCIATED TEMPLATE

Page 52: ELABORAZIONE DEL LINGUAGGIO NATURALE

AUTOMATIC CONTENT EXTRACTION (ACE)

Page 53: ELABORAZIONE DEL LINGUAGGIO NATURALE

ACE: THE DATA

Page 54: ELABORAZIONE DEL LINGUAGGIO NATURALE

ACE: THE TASKS

Page 55: ELABORAZIONE DEL LINGUAGGIO NATURALE

RELATION DETECTION AND RECOGNITION

Page 56: ELABORAZIONE DEL LINGUAGGIO NATURALE

ACE: RELATION TYPES

Page 57: ELABORAZIONE DEL LINGUAGGIO NATURALE

OTHER PRACTICAL VERSIONS OF RELATION EXTRACTION

• Biomedical domain (BIONLP, BioCreative)• Chemistry• Cultural Heritage

Page 58: ELABORAZIONE DEL LINGUAGGIO NATURALE

THE TASK OF SEMANTIC RELATION EXTRACTION

Page 59: ELABORAZIONE DEL LINGUAGGIO NATURALE

SEMANTIC RELATION EXTRACTION: THE CHALLENGES

Page 60: ELABORAZIONE DEL LINGUAGGIO NATURALE

HISTORY OF RELATION EXTRACTION

• Before 1993: Symbolic methods (using knowledge bases)

• Since then: statistical / heuristic based methods– From 1995 to around 2005: mostly SUPERVISED– More recently: also quite a lot of UNSUPERVISED /

SEMI SUPERVISED techniques

Page 61: ELABORAZIONE DEL LINGUAGGIO NATURALE

MORE COMPLEX SEMANTICS

• Modalities• Temporal interpretation


Recommended