+ All Categories
Home > Documents > Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1...

Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1...

Date post: 10-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
9
1 Fefor, Jan, 2003 Swedish Named Swedish Named- Entity Entity Recognition in the Nomen Recognition in the Nomen Nescio Project Nescio Project …the work conducted during 2002 …the work conducted during 2002 Dimitrios Kokkinakis Dimitrios Kokkinakis 2 Fefor, Jan, 2003 Outline Outline Current Status of the Swedish System Current Status of the Swedish System What has been accomplished in 2002 What has been accomplished in 2002 Approach, architecture, individual modules, examples... Approach, architecture, individual modules, examples... The (Swedish) Resources in more detail The (Swedish) Resources in more detail Anatomy of the grammars and the DCA approach Anatomy of the grammars and the DCA approach NE Tagset and some General/Specific Guidelines NE Tagset and some General/Specific Guidelines Some Preliminary Evaluations based on SUC2 Some Preliminary Evaluations based on SUC2 Evaluation on some difficult/frequent cases Evaluation on some difficult/frequent cases Work in 2003 Work in 2003 Continuous development, better integration of modules Continuous development, better integration of modules and better DCA implementation and better DCA implementation Extensive testing & evaluation (second half of 2003) Extensive testing & evaluation (second half of 2003) Extensions: Web Extensions: Web-Interface, New Groups (e.g. Money) Interface, New Groups (e.g. Money) mainly 3 Fefor, Jan, 2003 ... Status of the Swedish System: ... Status of the Swedish System: The Work in 2002 The Work in 2002 Gathered Lists of Single and Multiword Names (www, Gathered Lists of Single and Multiword Names (www, corpora, available name lists) corpora, available name lists) Studied large corpora and extracted a number of Studied large corpora and extracted a number of (incomplete) patterns using reg. expressions, which (incomplete) patterns using reg. expressions, which were then generalised and formalised into fs were then generalised and formalised into fs- rules; rules; tested, extended... tested, extended... ...those rules were implemented as sets of context ...those rules were implemented as sets of context- sensitive finite sensitive finite-state grammars ’semantic grammars’, state grammars ’semantic grammars’, one for each NE one for each NE-group group Implemented a first version of the DCA Implemented a first version of the DCA-algorithm algorithm Defined an XML Defined an XML- schema for the validation of NE schema for the validation of NE- annotated data annotated data 4 Fefor, Jan, 2003 ... Status of the Swedish System: ... Status of the Swedish System: General Approach General Approach Application on unannotated texts (portability, Application on unannotated texts (portability, ease of reusability…) ease of reusability…) Modular and scalable architecture Modular and scalable architecture Since no individual criterion can achieve both Since no individual criterion can achieve both high precision and recall; a combination of high precision and recall; a combination of criteria (external & internal evidence) are criteria (external & internal evidence) are applied applied Separation of lexical (internal) and grammatical Separation of lexical (internal) and grammatical (external) resources (external) resources Five major components Five major components (see later slide...) (see later slide...) 5 Fefor, Jan, 2003 Evidence: Evidence: McDonald (1996) McDonald (1996) Internal Internal: is taken from within the sequence of words that comprise the name, such as the content of lists of proper names (gazetteers), abbreviations and acronyms External External: provided by the context in which a name appears – the characteristic properties or events in a syntactic relation (verbs, adjectives) with a proper noun can be used to provide confirming or criterial evidence for a name’s category – important complementary information since internal evidence can never be complete... Operational definition of NE: Operational definition of NE: NE are words or sequences of words which usually cannot be found in common (defining) dictionaries, and yet encapsulate important info that can be useful for the semantic interpretation of texts 6 Fefor, Jan, 2003 ... Status of the Swedish System: ... Status of the Swedish System: Architecture Architecture Source Texts Multiword Lists FS- Grammars Single-Word Lists DCA Filter/ Revision Output.xml Tokenization
Transcript
Page 1: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

1Fefor, Jan, 2003

Swedish NamedSwedish Named--Entity Entity Recognition in the NomenRecognition in the Nomen

Nescio ProjectNescio Project…the work conducted during 2002…the work conducted during 2002

Dimitrios KokkinakisDimitrios Kokkinakis

2Fefor, Jan, 2003

OutlineOutlineCurrent Status of the Swedish SystemCurrent Status of the Swedish System

–– What has been accomplished in 2002What has been accomplished in 2002–– Approach, architecture, individual modules, examples...Approach, architecture, individual modules, examples...The (Swedish) Resources in more detailThe (Swedish) Resources in more detail

–– Anatomy of the grammars and the DCA approachAnatomy of the grammars and the DCA approachNE Tagset and some General/Specific GuidelinesNE Tagset and some General/Specific GuidelinesSome Preliminary Evaluations based on SUC2Some Preliminary Evaluations based on SUC2

–– Evaluation on some difficult/frequent casesEvaluation on some difficult/frequent casesWork in 2003Work in 2003

–– Continuous development, better integration of modules Continuous development, better integration of modules and better DCA implementationand better DCA implementation

–– Extensive testing & evaluation (second half of 2003)Extensive testing & evaluation (second half of 2003)–– Extensions: WebExtensions: Web--Interface, New Groups (e.g. Money)Interface, New Groups (e.g. Money)

mainly

3Fefor, Jan, 2003

... Status of the Swedish System:... Status of the Swedish System:The Work in 2002The Work in 2002

Gathered Lists of Single and Multiword Names (www, Gathered Lists of Single and Multiword Names (www, corpora, available name lists)corpora, available name lists)Studied large corpora and extracted a number of Studied large corpora and extracted a number of (incomplete) patterns using reg. expressions, which (incomplete) patterns using reg. expressions, which were then generalised and formalised into fswere then generalised and formalised into fs--rules; rules; tested, extended... tested, extended... ...those rules were implemented as sets of context...those rules were implemented as sets of context--sensitive finitesensitive finite--state grammars ’semantic grammars’, state grammars ’semantic grammars’, one for each NEone for each NE--groupgroupImplemented a first version of the DCAImplemented a first version of the DCA--algorithmalgorithmDefined an XMLDefined an XML--schema for the validation of NEschema for the validation of NE--annotated dataannotated data

4Fefor, Jan, 2003

... Status of the Swedish System: ... Status of the Swedish System: General ApproachGeneral Approach

Application on unannotated texts (portability, Application on unannotated texts (portability, ease of reusability…) ease of reusability…) Modular and scalable architectureModular and scalable architectureSince no individual criterion can achieve both Since no individual criterion can achieve both high precision and recall; a combination of high precision and recall; a combination of criteria (external & internal evidence) are criteria (external & internal evidence) are appliedappliedSeparation of lexical (internal) and grammatical Separation of lexical (internal) and grammatical (external) resources(external) resourcesFive major components Five major components (see later slide...)(see later slide...)

5Fefor, Jan, 2003

Evidence: Evidence: McDonald (1996)McDonald (1996)

InternalInternal:: is taken from within the sequence of words that comprise the name, such as the content of lists of proper names (gazetteers), abbreviations and acronyms

ExternalExternal:: provided by the context in which a name appears – the characteristic properties or events in a syntactic relation (verbs, adjectives) with a proper noun can be used to provide confirming or criterial evidence for a name’s category – important complementary information since internal evidence can never be complete...

Operational definition of NE: Operational definition of NE: NE are words or sequences of words which usually cannot be found in common (defining) dictionaries, and yet encapsulate important info that can be useful for the semantic interpretation of texts 6Fefor, Jan, 2003

... Status of the Swedish System:... Status of the Swedish System:ArchitectureArchitecture

SourceTexts

MultiwordLists

FS-Grammars

Single-WordLists

DCAFilter/

Revision Output.xml

Tokenization

Page 2: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

7Fefor, Jan, 2003

Current Status of the Current Status of the Swedish SystemSwedish System

≈≈2402401010150150OBJECTOBJECT≈≈2102101,9001,9001,3101,310ORGANIZATIONORGANIZATION

≈≈16016020202020EVENTEVENT≈≈23023050505050WORK&ARTWORK&ART

≈≈1801804,5004,500410410LOCATIONLOCATION≈≈14014080,00080,00011PERSONPERSON

# GRAMM.RULES

SINGLELISTS

MULTILISTS

CATEGORY

Approximate number of resources:Approximate number of resources:

externalexternalinternalinternal 8Fefor, Jan, 2003

Lexical Resources (1) Lexical Resources (1) (Internal Evidence)(Internal Evidence)

Name Lists (Gazeteers)Name Lists (Gazeteers)

Multiword namesMultiword names

Single namesSingle names

Organizations (PLT): 35Organizations (ATH): 80Organizations (FIN): 70Organizations (TVR): 25Organizations (CRP): 1100

Organiz. (FIN): 42Organiz. (TVR): 15Organiz. (POL): 35Organiz. (ATH): 9

Organiz. (CRP):1,800Locat. (non-S): 2,400Locat. (countr): 320Locat. (S): 1,700

Persons: 80,000Events: 20Objects: 30Work&Art: 50

Locations (GPL): 15Locations (FNC): 75Locations (PPL): 320Events: 20Objects: 150

9Fefor, Jan, 2003

Lexical Resources (2) Lexical Resources (2) (Internal Evidence)(Internal Evidence)

Designators, affixes, short phrases and trigger wordsDesignators, affixes, short phrases and trigger words

Titles, premodifiers, appositions...Titles, premodifiers, appositions...

PersonsPersons

PostPostModsMods: Jr, Junior,…PreTitlesPreTitles: VD, Dr, sir,…NationalityNationality: belgaren, brasilianaren, dansken,…

Profess.Profess.: amiral, inspektor,...HonorHonor.: von, van der, af,...VerbsVerbs: säger, frågar...

OrganizationsOrganizations

Trigger wordsTrigger words: bolaget X, föreningen X, institutet X,…X Agency, X Biotech, X Chemical, X Consultancy ,…AffixesAffixes:+kollegium,+verket,...DesignatorsDesignators: AB, A/S, IFK,...PhrasesPhrases: *match mot, VD för

10Fefor, Jan, 2003

Lexical Resources (3) Lexical Resources (3) (Internal Evidence)(Internal Evidence)

Trigger words, affixes, MOVETrigger words, affixes, MOVE--verbs and verbs and prepositionsprepositions

Trigger words, graphical criteriaTrigger words, graphical criteria

EventsEvents

Trigger wordsTrigger words: turneringen, tävlingen, spel, utställning, mässa, festival, kongress, workshop, Games, Final, Cup,...Graph. criteriaGraph. criteria: tävlingen ”...”

LocationsLocations

Trigger wordsTrigger words: floden, ö, ...AffixesAffixes: *väg,*gatan,*byn, *staden, *...Prepos.:Prepos.: längs X, söder om X,väster om, utanför,...Short PhrasesShort Phrases: ’ligger nära’MOVEMOVE--verbsverbs: ’resa till’,...

11Fefor, Jan, 2003

Lexical Resources (4) Lexical Resources (4) (Internal Evidence)(Internal Evidence)

Trigger words, verbs, graphical criteria and Trigger words, verbs, graphical criteria and affixesaffixes Trigger words, small phrases, Trigger words, small phrases,

verbsverbs

ObjectsObjects

Trigger words & phrasesTrigger words & phrases: ’ombord på’, sorten, märket, medlet, sjukdomen,...VerbsVerbs: ’flyger en’, ’äter en’, ’dricker en’, lanserar, tillverkar,...

Work&ArtWork&Art

Trigger wordsTrigger words: boken, filmen, dikten, plattan,…AffixesAffixes:+posten,+journalen, +kuriren,...VerbsVerbs: regisserat X, ’föreläsa om’, komponerat X, sjöng X...

12Fefor, Jan, 2003

Grammars: Grammars: Anatomy of PERSONsAnatomy of PERSONs

/*OCCUPATION/*OCCUPATION--TRIGGERS PRECTRIGGERS PRECEEEDING*/ EDING*/ [^\n ]*(bil|jur|bas|kem|...)(ist|isten)(" \'"|" \:"|" ,")?(" “{U}[^ \n]*)+[^\n ]*(bil|jur|bas|kem|...)(ister|isterna)(" \'"|" \:"|" ,")?

(" ”{U}[^ \n]*)+([ \,]+{U}[^ \n]*)*(" "(och|eller)(" "{U}[^ \n]*)+)?/*OCCUPATION/*OCCUPATION--TRIGGERS FOLLOWING*/ TRIGGERS FOLLOWING*/ {U}[^ \n]+" "{U}[^ \n]+" \, "(professor|expert|...)/*NATIONALITY/*NATIONALITY--TRIGGERS*/TRIGGERS*/(bulgari|fransy|japan|...)(skor|skorna)(" \'"|" \:"|" ,")?

(" ”{U}[^ \n]*)+([ \,]+{U}[^ \n]*)*(" "(och|eller)(" "{U}[^ \n]*)+)?/*/*TYPICAL TYPICAL VERBS FOLLOWING NAME*/VERBS FOLLOWING NAME*/{U}[^ \n]+" "{U}[^ \n][^ \n]+" "(jobbade|arbetar|…)/* /* TYPICAL TYPICAL VERBS PRECEVERBS PRECEEEDING NAME*/DING NAME*/(säger|förklarar|...)" "{U}[^ \n]+(" "{U}[^ \n]*)*/*TRIGGER PHRASES*//*TRIGGER PHRASES*/(gift|kamrat...)" med "{U}[^ \n]*(" "{U}[^ \n]*)*(manus|regi|...)" av "{U}[^ \n]*(" "{U}[^ \n]*)* U=[A-ZÅÄÖÈÉ]

Page 3: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

13Fefor, Jan, 2003

Rules capture local patterns that characterize entities, Rules capture local patterns that characterize entities, extracted from partextracted from part--ofof--sspeech peech annotated data and semiannotated data and semi--automatic analysis of corpora:automatic analysis of corpora:

..*(trafiken|banan|t*(trafiken|banan|tååget|vget|väägen|bygggen|bygge|stre|strääckan|tunnel...)ckan|tunnel...) mellan mellan XXXXXXoch och YYYYYY :: XXXXXX and and YYYYYY are with are with high probability locationshigh probability locations

tågettåget mellanmellan Warszawa och BerlinWarszawa och Berlinsträckan mellan Bornholm och Gotland sträckan mellan Bornholm och Gotland

^̂XXX YYYXXX YYY , [0, [0--9]9]+ ,+ ,:: XXXXXX and and YYY YYY are with high probability humanare with high probability humanss

Robert Robert BrobergBroberg , 51 ,, 51 ,Maria Maria BäckströmBäckström , 25 ,, 25 ,Hillary Clinton , 44 ,Hillary Clinton , 44 ,

XXXXXX köpteköpte YYYYYY: : XXXXXX and and YYYYYY are with high are with high probability organizationsprobability organizations

EMI EMI köpteköpte VirginVirgin MusicMusic GroupGroupGrundin köpte HornlineGrundin köpte HornlineMoyne köpte TrustorMoyne köpte TrustorOptiroc köpte StråbrukenOptiroc köpte StråbrukenPandox köptePandox köpte ParkPark AvenueAvenue HotelHotelSF SF köpte Europafilmköpte EuropafilmStagecoach Stagecoach köpte Swebusköpte SwebusTrelleborg köpte Intertrade Trelleborg köpte Intertrade

Grammars: Grammars: General Pattern ExamplesGeneral Pattern Examples

14Fefor, Jan, 2003

DCA (1)DCA (1)...for names not covered by contextual patterns...for names not covered by contextual patternsNo evidence to disambiguateNo evidence to disambiguateIdeaIdea: : ”Important words are typically used in a document more ”Important words are typically used in a document more than once and in different contexts. Ambiguous words and than once and in different contexts. Ambiguous words and phrases are usually unambiguously introduced at least once in phrases are usually unambiguously introduced at least once in the text unless they’re part of common knowledge presupposed the text unless they’re part of common knowledge presupposed to be known by the readers”; to be known by the readers”; Mikheev ’00Mikheev ’00Accordingly: look at unambiguous usages and then assign these Accordingly: look at unambiguous usages and then assign these to words in ambiguous positionsto words in ambiguous positionsA kind of inference & generalization process used to accurately A kind of inference & generalization process used to accurately tag isolated unknown names; prior knowledge (e.g. rules) is tag isolated unknown names; prior knowledge (e.g. rules) is used in order to make the performance increaseused in order to make the performance increaseImportant step in case of only UPPER/lowercase textsImportant step in case of only UPPER/lowercase texts

see later the ”Matti” & ”Julius” examplsee later the ”Matti” & ”Julius” exampleses

15Fefor, Jan, 2003

DCA (2) DCA (2) -- ProcessProcess

Annotated TextAnnotated Text [... <LPG Telecom AB>...]

Array of AnnotationsArray of Annotations <LPG Telecom AB>

Partial Orderings of AnnotationsPartial Orderings of Annotations (if>1tagged)(excl. Single occ. of den, det, a, AB...) <LPG Telecom>

<Telecom AB><LPG> <Telecom>

also <Lpg><TELECOM>

Apply the Newly ”Learned” Tags on Unannotated Apply the Newly ”Learned” Tags on Unannotated TokensTokens

Produce (Hopefully) New AnnotationsProduce (Hopefully) New Annotations

16Fefor, Jan, 2003

Filter/Revision Filter/Revision CapabilitiesCapabilities

Technique inspired by Mooney ’93 which he calls Technique inspired by Mooney ’93 which he calls ”theory refinement””theory refinement”Tags put on NEs at previous steps are considered as Tags put on NEs at previous steps are considered as default (preferential) tags & can be modifieddefault (preferential) tags & can be modifiedThese can be updated given an appropriate contextThese can be updated given an appropriate contexte.g. isolated occurrences of ”e.g. isolated occurrences of ”SverigeSverige” are tagged as ” are tagged as LOC/GPLLOC/GPL but in a sequence such as ”but in a sequence such as ”Sverige klar Sverige klar favorit i EMfavorit i EM”, ””, ”SverigeSverige” is an ” is an ORG/ATHORG/ATH and thus the and thus the LOC/GPLLOC/GPL annotation will be replaced by a new tagannotation will be replaced by a new tag

17Fefor, Jan, 2003

NE Tagset and General NE Tagset and General GuidelinesGuidelines

e.g.: produce a single, unambiguous output for any relevant strie.g.: produce a single, unambiguous output for any relevant string; ng; no nested expressions will be markedno nested expressions will be marked

•• no extra whitespaces or carriage returns to be inserted in the no extra whitespaces or carriage returns to be inserted in the markup; only tags in angled brackets; the markup has the form markup; only tags in angled brackets; the markup has the form of the entity type and attribute informationof the entity type and attribute information<ELEMENT<ELEMENT--NAME ATTRNAME ATTR--NAME=”ATTRNAME=”ATTR--VALUE”>VALUE”>texttext--stringstring</ENAMEX></ENAMEX>

•• definite articles commonly associated with entity names are definite articles commonly associated with entity names are tagged: <tagged: <Den Norske Opera>Den Norske Opera>

•• romance numerals & arabic ordinals associated with entity romance numerals & arabic ordinals associated with entity names are tagged: names are tagged: <<Gustav IV Gustav IV Adolf>Adolf>

•• defined an XMLdefined an XML--schema for the validation of tagged texts schema for the validation of tagged texts ptoduced by the systemptoduced by the system<ELEMENT<ELEMENT--NAME ATTRNAME ATTR--NAME=”ATTRNAME=”ATTR--VALUE”>VALUE”><METHOD=”METH<METHOD=”METH--

VALUE”/>VALUE”/>texttext--stringstring</ENAMEX></ENAMEX>18Fefor, Jan, 2003

Annotation GuidelinesAnnotation Guidelines

FFirstirst draft specifications for the creation of simple draft specifications for the creation of simple guidelines for the NER work as applied on Swedish guidelines for the NER work as applied on Swedish datadata have been writtenhave been written

IIdeasdeas from MUC, ACEfrom MUC, ACE, Eckhard, Eckhard’’s reports,s reports, ownownexperienceexperience......

The guidelines The guidelines have beenhave been evolveevolvedd during the course of during the course of the project, refined and the project, refined and (might be) (might be) extendedextended

The purpose of the guidelines is to try and impose some The purpose of the guidelines is to try and impose some consistency measures for annotation and evaluation, consistency measures for annotation and evaluation, and for and for giving the potential future users of the system a giving the potential future users of the system a clearer picture of what the recognition components can clearer picture of what the recognition components can offeroffer

Page 4: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

19Fefor, Jan, 2003

Some Name Taxonomies in Some Name Taxonomies in the Litteraturethe Litterature

Bauer G. 1985. Bauer G. 1985. NamenkundeNamenkunde des des DeuschenDeuschene.g. e.g. AnthroponymsAnthroponyms, , ErgonymsErgonyms, etc , etc

JonassonJonasson K. 1994. K. 1994. Le Nom Le Nom ProprePropre. . CosntructionsCosntructions et et InterprétationsInterprétationsPaikPaik et al. 1996. et al. 1996. Categorizing and Standardizing Proper Nouns Categorizing and Standardizing Proper Nouns

for Efficient IRfor Efficient IRSheremetyevaSheremetyeva et al. 1998. et al. 1998. A Multilingual A Multilingual OnomosticonOnomosticon as a as a

Multipurpose NLP ResourceMultipurpose NLP ResourceSekineSekine et al. 2002. et al. 2002. Extended Named Entity HierarchiesExtended Named Entity Hierarchies..........................

20Fefor, Jan, 2003

NE Tagset and Specific NE Tagset and Specific Guidelines Guidelines -- PERSONSPERSONS

HUMHUM: humans (alive or dead), fictional human : humans (alive or dead), fictional human characters appearing in TV, movies etc. appositives characters appearing in TV, movies etc. appositives (Jr) or structural words (von) are part of the name(Jr) or structural words (von) are part of the name

MTHMTH: names of saints, apostles, gods, mythical names, : names of saints, apostles, gods, mythical names, humanoids, religious figureshumanoids, religious figures

e.g. Gud, Allah, Jesus, Djävulen, ...e.g. Gud, Allah, Jesus, Djävulen, ...ANMANM: names of animals, pets and mythical beasts: names of animals, pets and mythical beastse.g. Pegasus, pudeln Zeb,...e.g. Pegasus, pudeln Zeb,...CLCCLC: collective names of tribes, dynasties, ethnical and : collective names of tribes, dynasties, ethnical and

race namesrace namese.g. Yorubafolket, Yarlunge.g. Yorubafolket, Yarlung--dynastin,...dynastin,...

21Fefor, Jan, 2003

NE Tagset and Specific NE Tagset and Specific Guidelines Guidelines -- LOCATIONLOCATION

ASTAST: astronomical defined location with physical extent; : astronomical defined location with physical extent; planets, comets, galaxiesplanets, comets, galaxies

GPLGPL: geographically/geologically defined location; bodies : geographically/geologically defined location; bodies of water, archipelagos, rivers, mountains, continents, of water, archipelagos, rivers, mountains, continents, oceansoceans

PPLPPL: geo: geo--socialsocial--political entities, politically defined political entities, politically defined geographical regions; nations, states, towns, villages, geographical regions; nations, states, towns, villages, territories, provincesterritories, provinces

FNCFNC: man: man--made artefacts falling under the domains of made artefacts falling under the domains of architecture, transportation infrastructure and civil architecture, transportation infrastructure and civil engineering; museums, parks, tunnels, stadiums, engineering; museums, parks, tunnels, stadiums, galleries, factories, bridges, airports, mil. bases galleries, factories, bridges, airports, mil. bases

STRSTR: names of streets, avenues, roads and postal addresses: names of streets, avenues, roads and postal addresses22Fefor, Jan, 2003

NE Tagset and Specific NE Tagset and Specific Guidelines Guidelines -- ORGANIZ.ORGANIZ.

CRPCRP: corporations, company groups, governmental, non: corporations, company groups, governmental, non--profit org., businesses, unionsprofit org., businesses, unions

FINFIN: financial institutions, banks, capital management : financial institutions, banks, capital management and funding organizationsand funding organizations

ATHATH: sports teams, athletic organizations, even mentions : sports teams, athletic organizations, even mentions of regions in sportof regions in sport--related contextsrelated contexts

CLTCLT: artistic and music groups, circus, orchestras, bands: artistic and music groups, circus, orchestras, bandsPLTPLT: political parties, groups and movements, terrorist : political parties, groups and movements, terrorist

and criminal organizations, liberation armiesand criminal organizations, liberation armiesTVRTVR: organization with a media profile, tv: organization with a media profile, tv--channels and channels and

radio stationsradio stations

23Fefor, Jan, 2003

NE Tagset and Specific NE Tagset and Specific Guidelines Guidelines -- EVENTSEVENTS

HPLHPL: historical or political events and manifestations, : historical or political events and manifestations, wars, scandals, battleswars, scandals, battles

e.g. Gulfkriget, Förintelsen,...e.g. Gulfkriget, Förintelsen,...WTHWTH: events that include some kind of natural motion; : events that include some kind of natural motion;

weather phenomena, hurricanes, cyclones, stormsweather phenomena, hurricanes, cyclones, stormse.g. El Niño, orkanen George,...e.g. El Niño, orkanen George,...CLUCLU: festivals, fairs and conferences, workshops: festivals, fairs and conferences, workshopse.g. Vattenfestivalen, Expo 98, ...e.g. Vattenfestivalen, Expo 98, ...ATLATL: sports races, games and competitions, tournaments: sports races, games and competitions, tournamentse.g. US Open, VM 1997, Formel 3000,...e.g. US Open, VM 1997, Formel 3000,...

24Fefor, Jan, 2003

NE Tagset and Specific NE Tagset and Specific Guidelines Guidelines -- OBJECTSOBJECTS

MDCMDC: medical and pharmaceutical products, names of : medical and pharmaceutical products, names of medicines, diseases, proteinsmedicines, diseases, proteins

e.g. Xenical, Alzheimers, Omegae.g. Xenical, Alzheimers, Omega--3,...3,...FWPFWP: food and wine products, names of drinks, fruits: food and wine products, names of drinks, fruitse.g. Granny Smith, Biff Oscar, Fanta,...e.g. Granny Smith, Biff Oscar, Fanta,...CMPCMP: S/W and H/W products, as well as telephony: S/W and H/W products, as well as telephonyVHVH((AGWAGW): vehicles and transportation means, water, ): vehicles and transportation means, water,

land and air/spaceland and air/spacePRZPRZ: scholarships, h: scholarships, honoursonours and prizesand prizesPRDPRD: general products and artefacts that do not fit in the : general products and artefacts that do not fit in the

previousprevious subcategoriessubcategoriese.g. Marlboro Lights, Libress Invisible,...e.g. Marlboro Lights, Libress Invisible,...

Page 5: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

25Fefor, Jan, 2003

NE Tagset and Specific NE Tagset and Specific Guidelines Guidelines -- WORK&ARTWORK&ART

WRTWRT: names that deal primarily with written material, : names that deal primarily with written material, magazines, newspapers, journals, booksmagazines, newspapers, journals, books

e.g. Bibeln, Vanity Fair,... e.g. Bibeln, Vanity Fair,... RTVRTV: names that denote radio and tv: names that denote radio and tv--programs; tvprograms; tv--series, series,

soapsoap--operasoperase.g. TVe.g. TV--serien Glappet, Lilla Sportspegeln,...serien Glappet, Lilla Sportspegeln,...WAAWAA: names of paintings, songs, films, operas, movies: names of paintings, songs, films, operas, moviese.g. Filmen Titanic, operan ”Staden”,...e.g. Filmen Titanic, operan ”Staden”,...

26Fefor, Jan, 2003

NE TagsetNE Tagset

HPL,CLT,ATH,WTHHPL,CLT,ATH,WTHEVNEVNEVENTEVENT

FINFIN,,TVRTVR,,ATHATH,,CLTCLT,,PLTPLT,,CRPCRP

ORGORGORGANIZATIONORGANIZATION

VHW,VHA,VHG,CMP, VHW,VHA,VHG,CMP, MDC,PRZ,FWP,PRDMDC,PRZ,FWP,PRD

OBJOBJOBJECTOBJECT

AST,STR,GPL,PPL,FNCAST,STR,GPL,PPL,FNCLOCLOCLOCATIONLOCATION

HUMHUM,,ANMANM,,MTHMTH,,CLCCLCPRSPRSPERSONPERSON

WRTWRT,,RTVRTV,,WAAWAAWRKWRKWORK&ARTWORK&ART

attr: SBTattr.: TYPEENTITY

27Fefor, Jan, 2003

Notes on the EvaluationNotes on the Evaluation

Evaluation consists of (at least) three parts:Evaluation consists of (at least) three parts:–– Entity DetectionEntity Detection (of the string that names an (of the string that names an

entity): entity): <ENAMEX><ENAMEX>FjFjäärranrran ÖÖsternstern</ENAMEX></ENAMEX>–– Attribute Recognition/ClassificationAttribute Recognition/Classification (of the (of the

entity); entity); <ENAMEX TYPE=“LOCATION” <ENAMEX TYPE=“LOCATION” SBT=“XXX”>SBT=“XXX”>FjFjäärranrran ÖÖsternstern</ENAMEX></ENAMEX>

–– Extent Recognition Extent Recognition (measure the ability of a (measure the ability of a system to correctly determine an entity’s system to correctly determine an entity’s extentextentpartial correctness): partial correctness): Fjärran <ENAMEX TYPE=<ENAMEX TYPE=““LOCATIONLOCATION””>>ÖÖsternstern</ENAMEX></ENAMEX>

28Fefor, Jan, 2003

Notes on the Evaluation Notes on the Evaluation cont’dcont’d

Existing systems identify names on the scale ~90Existing systems identify names on the scale ~90--95% 95% in newswire texts (several languages)in newswire texts (several languages)

Metrics: Metrics: VaryVary from test case to test case; the from test case to test case; the commonest definitions are:commonest definitions are:

PrecisionPrecision = #= #CorrectReturnedCorrectReturned/#/#TotalReturnedTotalReturned

RecallRecall = #= #CorrectReturnedCorrectReturned/#/#CorrectPossibleCorrectPossibleQuite high figures in P&R can be found in the Quite high figures in P&R can be found in the

litterature based exclusively on these metrics...litterature based exclusively on these metrics...

Papers lack a thorough discussion on metonymy or Papers lack a thorough discussion on metonymy or other difficult cases; while the evaluations have been other difficult cases; while the evaluations have been solely based on monolithic types of texts...solely based on monolithic types of texts...

29Fefor, Jan, 2003

Notes on the Notes on the Evaluation cont’dEvaluation cont’d

Guidelines for more rigid evaluation criteria have been Guidelines for more rigid evaluation criteria have been imposed by the MUC; e.g. imposed by the MUC; e.g.

Precision = Correct + Precision = Correct + ( 0.5 * Partially Correct )( 0.5 * Partially Correct )ActualActual

Correct:Correct: two single two single annotationsannotations are considered identicalare considered identicalPartially Correct:Partially Correct: two single two single annotations areannotations are not identical, but partial not identical, but partial

credit should still be givencredit should still be givenActual = Correct + Incorrect + Partially Correct + SpuriousActual = Correct + Incorrect + Partially Correct + SpuriousSpurious:Spurious: a response object has no key object aligned with ita response object has no key object aligned with it

Recall = Correct + Recall = Correct + ( 0.5 * Partially Correct )( 0.5 * Partially Correct )PossiblePossible

EffetivenessEffetiveness: (Meadow et al. ’00): (Meadow et al. ’00) Eff = 1-√ 2

√((1-P)2 + (1-R)2)

30Fefor, Jan, 2003

SUCSUC--22

The SUCThe SUC--2 (appx 1,1 mil. tokens) has been semi2 (appx 1,1 mil. tokens) has been semi--automaticallyautomatically?? annotated with ”NAMEannotated with ”NAME--tags” tags” 15131 PERSON 15131 PERSON 8771 PLACE8771 PLACE6309 INST6309 INST1887 WORK1887 WORK638 PRODUCT638 PRODUCT540 OTHER540 OTHER364 ANIMAL364 ANIMAL280 MYTH280 MYTH245 EVENT245 EVENT242 FORMULA242 FORMULA

Här har <NAME TYPE=ANIMAL>Nalle</NAME> frukosterat...

...ber <NAME TYPE=MYTH>Herren</NAME> välsigna vår...

...årsmöte i <NAME TYPE=OTHER>Kristiansborgskyrkan</NAME>…

...till nitrat ( <DISTINCT TYPE=FORMULA>NO3-</DISTINCT> ) och därefter...

Page 6: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

31Fefor, Jan, 2003

SUCSUC--2; Problems...2; Problems...

Though manually checkedThough manually checked??, SUC, SUC--2 is not completely 2 is not completely errorerror--free; for instance:free; for instance:...over 80 cases of PERSON...over 80 cases of PERSON--tags in which the role names are part of the tags in which the role names are part of the annotationannotation<NAME TYPE=PERSON>professor Ulf af Trolle</NAME><NAME TYPE=PERSON>fröken Lundmark</NAME><NAME TYPE=PERSON>komminister Harry Åström</NAME>...cases of inconcistencies in the annotations...cases of inconcistencies in the annotationsPlåtslagarns hus under de [...] <TYPE=PERSON>Plåtslagarmästarens</NAME> hus , hade han [...]...sonsöner i bön till den Allsmäktige ,

glad kunde väl inte ens <NAME TYPE=MYTH>Den Allsmäktige</NAME>...han påpekade att <NAME TYPE=PERSON>Jesus Kristus</NAME> i själva

....sin uppenbarelse från <NAME TYPE=MYTH>Jesus Kristus</NAME>

...erroneous tags?...erroneous tags?Efter slaget på <NAME TYPE=PLACE>Vita berget</NAME> 1620 [...]...ballong med inskriptionen <NAME TYPE=EVENT>Teaterdagar</NAME>

32Fefor, Jan, 2003

The ”EVENT”s The ”EVENT”s in SUCin SUC--22

Tested the current status of the Swedish system that Tested the current status of the Swedish system that recognizes & annotates ”EVENT” entitiesrecognizes & annotates ”EVENT” entitiesA rather ”open” category with difficulties to reach A rather ”open” category with difficulties to reach consensus on its contentconsensus on its contentThe SUCThe SUC--2 annotations of 2 annotations of EVENT (245)EVENT (245) seems at a first seems at a first glance to closely resemble the NNglance to closely resemble the NN--group of group of EVENTEVENTHypothesis: does our system recognize more events Hypothesis: does our system recognize more events than those annotated in SUC2? than those annotated in SUC2? Answer: YESAnswer: YES

Mixed-All: F=82,65% Eff=0,77P=97,6%(166/170) R=67,7%(166/245)

33Fefor, Jan, 2003

””EVENT”: Errors Analysis EVENT”: Errors Analysis

1 1 partial correct:partial correct: <EVENT>U.S.-Japan Workshop on Smart</> / Intelligent Materials and Systems; 3 3 annotations were annotations were not in the SUCnot in the SUC22 sentences sentences with EVENTwith EVENT--tagstags:: VM-kvalet, märkes-VM, världscupenThere where a few questionable cases in the There where a few questionable cases in the SUC2SUC2: : ...ballong med inskriptionen <SUC EVENT>Teaterdagar</> Dubbeltriumfen i <SUC EVENT>Wimbledon</> var inte sämre... Some unrecognized cases by the swedish system:Some unrecognized cases by the swedish system:... torsdag , är det <SUC EVENT>Qingming</> , dagen då gravarna ... " <SUC EVENT>Globengrejen</> " förra året var ju... <SUC EVENT>Barcelona</> , då ?...såg dock till att de dominerade <SUC EVENT>Chiquitatalangen</> , .... ...Kulturhusets annonserade " <SUC EVENT>Titelmatch i 15 satser</> " ... ...en minneslapp där det står " <SUC EVENT>Canossa</> ”...sammankomster där " <SUC EVENT>Catechismi Förhör</> " hölls...

...skedde på <SUC EVENT>Yom Kippur</> nittonhundratrettisex . 34Fefor, Jan, 2003

EVENTs not Marked in EVENTs not Marked in SUC2SUC2

FOUND FOUND 168168 ””EVNEVN" not marked as such " not marked as such in SUC2; 32in SUC2; 32 of those of those found found were wrongwere wrong; ; of which ATGof which ATG (12)(12), Paris All Stars, Paris All Stars (3)(3),,WarszawapaktenWarszawapakten (2) e.g. (2) e.g. ” Musikerna i Paris All Stars strålar samman...”; ”...konstruerade men aldrig avslutate Multi-User-Dungeon-spel ( MUD )...”; ”Researrangören heter Holiday on Ice och då...”

14 SBT="HPL">andra världskriget</ENAMEX>9 SBT="ATL">Allsvenskan</ENAMEX>8 SBT="HPL">första världskriget</ENAMEX>4 SBT="CLU">D-föreställning</ENAMEX>2 SBT="WTH">orkanen Andrew</ENAMEX>2 SBT="HPL">Vietnamkriget</ENAMEX>2 SBT="HPL">Tjernobylolyckan</ENAMEX>2 SBT="HPL">finska inbördeskriget</ENAMEX>2 SBT="CLU">Karl XII-utställning</ENAMEX>2 SBT="CLU">Dionysosfesten</ENAMEX>2 SBT="CLU">1992 års FN-konferens</ENAMEX>2 SBT="ATL">VM</ENAMEX> .............

35Fefor, Jan, 2003

EVENTEVENT--discrepancies discrepancies in SUC2in SUC2

egentligen inte har i <NAME TYPE=EVENT>OS</NAME> att göra . bara några dagar före <NAME TYPE=EVENT>OS</NAME> ...första matchen här i OS-turneringen ...Den här säsongen satsar jag på <NAME TYPE=INST>AIK</NAME> och OS ,

Men de har inte hört om <NAME TYPE=EVENT>Tjernobyl</NAME> , ...svårt av effekterna av <NAME TYPE=EVENT>Tjernobylolyckan</NAME>...minska effekterna av Tjernobylolyckan , ...

... agerandet under <NAME TYPE=EVENT>andra världskriget</NAME> ochhan kom själv hit som flykting under andra världskriget . och hörde till slutet av första världskriget till ...

36Fefor, Jan, 2003

The ”MYTH”s The ”MYTH”s in SUCin SUC--22

Tested the current status of the Swedish system that Tested the current status of the Swedish system that recognizes & annotates ”MYTH” entitiesrecognizes & annotates ”MYTH” entitiesThe SUCThe SUC--2 annotations of 2 annotations of MYTH (280)MYTH (280) seems at a first seems at a first glance to closely resemble the NNglance to closely resemble the NN--group of group of PRS:MTHPRS:MTHHypothesis: does our system recognize more Hypothesis: does our system recognize more mythsmythsthan those annotated in SUC2? than those annotated in SUC2? Answer: YESAnswer: YES

Mixed-All: F=83,2% Eff=0,82P=88,6%(218/246) R=77,8%(218/280)

Page 7: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

37Fefor, Jan, 2003

””MYTH”: Errors Analysis MYTH”: Errors Analysis

Strange casesStrange cases<NAME TYPE=MYTH>Paris</NAME> , vrålar den bedragne

<NAMETYPE=PERSON>Menelaos</NAME> , är en horbock......placerade där<NAME TYPE=MYTH>Atlantis</NAME> , den sjukna

kontinenten ...att <NAME TYPE=PERSON>Läraren</NAME> själv skulle återuppstå......tolkas som <NAME TYPE=PERSON>Jesus</NAME>, eftersom han liksom ...fungerade den <NAME TYPE=MYTH>Gudomliga Uppenbarelsen</NAME>

som en mall ......beroende av den <NAME TYPE=MYTH>Gudomliga

Uppenbarelsen</NAME>...aldrig ätit av <NAME TYPE=MYTH>Kunskapens träd</NAME>ska skriva <NAME TYPE=MYTH>Den stora amerikanska romanen</NAME>...pocketversionen av <NAME TYPE=MYTH>Baskervilles hund</NAME> ...

38Fefor, Jan, 2003

MYTHs not Marked in MYTHs not Marked in SUC2SUC2

FOUND 1FOUND 14646 "MTH" not marked as such "MTH" not marked as such in SUC2 (tagged as in SUC2 (tagged as PERSON); PERSON); 5 5 of those found of those found were wrongwere wrong; e.g. ; e.g. ”...översättningar är extremt ordagranna , t_ex Klagovisorna , Hesekiel…”

85 SBT="MTH">Jesus</ENAMEX>33 SBT="MTH">Jesu</ENAMEX>8 SBT="MTH">Gud</ENAMEX> 4 SBT="MTH">Guden</ENAMEX> 2 SBT="MTH">Jungfru Maria</ENAMEX> 2 SBT="MTH">guden</ENAMEX> 1 SBT="MTH">profeten Mohammeds</ENAMEX> 1 SBT="MTH">profeten Jesus</ENAMEX> 1 SBT="MTH">Johannes Döparen</ENAMEX> 1 SBT="MTH">Jesus Kristus</ENAMEX> 1 SBT="MTH">heliga Anden</ENAMEX> 1 SBT="MTH">den Allsmäktige</ENAMEX>1 SBT="MTH">Buddha</ENAMEX>

39Fefor, Jan, 2003

...sonsöner i bön till den Allsmäktige ,

...glad kunde väl inte ens <NAME TYPE=MYTH>Den Allsmäktige</NAME>

...är att profeten <NAME TYPE=PERSON>Jesus</NAME> verkligen

...den försoning som <NAME TYPE=PERSON>Jesus</NAME> död innebär

...himmelske <NAME TYPE=MYTH>Jesus</NAME> hade uppenbarat sig

...frälsande tron på <NAME TYPE=MYTH>Jesus</NAME>

... påpekade att <NAME TYPE=PERSON>Jesus Kristus</NAME> i själva

...sin uppenbarelse från <NAME TYPE=MYTH>Jesus Kristus</NAME>

Gode Gud , om det syntes på henne .<NAME TYPE=PERSON>David</NAME> skrek till å Gud...och den<NAME TYPE=MYTH>Gud</NAME> välsigne våra tappra...bekant med <NAME TYPE=MYTH>Guds</NAME> röst

MYTHMYTH--discrepancies discrepancies in SUC2in SUC2

40Fefor, Jan, 2003

The ”ANIMAL”s The ”ANIMAL”s in SUCin SUC--22

Tested the current status of the Swedish System that Tested the current status of the Swedish System that recognizes & annotates ”ANIMAL” entitiesrecognizes & annotates ”ANIMAL” entitiesThe SUCThe SUC--2 annotations of 2 annotations of ANIMAL (364)ANIMAL (364) seems at a seems at a first glance to closely resemble the NNfirst glance to closely resemble the NN--group of group of PRS:ANMPRS:ANMHypothesis: does our system recognize more Hypothesis: does our system recognize more animalsanimalsthan those annotated in SUC2? than those annotated in SUC2? Answer: NO!Answer: NO!

Mixed-All: F=51,9% Eff=0,36P=93%(40/43) R=10,9%(40/364)

WHY??

41Fefor, Jan, 2003

””ANIMAL”: Errors ANIMAL”: Errors Analysis Analysis

Two evaluations were conductedTwo evaluations were conducted. One using only the "ANIMAL" part of the PERSON-grammar and

one using the grammar plus the DCA. In the first case: Precision=93% (40/43)Precision=93% (40/43) & Recall=10,9% (40/364)Recall=10,9% (40/364). Using the grammars and DCA module the Precision slightly

increased (97,4%)(97,4%) while the Recall was substantially raised to 81,6%81,6%. Why is that?

The vast majority of the "ANIMAL" occurrences in SUC are found iThe vast majority of the "ANIMAL" occurrences in SUC are found in n texts from the area "K"="Imaginative prose" and particularly texts from the area "K"="Imaginative prose" and particularly "KK"="General fiction""KK"="General fiction" (~60%; 218 of 364) – Simple DCA on these texts raises Recall!

See the two examples of "Matti" and "Julius"

Mixed-All: F=89,5% Eff=0,7342Fefor, Jan, 2003

The Volvo/Saab The Volvo/Saab Evaluation (1)Evaluation (1)

the the Volvo/SaabVolvo/Saab case (can be generalized with other meton. cases)case (can be generalized with other meton. cases)a typical, frequent and fairly difficult example that illustratea typical, frequent and fairly difficult example that illustrates how s how metonymic distinctions are captured in the grammarsmetonymic distinctions are captured in the grammarssearched in corpora (872 inst.); searched in corpora (872 inst.); ideaidea: capture and model difficult & : capture and model difficult & unusual cases, tag the rest as default (in this case <ORG:CRP>);unusual cases, tag the rest as default (in this case <ORG:CRP>);e.g.e.g. ...Saab ...Saab 90009000......

...mellanklass...mellanklassbilar sombilar som Volvo,...Volvo,...

...att ...att köraköra Volvo i en Volvostad som...Volvo i en Volvostad som...

... i en stor ... i en stor svartsvart Volvo och blinkade...Volvo och blinkade...

...tjuven försvinner i en ...tjuven försvinner i en stulenstulen SaabSaab

...tappat kontrollen över ...tappat kontrollen över sinsin VolvoVolvoVolvo Volvo stegsteg med 12 kronormed 12 kronorSaab Saab backadebackade med 1 peocentmed 1 peocent...gick Volvo ...gick Volvo nedned med 10 kronor...med 10 kronor...rest of cases are tagged as defaultrest of cases are tagged as default......

Object: <VHG>

Object: <PRD>

Organization: <CRP>

Page 8: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

43Fefor, Jan, 2003

Sense1Sense1:: objectobject;; productproduct (vehicle); (vehicle); Patterns after cPatterns after corpusorpus aanalysisnalysis

1. (Poss-Pron|Color-Adj|Partcpl)* (Saab Saab | VolvoVolvo) NUMNUM+ + [A[A--Z]Z]??2. (Poss-Pron|Color-Adj|Partcpl)* (Saab Saab | VolvoVolvo) NUMNUM? ? [A[A--Z]Z]??

((coupcoupéé|turbo|dieselcabriolet|corvette|transporter|cc|...|turbo|dieselcabriolet|corvette|transporter|cc|...))3. bil(ar)? (av mbil(ar)? (av määrket|som)rket|som) (Saab Saab | VolvoVolvo) NUM?NUM? [A[A--Z]Z]??4. (Typical-Verbs:köra|Typical-Nouns) (Saab Saab | VolvoVolvo)

no rules without exception: …… dagarna dagarna kköör Volvor Volvo tvtvåå ffäälttester i Sverige lttester i Sverige ……

but the above simplistic patterns return, in the 143 (of 872) occurrences that were OBJ/VHG, a 95,1% (118/124) 95,1% (118/124) precision and 82,5%82,5% (118/143) recall

The Volvo/Saab The Volvo/Saab Evaluation (2)Evaluation (2)

44Fefor, Jan, 2003

Sense2Sense2:: objectobject;; share; share; Patterns after cPatterns after corpusorpus aanalysisnalysis

1. (Saab Saab | VolvoVolvo) AUXAUX?? VERB(steg/stigVERB(steg/stig**/backa/backa**))2. (Saab Saab | VolvoVolvo) AUXAUX?? VERB(VERB(öökaka**/minska*)/minska*)?? med NUM procentmed NUM procent3. (Saab Saab | VolvoVolvo) gick (tillbaka kraftigt|mot strgick (tillbaka kraftigt|mot ströömmen|upp|ned)mmen|upp|ned)4. (Saab Saab | VolvoVolvo) NUMNUM procentprocent

The Volvo/Saab The Volvo/Saab Evaluation (3)Evaluation (3)

the above patterns return, in the 11 (of 872) occurrences that were OBJ/PRD, a 100% (10/1100% (10/100) ) precision and 90,9%90,9%(10/11) recall

45Fefor, Jan, 2003

Sense3Sense3:: location;location; share; share; No patterns could be found after cNo patterns could be found after corpusorpusaanalysisnalysis

Sense4Sense4:: organization; organization; was the Default sense since the vast majority in was the Default sense since the vast majority in corpora are of this sensecorpora are of this sense

The Volvo/Saab The Volvo/Saab Evaluation (4)Evaluation (4)

the default pattern (organization) returned, in the 710 (of 872) occurrences that were ORG/CRP, a 95,9% (708/738) 95,9% (708/738) precision and 99,7%99,7% (708/710) recall

In the 872 instances there were also 4 cases not captured by theIn the 872 instances there were also 4 cases not captured by thepatterns, and not belonging to the 4 sense distinctions discussepatterns, and not belonging to the 4 sense distinctions discussed:d:

e.g. e.g. ”” Kungsträdgården <VHG>Volvo</> showroom 19/9...””

46Fefor, Jan, 2003

The Volvo/Saab The Volvo/Saab Evaluation (5)Evaluation (5)

In the extracted 872 fragments from corpora containing either In the extracted 872 fragments from corpora containing either Volvo/Saab: 143 were OBJ/VHG, 4 were OBJ/VHA, 710 were Volvo/Saab: 143 were OBJ/VHG, 4 were OBJ/VHA, 710 were ORG/CRP, 11 were OBJ/PRD, 3 were LOC/FNC and 1 were ORG/CRP, 11 were OBJ/PRD, 3 were LOC/FNC and 1 were ORG/ATHORG/ATH

LOC/FNCLOC/FNC Error: Error: Kungsträdgården <VHG>Volvo</> showroom 19/9....OBJ/ORG Error: OBJ/ORG Error: ...i dagarna kör <VHG>Volvo</> två fälttester i ...OBJ/VHG=>VHA Errors: OBJ/VHG=>VHA Errors: ...och <VHG>Saab 2000</> trappas ner ...

...av typen <VHG>Saab 2000</> med utrustning för kontroll

...jag flugit <VHG>Saab B 17</> ! Herregud ...OBJ/VHG Error:OBJ/VHG Error: ...svågerns <ORG>Volvo</> under S:t Eriksbron...OBJ/VHA Error: OBJ/VHA Error: ...bilsätena till <ORG>Volvo</> skärs och sys...

Mats Salomons <ORG>Saab</> på tvären och voltar...<ORG>Volvo</> har sin instrumentmiljö , trist...

UNDECIDABLE: UNDECIDABLE: ... I say <ORG>Saab</> , I say Bjorn and Benny .

47Fefor, Jan, 2003

Problems: Metonymy, Problems: Metonymy, Adjacent NamesAdjacent Names

a speaker uses a reference to one entity to refer to a speaker uses a reference to one entity to refer to another entity another entity –– oror entitiesentities –– related to itrelated to it;; in ALL groupsin ALL groupswe found we found words metonymwords metonymic with other groupsic with other groups!!

OBJ#LOCOBJ#LOC:: ... A-influensan som är ganska lik Asiaten och HongkongOBJ#PRSOBJ#PRS:: ...gäver av pipspännarkonstruktion som bär namn som Carl Gustaf...WRK#W&AWRK#W&A:: ...släpptes första singeln Château d’ amour...ANM#PRSANM#PRS:: Underbara Barbara galloperade...EVN#PRSEVN#PRS:: Maria Marathon vs Stockholm Marathon

Adjacent names might be a problem in certain contexts:Adjacent names might be a problem in certain contexts:–– ...på Karolinska sjukhuset i ...på Karolinska sjukhuset i Solna Gunilla BolinderSolna Gunilla Bolinder–– ...enligt ...enligt Zhang Yimov KinasZhang Yimov Kinas mest berömda komikernmest berömda komikern–– ...grundaren av ...grundaren av Ruter Dam Gunilla ArhénRuter Dam Gunilla Arhén

48Fefor, Jan, 2003

<GP030121>Efter bilen kommer turen till hunden

ALTEA : På en bensinstation i kuststaden Altea i sydöstra Spanien ( cirka tio mil söder om Valencia ) kan hundägare slå två flugor i en smäll . På stationen erbjuds inte bara tvätt av bilen , här kan även hunden få sig en grundlig rengöring . Peter Ardon och hans till synes skeptiska hund Pia var i går påplatsen och undersökte automatens valmöjligheter . Man kunde välja mellan fyra olika tvättningsprogram : tvätt , skölj , avlusning och torkning . Vad Peter Ardonoch Pia valde förtäljer inte historien .

MULTISMULTIS

GRAMMARSGRAMMARS

DCADCA

Wrapping UP: example 1Wrapping UP: example 1

SINGLESINGLE

FILTERFILTER

Page 9: Current Status of the Swedish System Swedish Named-Entity … · 2005-11-18 · Fefor, Jan, 2003 1 Swedish Named-Entity Recognition in the Nomen Nescio Project …the work conducted

49Fefor, Jan, 2003

<GP030121>Erdogan friad från mutmisstankar ANKARA : Den turkiske ledaren Recep Tayyip Erdogan , ledare för det regerande Rättvise- och utvecklingspartiet ( AKP ) , friades i går från anklagelser om korruption . En domstol i Ankarahar undersökt påståenden om att Erdogan skaffat sig en förmögenhet genom mutor under sin tid som borgmästare i Istanbul på 1990-talet . Erdogan och hans parti AKP tog hem en jordskredsseger i valet i november . Men Erdogan har inte tillåtits att ta plats i parlamentet . Orsaken är att han 1998 dömdes för religiös uppvigling . En lagändring gör att Erdogannu kan ställa upp i ett fyllnadsval den 9 februari . ( TT-DPA )

Wrapping UP: example 2Wrapping UP: example 2

MULTISMULTIS

GRAMMARSGRAMMARS

DCADCA

SINGLESINGLE

FILTERFILTER

50Fefor, Jan, 2003

<GP030121>LGP Telecom gör nytt försök att köpa AllgonLGP Telecom gör ett nytt försök att köpa antenntillverkaren Allgon . För tre år sedan blev det nobben till fusionen . I dag när verkligheten ser annorlunda ut och priset är en femtedel verkar Allgonsägare beskedligare . LGP Telecom lade i går ett bud påsamtliga aktier i Allgon . Köpet ska ske genom betalningmed egna nyutgivna aktier . Budet är värt 760 miljonerkronor baserat på börskurserna i de bägge bolagen som båda är underleverantören till telekomindustrin . Budpremien - det erbjudna priset jämfört med Allgonsbörsvärde - angav LGP till 62 procent . Men den premien åts snabbt upp när LGP:s aktie rasade med 20 procentsedan affären blivit känd i går .

Wrapping UP: example 3Wrapping UP: example 3

MULTISMULTIS

GRAMMARSGRAMMARS

DCADCA

SINGLESINGLE

FILTERFILTER

51Fefor, Jan, 2003

Wrapping UP: example 4 Wrapping UP: example 4 SUCSUC--filefile: KK60: KK60 -- ""MattiMatti""

I hela sitt liv väntade Matti på att jag skulle ta initiativ .Jag överförde mänskliga egenskaper på Matti .Matti fick tre promenader och mental träning varje dag .I början förutsatte jag att Matti var en normal hund .I början av januari fick Matti stygnen borttagna där hans testiklar hade opererats bort .En månad efter kastrationen var skillnaden i Mattis humör liten eller obefintlig .Matti hade ätit I/D i fyra och ett halvt år .Matti stank hel och hållen .Matti låg på sin madrass .Sigge kom in i rummet och befann sig omkring tre meter från Matti .<ENAMEX TYPE="PRS" SBT="ANM"><ENAMEX TYPE="PRS" SBT="ANM">MattiMatti</ENAMEX></ENAMEX> morrade kraftigt .<ENAMEX TYPE="PRS" SBT="ANM"><ENAMEX TYPE="PRS" SBT="ANM">MattiMatti</ENAMEX></ENAMEX> morrade ännu mer och reste sig .Matti ville ha sin liggplats ifred , det var naturligt .Sedan ville Matti gå in .Matti visste att Janne och jag ville honom väl .Jag hade ett sorgset och förtvivlat tonfall när jag pratade med och om Matti .Matti hörde mitt sorgsna tonfall och blev påmind om att han var sjuk .

...................

””Matti” Matti” –– 55 occ.55 occ.2 matches!2 matches!

52Fefor, Jan, 2003

Wrapping UP: example 4 Wrapping UP: example 4 SUCSUC--filefile: KK: KK5858 -- ””JuliusJulius" "

Julius .Julius hade ingen respekt för Matti .Julius ställde sig på bakbenen och lekte med Mattis svans .När Matti ville nosa satte Julius upp en tass : Stopp !På nätterna ville Julius vara hos mig .Julius fick lov att vara uppe hos mig igen .Julius låg och spanade efter kroppsdelar som rörde sig .När Matti låg på sidan kröp Julius ihop intill Mattis mage .På morgnarna satte jag ut mat åt Julius i ladugården .Julius höll sig runt gården .Julius var ensam hemma .Det var som om Julius lockade till sig mössen .Jag ville inte att Julius skulle bo i lägenhet .Om inte Julius fanns ...<ENAMEX TYPE="PRS" SBT="ANM"><ENAMEX TYPE="PRS" SBT="ANM">JuliusJulius</ENAMEX></ENAMEX> var en underbar katt .En morgon tog jag Julius med mig i bilen .Han tog med sig Julius en bit in i skogen .Julius åt .

...................

””Julius” Julius” –– 30 occ.30 occ.1 match!1 match!

53Fefor, Jan, 2003

Some Final RemarksSome Final Remarks

One of the challenges with NER is creating a stable One of the challenges with NER is creating a stable definition of what an entity is and creating a taxonomy definition of what an entity is and creating a taxonomy of entities to map to...of entities to map to...

Having done that it becomes simpler to solve moste of Having done that it becomes simpler to solve moste of the metonymy and other ambiguity problems...the metonymy and other ambiguity problems...

Problems still remain; where shall we draw the entity Problems still remain; where shall we draw the entity boundaries?boundaries?

Have we chosen the best approach?Have we chosen the best approach?How can we best measure performance? (different text How can we best measure performance? (different text

genres, formats etc)genres, formats etc)......

54Fefor, Jan, 2003

Plan for 2003Plan for 2003

FebruaryFebruary--MayMay: continuous testing of modules : continuous testing of modules in isolation and together; integration of modules in isolation and together; integration of modules (different permutations)(different permutations)MayMay--AugustAugust: work on the interface : work on the interface JuneJune--NovemberNovember: large: large--scale evaluation on an scale evaluation on an improved SUC2? and/or other materialimproved SUC2? and/or other materialDecemberDecember: wrapping: wrapping--upupUpcoming events: NODALIDA ´03 & ACL ´03 Upcoming events: NODALIDA ´03 & ACL ´03 Workshop ”Multilingual NER”Workshop ”Multilingual NER”


Recommended