+ All Categories
Home > Documents > Confidential© MetaCarta, Inc. 2004 MetaCarta Federal Users Group Natural Language Processing...

Confidential© MetaCarta, Inc. 2004 MetaCarta Federal Users Group Natural Language Processing...

Date post: 21-Dec-2015
Category:
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
49
Confidential © MetaCarta, Inc. 2004 MetaCarta Federal Users Group Natural Language Processing Technology Overview Andras Kornai Chief Scientist MetaCarta 875 Massachusetts Avenue Cambridge MA 02138 [email protected]
Transcript

Confidential © MetaCarta, Inc. 2004

MetaCarta Federal Users GroupNatural Language Processing Technology Overview

Andras KornaiChief Scientist

MetaCarta875 Massachusetts Avenue

Cambridge MA [email protected]

Confidential © MetaCarta, Inc. 2004

Plan of the talk

• What is NLP?

• What is geography?

• How do we combine them?

Confidential © MetaCarta, Inc. 2004

Where……..

• Where is/are………..

– Al Sadar ?

– The WMD?

– Osama Bin Ladin?

– The enemy?

Confidential © MetaCarta, Inc. 2004

MetaCarta Internal Testing Results#docs   %geo

50672    95.6912245    89.4317709    90.01105944    59.191766    65.1897829    43.9454693    59.1726447    51.8520964    69.1929693    81.4342480    64.3234498    64.7237020    50.4730558    60.2119611    58.3561570    81.5433129    86.1847936    64.78

#docs   %geo34476    62.1073771    52.6559755    64.0347917    74.9448532    73.0840696    71.4329011    68.0965856    87.8777641    83.5385610    89.4175934    87.2865848    86.3881774    76.9910320    74.8881938    93.7477719    92.0165276    88.1567563    87.1910042    74.66

WWW DataAverage

# Docs % Geo-Relevant

1,914,443     74.05

Confidential © MetaCarta, Inc. 2004

What is NLP?

• NLP stands for Natural Language Processing • Sometimes the field is referred to as

Computational Linguistics

• The main split

numeric techniques

pattern matching

light parsing

induction

symbolic techniques

tokenization

grammar rules

deduction

Confidential © MetaCarta, Inc. 2004

What is GIS?

• GIS stands for Geographic Information System

• Such systems have

– Georeferenced data

– Map display interface

– Decision support

Confidential © MetaCarta, Inc. 2004

Confidential © MetaCarta, Inc. 2004

A Typical DocumentPATHFINDER RECORD NUMBER: 17GENDATE: 20021101CLASS : UNCLASSIFIEDCLASSIF: UNCLASSIFIEDGENTIME: 200211010835INFODATE: 20021101INFOTIME: 1325DTG: 011325Z NOV 02FROM: FBIS RESTON VATITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedWARNING: TOPIC: DISSENT, DOMESTIC POLITICAL, LEADER, MILITARYSERIAL: AFP20021101000102DOCCOUNTRY: CENTRAL AFRICAN REPUBLICDOCCOUNTRY: CHADDOCCOUNTRY: LIBYADOCCOUNTRY: FRANCEDOCCOUNTRY: UNITED STATESTITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedTEXT: 1. Chad: Government alleges 80 to 120 compatriots ’massacred’ in CAR capital BanguiSOURCE: Paris AFP (World Service) in English 1153 GMT 01 Nov 02TEXT:[FBIS Transcribed Text] BANGUI, Nov 1 (AFP) – Residents of the Central African Republic (CAR) capital cautiously resumed their routine activities Friday, two days after rebels were driven out of Bangui by pro-government forces. The government in neighbouring Chad on Thursday [31 October] night urged CAR President Ange Felix Patasse "to halt the mass arrests" of Chadian nationals living in his country.In a statement, the Ndjamena government alleged that Patasse’s presidential guards had on Thursday massacred between 80 and 120 Chadians amid allegations that Chad had been involved in the six-day uprising.The rebellion was launched last Friday [25 October] by supporters of renegade former army chief General Francois Bozize, who was dismissed

By Patasse last year and has gone into exile in France after fleeing to Chad. Chad’s Communications Minister Moctar Wawa Dahab told AFP that "between 80 and 120 Chadians were massacred Thursday at around 5:00 pm (1600 GMT)" in the PK12 district, about 12 kilometers (eight miles) north of central Bangui.[Description of Source: Paris AFP (World Service) in English]THIS REPORT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.(endall)BT<<PATHFINDER ANNOTATIONS>>PERSON: PRESIDENT ANGE FELIX PATASSEPERSON: GENERAL FRANCOIS BOZIZEMILUNIT: LIBYAN FORCESCOUNTRY: UNITED STATESCOUNTRY: CHADCOUNTRY: LIBYA

Confidential © MetaCarta, Inc. 2004

Document Structure• Typically, documents are not just some running text, they have

structure • In this domain, the structure is simple: header – body – footer

• In other domains the structure can be much more complex– For example, US patents have the structure:

barcode – “United States Patent” – number– author – date – full title – inventors – assignee

– notice – application number – filingdate – international classification – US classification

– field of search – references cited– primary examiner – attorney – abstract –

claims/drawings count – figures – title – background– summary – description of drawings

– detailed description of invention – claims

• Many of these fields are metadata (data about the data), others are “the” data

Confidential © MetaCarta, Inc. 2004

Metadata

• Much of the critical metadata is external to the document: – Access path, revision history, importance assigned by

analyst, ...

• Critical pieces of metadata are not explicit in the document but need to be computed. – Computing the size of a document is easy– Computing the stance is hard.

• The data/metadata division is task driven: – One person’s metadata trash is the other person’s data

treasure

Confidential © MetaCarta, Inc. 2004

Data Styles

free

numeric techniques

textual

parse

inferred

Message traffic

fielded

database

numerical

compute

documented

MASINT

Repository

Information

Use case

Metadata

Typical field

Confidential © MetaCarta, Inc. 2004

Adding Documents to the GIS Flow

• Parse document into relevant segments symbolic techniques strong

• Apply segment-specific rules– Date field: do not extract 1 May (a town in Kazakhstan)– Author field: do not extract Rachel Creek (Little Rock,

AK)– Free text field: symbolic techniques are weak

• Build georeference index

• Visualization of georeferenced textual data

Confidential © MetaCarta, Inc. 2004

Load-time Document Flow

Doc repository or feed

ingestion

document-size files

format-based segmentation

sectioned data

section-specific extraction

potential georeferences

disambiguation

georeferenced text segments

indexer

database entries

Confidential © MetaCarta, Inc. 2004

Ingestion• Sources

– Database– Shared drives– NFS mounts– Document mgmt system– Crawl– Live feed

• Authentication• Format conversion

– Plain ascii– MS Word– PDF– HTML

• Mixing push and pull models• Metadata indexes

Confidential © MetaCarta, Inc. 2004

Format-based segmentationPATHFINDER RECORD NUMBER: 17GENDATE: 20021101CLASS : UNCLASSIFIEDCLASSIF: UNCLASSIFIEDGENTIME: 200211010835INFODATE: 20021101INFOTIME: 1325DTG: 011325Z NOV 02FROM: FBIS RESTON VATITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedWARNING: TOPIC: DISSENT, DOMESTIC POLITICAL, LEADER, MILITARYSERIAL: AFP20021101000102DOCCOUNTRY: CENTRAL AFRICAN REPUBLICDOCCOUNTRY: CHADDOCCOUNTRY: LIBYADOCCOUNTRY: FRANCEDOCCOUNTRY: UNITED STATESTITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedTEXT: 1. Chad: Government alleges 80 to 120 compatriots ’massacred’ in CAR capital BanguiSOURCE: Paris AFP (World Service) in English 1153 GMT 01 Nov 02TEXT:[FBIS Transcribed Text] BANGUI, Nov 1 (AFP) – Residents of the Central African Republic (CAR) capital cautiously resumed their routine activities Friday, two days after rebels were driven out of Bangui by pro-government forces. The government in neighbouring Chad on Thursday [31 October] night urged CAR President Ange Felix Patasse "to halt the mass arrests" of Chadian nationals living in his country.In a statement, the Ndjamena government alleged that Patasse’s presidential guards had on Thursday massacred between 80 and 120 Chadians amid allegations that Chad had been involved in the six-day uprising.The rebellion was launched last Friday [25 October] by supporters of renegade former army chief General Francois Bozize, who was dismissed

By Patasse last year and has gone into exile in France after fleeing to Chad. Chad’s Communications Minister Moctar Wawa Dahab told AFP that "between 80 and 120 Chadians were massacred Thursday at around 5:00 pm (1600 GMT)" in the PK12 district, about 12 kilometers (eight miles) north of central Bangui.[Description of Source: Paris AFP (World Service) in English]THIS REPORT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.(endall)BT<<PATHFINDER ANNOTATIONS>>PERSON: PRESIDENT ANGE FELIX PATASSEPERSON: GENERAL FRANCOIS BOZIZEMILUNIT: LIBYAN FORCESCOUNTRY: UNITED STATESCOUNTRY: CHADCOUNTRY: LIBYA

Confidential © MetaCarta, Inc. 2004

Format-based Segmentation – HeaderPATHFINDER RECORD NUMBER: 17

GENDATE: 20021101CLASS : UNCLASSIFIEDCLASSIF: UNCLASSIFIEDGENTIME: 200211010835INFODATE: 20021101INFOTIME: 1325DTG: 011325Z NOV 02FROM: FBIS RESTON VATITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedWARNING: TOPIC: DISSENT, DOMESTIC POLITICAL, LEADER, MILITARYSERIAL: AFP20021101000102DOCCOUNTRY: CENTRAL AFRICAN REPUBLICDOCCOUNTRY: CHADDOCCOUNTRY: LIBYADOCCOUNTRY: FRANCEDOCCOUNTRY: UNITED STATESTITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedTEXT: 1. Chad: Government alleges 80 to 120 compatriots ’massacred’ in CAR capital BanguiSOURCE: Paris AFP (World Service) in English 1153 GMT 01 Nov 02TEXT:[FBIS Transcribed Text]

BANGUI, Nov 1 (AFP) – Residents of the Central African Republic (CAR) capital cautiously resumed their routine activities Friday, two days after rebels were driven out of Bangui by pro-government forces. The government in neighbouring Chad on Thursday [31 October] night urged CAR President Ange Felix Patasse "to halt the mass arrests" of Chadian nationals living in his country.

In a statement, the Ndjamena government alleged that Patasse’s presidential guards had on Thursday massacred between 80 and 120 Chadians amid allegations that Chad had been involved in the six-day uprising.

The rebellion was launched last Friday [25 October] by supporters of renegade former army chief General Francois Bozize, who was dismissed by Patasse last year and has gone into exile in France after fleeing to Chad.

Chad’s Communications Minister Moctar Wawa Dahab told AFP that "between 80 and 120 Chadians were massacred Thursday at around 5:00 pm (1600 GMT)" in the PK12 district, about 12 kilometers (eight miles) north of central Bangui.

[Description of Source: Paris AFP (World Service) in English]

THIS REPORT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.

(endall)

BT

<<PATHFINDER ANNOTATIONS>>

PERSON: PRESIDENT ANGE FELIX PATASSE

PERSON: GENERAL FRANCOIS BOZIZE

MILUNIT: LIBYAN FORCES

COUNTRY: UNITED STATES

COUNTRY: CHAD

COUNTRY: LIBYA

Confidential © MetaCarta, Inc. 2004

Section-specific extraction of georeferences• Toponyms

– Alaska, Baltimore, Cambridge, DC, ... (points and regions)• Explicit coordinates

– 48.3N 33.5W• Relative references

– 40 miles south of Fallujah• Descriptive phrases

– three miles upriver• Street addresses

– 1489 Jefferson Davis Highway • Phone numbers

– 01155663224123 – Feliz Natal, Brazil• IP addresses

– 171.64.22.122 – Palo Alto, CA• Company names

– Y-Not Variety – Somerville, MA• Etc…

Confidential © MetaCarta, Inc. 2004

Disambiguation

• Multiple meaning, same location:– New York (city or state)– Haora (city, lake, district)

• Same meaning, multiple location:– Paris TX, Paris France

• Vaguely defined– Appalachia– Labrador

• Historically shifting– Poland

• Contested– Kashmir

Confidential © MetaCarta, Inc. 2004

Indexing

• Input:– Document identifiers– Index key

• Output:– Key lookup acceleration structure

• Standard indexes– Keyword– Key phrase– Metadata– Range queries

• CartaTrees– Specifically for geographic information

Confidential © MetaCarta, Inc. 2004

Query-time Flow

user query

query parser

formal query

index

document list

map-based filtering

geolocated doclist

relevance ranking

WSDL stream

GUI

map, text, mouseover, ...

Confidential © MetaCarta, Inc. 2004

Query Parsing

• Ideal: – Computer, what is enemy strength in Gamma Quadrant?

• Current state of the art:– enemy OR hostile NOT Klingon AND forces OR

deployment AND “Gamma Quadrant”

• Rock bottom reality:– SELECT FROM KEYWORD WHERE …

• Supporting loose but clever queries already very hard– Arbitrary (nested) Booleans– Near search– Metadata index restricted search

Confidential © MetaCarta, Inc. 2004

Parsing Support

• Ideal: parse every doc in advance, parse query• Required: high speed high quality sentential parser!• Today: High speed high quality word-level parsers

– Typically rule-based/hand-crafted– Machine learning systems emulate rule-based

• Within striking distance: phrase-level parsers– Named Entity Extraction– Role assignment (template filling)– Modals– Coreference

• Science Fiction:– Conditionals– Counterfactuals– Above sentence level

Confidential © MetaCarta, Inc. 2004

Indexing

• Cheap: metadata• Medium: keyword• Expensive: key phrase• Very expensive: positional• Prohibitive: context dependent• Futuristic: dynamically weighted• Key issue: minimizing disk seeks

Confidential © MetaCarta, Inc. 2004

Map-based filtering

• Wanted: Bordeaux• Workaround: Medoc OR Margaux OR Saint-Julien

OR Saint-Estephe OR Moulis OR Paulliac OR Graves OR Pomerol OR … (over 250 significant chateuax, about 5k significant terms)

• The user-friendly solution: use map as filter

Confidential © MetaCarta, Inc. 2004

Relevance Ranking

• Wanted: best results on top• Factors used:

– Number of keywords matched– Relevance of keywords– Confidence in match– Emphasis of keywords– Quality/authority of doc– External pointers to doc– Timeliness of doc– Usage pattern on query side (user adaptation)– Usage pattern on response side (collaborative filtering)

Confidential © MetaCarta, Inc. 2004

GUI

• The Wall Street Journal's resistance to photographs is legendary. I've always thought that one word was worth a thousand pictures, retired executive editor Fred Taylor told a reporter on the occasion of the Journal's 100th anniversary. (From the Smithonian hed exhibit intro)

• Consumer market: eye candy loses over spartan interface

• GUI is justified only if – Works and plays well with text– Is truly graphical (avoid fontitis)– Comes with its own symbolic structure– Symbol code is already understood by user

Confidential © MetaCarta, Inc. 2004

Extraction Techniques

• Key issue: precision-recall tradeoff

• Precision = 1 - false positive rate

• Recall = 1 - false negative rate

Confidential © MetaCarta, Inc. 2004

Peeking Ahead

<GeoText>KRASNOYARSK</GeoText>. Nov 1

(<OrgText>Interfax-AVN</OrgText>) - Five 85-100mm artillery shells have been found near <GeoText>the lodge of the Stalmost plant in the city of Ulan-Ude in the Buryatian autonomous republic</GeoText>.

"According to specialists who examined the explosive find, a combat shell was among the devices. Four others do not pose any danger," a spokesman for the <OrgText>Siberian regional center of the Emergencies Ministry</OrgText> told <OrgText>Interfax-Military News Agency</OrgText> on Friday.

The site is sealed off, workers are evacuated to safe places and there are no residential blocks around. The only danger is posed by <GeoText>the plant’s oxygen shop</GeoText> that is located 50 meters from <GeoText>the lodge</GeoText>.

Sappers of the <OrgText>Defense Ministry</OrgText>, operations groups of the <OrgText>Buryatian Emergencies Ministry</OrgText>, officials of the <OrgText>Federal Security Service and Interior Ministry</OrgText> are working at the site.

Confidential © MetaCarta, Inc. 2004

Lexical Lookup

• The single most effective technique in Named Entity Recognition is lookup– Try to find every word or phrase stored in the lexicon,

and use the information stored with the lexical entry

• In geographic text search, the lexicon is called the gazetteer– For each toponym it typically contains:

• Latitude, longitude, polygon, elevation, population, feature type, (admin) region, ...

• To these, MetaCarta adds a number called confidence – Measures our strength of belief that the word (or phrase)

is used in a geographic sense The better the gazetteer, the higher the recall, and the lower the

precision

Confidential © MetaCarta, Inc. 2004

Problems with Lexical Lookup

• Once the lexicon is large, false positives are rampant– The Ridge in New Caledonia– Macho Town in Honduras– Energy Town in Williamson Cty, Illinois– Of Town in Turkey– Harrison Ford Crawford Cty, Missouri

• Even with a large lexicon, plenty of placenames are not there (out of vocabulary)

• Some things are too numerous to list– A camp 10 miles south of Basra– A camp 11 miles south of Basra– A camp 12 miles south of Basra– A camp 13 miles south of Basra– A camp 14 miles south of Basra

The macho energy of Harrison Ford

Confidential © MetaCarta, Inc. 2004

Solution #1: Pattern Matching

• Extend lexicon to cover regular expressions

• Technology: regular-expression based tokenization lex often combined with context-free template matching yacc

– Ideal for small lexicon and known fixed syntax programming languages

– Requires work for medium to large lexicons and constrained syntax whether reports, standard (template-filling) messages

– Too much work for large lexicons, unconstrained syntax, ambiguities natural language, mixed format message traffic

Confidential © MetaCarta, Inc. 2004

Pattern Matching Technology

• Regular expression tokenization– Classic tools like grep, egrep, agrep, lex, flex, ... going back

to Thompson (1968). Embedded in the perl/python regex capability. Modern approach emphasizes transducers, two-level rule systems (Koskenniemi 1983, Kaplan and Kay 1994).

• For an overview, see Andras Kornai: Extended Finite State Models of Language. Cambridge University Press 1999.

• Context-free template matching– Classic tools like yacc. Modern approach builds in the lexer:

JavaCC. CFGs started with Chomsky (1956), and the classic parsing algorithms (Earley, CKY) could cope with ambiguity. Programming languages need to be unambiguous, which put the emphasis on deterministic parsers. Natural language is nondeterministic.

• For an overview, see Masaru Tomita: Current Issues in Parsing Technology. Kluwer Academic 1991.

Confidential © MetaCarta, Inc. 2004

Solution #2: Statistical Analysis

• Key observation: Zipf’s Laws

Confidential © MetaCarta, Inc. 2004

Zipf’s 2nd Law

Confidential © MetaCarta, Inc. 2004

Leveraging the statistics

• Build background model – Unigram: bag of words– Bigram: transitions– N-gram: phrases

• Use Markov assumption– Text is like low-level motor action: plan as you go

• Fight sparseness– Aggregate data– Extrapolate for missing data– Exploit sparse structure to speed things up

Confidential © MetaCarta, Inc. 2004

Statistical Technologies

• Hidden Markov Models

• Stochastic CFGs

• Clustering– Unsupervised– Supervised

• Artificial Neural Nets

• Graph models (Bayesian Nets)

Confidential © MetaCarta, Inc. 2004

A Tale of Two Coins

• Fake : Produces heads with = .9, Tails with = .1

• Real : Produces heads with = .5, Tails with = .5

• Coin swap is hard : Magician will use the same coin as in the preceding trial with = .99

Confidential © MetaCarta, Inc. 2004

How HMMs work• To define an HMM we need

– A set of model states

– Initial probabilities

– Transition probabilities

– Output probabilities• In a single run begin in state according to the , and emit

some signal in accordance with . Move to some state with probability , continue. The probability of a particular sequence

Confidential © MetaCarta, Inc. 2004

Using the Markov Assumption

• Brute force computation would require multiplications, because of the Markov assumption this can be reduced to

• The trick: instead of a single register accumulating the probabilities through various paths, maintain a separate register for each state.– Initialization:

– Iteration:

( multiplication for each of the n registers)

– Termination:

Confidential © MetaCarta, Inc. 2004

Using the Markov Assumption (continued)

• The Viterbi algorithm Goal: find the best given some

• Brute force: maintain the score along each of the paths with some heuristic for pruning when this number gets too large.

• Using the Markov assumption: for each state maintain only the best path that ends there, because the best one longer path will be an extension of one of these.

Confidential © MetaCarta, Inc. 2004

Major Parameters of HMMs

• Increasing the Markovian parameter: instead of chains with memory going back to 2,3,... Steps build 1-chains made out of compound states (pairs, triples,...)

– Number of states

– Number of mixtures

– Number of codebooks

– Dimension of feature vectors

– Diagonal vs. full covariances

– Transition structure

Confidential © MetaCarta, Inc. 2004

Networks of Networks are Networks1. Word models built from character models

2. n-gram models built on character models

3. Syntax models built from word n-grams

Confidential © MetaCarta, Inc. 2004

Advantages of HMMs

• Variation in data is confronted

• Trainable

• Delayed decisions

• Captures lots of structure

• Fast and robust implementation

Confidential © MetaCarta, Inc. 2004

Stochastic CFGs

• Generalization of HMMs to Context Free Grammars

• Trainable

• Delayed decisions -- but not at the right level

• Captures lots of structure

• Satisfies neither side

Confidential © MetaCarta, Inc. 2004

Chomsky Hierarchy revisited

• Finite State– Rich toolkits both discrete and stochastic– HMMs and FSTs differently flavored

• Context Free– Rich discrete toolkits– Many stochastic implementations but little success

• Mildly Context Sensitive– What linguists agree on– Stochastic work exists but slow O(n6)

• Context Sensitive– Classic use in phonology and morphology (now FS)

• Recursively enumerable

Confidential © MetaCarta, Inc. 2004

Clustering• Classical statistical technique (1930s and 1940s)

– Many modern offshoots• Two main flavors:

– Supervised (you know what you want)– Unsupervised (promises structure discovery)

• Fielded data: clustering is dominant– Very widespread use– Many data mining techniques

• Free text: N-gram clustering – Gets much harder with larger N– Radial Basis Networks

• Linear analysis– principal components (PCA)– linear discriminants (LDA)

• Singular Value Decomposition “Latent Semantic Analysis”

Confidential © MetaCarta, Inc. 2004

Neural Nets

• Genuinely useful Machine Learning technique – As long as you ignore the biological

pretensions

• Strange history – Invented in fifties (Rosenblatt) – Knocked in sixties (Minsky and Papert)– Back with a vengeance since late seventies

(PDP group)

• Forces fixed width representation – Not competitive with HMMs on temporally

ordered data

• Can’t justify decisions

Confidential © MetaCarta, Inc. 2004

Graphical Models (Bayesian Nets)

• Genuinely useful Machine Learning technique – As long as you ignore the philosophical

pretensions

• Long history – Roots in mathematical logic and probability

theory – Strongly hyped by Microsoft– Still a follower not a leader

• Forces fixed width representation – Not competitive with HMMs on temporally

ordered data

• Can justify decisions

Confidential © MetaCarta, Inc. 2004

Questions?


Recommended