Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
Confidential © MetaCarta, Inc. 2004
MetaCarta Federal Users GroupNatural Language Processing Technology Overview
Andras KornaiChief Scientist
MetaCarta875 Massachusetts Avenue
Cambridge MA [email protected]
Confidential © MetaCarta, Inc. 2004
Plan of the talk
• What is NLP?
• What is geography?
• How do we combine them?
Confidential © MetaCarta, Inc. 2004
Where……..
• Where is/are………..
– Al Sadar ?
– The WMD?
– Osama Bin Ladin?
– The enemy?
Confidential © MetaCarta, Inc. 2004
MetaCarta Internal Testing Results#docs %geo
50672 95.6912245 89.4317709 90.01105944 59.191766 65.1897829 43.9454693 59.1726447 51.8520964 69.1929693 81.4342480 64.3234498 64.7237020 50.4730558 60.2119611 58.3561570 81.5433129 86.1847936 64.78
#docs %geo34476 62.1073771 52.6559755 64.0347917 74.9448532 73.0840696 71.4329011 68.0965856 87.8777641 83.5385610 89.4175934 87.2865848 86.3881774 76.9910320 74.8881938 93.7477719 92.0165276 88.1567563 87.1910042 74.66
WWW DataAverage
# Docs % Geo-Relevant
1,914,443 74.05
Confidential © MetaCarta, Inc. 2004
What is NLP?
• NLP stands for Natural Language Processing • Sometimes the field is referred to as
Computational Linguistics
• The main split
numeric techniques
pattern matching
light parsing
induction
symbolic techniques
tokenization
grammar rules
deduction
Confidential © MetaCarta, Inc. 2004
What is GIS?
• GIS stands for Geographic Information System
• Such systems have
– Georeferenced data
– Map display interface
– Decision support
Confidential © MetaCarta, Inc. 2004
A Typical DocumentPATHFINDER RECORD NUMBER: 17GENDATE: 20021101CLASS : UNCLASSIFIEDCLASSIF: UNCLASSIFIEDGENTIME: 200211010835INFODATE: 20021101INFOTIME: 1325DTG: 011325Z NOV 02FROM: FBIS RESTON VATITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedWARNING: TOPIC: DISSENT, DOMESTIC POLITICAL, LEADER, MILITARYSERIAL: AFP20021101000102DOCCOUNTRY: CENTRAL AFRICAN REPUBLICDOCCOUNTRY: CHADDOCCOUNTRY: LIBYADOCCOUNTRY: FRANCEDOCCOUNTRY: UNITED STATESTITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedTEXT: 1. Chad: Government alleges 80 to 120 compatriots ’massacred’ in CAR capital BanguiSOURCE: Paris AFP (World Service) in English 1153 GMT 01 Nov 02TEXT:[FBIS Transcribed Text] BANGUI, Nov 1 (AFP) – Residents of the Central African Republic (CAR) capital cautiously resumed their routine activities Friday, two days after rebels were driven out of Bangui by pro-government forces. The government in neighbouring Chad on Thursday [31 October] night urged CAR President Ange Felix Patasse "to halt the mass arrests" of Chadian nationals living in his country.In a statement, the Ndjamena government alleged that Patasse’s presidential guards had on Thursday massacred between 80 and 120 Chadians amid allegations that Chad had been involved in the six-day uprising.The rebellion was launched last Friday [25 October] by supporters of renegade former army chief General Francois Bozize, who was dismissed
By Patasse last year and has gone into exile in France after fleeing to Chad. Chad’s Communications Minister Moctar Wawa Dahab told AFP that "between 80 and 120 Chadians were massacred Thursday at around 5:00 pm (1600 GMT)" in the PK12 district, about 12 kilometers (eight miles) north of central Bangui.[Description of Source: Paris AFP (World Service) in English]THIS REPORT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.(endall)BT<<PATHFINDER ANNOTATIONS>>PERSON: PRESIDENT ANGE FELIX PATASSEPERSON: GENERAL FRANCOIS BOZIZEMILUNIT: LIBYAN FORCESCOUNTRY: UNITED STATESCOUNTRY: CHADCOUNTRY: LIBYA
Confidential © MetaCarta, Inc. 2004
Document Structure• Typically, documents are not just some running text, they have
structure • In this domain, the structure is simple: header – body – footer
• In other domains the structure can be much more complex– For example, US patents have the structure:
barcode – “United States Patent” – number– author – date – full title – inventors – assignee
– notice – application number – filingdate – international classification – US classification
– field of search – references cited– primary examiner – attorney – abstract –
claims/drawings count – figures – title – background– summary – description of drawings
– detailed description of invention – claims
• Many of these fields are metadata (data about the data), others are “the” data
Confidential © MetaCarta, Inc. 2004
Metadata
• Much of the critical metadata is external to the document: – Access path, revision history, importance assigned by
analyst, ...
• Critical pieces of metadata are not explicit in the document but need to be computed. – Computing the size of a document is easy– Computing the stance is hard.
• The data/metadata division is task driven: – One person’s metadata trash is the other person’s data
treasure
Confidential © MetaCarta, Inc. 2004
Data Styles
free
numeric techniques
textual
parse
inferred
Message traffic
fielded
database
numerical
compute
documented
MASINT
Repository
Information
Use case
Metadata
Typical field
Confidential © MetaCarta, Inc. 2004
Adding Documents to the GIS Flow
• Parse document into relevant segments symbolic techniques strong
• Apply segment-specific rules– Date field: do not extract 1 May (a town in Kazakhstan)– Author field: do not extract Rachel Creek (Little Rock,
AK)– Free text field: symbolic techniques are weak
• Build georeference index
• Visualization of georeferenced textual data
Confidential © MetaCarta, Inc. 2004
Load-time Document Flow
Doc repository or feed
ingestion
document-size files
format-based segmentation
sectioned data
section-specific extraction
potential georeferences
disambiguation
georeferenced text segments
indexer
database entries
Confidential © MetaCarta, Inc. 2004
Ingestion• Sources
– Database– Shared drives– NFS mounts– Document mgmt system– Crawl– Live feed
• Authentication• Format conversion
– Plain ascii– MS Word– PDF– HTML
• Mixing push and pull models• Metadata indexes
Confidential © MetaCarta, Inc. 2004
Format-based segmentationPATHFINDER RECORD NUMBER: 17GENDATE: 20021101CLASS : UNCLASSIFIEDCLASSIF: UNCLASSIFIEDGENTIME: 200211010835INFODATE: 20021101INFOTIME: 1325DTG: 011325Z NOV 02FROM: FBIS RESTON VATITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedWARNING: TOPIC: DISSENT, DOMESTIC POLITICAL, LEADER, MILITARYSERIAL: AFP20021101000102DOCCOUNTRY: CENTRAL AFRICAN REPUBLICDOCCOUNTRY: CHADDOCCOUNTRY: LIBYADOCCOUNTRY: FRANCEDOCCOUNTRY: UNITED STATESTITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedTEXT: 1. Chad: Government alleges 80 to 120 compatriots ’massacred’ in CAR capital BanguiSOURCE: Paris AFP (World Service) in English 1153 GMT 01 Nov 02TEXT:[FBIS Transcribed Text] BANGUI, Nov 1 (AFP) – Residents of the Central African Republic (CAR) capital cautiously resumed their routine activities Friday, two days after rebels were driven out of Bangui by pro-government forces. The government in neighbouring Chad on Thursday [31 October] night urged CAR President Ange Felix Patasse "to halt the mass arrests" of Chadian nationals living in his country.In a statement, the Ndjamena government alleged that Patasse’s presidential guards had on Thursday massacred between 80 and 120 Chadians amid allegations that Chad had been involved in the six-day uprising.The rebellion was launched last Friday [25 October] by supporters of renegade former army chief General Francois Bozize, who was dismissed
By Patasse last year and has gone into exile in France after fleeing to Chad. Chad’s Communications Minister Moctar Wawa Dahab told AFP that "between 80 and 120 Chadians were massacred Thursday at around 5:00 pm (1600 GMT)" in the PK12 district, about 12 kilometers (eight miles) north of central Bangui.[Description of Source: Paris AFP (World Service) in English]THIS REPORT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.(endall)BT<<PATHFINDER ANNOTATIONS>>PERSON: PRESIDENT ANGE FELIX PATASSEPERSON: GENERAL FRANCOIS BOZIZEMILUNIT: LIBYAN FORCESCOUNTRY: UNITED STATESCOUNTRY: CHADCOUNTRY: LIBYA
Confidential © MetaCarta, Inc. 2004
Format-based Segmentation – HeaderPATHFINDER RECORD NUMBER: 17
GENDATE: 20021101CLASS : UNCLASSIFIEDCLASSIF: UNCLASSIFIEDGENTIME: 200211010835INFODATE: 20021101INFOTIME: 1325DTG: 011325Z NOV 02FROM: FBIS RESTON VATITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedWARNING: TOPIC: DISSENT, DOMESTIC POLITICAL, LEADER, MILITARYSERIAL: AFP20021101000102DOCCOUNTRY: CENTRAL AFRICAN REPUBLICDOCCOUNTRY: CHADDOCCOUNTRY: LIBYADOCCOUNTRY: FRANCEDOCCOUNTRY: UNITED STATESTITLE: CAR: Bangui residents cautiously resume routine activities after rebels crushedTEXT: 1. Chad: Government alleges 80 to 120 compatriots ’massacred’ in CAR capital BanguiSOURCE: Paris AFP (World Service) in English 1153 GMT 01 Nov 02TEXT:[FBIS Transcribed Text]
BANGUI, Nov 1 (AFP) – Residents of the Central African Republic (CAR) capital cautiously resumed their routine activities Friday, two days after rebels were driven out of Bangui by pro-government forces. The government in neighbouring Chad on Thursday [31 October] night urged CAR President Ange Felix Patasse "to halt the mass arrests" of Chadian nationals living in his country.
In a statement, the Ndjamena government alleged that Patasse’s presidential guards had on Thursday massacred between 80 and 120 Chadians amid allegations that Chad had been involved in the six-day uprising.
The rebellion was launched last Friday [25 October] by supporters of renegade former army chief General Francois Bozize, who was dismissed by Patasse last year and has gone into exile in France after fleeing to Chad.
Chad’s Communications Minister Moctar Wawa Dahab told AFP that "between 80 and 120 Chadians were massacred Thursday at around 5:00 pm (1600 GMT)" in the PK12 district, about 12 kilometers (eight miles) north of central Bangui.
[Description of Source: Paris AFP (World Service) in English]
THIS REPORT MAY CONTAIN COPYRIGHTED MATERIAL. COPYING AND DISSEMINATION IS PROHIBITED WITHOUT PERMISSION OF THE COPYRIGHT OWNERS.
(endall)
BT
<<PATHFINDER ANNOTATIONS>>
PERSON: PRESIDENT ANGE FELIX PATASSE
PERSON: GENERAL FRANCOIS BOZIZE
MILUNIT: LIBYAN FORCES
COUNTRY: UNITED STATES
COUNTRY: CHAD
COUNTRY: LIBYA
Confidential © MetaCarta, Inc. 2004
Section-specific extraction of georeferences• Toponyms
– Alaska, Baltimore, Cambridge, DC, ... (points and regions)• Explicit coordinates
– 48.3N 33.5W• Relative references
– 40 miles south of Fallujah• Descriptive phrases
– three miles upriver• Street addresses
– 1489 Jefferson Davis Highway • Phone numbers
– 01155663224123 – Feliz Natal, Brazil• IP addresses
– 171.64.22.122 – Palo Alto, CA• Company names
– Y-Not Variety – Somerville, MA• Etc…
Confidential © MetaCarta, Inc. 2004
Disambiguation
• Multiple meaning, same location:– New York (city or state)– Haora (city, lake, district)
• Same meaning, multiple location:– Paris TX, Paris France
• Vaguely defined– Appalachia– Labrador
• Historically shifting– Poland
• Contested– Kashmir
Confidential © MetaCarta, Inc. 2004
Indexing
• Input:– Document identifiers– Index key
• Output:– Key lookup acceleration structure
• Standard indexes– Keyword– Key phrase– Metadata– Range queries
• CartaTrees– Specifically for geographic information
Confidential © MetaCarta, Inc. 2004
Query-time Flow
user query
query parser
formal query
index
document list
map-based filtering
geolocated doclist
relevance ranking
WSDL stream
GUI
map, text, mouseover, ...
Confidential © MetaCarta, Inc. 2004
Query Parsing
• Ideal: – Computer, what is enemy strength in Gamma Quadrant?
• Current state of the art:– enemy OR hostile NOT Klingon AND forces OR
deployment AND “Gamma Quadrant”
• Rock bottom reality:– SELECT FROM KEYWORD WHERE …
• Supporting loose but clever queries already very hard– Arbitrary (nested) Booleans– Near search– Metadata index restricted search
Confidential © MetaCarta, Inc. 2004
Parsing Support
• Ideal: parse every doc in advance, parse query• Required: high speed high quality sentential parser!• Today: High speed high quality word-level parsers
– Typically rule-based/hand-crafted– Machine learning systems emulate rule-based
• Within striking distance: phrase-level parsers– Named Entity Extraction– Role assignment (template filling)– Modals– Coreference
• Science Fiction:– Conditionals– Counterfactuals– Above sentence level
Confidential © MetaCarta, Inc. 2004
Indexing
• Cheap: metadata• Medium: keyword• Expensive: key phrase• Very expensive: positional• Prohibitive: context dependent• Futuristic: dynamically weighted• Key issue: minimizing disk seeks
Confidential © MetaCarta, Inc. 2004
Map-based filtering
• Wanted: Bordeaux• Workaround: Medoc OR Margaux OR Saint-Julien
OR Saint-Estephe OR Moulis OR Paulliac OR Graves OR Pomerol OR … (over 250 significant chateuax, about 5k significant terms)
• The user-friendly solution: use map as filter
Confidential © MetaCarta, Inc. 2004
Relevance Ranking
• Wanted: best results on top• Factors used:
– Number of keywords matched– Relevance of keywords– Confidence in match– Emphasis of keywords– Quality/authority of doc– External pointers to doc– Timeliness of doc– Usage pattern on query side (user adaptation)– Usage pattern on response side (collaborative filtering)
Confidential © MetaCarta, Inc. 2004
GUI
• The Wall Street Journal's resistance to photographs is legendary. I've always thought that one word was worth a thousand pictures, retired executive editor Fred Taylor told a reporter on the occasion of the Journal's 100th anniversary. (From the Smithonian hed exhibit intro)
• Consumer market: eye candy loses over spartan interface
• GUI is justified only if – Works and plays well with text– Is truly graphical (avoid fontitis)– Comes with its own symbolic structure– Symbol code is already understood by user
Confidential © MetaCarta, Inc. 2004
Extraction Techniques
• Key issue: precision-recall tradeoff
• Precision = 1 - false positive rate
• Recall = 1 - false negative rate
Confidential © MetaCarta, Inc. 2004
Peeking Ahead
<GeoText>KRASNOYARSK</GeoText>. Nov 1
(<OrgText>Interfax-AVN</OrgText>) - Five 85-100mm artillery shells have been found near <GeoText>the lodge of the Stalmost plant in the city of Ulan-Ude in the Buryatian autonomous republic</GeoText>.
"According to specialists who examined the explosive find, a combat shell was among the devices. Four others do not pose any danger," a spokesman for the <OrgText>Siberian regional center of the Emergencies Ministry</OrgText> told <OrgText>Interfax-Military News Agency</OrgText> on Friday.
The site is sealed off, workers are evacuated to safe places and there are no residential blocks around. The only danger is posed by <GeoText>the plant’s oxygen shop</GeoText> that is located 50 meters from <GeoText>the lodge</GeoText>.
Sappers of the <OrgText>Defense Ministry</OrgText>, operations groups of the <OrgText>Buryatian Emergencies Ministry</OrgText>, officials of the <OrgText>Federal Security Service and Interior Ministry</OrgText> are working at the site.
Confidential © MetaCarta, Inc. 2004
Lexical Lookup
• The single most effective technique in Named Entity Recognition is lookup– Try to find every word or phrase stored in the lexicon,
and use the information stored with the lexical entry
• In geographic text search, the lexicon is called the gazetteer– For each toponym it typically contains:
• Latitude, longitude, polygon, elevation, population, feature type, (admin) region, ...
• To these, MetaCarta adds a number called confidence – Measures our strength of belief that the word (or phrase)
is used in a geographic sense The better the gazetteer, the higher the recall, and the lower the
precision
Confidential © MetaCarta, Inc. 2004
Problems with Lexical Lookup
• Once the lexicon is large, false positives are rampant– The Ridge in New Caledonia– Macho Town in Honduras– Energy Town in Williamson Cty, Illinois– Of Town in Turkey– Harrison Ford Crawford Cty, Missouri
• Even with a large lexicon, plenty of placenames are not there (out of vocabulary)
• Some things are too numerous to list– A camp 10 miles south of Basra– A camp 11 miles south of Basra– A camp 12 miles south of Basra– A camp 13 miles south of Basra– A camp 14 miles south of Basra
The macho energy of Harrison Ford
Confidential © MetaCarta, Inc. 2004
Solution #1: Pattern Matching
• Extend lexicon to cover regular expressions
• Technology: regular-expression based tokenization lex often combined with context-free template matching yacc
– Ideal for small lexicon and known fixed syntax programming languages
– Requires work for medium to large lexicons and constrained syntax whether reports, standard (template-filling) messages
– Too much work for large lexicons, unconstrained syntax, ambiguities natural language, mixed format message traffic
Confidential © MetaCarta, Inc. 2004
Pattern Matching Technology
• Regular expression tokenization– Classic tools like grep, egrep, agrep, lex, flex, ... going back
to Thompson (1968). Embedded in the perl/python regex capability. Modern approach emphasizes transducers, two-level rule systems (Koskenniemi 1983, Kaplan and Kay 1994).
• For an overview, see Andras Kornai: Extended Finite State Models of Language. Cambridge University Press 1999.
• Context-free template matching– Classic tools like yacc. Modern approach builds in the lexer:
JavaCC. CFGs started with Chomsky (1956), and the classic parsing algorithms (Earley, CKY) could cope with ambiguity. Programming languages need to be unambiguous, which put the emphasis on deterministic parsers. Natural language is nondeterministic.
• For an overview, see Masaru Tomita: Current Issues in Parsing Technology. Kluwer Academic 1991.
Confidential © MetaCarta, Inc. 2004
Solution #2: Statistical Analysis
• Key observation: Zipf’s Laws
Confidential © MetaCarta, Inc. 2004
Leveraging the statistics
• Build background model – Unigram: bag of words– Bigram: transitions– N-gram: phrases
• Use Markov assumption– Text is like low-level motor action: plan as you go
• Fight sparseness– Aggregate data– Extrapolate for missing data– Exploit sparse structure to speed things up
Confidential © MetaCarta, Inc. 2004
Statistical Technologies
• Hidden Markov Models
• Stochastic CFGs
• Clustering– Unsupervised– Supervised
• Artificial Neural Nets
• Graph models (Bayesian Nets)
Confidential © MetaCarta, Inc. 2004
A Tale of Two Coins
• Fake : Produces heads with = .9, Tails with = .1
• Real : Produces heads with = .5, Tails with = .5
• Coin swap is hard : Magician will use the same coin as in the preceding trial with = .99
Confidential © MetaCarta, Inc. 2004
How HMMs work• To define an HMM we need
– A set of model states
– Initial probabilities
– Transition probabilities
– Output probabilities• In a single run begin in state according to the , and emit
some signal in accordance with . Move to some state with probability , continue. The probability of a particular sequence
Confidential © MetaCarta, Inc. 2004
Using the Markov Assumption
• Brute force computation would require multiplications, because of the Markov assumption this can be reduced to
• The trick: instead of a single register accumulating the probabilities through various paths, maintain a separate register for each state.– Initialization:
– Iteration:
( multiplication for each of the n registers)
– Termination:
Confidential © MetaCarta, Inc. 2004
Using the Markov Assumption (continued)
• The Viterbi algorithm Goal: find the best given some
• Brute force: maintain the score along each of the paths with some heuristic for pruning when this number gets too large.
• Using the Markov assumption: for each state maintain only the best path that ends there, because the best one longer path will be an extension of one of these.
Confidential © MetaCarta, Inc. 2004
Major Parameters of HMMs
• Increasing the Markovian parameter: instead of chains with memory going back to 2,3,... Steps build 1-chains made out of compound states (pairs, triples,...)
– Number of states
– Number of mixtures
– Number of codebooks
– Dimension of feature vectors
– Diagonal vs. full covariances
– Transition structure
Confidential © MetaCarta, Inc. 2004
Networks of Networks are Networks1. Word models built from character models
2. n-gram models built on character models
3. Syntax models built from word n-grams
Confidential © MetaCarta, Inc. 2004
Advantages of HMMs
• Variation in data is confronted
• Trainable
• Delayed decisions
• Captures lots of structure
• Fast and robust implementation
Confidential © MetaCarta, Inc. 2004
Stochastic CFGs
• Generalization of HMMs to Context Free Grammars
• Trainable
• Delayed decisions -- but not at the right level
• Captures lots of structure
• Satisfies neither side
Confidential © MetaCarta, Inc. 2004
Chomsky Hierarchy revisited
• Finite State– Rich toolkits both discrete and stochastic– HMMs and FSTs differently flavored
• Context Free– Rich discrete toolkits– Many stochastic implementations but little success
• Mildly Context Sensitive– What linguists agree on– Stochastic work exists but slow O(n6)
• Context Sensitive– Classic use in phonology and morphology (now FS)
• Recursively enumerable
Confidential © MetaCarta, Inc. 2004
Clustering• Classical statistical technique (1930s and 1940s)
– Many modern offshoots• Two main flavors:
– Supervised (you know what you want)– Unsupervised (promises structure discovery)
• Fielded data: clustering is dominant– Very widespread use– Many data mining techniques
• Free text: N-gram clustering – Gets much harder with larger N– Radial Basis Networks
• Linear analysis– principal components (PCA)– linear discriminants (LDA)
• Singular Value Decomposition “Latent Semantic Analysis”
Confidential © MetaCarta, Inc. 2004
Neural Nets
• Genuinely useful Machine Learning technique – As long as you ignore the biological
pretensions
• Strange history – Invented in fifties (Rosenblatt) – Knocked in sixties (Minsky and Papert)– Back with a vengeance since late seventies
(PDP group)
• Forces fixed width representation – Not competitive with HMMs on temporally
ordered data
• Can’t justify decisions
Confidential © MetaCarta, Inc. 2004
Graphical Models (Bayesian Nets)
• Genuinely useful Machine Learning technique – As long as you ignore the philosophical
pretensions
• Long history – Roots in mathematical logic and probability
theory – Strongly hyped by Microsoft– Still a follower not a leader
• Forces fixed width representation – Not competitive with HMMs on temporally
ordered data
• Can justify decisions