Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | regina-day |
View: | 213 times |
Download: | 1 times |
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
On building a high performance
gazetteer database
Amittai AxelrodMetaCarta Inc
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Thanks to
Keith Baker
Kenneth Baker
Michael Bukatin
András Kornai
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Plan of the talk
• Database background
• Relating geographic names and features
• Handling ambiguities and inconsistencies in geographic names
• Classification and storage system for geographic features
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Databases
• No DB (faking it with flat files) -- clumsy
• Record-oriented -- still runs the world
• Relational -- making headway
• Object-oriented -- still very academic
• For MetaCarta GazDB, relational approach made most sense:• Overlapping records (McKinley/Denali)• Need for frequent updates of subparts of
records
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Gazetteer production process
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Conversion scripts
• Enforce uniform structure on the data
• Normalize across sources (e.g. lat/lon to decimal degrees, spelling, …)
• Configuration required once per source
• Load data in GazDB
• Combination perl/SQL
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Relating features and names
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Other tables used in GazDB• Population• Elevation• Language• Feature type• Source/versioning info• Temporal extent• Hierarchical information• Confidence• Comments• Change logs (full auditing)
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic names
• Internationalization• Full Unicode (UTF8) support• Maintain detail language information (SIL)
• Name resolution • Canonical form (16 bits)• Display form (8 bit)• Search form (6 bit)
• Authoritativeness
• Explicitness
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Updating a name in the GazDB
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Geographic features
• Spatial representations • Point, line, area, …
• Functional classes• Building, field, campus, city, …
• Administrative types• Nation, province, county, international org, …
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Export scripts
• Read GazDB
• Select which fields to include in custom output
• Creates .gbdm (MetaCarta format) binaries
• Combination perl/SQL
• Not yet general across binary output formats
Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.
Conclusions• Accept multiple sources (only configure
once per source)• Fast loading of large datasets (1m entries
per hour on linux desktop)• Simple update procedure• Outputting large binary custom gazetteers
for different purposes at extreme speeds (1m entries per minute)