Geographic Text Search Corporate Proprietary, Copyright 1999-2003, MetaCarta, Inc. On building a...

Post on 04-Jan-2016

213 views 1 download

transcript

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

On building a high performance

gazetteer database

Amittai AxelrodMetaCarta Inc

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Thanks to

Keith Baker

Kenneth Baker

Michael Bukatin

András Kornai

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Plan of the talk

• Database background

• Relating geographic names and features

• Handling ambiguities and inconsistencies in geographic names

• Classification and storage system for geographic features

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Databases

• No DB (faking it with flat files) -- clumsy

• Record-oriented -- still runs the world

• Relational -- making headway

• Object-oriented -- still very academic

• For MetaCarta GazDB, relational approach made most sense:• Overlapping records (McKinley/Denali)• Need for frequent updates of subparts of

records

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Gazetteer production process

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Conversion scripts

• Enforce uniform structure on the data

• Normalize across sources (e.g. lat/lon to decimal degrees, spelling, …)

• Configuration required once per source

• Load data in GazDB

• Combination perl/SQL

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Relating features and names

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Other tables used in GazDB• Population• Elevation• Language• Feature type• Source/versioning info• Temporal extent• Hierarchical information• Confidence• Comments• Change logs (full auditing)

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Geographic names

• Internationalization• Full Unicode (UTF8) support• Maintain detail language information (SIL)

• Name resolution • Canonical form (16 bits)• Display form (8 bit)• Search form (6 bit)

• Authoritativeness

• Explicitness

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Updating a name in the GazDB

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Geographic features

• Spatial representations • Point, line, area, …

• Functional classes• Building, field, campus, city, …

• Administrative types• Nation, province, county, international org, …

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Export scripts

• Read GazDB

• Select which fields to include in custom output

• Creates .gbdm (MetaCarta format) binaries

• Combination perl/SQL

• Not yet general across binary output formats

Geographic Text SearchCorporate Proprietary, Copyright 1999-2003, MetaCarta, Inc.

Conclusions• Accept multiple sources (only configure

once per source)• Fast loading of large datasets (1m entries

per hour on linux desktop)• Simple update procedure• Outputting large binary custom gazetteers

for different purposes at extreme speeds (1m entries per minute)