Integrating Government Data New

Integrating Government Data

using Semantic Web technology

Dean AllemangChief Scientist, TopQuadrant Inc.

Prepared for ISWC 2009

Government Data Sources

Recent efforts have changed the face of government data distribution

– Better motivated– More sources– ‘Mandate’ (well, memorandum, anyway) for sharing data

Government data sources– Data.gov (main focus)– DOI Architecture - http://www.doi.gov/ocio/architecture/– USGS Earthquakes - http://earthquake.usgs.gov/eqcenter/ – USASpending.gov

Non-government data sources– Dbpedia.org– oeGov

“Objets trouvés”

Artwork made from “found objects” Project Runway, etc.

Lal Hitchcock Sculptures

“Found data”

Data integration efforts try to make data reusable– Data ‘wholesale’ instead of ‘retail’– Multiple efforts result in multiple data formats– Many efforts to ‘unify’ how data is represented – (competing) global

data standards. – Maybe one day, one will win.

Until that time, we have to make do with “found data” – data that is already available,

however it is.

RDF (etc.) can help us do that

Formats for “Found Data” in government

Format Examples Notes

Spreadsheets Data.gov, USASpending.gov, DOI

Flexibility makes it popular, but makes work at re-use time

XML Data.gov Not really a single format, but can be parsed uniformly

RSS USASpending.gov, USGS

Syntax wars largely irrelevant now. Easy to read, dynamic

RDFa <none?> New kid on the block, supported by Google, Yahoo!, Drupal

SPARQL Endpoint

Dbpedia.org Most flexible of all, dynamic

RDF/N3/SKOS OEGov, Tetherless World

Flexible, relatively static. Great for vocabularies etc.

Quality Considerations of Found Data

Correctness– Usual notion for data quality; is it right?– Misspellings, out-of-date data, etc.

Understandability– Found data requires interpretation. – E.g., what do columns in a spreadsheet mean?

Accessibility – How easily can the data be organized?– Eg. Spreadsheets can have haphazard organization– Eg., RSS feeds that aren’t dynamic, don’t have readable fields, etc.

Reusability/Repurposing – References to Controlled Vocabularies– Use of standardized ‘columns’ (properties)

A few species of Found Data

Quantitative Data feeds– This is what we are usually actually interested in– Data is described using properties, units, tags, etc.

Vocabularies*– Structured, unstructured– Sometimes with strong standards behind them (Westlaw, AGROVOC)– Not always advertised as ‘vocabularies’ – also as org diagrams,

architectures, or even data• FEA, TOGAF• Geographical entities (States, cities, countries) FAO Geopolitical ontology• Units of measure, structure of gov’t agencies

Schema*– Used to standardize properties (columns, XML tags, etc.)

• DC, WGS, FOAF, SIOC• 11179

* Two kinds of “controlled vocabulary” – often confused!

Integration strategy using RDF

IMPORT data into RDF– RDF is a sort of ‘least common denominator’ data representation

MERGE data – A wide variety of technologies available here– Semantic Web approach – you MODEL your mapping.

ANALYZE and DISPLAY conclusions– RDF is a sort of ‘least common denominator’ data representation

Import data into RDF RDF as Common Data

representation ‘rote’ transformations

<Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position></Person>

Name Address Company Title

Dean Allemang

10 Downing St.

TopQuadrant

Chief Scientist

Michael Brodie

14 Wysteria Lane

Verizon Chief Scientist

Import Data into RDF

Each common data type can be input ‘rote’ into RDF– Input preserves information from original; entities for e.g spreadsheet rows,

XML elements, database tables, RSS channels, etc. – Often “found data” requires further processing to make sense, eg:

• Extracting trees from spreadsheets• Resolving references in XML

– SPARQL CONSTRUCT is useful for any of these, once data is ‘rote’ translated into RDF

Genus Species Sub-species

Canus Dog Collie

Canus Dog Beagle

Canus Dog Terrier

Canus Wolf Steppen

Canus Wolf Lone

Canus

Dog

Collie

Wolf

Beagle

Terrier

Lone Steppen

Data Quality and Controlled Vocabularies

Do you reference a controlled vocabulary?– Flickr, del.icio.us, no– DOI, GSA, FTF, etc. reference FEA– Some reference more than one, e.g., GSA references TOGAF also– Legal briefs reference West Key Numbering System (WestLaw)– If you reference one (or more), then information sharing becomes possible

along that vocabulary

Did you tell us which one you referenced?– Reference is often implicit, or hidden in column name “Service Standard”

(did you recognize that as FEA?)– Reference is often explicit but informal ISBN-10: 0123735564– RDF provides global means of referencing vocabulary with a URIhttp://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en

Data Quality and Controlled Vocabularies (cont)

How did you specify the term?– del.icio.us, Flickr, etc. use (uncontrolled) strings– FEA uses controlled strings (which notion of “Quality” do you mean?)– WestLaw uses Key Numbering System: 2233(2) “Regular income”– RDF/SKOS uses global means of referring to terms with the URI

http://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en

Sounds familiar? The URI solves many problems of reference with respect to shared controlled

vocabularies!

Unstructured data and Controlled Vocabularies

Found Data sometimes doesn’t refer to vocabularies directly– “Microsoft announced today that negotiations to acquire search giant

Yahoo! have stalled.”– MICROSOFT, YAHOO! etc. could be controlled terms!– ‘standard’ terms might not match exactly (SEC names, etc.)

Concept Extraction technology can be relevant here– Reuters Calais reads news stories and extracts concepts in a controlled

vocabulary– Still has all the reference issues from before– Calais uses RDF (URIs) to resolve this.

Hooray for Calais!

Merging Data

“Schema mapping”– Useful when multiple data sources provide the same information about

similar items– Same information is described using different terms (columns, properties)

“Tagging” or “Sorting” – ‘tags’ data (like del.icio.us or Library of Congress)– Useful for grouping similar items for search and discovery

Both can be used together– Eg., use tags to find similar things, then map schemas to report data

uniformly

Data mapping Style 1: Schema Mapping Examples

Different sources use different names

Name Address Company Title

Dean Allemang 10 Downing St. TopQuadrant Chief Scientist

Michael Brodie 14 Wysteria Lane Verizon Chief Scientist

<Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position></Person>

Name=name, butCompany=employerTitle=position

Schema Mapping Examples (cont)

Different structures for similar data<rss:item ID=“3”> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long></rss:item>

<image src=“doggie.jpg”> <wgs:Point> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long> </wgs:Point></image>

<Entry <position>39.945345,-79.34524</position></Entry>

Schema mapping solutions:

With RDFS/OWL::employer owl:equivalentProperty :Company .:position owl:equivalentProperty :Title .

With SPARQLCONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .}WHERE { ?x sxml:child ?point. ?point a :Point . ?point wgs:lat ?lat . ?point wgs:long ?long }

With SPARQL extensions CONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .}WHERE {?x :position ?pos . LET (?lat:=str:before (?pos, “,”)) LET (?long:=str:after(?pos, “,”)) }

Schema mapping solutions (cont)

With a controlled meta-vocabulary and RDFS: E.g., 11179

:employer rdfs:subPropertyOf 11179:Concept1234 .:position rdfs:subPropertyOf 11179:Concept5678 .

Role of Standards in the Mapping

Schema standards like WGS:– If all parties use them, no mapping necessary!!– Simple standards encourage reuse: Microformats

Schema meta-standards like 11179– If all parties map to them, no new mapping necessary – just use theirs!– One mega-standard makes re-use difficult– Meta-standard (don’t use my words, just map to them) makes reuse easier

Vocabulary standards (AGROVOC, WestLaw, FEA, etc.)– Not very applicable at this stage– Will come in to their own in the next step . . .

Data Mapping Style 2: Tagging or Sorting

Like del.icio.us etc.

<Bookmark href=“http://www.topquadrant.com”> <tag>Semantic Web</tag></Bookmark>

<System name=“Central Bookkeeping”> <Evaluation> <PerformanceMeasure>Quality</PerformanceMeasure> <Resullt>Fair</Result> </Evaluation></Bookmark>

That’s an FEA reference!

Where does this come from?

Role of Standards in the Mapping

Vocabulary standards (AGROVOC, WestLaw, FEA, etc.)– Useful for organizing collaboration among groups– Used extensively by libraries, professional organizations, focused domain

groups, etc.– Not used by del.icio.us, Flickr, etc.– Related to “Folksonomies”

Analysis and Display

Wide variety of options, including eg: Use tags and tag structure to amalgamate data Display merged properties in a table Display merged data on a specific widget (e.g., mapping

geospatial data) Business Intelligence reporting – pie chart, bars, graphs, etc.

Tags as Amalgamation

FEA

DOI

GSAIf two sources use the same controlled vocabulary, they can be amalgamated along that dimension.

Mapping Columns

Model-driven displays

SELECT ?lat ?longWHERE {?item a :DisplayLocation . ?item geo:lat ?lat . ?item geo:long ?long .}

Name latitude

longitude

Slausen -171.3 38.4

Union -171.4 38.2

Vine -170.9 37.9

McArthur -170.4 38.1

Anaheim -171.3 38.2

Chinatown

-171.1 38.5

Beverly -171.3 38.1

latitude

longitude

Stationdomain

geo:lat

geo:long

:DisplayLocation

domain

domain

subPropertyOf

subPropertyOf

subClassOf

Exercises

Will use TopBraid™ Ensemble and TopBraid™ Composer

Using data from – oeGov– USASpending.gov– … others TBD …

Merge, slice, amalgamate, etc…

Date post:	14-May-2015
Category:	Technology
Upload:	guest4543bb
View:	754 times
Download:	1 times

Integrating Government Data New

Technology