Date post: | 14-May-2015 |
Category: |
Technology |
Upload: | guest4543bb |
View: | 754 times |
Download: | 1 times |
Integrating Government Data
using Semantic Web technology
Dean AllemangChief Scientist, TopQuadrant Inc.
Prepared for ISWC 2009
Government Data Sources
Recent efforts have changed the face of government data distribution
– Better motivated– More sources– ‘Mandate’ (well, memorandum, anyway) for sharing data
Government data sources– Data.gov (main focus)– DOI Architecture - http://www.doi.gov/ocio/architecture/– USGS Earthquakes - http://earthquake.usgs.gov/eqcenter/ – USASpending.gov
Non-government data sources– Dbpedia.org– oeGov
“Objets trouvés”
Artwork made from “found objects” Project Runway, etc.
Lal Hitchcock Sculptures
“Found data”
Data integration efforts try to make data reusable– Data ‘wholesale’ instead of ‘retail’– Multiple efforts result in multiple data formats– Many efforts to ‘unify’ how data is represented – (competing) global
data standards. – Maybe one day, one will win.
Until that time, we have to make do with “found data” – data that is already available,
however it is.
RDF (etc.) can help us do that
Formats for “Found Data” in government
Format Examples Notes
Spreadsheets Data.gov, USASpending.gov, DOI
Flexibility makes it popular, but makes work at re-use time
XML Data.gov Not really a single format, but can be parsed uniformly
RSS USASpending.gov, USGS
Syntax wars largely irrelevant now. Easy to read, dynamic
RDFa <none?> New kid on the block, supported by Google, Yahoo!, Drupal
SPARQL Endpoint
Dbpedia.org Most flexible of all, dynamic
RDF/N3/SKOS OEGov, Tetherless World
Flexible, relatively static. Great for vocabularies etc.
Quality Considerations of Found Data
Correctness– Usual notion for data quality; is it right?– Misspellings, out-of-date data, etc.
Understandability– Found data requires interpretation. – E.g., what do columns in a spreadsheet mean?
Accessibility – How easily can the data be organized?– Eg. Spreadsheets can have haphazard organization– Eg., RSS feeds that aren’t dynamic, don’t have readable fields, etc.
Reusability/Repurposing – References to Controlled Vocabularies– Use of standardized ‘columns’ (properties)
A few species of Found Data
Quantitative Data feeds– This is what we are usually actually interested in– Data is described using properties, units, tags, etc.
Vocabularies*– Structured, unstructured– Sometimes with strong standards behind them (Westlaw, AGROVOC)– Not always advertised as ‘vocabularies’ – also as org diagrams,
architectures, or even data• FEA, TOGAF• Geographical entities (States, cities, countries) FAO Geopolitical ontology• Units of measure, structure of gov’t agencies
Schema*– Used to standardize properties (columns, XML tags, etc.)
• DC, WGS, FOAF, SIOC• 11179
* Two kinds of “controlled vocabulary” – often confused!
Integration strategy using RDF
IMPORT data into RDF– RDF is a sort of ‘least common denominator’ data representation
MERGE data – A wide variety of technologies available here– Semantic Web approach – you MODEL your mapping.
ANALYZE and DISPLAY conclusions– RDF is a sort of ‘least common denominator’ data representation
Import data into RDF RDF as Common Data
representation ‘rote’ transformations
<Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position></Person>
Name Address Company Title
Dean Allemang
10 Downing St.
TopQuadrant
Chief Scientist
Michael Brodie
14 Wysteria Lane
Verizon Chief Scientist
Import Data into RDF
Each common data type can be input ‘rote’ into RDF– Input preserves information from original; entities for e.g spreadsheet rows,
XML elements, database tables, RSS channels, etc. – Often “found data” requires further processing to make sense, eg:
• Extracting trees from spreadsheets• Resolving references in XML
– SPARQL CONSTRUCT is useful for any of these, once data is ‘rote’ translated into RDF
Genus Species Sub-species
Canus Dog Collie
Canus Dog Beagle
Canus Dog Terrier
Canus Wolf Steppen
Canus Wolf Lone
Canus
Dog
Collie
Wolf
Beagle
Terrier
Lone Steppen
Data Quality and Controlled Vocabularies
Do you reference a controlled vocabulary?– Flickr, del.icio.us, no– DOI, GSA, FTF, etc. reference FEA– Some reference more than one, e.g., GSA references TOGAF also– Legal briefs reference West Key Numbering System (WestLaw)– If you reference one (or more), then information sharing becomes possible
along that vocabulary
Did you tell us which one you referenced?– Reference is often implicit, or hidden in column name “Service Standard”
(did you recognize that as FEA?)– Reference is often explicit but informal ISBN-10: 0123735564– RDF provides global means of referencing vocabulary with a URIhttp://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en
Data Quality and Controlled Vocabularies (cont)
How did you specify the term?– del.icio.us, Flickr, etc. use (uncontrolled) strings– FEA uses controlled strings (which notion of “Quality” do you mean?)– WestLaw uses Key Numbering System: 2233(2) “Regular income”– RDF/SKOS uses global means of referring to terms with the URI
http://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en
Sounds familiar? The URI solves many problems of reference with respect to shared controlled
vocabularies!
Unstructured data and Controlled Vocabularies
Found Data sometimes doesn’t refer to vocabularies directly– “Microsoft announced today that negotiations to acquire search giant
Yahoo! have stalled.”– MICROSOFT, YAHOO! etc. could be controlled terms!– ‘standard’ terms might not match exactly (SEC names, etc.)
Concept Extraction technology can be relevant here– Reuters Calais reads news stories and extracts concepts in a controlled
vocabulary– Still has all the reference issues from before– Calais uses RDF (URIs) to resolve this.
Hooray for Calais!
Merging Data
“Schema mapping”– Useful when multiple data sources provide the same information about
similar items– Same information is described using different terms (columns, properties)
“Tagging” or “Sorting” – ‘tags’ data (like del.icio.us or Library of Congress)– Useful for grouping similar items for search and discovery
Both can be used together– Eg., use tags to find similar things, then map schemas to report data
uniformly
Data mapping Style 1: Schema Mapping Examples
Different sources use different names
Name Address Company Title
Dean Allemang 10 Downing St. TopQuadrant Chief Scientist
Michael Brodie 14 Wysteria Lane Verizon Chief Scientist
<Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position></Person>
Name=name, butCompany=employerTitle=position
Schema Mapping Examples (cont)
Different structures for similar data<rss:item ID=“3”> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long></rss:item>
<image src=“doggie.jpg”> <wgs:Point> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long> </wgs:Point></image>
<Entry <position>39.945345,-79.34524</position></Entry>
Schema mapping solutions:
With RDFS/OWL::employer owl:equivalentProperty :Company .:position owl:equivalentProperty :Title .
With SPARQLCONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .}WHERE { ?x sxml:child ?point. ?point a :Point . ?point wgs:lat ?lat . ?point wgs:long ?long }
With SPARQL extensions CONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .}WHERE {?x :position ?pos . LET (?lat:=str:before (?pos, “,”)) LET (?long:=str:after(?pos, “,”)) }
Schema mapping solutions (cont)
With a controlled meta-vocabulary and RDFS: E.g., 11179
:employer rdfs:subPropertyOf 11179:Concept1234 .:position rdfs:subPropertyOf 11179:Concept5678 .
Role of Standards in the Mapping
Schema standards like WGS:– If all parties use them, no mapping necessary!!– Simple standards encourage reuse: Microformats
Schema meta-standards like 11179– If all parties map to them, no new mapping necessary – just use theirs!– One mega-standard makes re-use difficult– Meta-standard (don’t use my words, just map to them) makes reuse easier
Vocabulary standards (AGROVOC, WestLaw, FEA, etc.)– Not very applicable at this stage– Will come in to their own in the next step . . .
Data Mapping Style 2: Tagging or Sorting
Like del.icio.us etc.
<Bookmark href=“http://www.topquadrant.com”> <tag>Semantic Web</tag></Bookmark>
<System name=“Central Bookkeeping”> <Evaluation> <PerformanceMeasure>Quality</PerformanceMeasure> <Resullt>Fair</Result> </Evaluation></Bookmark>
That’s an FEA reference!
Where does this come from?
Role of Standards in the Mapping
Vocabulary standards (AGROVOC, WestLaw, FEA, etc.)– Useful for organizing collaboration among groups– Used extensively by libraries, professional organizations, focused domain
groups, etc.– Not used by del.icio.us, Flickr, etc.– Related to “Folksonomies”
Analysis and Display
Wide variety of options, including eg: Use tags and tag structure to amalgamate data Display merged properties in a table Display merged data on a specific widget (e.g., mapping
geospatial data) Business Intelligence reporting – pie chart, bars, graphs, etc.
Tags as Amalgamation
FEA
DOI
GSAIf two sources use the same controlled vocabulary, they can be amalgamated along that dimension.
Mapping Columns
Model-driven displays
SELECT ?lat ?longWHERE {?item a :DisplayLocation . ?item geo:lat ?lat . ?item geo:long ?long .}
Name latitude
longitude
Slausen -171.3 38.4
Union -171.4 38.2
Vine -170.9 37.9
McArthur -170.4 38.1
Anaheim -171.3 38.2
Chinatown
-171.1 38.5
Beverly -171.3 38.1
latitude
longitude
Stationdomain
geo:lat
geo:long
:DisplayLocation
domain
domain
subPropertyOf
subPropertyOf
subClassOf
Exercises
Will use TopBraid™ Ensemble and TopBraid™ Composer
Using data from – oeGov– USASpending.gov– … others TBD …
Merge, slice, amalgamate, etc…