+ All Categories
Home > Technology > Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Date post: 27-Jan-2015
Category:
Upload: rinke-hoekstra
View: 104 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
58
Converts' rally, Evangelistic Committee of New York City, Carnegie Hall, Sept.14, 1908
Transcript
Page 1: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Converts' rally, Evangelistic Committee of New York City, Carnegie Hall, Sept.14, 1908

Page 2: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 3: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Open DataLinkedSix Ingredients

The missing ★

Mix ‘n Mash

Contextualize!

Choose your Grain Size

Lower the Threshold

Repeatable Transformation

Page 4: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

1The missing ★

http://give.everything/a/URI

HTTPs URIs only please!(or resolver + URN)

Version information

Version agnostic

Guessable

Page 5: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

2Repeatable Transformation

Transformation should be part of routine ...

... manageable and scalable ...

... repeatable ...

Linked Data will not be the official source anytime soon

http://www.w3.org/TR/prov-overview/

Provenance is key

Page 6: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

3Choose your Grain Size• The document is the

traditional grain size(dublin core)

• Linked data allows for deep links into data

• Cost versus usefulness

• Are you the right party to provide detailed descriptions?

http://creatingandeducating.blogspot.nl/2011/11/blog-post.html

Page 7: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

4 Mix ‘n Mash

• Multiple vocabularies won’t bite

• Multiple identifiers won’t bite

!

• Choose what’s useful for you...

• ... then map to others!

Image © David Sykes 2009 All rights reserved

Good News: the bulk has already been done for you!

Page 8: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

5• Information is not always compatible

• Make explicit in which context the information holds ...

• ... and who stated the information, why and how.

Contextualize!

Flat Earth and Square Earth idea courtesy of Szymon Klarman

Page 9: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 10: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 11: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

to2Data Semantics

Semantics for Scientific Data PublishersFrom Data

Page 12: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 13: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Photo by Philip Dujardin, http://www.filipdujardin.be

Page 14: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 15: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 16: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 17: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 18: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 19: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 20: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 21: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 22: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Herkomst en Hergebruik van Open Data

Rinke HoekstraVU University Amsterdam/University of Amsterdam

[email protected]

Photo by Philip Dujardin, http://www.filipdujardin.be

Page 23: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 24: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Definition(Oxford English Dictionary)

• The fact of coming from some particular source or quarter; origin, derivation;

• the history or pedigree of a work of art, manuscript, rare book, etc.;

• concretely, a record of the passage of an item through its various owners.

Page 25: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Making trust judgements

Liability, trust and privacy in open government data

Compliance and auditing of business processes

Licensing and attribution of combined information

Page 26: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Curt Tilmes, Peter Fox, Xiaogang Ma, Deborah L. McGuinness, Ana Pinheiro Privette, Aaron Smith, Anne Waple, Stephan Zednik, Jinguang Zheng: Provenance Representation for the National Climate Assessment in the Global Change Information System. IEEE T. Geoscience and Remote Sensing 51(11): 5160-5168 (2013)

Integrated & Summarized Data

Transparency and Trust

“Provenance is the number one issue that we face when publishing

government data in data.gov.uk”John Sheridan, UK National Archives, data.gov.uk

Page 27: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Provenance?• Provenance = Metadata?

Provenance can be seen as metadata, but not all metadata is provenance

• Provenance = Trust?Provenance provides a substrate for deriving different trust metrics

• Provenance = Authentication?Provenance records can be used to verify and authenticate amongst users

Page 28: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 29: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 30: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Three Dimensions

• ContentCapturing and representing provenance information

• ManagementStoring, querying, and accessing provenance information

• UseInterpreting and understanding provenance in practice

Page 31: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Three Dimensions

• ContentCapturing and representing provenance information

• ManagementStoring, querying, and accessing provenance information

• UseInterpreting and understanding provenance in practice

recording annotating workflows

Page 32: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Three Dimensions

• ContentCapturing and representing provenance information

• ManagementStoring, querying, and accessing provenance information

• UseInterpreting and understanding provenance in practice

recording annotating workflows

scalability interoperability

Page 33: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Three Dimensions

• ContentCapturing and representing provenance information

• ManagementStoring, querying, and accessing provenance information

• UseInterpreting and understanding provenance in practice

recording annotating workflows

scalability interoperability

trust accountability compliance explanation debugging

Page 34: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Standardization

Page 35: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

W3C PROV StandardProvenance is a record

that describes the people, institutions, entities, and

activities, involved in producing, influencing, or delivering a

piece of dataor a thing.

http://www.w3.org/TR/prov-overview

Page 36: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Luc Moreau & Paul Groth

W3C PROV StandardProvenance is a record

that describes the people, institutions, entities, and

activities, involved in producing, influencing, or delivering a

piece of dataor a thing.

http://www.w3.org/TR/prov-overview

Page 37: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

http://doc.metalex.eu

http://yasgui.data2semantics.org

Page 38: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 39: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Interpretation

Page 40: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 41: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Naive Approaches

InProv: Visualizing Provenance Graphs with Radial Layouts and Time-Based Hierarchical Grouping Madelaine D. Boyd - http://www.seas.harvard.edu/sites/default/files/files/archived/Boyd.pdf

Orbiter has several limitations. It does not have capabilities for query subgraph high-lighting, regular expression filters, process grouping, annotations, or programmable views[16].Furthermore, the structure of each summary node, where child nodes are grouped withinparents and are hidden until the parent is expanded, benefits queries earlier in the depen-dency chain. Initial overviews often correspond with system bootup, and appear very similaracross di↵erent traces (time slices of system activity).

Figure 10: In these screenshots of Orbiter, the presence of edges overwhelms the visibility ofnodes. By relying on a node-link graph layout and using spatial location to encode objectrelationships, Orbiter’s graph layout algorithm must draw many long edges to communi-cate node connections. Without edge bundling or opacity variation, the meanings of theserelationships are obscured.

Another one of Orbiter’s weaknesses is its node-link diagram layout. As a result, eachnode’s position in the X-Y plane and the length and angle of connecting lines are wastedattributes. The chosen graph layout algorithm (dot by default) arranges nodes to minimizeedge crossings and total edge lengths. However, depending on the interrelationships amongnodes, it may be impossible to find an optimal layout. In this case, undesirable designs withdense quantities of long edges may emerge, as seen in Figure 10. At the scale of a typicalprovenance graph, related nodes may be drawn far apart. This weakens the e↵ectiveness ofedges as “connections” that show relationships between nodes.

2.4 Large Graph Visualization

While a complete survey of graph and tree visualization is beyond the scope of this paper,I will summarize some notable approaches. See Herman et. al for a more detailed overviewof graphs and information visualization[27], or see Ellis and Dix for an overview of clutterreduction techniques for visualization of large data sets[20].

There is a variety of current e↵orts to visualize large graphs. Many of these tools weredesigned for social network or genomics data sets, for which there is a motivation to seeboth patterns in the data set at large, as well as node-level detail. Visualization attemptsfor large graphs mostly fall within three categories — summary node-link diagrams, tree

17

Figure 11: (Top): A screenshot of the portion of the graph generated by GraphViz for atrace of the third provenance challenge. (Bottom): A zoomed-in view of the same graph.The horizontal black bars across the images are dense collections of edges.

E↵ective large graph visualizations present the user with a summary view that can beexplored, filtered, and expanded interactively.

2.5 Tree Visualization

While trees are a subcategory of graphs, because of their hierarchical composition, tree visu-alization forms its own subfield of research. A survey of over two-hundred tree visualizationsis given at Hans-Jrg Schulz’s treevis.net. Visitors can narrow down by dimensionality(2D, 3D, or mixed), representation (explicit node-link diagram, implicit treemap, or combi-nation), alignment (XY plot, radial layout, or free diagram)[55]. These categories are shownby the icons in Figure 13.

19

Figure 12: Left : Pajek uses various summary node-link and matrix-based representationsdepending on the structure of the supplied data set. Pictured is a main core subgraphextracted from routing data on the Internet. Right : TopoLayout optimizes the choice ofvisualization display depending on the underlying graph structure. The right column isTopoLayout’s output, while the left and middle columns are the outputs of the GRIP andFM graph layout algorithms.

Figure 13: treevis.net defines di↵erent categories for tree maps. Tree maps can be cate-gorized by dimensionality (2D, 3D, or mixed), representation (explicit, implicit, or mixed),or alignment (XY, radial, or spring).

Tree visualizations are either explicit or implicit. Explicit representations resemble node-link diagrams. An example of an implicit representation is a tree map, a diagram where theentire tree is inscribed in a rectangle representing the root node. This root is subdividedhierarchically into more rectangles, which represent child nodes, and each child node issubdivided into more child nodes. Treemaps are excellent for displaying hierarchical orcategorical data[57]. One famous example, shown in Figure 14, is the “Map of the Market”from SmartMoney.com, which displays in red and green the changes in market value ofpublicly-traded companies, grouped by market sector, with cell size proportional to marketcapitalization[64].

TreePlus is an example of a tree-inspired graph visualization tool (Figure 15). It usesthe guiding metaphor of “plant a seed to watch it grow” to summarize navigation of its tree-based large graph visualization tool[42]. The visual interface displays a tree, starting fromthe graph root or a user-specified starting node. Nodes at the same level are listed vertically;parents and children are listed to the left or right. When the user hovers over displayed

20

Page 42: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 43: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Width of activities and entities is based on information flow

Activities and entities are extracted from an ego graph

Page 44: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Capturing

Page 45: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 46: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 47: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

We need an intuitive REST-like API to integrated Open Government data. Dealing with all these different formats and identifiers is really taking too much time.

I have all this data, and I want to make (part of) it available for the general public, but haven't a clue how!

Civil Servantwants to publish data

Application Developerswant to consume data

Carrier 12:00 PMPage Title

http://www.domain.com

Google

Apps and applicationsVisual interactions with Open Data. Application specific logics (e.g. 'danger')

CitySDK APIHTTP API to the CitySDKReturns JSON, Turtle, etc.

(includes the Linked Data API of CitySDK)

SPARQL APISPARQL Endpoint to the Linked

Data storage of the ODE

Partial Synchronisation

CitySDK Datastores Linked Data Triplestore

Feed into

Query Orchestrator

Amsterdam Open Data ExchangeHTTP API to `canned queries' across multiple datasets.

Returns JSON-LD, Turtle

Data Integrator

ODE Best PracticesBest practices for publishing Open Data

CitySDK Ingestion Plugins"Standard" adapters part of CitySDK

ODE Ingestion AdaptersIngestion adapters developed within

ODE

Municipal Legacy Systems Excel FilesAmsterdam Open Data CKAN

Amsterdam Open Data CatalogWill point to datasets in the ODE

May provide a direct query interface on top of ODE

Wrapper-based

Page 48: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Workflow-based

Page 49: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Tom de Nies (Ghent University)Sara Magliacane (VU University Amsterdam)

Page 50: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Integrated

Page 51: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 52: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Page 53: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

to2Data Semantics

Semantics for Scientific Data PublishersFrom Data

The Big Future of Data2 October 2014

Page 54: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Enrich Publish Analyze

Page 55: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Semantic Publication of Data

Publish directly from the cloud

to the cloud

On-the-fly analysis and tag suggestion

Page 56: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Interactive Data Construction via Instrumented IPython Notebook

Integration in popular tool

No “green field”

Page 57: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Visual Exploration of Big Data

Virtualisation

Discover patterns

Interactive visualisation

Sparse and heterogeneous

Page 58: Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

Recommended