20111120 warsaw learning curve by b hyland notes

transcript

Climbing the Learning Curve with Linked DataOpen Government Data Camp 20-Oct-2011

Bernadette Hyland, CEObhyland@3roundstones.com

Twitter @BernHyland

Wednesday, October 19, 2011

Information overload, Impatient society, Change is the only constantSoftware is not valued by its usefulness ... but by its expected future value

• Linked Data is about publishing and consuming data using international data standards

• Based on 20 year old idea

• Goal is to solve organizational issues related to data silos, requirements for faster data integration and an environment of reduced IT budgets

Why am I speaking on Linked Data and sharing? I’m here in my role as the co-chair of W3C GLD WG.I’m also a long time entrepreneur in this space having founded companies that led to several of the most widely used Open Source projects for Linked Data, including Mulgara, OpenRDF/Sesame, the PURLs 2.0 and Callimachus. I’ve authored chapters in two of these peer-reviewed books published by Springer which are available in hardcopy or for free, via the Web.

There is a Process

PublishConvertDescribeNameModelIdentify

Maintain

Identify the data, model exemplar records -- what you are going to carry forward & what you are going to leave behind. Name all of the NOUNs. Turn the records into URIs. Next, describe RESOURCES with vocabularies. Write a script or process to convert from canonical form to RDF. Then publish. Maintain over time.

Preparation1.Leverage what exists

• Request a copy of the logical and physical model of the database(s)

• Obtain data extracts (i.e., databases and/or spreadsheets) or create data in a way that can be replicated.

Linked Data modelers typically model two or three exemplar objects to begin the process. We figure out the relationships and identify how each object relates to the real world, initially drawing on a large white board or collaborative wiki site.

Model the data2. Model data without context to allow for

reuse and easier merging of data sets

•Traditional DBAs organize data for specified Web services or applications.

•With LD, application logic does not drive the data schema, concepts, etc.

LD domain experts model data without context versus traditional modelers who typically organize data for specified Web services or applications. Application logic does not drive the data schema.Better enables data reuse and easier merging of data sets.

Model the data3.Look for real world objects of interest (e.g.,

people, places, things, locations, etc.) and model them.

• Investigate how others are already modeling similar or related data.

• Look for duplication and normalize the data

•Use common sense to decide whether or not to make link

Linked Data modeling experts typically model two or three exemplar objects to begin the process. We figure out the relationships and identify how each object relates to the real world, initially drawing on a large white board or collaborative wiki site.

Model the data ...

4. Connect data from different sources and authoritative vocabularies (see list of popular vocabularies below).

•Use URIs as names for your objects

During the modeling process, donʼt think about how an application will use your data. Instead, focus on modeling real world things that are known about the data and how it is related to other objects. Take the time to understand the data and how the objects represented in the data are related to each other.

Model the data ...

•Put aside immediate needs of any application

•Don’t think about how an application will use your data

•Do think about time and how the data will change over time.

Focus on modeling real world things that are known about the data and how it is related to other objects. Take the time to understand the data and how the objects represented in the data are related to each other.

Convert, Publish & Maintain5.Write a script or process to convert the

data set repeatedly

6.Publish to the Web and announce it! (more details shortly)

7.Maintenance strategy (more details in the social contract at the end)

1.Expect to be maintained in perpetuity2.Do not encode the name of the department or agency currently defining and naming a

concept, as that may be re-assigned3.Support a direct response, or redirect to department/agency servers

Take the plunge ... Be forgiving

•Simplistic data models can still be useful

•Better to make progress with something rather than do nothing because we cannot be comprehensive and complete

Science still doesn’t have a good understanding of a gene. We have gene therapy yet we haven’t agreed on a definition of a gene.

We capture vast quantities of topographical data (USGS), yet scientists still debate the meaning of topographical elements. From the time we are young children, we use mono syllabic words to navigate trees and roads. If our parents said we cannot do anything because we don’t have a perfect model of the world, we couldn’t have learned to navigate our home as toddlers.

Take an iterative approach1. Review of modeling decisions

2. Review vocabularies chosen and developed

3. Modify/update data conversion scripts

4. Do a maintenance walk-through with real use cases

5. Show how to explore data with SPARQL and visualizations

6. Discuss a persistent identifier strategy (think PURLs)

Iterate on this process in short sprints, two weeks at a time. Don’t be afraid to review modeling decisions with SMEs. Review vocabulary choicesDo a maintenance walk through with actual use cases and ensure the team can carry forwardShow people their OWN DATA in visualization tools like Callimachus.

Reality ... We started with the usual CSV dump ... ugly, cumbersome data

We used two common RDF vocabulary description languages in our modeling for SRS: RDF Schema (RDFS) and Simple Knowledge Organization System (SKOS). RDFS is used to give labels to objects, synonyms and substance lists. Human-readable comments were added using rdfs:comment property.

Possible Solutions for Data Management

•Roll your own three-tier

•Content Management System

•Wiki-based

•Linked Data Management System

A few different possible solutions to the three challenges stated earlier

Content Management Systems

•Wordpress

•Drupal

•Joomla!

The big downside to 3 tier architecture is the upfront cost, as well as getting people to agree upfront on the schemaSo we then looked at CMS. These are systems that can be up and running the same day, however these systems are architected to work well with primarily unstructured content.

We have a strong heritage in FLOSS projects starting with the first community supported RDF database in 2003. We offered a commercial version used by the US defense community primarily, and in 2004 open sourced 80% into what became the Mulgara triple store and is used by institutions all over the world. OpenRDF and Sesame was led by Aduna.

Linked Data Management System

•Callimachus (kəәlĭm'əәkəәs) is a framework for data-driven applications based on Linked Data principles.

•Callimachus allows Web authors to quickly and easily create semantically-enabled Web applications.

Wiki Systems don't handle structured content well nor promulgate change well. A tool for Web 2.0 developers creating DATA RICH web sites was needed … We created Callimachus, a triples up & down solution (no mySQL under the covers). HIGHLY SCALABLE for real world use.Named for the father of Bibliography (The Pinakes) at the Great Library of Alexandria. Lived during 305-c. 240 BCE. He could not categorize his own work using Aristotle's hierarchical system. He was the first person who defined the use case for Linked Data.

Callimachus uses RDFa as a query langage; templates are parsed to build SPARQL from RDFa markup and the query result set is returned to the Web page for human to read, or a machine to parse. This is very valuable and to our knowledge, there is no other solution available as FLOSS or commercially that compares to Callimachus at this time.

Once we had the data modeled, validated with SMEs, we converted & loaded into Callimachus. We spent about 1 hour creating templates to view the data in Callimachus. So here is the power of LOD in action -- Within one hour, we could view the data, navigate through the data and verify the contents without being a DBA or Java developer!

Callimachus’ forms driven interface allows authorized users to modify the underlying triples in the database -- we are round tripping create/modify/delete to a triple store via a Web page!

Note the fixed name and added comment.

A history of changes is kept. Note the change to the name and the added comment, along with the time/date and name of the user who made the edit.

Callimachus view page of the SRS, created in less than an hour. Someone with HTML, CSS and RDFa / SPARQL skills can create this type of page. No understanding of semantics, deep RDF knowledge is required.

Notice the wiki like editing capabilities of a Callimachus page!

•Web 2.0 developers can create data driven application with templates in hours

•Triples up & down (no mySQL under the covers)

•Wiki editing of content

•Access control

•Collaboration via Web

•Change tracking (history)

•Page/form Templates

Callimachus is a great way to collaboratively manage your Linked DataMedia Wiki is to free text what Callimachus is to linked dataCallimachus uses a straight forward ACL for linked data

Join the Community•Callimachus has benefited from 2+ years of corporate support

•We’re using it for real world Web applications in environmental protection, finance and healthcare

•We’d love to work with the publishing industry

•Open Source project

•Visit callimachusproject.org

• Join the discussion

@BernHylandEmail. bhyland@3roundstones.com

WHY SHARE AND WHO BENEFITS?

Bernadette Hyland, co-chair W3C Government Linked Data Working Group

http://purl.org/net/bhyland/why-share-2011-10

Next talk today @ 14:00 Sala I - “Linked Open Government

Data Workshop”

20111120 warsaw learning curve by b hyland notes

Technology