+ All Categories
Home > Technology > Iswc 2014-hammond-pasin-presentation-final

Iswc 2014-hammond-pasin-presentation-final

Date post: 30-Jun-2015
Category:
Upload: tony-hammond
View: 681 times
Download: 2 times
Share this document with a friend
Description:
Talk for ISWC 2014 (Industry Track) by Tony Hammond and Michele Pasin on October 22, 2014 at Riva del Garda, Italy: 'Linked data experience at Macmillan: Building discovery services for scientific and scholarly content on top of a semantic data model'
15
LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model 22 October 2014 Tony Hammond Michele Pasin
Transcript
Page 1: Iswc 2014-hammond-pasin-presentation-final

LINKED DATA EXPERIENCE AT MACMILLANBuilding discovery services for scientific andscholarly content on top of a semantic data model

22 October 2014

Tony Hammond

Michele Pasin

Michele Pasin
this one will be removed
Page 2: Iswc 2014-hammond-pasin-presentation-final

Linked Data at Macmillan | 22 October 2014

1

Background

About Macmillan and what we are doing

Page 3: Iswc 2014-hammond-pasin-presentation-final

Macmillan Science and Education

Linked Data at Macmillan | 22 October 2014

Group brands and businesses

Page 4: Iswc 2014-hammond-pasin-presentation-final

MS&E Current trends

Change Drivers

●Digital first workflow

– print becomes secondary

– support for multiple workflows

●User-centric design

– things, not data

– focus on user experience

●Deeply integrated datasets

– standard naming convention

– common metadata model

– flexible schema management

– rich dataset descriptions

Linked Data at Macmillan | 22 October 2014

Developing a richer graph of objects

Page 5: Iswc 2014-hammond-pasin-presentation-final

NPG Linked Data Platform (2012)

Deliverables (2012–2014)

●Prototype for external use

●Two RDF dataset releases in 2012

– April 2012 (22m triples)

– July 2012 (270m triples)

●Live updates to query endpoint

●SPARQL query service (decommissioned)

Current Work (2014–)

●Focus on internal use-cases

●Publish ontology pages

●Periodic data snapshots

Linked Data at Macmillan | 22 October 2014

data.nature.com

Page 6: Iswc 2014-hammond-pasin-presentation-final

NPG Core Ontology (2014)

Features

●Classes: ~65

●Properties: ~200

●Named graphs (per class)

Namespaces

●npg: => http://ns.nature.com/terms/

●npgg: => http://ns.nature.com/graphs/

Approach

●Incremental formalization (RDF, RDFS, OWL-DL)

●Shared metamodel vs. automatic inference

●Minimal commitment to external vocabs

Linked Data at Macmillan | 22 October 2014

Things: assets, documents, events, types

Page 7: Iswc 2014-hammond-pasin-presentation-final

NPG Subject Pages (2014)

Features

●Based on SKOS taxonomy

– >2500 scientific terms

– content inherited via SKOS tree

●Dynamically generated

– one webpage per subject term

– secondary pages for article types

●Various formats, e.g. e-alerts, feeds

– allows people to ‘follow’ a subject

●Customized related content

– ads, jobs, events, etc.

Linked Data at Macmillan | 22 October 2014

Topical access to content

Page 8: Iswc 2014-hammond-pasin-presentation-final

Linked Data at Macmillan | 22 October 2014

2

Data Storage and Query

Achieving speed by means of a hybrid architecture

Page 9: Iswc 2014-hammond-pasin-presentation-final

Content Hub

Capabilities

●Discovery – Graph

●Storage – Content Repos

Features

●Hybrid RDF + XML architecture

– MarkLogic for XML, RDF/XML

– Triplestore (TDB) for RDF validation

●Repo’s for binary assets

Datasets

●Documents (large; >1m)

●Ontologies (small; <10k)

Linked Data at Macmillan | 22 October 2014

Managed content warehouse for data discovery

Page 10: Iswc 2014-hammond-pasin-presentation-final

System Architecture

Linked Data at Macmillan | 22 October 2014

Hub content

Page 11: Iswc 2014-hammond-pasin-presentation-final

Content Discovery – Principles

Generations

●1st – Generic linked data API (RDF/*)

●2nd – Specific page model API (JSON)

Concerns

●Speed (20ms single object; 200ms filtered object)

●Simplicity (data construction)

●Stability (backup, clustering, security, transactions)

Principles

●Chunky not chatty, all data in a single response

●Data as consumed, rather than as stored

●Support common use cases in simple, obvious ways

●Ensure a guaranteed, consistent speed of response for more complex queries

●Build on foundation of standard, pragmatic REST (collections, items)

Linked Data at Macmillan | 22 October 2014

Readying the API for applications

Page 12: Iswc 2014-hammond-pasin-presentation-final

Content Discovery – Optimization

Approaches

●TDB + Fuseki – SPARQL

●MarkLogic Semantics – SPARQL

●MarkLogic – XQuery

●MarkLogic (Optimized) – XQuery

Techniques

●Partitioning – RDF/XML objects

●Streaming – serialization

●Hashing – dictionary lookup

●Cacheing – Varnish

Linked Data at Macmillan | 22 October 2014

Tuning the API for performance

Page 13: Iswc 2014-hammond-pasin-presentation-final

Content Storage – Layout and Indexing

Challenges

●Sort orders

●RDF Lists

●Facetting, counting

Layout

●Semantic RDF/XML includes in XML

●RDF objects serialized in list order

●Application XML for subject hierarchy

Indexes

●Indexes over all elements

●Range indexes for datatypes (e.g. datetimes)

Linked Data at Macmillan | 22 October 2014

Readying the data for page delivery

Page 14: Iswc 2014-hammond-pasin-presentation-final

In Conclusion

Summary

●An RDF metamodel allows for scalable enterprise-level data organization

●It is crucial to adequately distinguish between external and internal use cases

●A hybrid architecture proved to be an efficient internal solution for content delivery

Future Work

●Grow the ontology so that it matches product requirements more closely

●Support automated reasoning and richer query options – both RDF and XML based

●Maintain and expand the vision of a shared semantic model as a core enterprise asset

Linked Data at Macmillan | 22 October 2014

A few lessons learned

Page 15: Iswc 2014-hammond-pasin-presentation-final

For more information please contact

TONY HAMMONDData Architect, Content Data [email protected]

MICHELE PASINInformation Architect, Product [email protected]

Thank you


Recommended