+ All Categories
Home > Technology > Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Date post: 23-Jan-2015
Category:
Upload: ian-harrow
View: 431 times
Download: 0 times
Share this document with a friend
Description:
Towards a brokering framework for knowledge-based services: learning from the Pistoia Alliance SESL pilot Ian Harrow PhD for the Pistoia Alliance This presentation describes a pilot project to determine the feasibility of biomedical knowledge brokering. It shows query across multiple disparate data sources through a brokering demonstrator built from RDF triple store technology. The learning from this pilot is contributing to larger scale projects such as the Innovative Medicines Initiative, OpenPFACTs.
25
http://pistoiaalliance.org Ian Harrow, PhD Co-Leader of Pistoia Alliance SESL pilot (ex-Pfizer) Founder, Director & Principal Consultant at Ian Harrow Consulting Ltd Bio IT World, Hanover, October 2011 Towards a brokering framework for knowledge-based services: Learning from the Pistoia Alliance SESL pilot
Transcript
Page 1: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

http://pistoiaalliance.org

Ian Harrow, PhD

Co-Leader of Pistoia Alliance SESL pilot (ex-Pfizer)

Founder, Director & Principal Consultant at Ian Harrow Consulting Ltd

Bio IT World, Hanover, October 2011

Towards a brokering framework

for knowledge-based services:

Learning from the Pistoia Alliance

SESL pilot

Page 2: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Outline

• Industry Drivers

• Mission and Strategy of Pistoia

• Vision for the SESL pilot

• Minimal configuration to test a

brokering service

• Public demonstrator and standards

• Deliverables achieved by SESL pilot

• Learning and future direction 2

Page 3: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

What is Core to your Business?

What is Critical?

Core?

Cri

tical?

Focus

Staff on

Innovation

Externalize

for Cost

Reduction

Externalize

for

Best

Practices

Reduce

Non-Value

Added Work

1990

2012

3

Page 4: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Why the Pistoia Alliance?

• Industry was at a cross roads

– Change in business models required

• We are all in this (mess) together (Life Science,

technology vendors, service IT, academia, etc.)

• Need industry applicable services and

standards

• Collect all the stakeholders together

– Agree on commonly-shared, pre-competitive use

cases

• Focus on delivery of proofs of concept to

stimulate and foster new business models

4

Henry Chesbrough, UC Berlkey 2011

Page 5: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

The Mission of the Pistoia Alliance

Lowering the barriers to innovation

by improving the interoperability of R&D business processes

via pre-competitive collaborations

5

Page 6: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

6 6

Page 7: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Pistoia Alliance Membership Sept 2011

7

Page 8: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

A Reality Check: Setting Expectations

8

Page 9: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Signpost

clearly

9

Page 10: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Pistoia

Strategy

10

Page 11: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Domains of Action

Biology & Translational

Medicine Chemistry

Scientific Collaboration

11

Page 12: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

The Focus of Each Domain

12

Big Data, Analytics, Semantics

Supply Chain, Tech Transfer

Vocabularies, Use Cases,

Best Practices

Biology Chemistry

Scientific Collaboration

Page 13: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Try this at your desk….

Which diseases are correlated to the gene, TCF7L2?

Gene/Protein Literature - Abstracts

Inherited diseases Gene expression

Literature – Full Text

13

Page 14: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Try it again with Pistoia’s SESL….

Gene naming/synonyms

Gene Function

Literature statistics

Disease co-occurrences

Gene/protein interactions

…all in one report from one

search

HOW? A standard vocabulary,

data model, query language,

report structure, etc.

14

Page 15: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

• Deliverables: – Publication of standards and recommendations for brokering service

implementation – Public demonstrator service for a single disease area – Dialogue and assessment of potential business impact with key content

suppliers

• Scope: – Development of an assertion database in combination with a user

interface and associated web services for one disease/indication/phenotype of broad interest: Type II Diabetes

– Assertional content derived from 3 structured data sources and limited Journal content (co-occurrence and statistical derivation from full text)

– Assertional evidence for filtering and drill down to primary data. – Limited vocabulary development for area of focus: Type II Diabetes

• Participants and Cost: – AZ, Pfizer, GSK, Roche, Unilever, EMBL-EBI, NPG, OUP, Elsevier & RSC – Single contract between Pistoia Alliance & EMBL-EBI – £200K cost (=2 x FTEs) – shared by industry – 12 month project, January 2010 start

SESL Pilot project description

15

Page 16: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

16

The Knowledge Service Framework

Corpus 1

Supplier

Firewall

Db 2

Db 3

Db 4

Corpus 5

Content

Suppliers

‘Consumer’

Firewall

Multiple

Consumers

Knowledge

Applications

Disease Dossier

Assertion & Meta Data Management

Transform /Translate (RDF triples)

Integrator/Aggregator (Triple store)

Common

Service

Broker Business

Rules

Open

Stand

-ards

Service Layer Std Public

Vocabularies

16

Page 17: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Minimal configuration to test the technical

feasibility of a Knowledge Broker Service

Assertion & Meta Data Mgmt

Transform / Translate

Triple store 1

Service Layer Std Public

Vocabularies

Query

templates

Broker #1 Broker #2

Assertion & Meta Data Mgmt

Transform / Translate

Triple store 2

Service Layer Std Public

Vocabularies

Query

templates

Condition:

Identical structure.

Different content

which can overlap

Brokering service

Layer

Elsevier

corpus

EBI Uniprot

database

RSC

corpus NPG

corpus

OUP

corpus

EBI Array

Express

database

Primary source

Layer

UK-Pubmed

Central

corpus EBI Uniprot

database

NCBI OMIM

database

User Interface Interface

Layer

17

Page 18: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

A. Gene query results summary

1) Co-occurrence Documents

2) Uniprot names and annotation

3) OMIM disease names

4) Array express disease and/or

pancreas expression

5) Uniprot GO terms

6) Uniprot Binary interactions

B. Disease query results summary

1) Co-occurrence Documents

2) OMIM disease names

3) Array express disease expression

Full text detail

Title: Authors:

Citation

Co-occurrence of

gene and disease

mentions in text

extracts

Full text detail

Title: Authors:

Citation

Co-occurrence of

gene and disease

mentions in text

extracts

The results include links out to the primary sources

2. Aggregated Results on a single web page

Simple Graphical User Interface to the

SESL public demonstrator

and/or

A. Gene Query

B. Disease Query

1. Single point of query through a simple GUI

Filtered by:

1) Everything

2) Consensus

3) Co-occurrence

4) OMIM

5) Array Express

Show:

http://www.pistoia-sesl.org

SESL public demonstrator:

18

Page 19: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Type 2 diabetes genes in SESL demonstrator

19

Human protein names Human gene names

Source: UniProt diabetes mention

SESL: UniProt diabetes mention

Google Scholar: type 2

diabetes 2006 to

June 2011

Pubmed: type 2

diabetes June 2011

SESL: gene and type 2 diabetes

co-occurrence in Full Text

Source: OMIM

diabetes mention

SESL: OMIM

diabetes mention

Source: Array

Express Atlas

pancreas

SESL: Array

Express pancreas

Source: Uniprot

GO terms

SESL: GO terms

Source: Uniprot Intact binary

interactions

SESL: Binary

interactions

ATP-binding cassette sub-family C member 8

ABCC8 1 1 753 37 6 6 6 5 7 7 9 0 0

Calpain-10 CAPN10 1 1 810 168 21 1 1 1 1 12 12 0 0

Glucokinase GCK 1 1 3,950 708 12 7 7 0 0 19 19 2 2

Hematopoietically-expressed homeobox protein

HHEX 0 0 626 91 24 1 0 2 2 21 23 3 0

Hepatocyte nuclear factor 1-alpha HNF1A 1 1 633 340 23 3 4 2 2 12 12 6 6

Hepatocyte nuclear factor 1-beta HNF1B 1 1 408 269 20 1 1 2 2 9 8 1 0

Hepatocyte nuclear factor 4-alpha HNF4A 1 1 811 173 34 2 2 3 3 22 20 5 5

Insulin INS 2 1 166,000 37,670 5 9 0 7 0 59 59 0 0

Insulin receptor substrate 1 IRS1 1 1 7,970 616 9 1 0 2 2 24 24 3 0

Insulin receptor INSR 1 1 14,00 4,830 16 2 4 6 6 41 43 9 9

ATP-sensitive inward rectifier potassium channel 11

KCNJ11 1 1 1,260 45 35 3 1 0 0 12 12 1 0

Hepatic triacylglycerol lipase LIPC 1 0 2,090 89 1 1 1 1 1 17 17 0 0

C-Jun-amino-terminal kinase-interacting protein 1

MAPK8IP1 1 1 248 4 1 1 1 1 1 6 6 4 4

Neurogenic differentiation factor 1 NEUROD1 1 1 549 50 7 2 2 2 4 13 14 0 0

Pancreas/duodenum homeobox protein 1

PDX1 1 1 2,270 154 9 2 0 1 1 9 9 0 0

Peroxisome proliferator-activated receptor gamma

PPARG 1 1 9,540 1,556 48 1 1 2 2 40 42 7 7

Protein phosphatase 1 regulatory subunit 3A

PPP1R3A 1 1 141 23 3 1 0 1 1 2 2 0 0

Zinc transporter 8 SLC30A8 1 0 724 117 0 2 1 3 4 13 13 0 0

Transcription factor 7-like 2 TCF7L2 1 1 2,000 284 65 1 1 3 3 33 31 5 5

Mitochondrial brown fat uncoupling protein 1

UCP1 1 0 1,760 50 3 0 0 0 0 6 6 0 0

Page 20: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Gene discovery in SESL demonstrator

20

20

Pancreas

expression

in Array

Express db

T2D disease

genes in

Full Text

documents

Gene count

intersections from

the data sources in

the demonstrator

T2D disease

gene

mention

in OMIM db

1

3

1

T2D disease

gene

mention in

Uniprot db

1 0

4 3 10

Page 21: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Selected content loaded as RDF triples

21

Source Description # triples % Expression data Array Express 182,840 0.5%

Experimental Factor Ontology from Array Express 49,026 0.1%

Disease vocabulary from UMLS 6,906,735 18.8%

Vocabulary from Disease Ontology 1,863,664 5.1%

Terms from Gene Ontology 495,595 1.3%

Human genes from Uniprot 12,552,239 34.1%

Meta data from Full Text documents 3,485,212 9.5%

Gene annotations from Full Text documents 2,373,584 6.5%

Disease annotations from Full Text documents 4,983,788 13.6%

GO annotations from Full Text documents 3,870,834 10.5%

Totals 36,763,517 100%

Page 22: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Signposting: Standards used in SESL

Category Name Community

Triple Store

RDF W3C

SPARQL W3C

Jena, Sesame,

Virtuoso Open Source

Text Mining

leXML EBI & CALBC

LexEBI/BioLexicon EBI, NaCTeM, U of

Pisa

CALCBC EBI & CALBC

URIs

UniProt EBI, PIR, SBI, etc

Disease Ontology and UMLS OBO, NIH/NLM

ArrayExpress EBI

NCBI Taxonomy NCBI

RDF Schema

Dublin Core W3C

N3 notation W3C

Co-occurrence of gene-

disease EBI

PMC doc standard NCBI

Ontology Relation ontology OBO

URI server W3C

Blending of

existing

standards

22

Page 23: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

The Deliverables of the SESL pilot

• A proof-of-concept to demonstrate feasibility and

clarify requirements

– http://www.pistoia-sesl.org

• A functional specification for query brokering,

result filtering, report generation

– Expect publication by end 2011 – http://www.pistoiaalliance.com/workinggroups/sesl.html

• Academia, Life Science Industry and Publishers

– Attained a better understanding of each other’s needs

– Demonstration of potential for a new business model

– Explore follow-on via Open Innovation consortia

23

Page 24: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Learning and Future Direction

• Framework to maximise re-use of existing standards

– Minimise use of bespoke, hard-coded implementations

• Crucial features of a knowledge brokering service:-

– RDF triples for a scalable, meta index to broker across

primary sources (both databases and literature)

– Important to define business rules for query & extraction

– Recommend a registry of suitable data sources

• similar to web services registry

• What is next? – Example, follow-on to the SESL pilot:-

– Open PHACTs consortium => www.openphacts.org

– 3 year IMI pre-competitive project (started early 2011)

– Data providers and Life Science industry working together

24

Page 25: Pistoia Alliance SESL pilot Bio IT World Hanover 12 Oct 2011

Acknowledgements

Industry

Wendy Filsell - Unilever

(SESL co-leader)

Ian Stott - Unilever

Nigel Wilkinson - PFE

Catherine Marshall - PFE

Peter Woollard - GSK

Ashley George - GSK

Mike Westaway - AZ

Nick Lynch - AZ

Ian Dix - AZ

Michael Braxenthaler – Roche

John Wise – Pistoia Alliance

Publishers

Claire Bird – OUP

Richard O’Bierne – OUP

Colin Batchelor – RSC

Richard Kidd – RSC

David Hoole – NPG

Alf Eaton – NGP

Jabe Wilson – Elsevier

Bradley Allen – Elsevier

EMBL-EBI

Dietrich Rebholz Schuhmann

(Technical Team Leader)

Christoph Grabmueller

Silvestras Kavaliauskas

Dominic Clark

Roderigo Lopez

Jo McEntyre – UK-PMC

Janet Thornton

25


Recommended