Date post: | 23-Jan-2015 |
Category: |
Technology |
Upload: | ian-harrow |
View: | 431 times |
Download: | 0 times |
http://pistoiaalliance.org
Ian Harrow, PhD
Co-Leader of Pistoia Alliance SESL pilot (ex-Pfizer)
Founder, Director & Principal Consultant at Ian Harrow Consulting Ltd
Bio IT World, Hanover, October 2011
Towards a brokering framework
for knowledge-based services:
Learning from the Pistoia Alliance
SESL pilot
Outline
• Industry Drivers
• Mission and Strategy of Pistoia
• Vision for the SESL pilot
• Minimal configuration to test a
brokering service
• Public demonstrator and standards
• Deliverables achieved by SESL pilot
• Learning and future direction 2
What is Core to your Business?
What is Critical?
Core?
Cri
tical?
Focus
Staff on
Innovation
Externalize
for Cost
Reduction
Externalize
for
Best
Practices
Reduce
Non-Value
Added Work
1990
2012
3
Why the Pistoia Alliance?
• Industry was at a cross roads
– Change in business models required
• We are all in this (mess) together (Life Science,
technology vendors, service IT, academia, etc.)
• Need industry applicable services and
standards
• Collect all the stakeholders together
– Agree on commonly-shared, pre-competitive use
cases
• Focus on delivery of proofs of concept to
stimulate and foster new business models
4
Henry Chesbrough, UC Berlkey 2011
The Mission of the Pistoia Alliance
Lowering the barriers to innovation
by improving the interoperability of R&D business processes
via pre-competitive collaborations
5
6 6
Pistoia Alliance Membership Sept 2011
7
A Reality Check: Setting Expectations
8
Signpost
clearly
9
Pistoia
Strategy
10
Domains of Action
Biology & Translational
Medicine Chemistry
Scientific Collaboration
11
The Focus of Each Domain
12
Big Data, Analytics, Semantics
Supply Chain, Tech Transfer
Vocabularies, Use Cases,
Best Practices
Biology Chemistry
Scientific Collaboration
Try this at your desk….
Which diseases are correlated to the gene, TCF7L2?
Gene/Protein Literature - Abstracts
Inherited diseases Gene expression
Literature – Full Text
13
Try it again with Pistoia’s SESL….
Gene naming/synonyms
Gene Function
Literature statistics
Disease co-occurrences
Gene/protein interactions
…all in one report from one
search
HOW? A standard vocabulary,
data model, query language,
report structure, etc.
14
• Deliverables: – Publication of standards and recommendations for brokering service
implementation – Public demonstrator service for a single disease area – Dialogue and assessment of potential business impact with key content
suppliers
• Scope: – Development of an assertion database in combination with a user
interface and associated web services for one disease/indication/phenotype of broad interest: Type II Diabetes
– Assertional content derived from 3 structured data sources and limited Journal content (co-occurrence and statistical derivation from full text)
– Assertional evidence for filtering and drill down to primary data. – Limited vocabulary development for area of focus: Type II Diabetes
• Participants and Cost: – AZ, Pfizer, GSK, Roche, Unilever, EMBL-EBI, NPG, OUP, Elsevier & RSC – Single contract between Pistoia Alliance & EMBL-EBI – £200K cost (=2 x FTEs) – shared by industry – 12 month project, January 2010 start
SESL Pilot project description
15
16
The Knowledge Service Framework
Corpus 1
Supplier
Firewall
Db 2
Db 3
Db 4
Corpus 5
Content
Suppliers
‘Consumer’
Firewall
Multiple
Consumers
Knowledge
Applications
Disease Dossier
Assertion & Meta Data Management
Transform /Translate (RDF triples)
Integrator/Aggregator (Triple store)
Common
Service
Broker Business
Rules
Open
Stand
-ards
Service Layer Std Public
Vocabularies
16
Minimal configuration to test the technical
feasibility of a Knowledge Broker Service
Assertion & Meta Data Mgmt
Transform / Translate
Triple store 1
Service Layer Std Public
Vocabularies
Query
templates
Broker #1 Broker #2
Assertion & Meta Data Mgmt
Transform / Translate
Triple store 2
Service Layer Std Public
Vocabularies
Query
templates
Condition:
Identical structure.
Different content
which can overlap
Brokering service
Layer
Elsevier
corpus
EBI Uniprot
database
RSC
corpus NPG
corpus
OUP
corpus
EBI Array
Express
database
Primary source
Layer
UK-Pubmed
Central
corpus EBI Uniprot
database
NCBI OMIM
database
User Interface Interface
Layer
17
A. Gene query results summary
1) Co-occurrence Documents
2) Uniprot names and annotation
3) OMIM disease names
4) Array express disease and/or
pancreas expression
5) Uniprot GO terms
6) Uniprot Binary interactions
B. Disease query results summary
1) Co-occurrence Documents
2) OMIM disease names
3) Array express disease expression
Full text detail
Title: Authors:
Citation
Co-occurrence of
gene and disease
mentions in text
extracts
Full text detail
Title: Authors:
Citation
Co-occurrence of
gene and disease
mentions in text
extracts
The results include links out to the primary sources
2. Aggregated Results on a single web page
Simple Graphical User Interface to the
SESL public demonstrator
and/or
A. Gene Query
B. Disease Query
1. Single point of query through a simple GUI
Filtered by:
1) Everything
2) Consensus
3) Co-occurrence
4) OMIM
5) Array Express
Show:
http://www.pistoia-sesl.org
SESL public demonstrator:
18
Type 2 diabetes genes in SESL demonstrator
19
Human protein names Human gene names
Source: UniProt diabetes mention
SESL: UniProt diabetes mention
Google Scholar: type 2
diabetes 2006 to
June 2011
Pubmed: type 2
diabetes June 2011
SESL: gene and type 2 diabetes
co-occurrence in Full Text
Source: OMIM
diabetes mention
SESL: OMIM
diabetes mention
Source: Array
Express Atlas
pancreas
SESL: Array
Express pancreas
Source: Uniprot
GO terms
SESL: GO terms
Source: Uniprot Intact binary
interactions
SESL: Binary
interactions
ATP-binding cassette sub-family C member 8
ABCC8 1 1 753 37 6 6 6 5 7 7 9 0 0
Calpain-10 CAPN10 1 1 810 168 21 1 1 1 1 12 12 0 0
Glucokinase GCK 1 1 3,950 708 12 7 7 0 0 19 19 2 2
Hematopoietically-expressed homeobox protein
HHEX 0 0 626 91 24 1 0 2 2 21 23 3 0
Hepatocyte nuclear factor 1-alpha HNF1A 1 1 633 340 23 3 4 2 2 12 12 6 6
Hepatocyte nuclear factor 1-beta HNF1B 1 1 408 269 20 1 1 2 2 9 8 1 0
Hepatocyte nuclear factor 4-alpha HNF4A 1 1 811 173 34 2 2 3 3 22 20 5 5
Insulin INS 2 1 166,000 37,670 5 9 0 7 0 59 59 0 0
Insulin receptor substrate 1 IRS1 1 1 7,970 616 9 1 0 2 2 24 24 3 0
Insulin receptor INSR 1 1 14,00 4,830 16 2 4 6 6 41 43 9 9
ATP-sensitive inward rectifier potassium channel 11
KCNJ11 1 1 1,260 45 35 3 1 0 0 12 12 1 0
Hepatic triacylglycerol lipase LIPC 1 0 2,090 89 1 1 1 1 1 17 17 0 0
C-Jun-amino-terminal kinase-interacting protein 1
MAPK8IP1 1 1 248 4 1 1 1 1 1 6 6 4 4
Neurogenic differentiation factor 1 NEUROD1 1 1 549 50 7 2 2 2 4 13 14 0 0
Pancreas/duodenum homeobox protein 1
PDX1 1 1 2,270 154 9 2 0 1 1 9 9 0 0
Peroxisome proliferator-activated receptor gamma
PPARG 1 1 9,540 1,556 48 1 1 2 2 40 42 7 7
Protein phosphatase 1 regulatory subunit 3A
PPP1R3A 1 1 141 23 3 1 0 1 1 2 2 0 0
Zinc transporter 8 SLC30A8 1 0 724 117 0 2 1 3 4 13 13 0 0
Transcription factor 7-like 2 TCF7L2 1 1 2,000 284 65 1 1 3 3 33 31 5 5
Mitochondrial brown fat uncoupling protein 1
UCP1 1 0 1,760 50 3 0 0 0 0 6 6 0 0
Gene discovery in SESL demonstrator
20
20
Pancreas
expression
in Array
Express db
T2D disease
genes in
Full Text
documents
Gene count
intersections from
the data sources in
the demonstrator
T2D disease
gene
mention
in OMIM db
1
3
1
T2D disease
gene
mention in
Uniprot db
1 0
4 3 10
Selected content loaded as RDF triples
21
Source Description # triples % Expression data Array Express 182,840 0.5%
Experimental Factor Ontology from Array Express 49,026 0.1%
Disease vocabulary from UMLS 6,906,735 18.8%
Vocabulary from Disease Ontology 1,863,664 5.1%
Terms from Gene Ontology 495,595 1.3%
Human genes from Uniprot 12,552,239 34.1%
Meta data from Full Text documents 3,485,212 9.5%
Gene annotations from Full Text documents 2,373,584 6.5%
Disease annotations from Full Text documents 4,983,788 13.6%
GO annotations from Full Text documents 3,870,834 10.5%
Totals 36,763,517 100%
Signposting: Standards used in SESL
Category Name Community
Triple Store
RDF W3C
SPARQL W3C
Jena, Sesame,
Virtuoso Open Source
Text Mining
leXML EBI & CALBC
LexEBI/BioLexicon EBI, NaCTeM, U of
Pisa
CALCBC EBI & CALBC
URIs
UniProt EBI, PIR, SBI, etc
Disease Ontology and UMLS OBO, NIH/NLM
ArrayExpress EBI
NCBI Taxonomy NCBI
RDF Schema
Dublin Core W3C
N3 notation W3C
Co-occurrence of gene-
disease EBI
PMC doc standard NCBI
Ontology Relation ontology OBO
URI server W3C
Blending of
existing
standards
22
The Deliverables of the SESL pilot
• A proof-of-concept to demonstrate feasibility and
clarify requirements
– http://www.pistoia-sesl.org
• A functional specification for query brokering,
result filtering, report generation
– Expect publication by end 2011 – http://www.pistoiaalliance.com/workinggroups/sesl.html
• Academia, Life Science Industry and Publishers
– Attained a better understanding of each other’s needs
– Demonstration of potential for a new business model
– Explore follow-on via Open Innovation consortia
23
Learning and Future Direction
• Framework to maximise re-use of existing standards
– Minimise use of bespoke, hard-coded implementations
• Crucial features of a knowledge brokering service:-
– RDF triples for a scalable, meta index to broker across
primary sources (both databases and literature)
– Important to define business rules for query & extraction
– Recommend a registry of suitable data sources
• similar to web services registry
• What is next? – Example, follow-on to the SESL pilot:-
– Open PHACTs consortium => www.openphacts.org
– 3 year IMI pre-competitive project (started early 2011)
– Data providers and Life Science industry working together
24
Acknowledgements
Industry
Wendy Filsell - Unilever
(SESL co-leader)
Ian Stott - Unilever
Nigel Wilkinson - PFE
Catherine Marshall - PFE
Peter Woollard - GSK
Ashley George - GSK
Mike Westaway - AZ
Nick Lynch - AZ
Ian Dix - AZ
Michael Braxenthaler – Roche
John Wise – Pistoia Alliance
Publishers
Claire Bird – OUP
Richard O’Bierne – OUP
Colin Batchelor – RSC
Richard Kidd – RSC
David Hoole – NPG
Alf Eaton – NGP
Jabe Wilson – Elsevier
Bradley Allen – Elsevier
EMBL-EBI
Dietrich Rebholz Schuhmann
(Technical Team Leader)
Christoph Grabmueller
Silvestras Kavaliauskas
Dominic Clark
Roderigo Lopez
Jo McEntyre – UK-PMC
Janet Thornton
25