Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | betty-johnston |
View: | 214 times |
Download: | 0 times |
Data provenance in biomedical discovery
Donald DunbarQueen’s Medical Research Institute
University of Edinburgh
Workshop on Principles of Provenance in DatabasesMay 21st 2008
Background
biomedical research
basic & clinical science
animal, cell models, patients
genes, proteins, pathways
data analysis & mining
publication
Biomedical discovery
• Looking for contribution to – human health and disease
• In house experiments– data workflows– knowledge capture
• Use public databases– many data types– integration is a problem
Databases we use
sequence structure
function
expression domain specific
Data workflows
experiment 2
spreadsheet
raw datacalculations
publication
database
processeddata
experiment 1 database
Data workflows
copy and paste
open from file
‘algorithm’
copy and paste
save to file
IN
OUT
BUT:
web servicesautomated tools & databasesbioinformatics workflows
Bioinformatics workflows
Is our field changing?databases
experiments knowledge knowledgebase
Knowledge capture
Knowledge capture
What provenance to we need?Example:Gene expression in a transgenic animal
gene annotation gene expression measurements
public databases output from machine
processingintegration
where, when
which identifiers how
when, what, how
data miningwhat and how did we select genes
……
What provenance to we need?Example:Curated protein database
expert data database links
curator input
archive
contributor, date
verify, add, delete, modify
source, identifiers, dates
Curated databaseversions, dates
developmentschema & interface changes
What do we do now (for provenance)?
• We trust the main data providers a lot!– a pragmatic approach
• We use tools and note the settings– rarely fully
• We put extra fields in our databases– source, modify date
• We deposit our data in public repositories– but only when we need to
What might we do next?
• Use workflow tools like Taverna– capture workflow provenance
• Build provenance tool & database– widely applicable
• Make provenance more visible to biologists– so they value and use it
Conclusions
• In biology we don’t do provenance well (yet)• We use databases and manual workflows• We implement rudimentary provenance• We should build useful provenance tools • We need to make provenance visible