An Overview of Intersect’s GDM System
Joe Thurbon Intersect
Prelude
The Brief• The research activities and communities it supports
• The data management and analysis issues it seeks to resolve
• The types and volumes of data/metadata it will manage
• The philosophy behind its design (e.g. why use semantic web technologies, if you are)
Supported Communities
Issues Addressed
Issues Addressed
Data / Metadata
• We keep– All the short reads– All the bioanalyser output– All the trimmed reads– All the tertiary analysis output ‘that works’– All parameters used to generate all of the above• Chemistry versions• Command line parameters• Species, Locations, etc
Data CountsPlatform / Metric Tissue Samples
Per YearChunks of Data Per Year
Fields of Metadata Per Year
Illumina with standard multiplexing
~2050 > 12000 ~ 100000
454 with standard multiplexing
~3500 > 14000 ~ 120000
Data SizesPlatform / Metric Untrimmed / Year Trimmed / year
Illumina with standard multiplexing
1.2TB 300GB
454 with standard multiplexing
~2.5TB (?) 600GB
Philosophy
• Anthropology• Epistemology• History and Philosophy of Science• Bio-informatics
• GDM
What do people do?
What to Researchers do?
What do Scientists do?
What do Bio-Informaticians do?
What does GDM Do?
• Allows bio-informaticians to– Stand on the shoulders of giants (including
themselves)– Record their observations– Record their experimental parameters– Manage their data
• Iteratively
Demo
A results corresponds to a single experiment
• What experiment did I do? (steps)• What parameters did I set? (parameters)• What observations did I make? (outputs)• When did I end up with (data)• What did I start from? (parents)
Metadata + Data + Parents = One Results
GDM Repository
SCU Ramaciotti
ANU
VELVET
Tcoffeee
VELVET
AC3BLAST
BLASTBLASTVELVET
NCBI EMBLDDBJ