Post on 27-Jan-2015
description
transcript
Data management for large
collaborative projects: challenges
and solutions. Arek Kasprzyk
Head of Data Management
Center for Translational Genomics and Bioinformatics
San Raffaele Scientific Institute,
March 27, 2014
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.
Big Data
2000 – 1 Genome ---- Human Genome Project
2008 – 1000 Genomes ----- 1000 Genome Project
2008 – 25, 000 Genomes ----- ICGC
2012 – 100,000 ----- UK Genomes
Big Data?
libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses They contain information from research areas including genomics, proteomics, metabolomics, microarray gene expression, and phylogenetics, Information contained in biological databases includes gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures.
Biomedical databases
www.biomart.org
Egyptian Hieroglyphs
Phoenician Alphabet
Biological abstractions
Query abstractions
Dataset
Filter
Attribute
Examples
human genes
located on chromosome 1, expressed in lungs
name, chromosome, description
rat genes
up-regulated in brain and associated with a QTL for
a neurological disorder
Upstream sequences
Rihanna songs
released before 2012
UK top 10
Graphical User Interface
Dataset
Filter
Attribute
SOAP/REST
<Query> <Dataset name="hsapiens_gene_ensembl" > <Filter name="chromosome_name" value="1"/> <Attribute name="ensembl_gene_id"/> <Attribute name="ensembl_transcript_id"/> <Attribute name="biotype"/> </Dataset> </Query>
What percentage of patients with primary breast cancer who relapsed within 5 years of surgery?
What is the average survival of patients with Chronic Myeloid Leukaemia (CML) and both with and without splenomegaly at diagnosis?
Find the age and gender of patients who have been diagnosed with Hodgkin's disease, where the initial diagnosis occurred between the ages 50 and 70 inclusive
What is the percentage of patients diagnosed with primary breast cancer in the age range 30 to 70 who were surgically treated and had post operative haematoma/seroma?
Examples
?
BioMart schema – “reversed star”
Filter
Attribute
Dataset
BioMart Architecture
Mart Configurator Add new data sources
Federate data sources
Edit metadata
Convert relational databases into mart schemas ‘virtual marts’
Website with a click of a button
BioMart “out of the box” website
Multiple Graphical User Interfaces
Multiple Aplication Programing Interfaces
What is BioMart?
BioMart
BioMart
URL SOAP
REST JAVA
What is BioMart?
SPARQL
BioMart
Bioclipse Taverna
Galaxy
Cytoscape
BioConductor
WebLab
What is BioMart?
BioMart Central Portal: An Open Database Network for Biological Community Guberman et al Database Vol. 2011, doi:10.1093/database/bar041
BioMart Central Portal
BioMart Community
BioMart Community
BioMart Central Portal
BioMart Central Portal
The G-protein coupled receptor domain (GPCR) has the InterPro ID of IPR000276. Find the human protein-coding genes in Ensembl that code for this domain, and investigate whether any of them are detectable with the Affy HuGene 1_0 st v1 array
esv263 is the DGVa accession number of a structural variation from Redon et al. (20). What genomic region does this copy number variation span?
Find the genes from Escherichia coli strain K12 that are found within the region ‘360473–365601’ and discover whether there are any orthologs in the related strains E. coli O157:H7 EC4115 and E. coli DH10B
The three-gene APL1 locus encodes essential components of the mosquito immune defense against malaria parasites. Find the variations within the APL1A, APL1B and APL1C genes as well as the strain name, strain genotype, allele and biotype
Ensembl
Find all IKMC resources for genes encoding transcription factors on chromosome 1 between 180-190 Mbp
Find all IKMC resources for genes expressed in heart
Find all IKMC mice available from the EMMA Repository with information on the vector used to make the mutation
Show me all the distributed EMMA lines have passed Southern blot quality control at a distribution center
s there any existing phenotype data for other mouse knockouts of the same gene for mouse lines produced from EUCOMM ES resources
International Knockout Mouse Consortium (IKMC)
Find all gene fusion mutations involving the FUS gene with a primary site of bone, and display mutation and sample information
Find variation information for all genes from mutated samples with a primary site of breast, and display COSMIC gene, mutation and sample information along with Ensembl variation information
Check the transcriptomic alteration status of the genes gained in lung cancer
Find all missense substitution mutations for BRAF in cell lines, and display sample, mutation, site, and histology information
COSMIC
Find genes commonly deregulated in pancreatic cancer precursor lesions, pancreatic intraepithelial neoplasia (PanIN) samples and display gene information, comparison and direction of regulation
Find genes differentially expressed in the serum of pancreatic cancer patients when compared to the serum of patients with benign pancreatic diseases (chronic pancreatitis and pancreatic pseudocyst). Find associated pathways via query integration with Reactome. Display gene and protein information, experimental details and pathway information
Find DNA copy number high-level amplifications in PDAC samples that also contain genes differentially expressed in PDAC versus chronic pancreatitis (CP) and display copy number information, gene information and differential expression experimental details
Find miRNAs differentially expressed in PDAC versus CP whose expression has been confirmed by RT–PCR techniques and display miRNA attributes and study information
Pancreatic Expression Database
Scales better
No central funding
No admin overhead
Very green footprint
Maintained by experts
“Virtual Bioinformatics Institute”
ICGC Data Portal
International Cancer Genome Consortium Data Portal: A One Stop Shop for Cancer Genomic Data Zhang et al Database Vol. 2011, doi:10.1093/database/bar038
International Cancer Genome Consortium
Goals Catalogue genomic abnormalities in tumors in 50 different cancer types
and/or subtypes of clinical and societal importance across the globe
Generate complementary catalogues of transcriptomic and epigenomic datasets from the same tumors
Make the data available to the entire research community as rapidly as possible and with minimal restrictions to accelerate research into the causes and control of cancer
50 different tumor types and/or subtypes
500 samples per tumor
25,000 Human Genome Projects!
ICGC members
Data models
Genes
Samples
Simple mutations
Copy number mutations
Structural rearrangements
Gene expression
DNA methylation
miRNA
Exon junction
Architecture
“Parallel” Query Engine
Gene Report
Ensembl
KEGG
ICGC
Pancreatic Expression
Database
Quick Search
Ensembl
KEGG
ICGC
Ensembl
Retrieve clinical staging data for colorectal cancer patients with non-synonymous simple mutations in genes that are involved in WNT signaling pathway
Search for genes affected by copy number loss and also detected as deletion from structural rearrangement analysis
In pancreatic cancer data set, retrieve all RNA-seq expression data for genes that are affected by copy number gains
ICGC Query Examples
“Digital Medicine” University Health Network (UHN)
“Digital Medicine” Pilot
“Digital Medicine” Architecture
KEGG Reactome COSMIC Ensembl
BioMart Central Portal
HICT
COSMIC Ensembl
HICT
KEGG
Reactome
TCGA - Ovarian
ICGC
ICGC - Pancreas
TCGA
Cancer Portal (ICGC) Clinical Trials
Collaboration between UHN, OICR and Pfizer on collorectal cancer
Sequencing
Stem cells
Clinical data
Pop-cure project
PopCure data management architecture
KEGG Reactome COSMIC Ensembl
BioMart Central Portal
COSMIC Ensembl
HICT
KEGG
Reactome
TCGA - Colorectal
ICGC
ICGC - Colorectal
TCGA
Cancer Portal (ICGC) Pfizer internal data UHN internal data
Collaboration between BioMart and Pfizer on their internal data management infrastructure
BioMart technology provided a single access point to
BioMart community portal (Ensembl, Reactome etc)
ICGC Portal
Internal Pfizer resources
The “La Jolla” Project
Institut National de la Recherche Agronomique (INRA)
San Raffaele Scientific Institute (SRSI)
SRSI is one of the principal research Institutes in Italy as per volume and profile of scientific output
Italy’s leading center for translational medicine
Center for Translational Genomics and Bioinformatics
NGS e Malattie
Supportare a livello nazionale ed internazionale l’utilizzo di tecnologie genomiche e di metodologie bioinformatiche per migliorare la nostra conoscenza delle malattie umane, permettendo di migliorarne prevenzione, diagnosi e cura.
Interdisciplinarità
Offrire all’Istituto e all’Ospedale, a collaboratori e clienti una piattaforma integrata e interdisciplinare che spazi dalla biologia molecolare alla medicina, dalla statistica all’informatica, dalla matematica all’etica
Divulgazione
Contribuire con pubblicazioni di livello internazionale alle scienze genomiche e bioinformatiche
Servizio
Offrire un supporto professionale e qualificato nell’erogazione di servizi in ambito genomico e bioinformatico, che garantisca puntualità e precisione nell’erogazione dei risultati pur mantenendo la naturale curiosità scientifica e attitudine collaborativa caratterizzante la ricerca
“Consumer driven market”
Plethora of web tools
The usual problems
Incompatibility of the websites
Incompatibility of technologies
Etc
The reality of translational bioinformatics
BioMart v 0.9 Data analysis and visualization framework
Enrichment Tool All ensembl species
Plethora of identifiers
Homology
BED files (CNVs, DMRs)
Full programmatic access
All data and analytics under one roof All publically available disease data
Cancer
Mendelian
Complex
Analysis Enrichment
Prioritizer
“Disease report” for your experimental data
BioMart Disease Portal
BioMart
Services: Single access point to biomedical data
Software: Support for collaborative efforts
Data federation
Scalability
Data agnostic
Summary
Host institutions
European Bioinformatics Institute (EBI)
Ontario Institute for Cancer Research (OICR)
San Raffale Scientific Institute (SRSI)
BioMart community
28 organizations
50 database projects
BioMart developers
Acknowledgments
Center for Translational Genomics and Bioinformatics SRSI Milan Italy Director: Professor Giorgio Casari
Co-Director: Dr Elia Stupka
www.biomart.org