Date post: | 18-Jan-2018 |
Category: |
Documents |
Upload: | brendan-hopkins |
View: | 219 times |
Download: | 0 times |
BioMart Federated Database Architecture
Arek KasprzykEBI9 June 2005
BioMart• A join project
– European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)
• Aim– To develop a simple and scalable data management
system capable of integrating distributed data sources.
Challenges• Data sources
– Large– Distributed– Different data
Requirements• User
– All data accessible through a single set of interaces– Suitable for power biologists and bioinformaticians
• Deployer– ‘Out of the box’ installation– Built in query optimization– Easy data federation
• Architecture– Distributed– Domain agnostic– Platform independent
Query Engine
Federated architecture
BioMart
Data mart
User interfaces
Data sources
Data mart and dataset
Dataset
Data mart, dataset and schema
Schema
Dataset Configuration
XML
XML
XML
BioMart abstractions• Dataset
– A subset of data organized into 1 or more tables• Attribute
– A single data point – e. g. gene name
• Filter– An operation on an attribute – e. g. ‘Chromosome =1’
Datasets, Attributes and Filters
GENE
gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription
Mart
Dataset
Attribute
Filter
ExamplesUpstream sequences
for all kinases up-regulated in brain and associated with a
QTL for a neurological disorder
Name, chromosome position, description of all genes located on chromosome 1, expressed in lung,
associated with human homologues and non-synonymous snp changes
FK
FK
FK
FK
PK
PK
Data model
FK
FK
FK
FK
PK
FK FK FKFK
PK PK
PK PK
Data model
FK
FK
FK
FK
PK
PK
FK FK
FK FK
Data model
main1
PK1
2
PK2PK1
FK2
dm
FK2
dm
FK1 FK2
dm
FK1 FK2
PK1FK1 FK1
FK2 FK2PK2 FK1
Data model - ‘reversed star’
DatasetFixed schema transformation
A
B
TA
TB
C
BioMart abstractions• Link
– ‘common currency’ between two datasets – e. g. accession
• Exportable – Potential links to export
• Importable– Potential links to import
Exportables, Importables and Links
Dataset 1
Dataset 2
Links
Exportables, Importables and Links
Dataset 1 Dataset 2
Exportable Importablename = uniprot_id
attributes = uniprot_ac
name = uniprot_id
filters = uniprot_ac
Links
Exportables, Importables and Links
Dataset 1 Dataset 2
Exportable Importablename=genomic_region
attributes=chr_name, chr_start, chr_end
name=genomic_region
filters=chr_name (=), chr_start (>=), chr_end (<=)
Links
Building BioMart databases
Source databases
Mart
Transformation
MartBuilder
Configuration
XML
MartEditor
MartEditor
Table naming conventionNaïve configuration
• Tables– Meta tables meta_content– Data tables dataset__content__type
• Data tables– Main __main – Dimension __dm
• Columns– Key _key
Retrieval
myDatabaseSNPVega
EnsemblUniProt
myMart
MSD
BioMart API
JAVA Perl
MartExplorer MartShell MartView
Schema transformation
MartBuilder
XML
MartEditor
Configuration
Databases
Public data (local or remote)
BioMart architecture
MartView
MartExplorer
MartShell
Using = dataset
Get = attributeWhere = filter
Mart Query Language (MQL)● Mart Query Language (MQL) syntax:using <dataset> get <attributes> where <filters>
● Can join datasets together:using Dataset1 get Attribute1 where Filter1=var1 as q;using Dataset2 get Attribute2 where Filter2=var2 and
filter3 in q
● Can script and pipe:martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc
Third party software• Bioconductor (biomaRt)
– BioMart schema• Taverna
– BioMart java library• DAS ProServer
– BioMart perl library
biomaRt
Taverna
ProServer• No programming• DAS request and responses defined by
Exportables and Importables and configured by MartEditor
• DAS1
BioMart deployers• Large scale data federation (EBI)• Optimising access to a large database
(Ensembl, WormBase)• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
EBIUniprotMSD
SANGEREnsemblSNPVegaSequenceWWW
Hinxton example
BioMart deployers• Large scale data federation (Hinxton)• Optimising access to a large database
(Ensembl, WormBase, ArrayExpress)• Connecting priopriatery datasets to
public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)
WormBase
Ensembl
ArrayExpress
BioMart deployers• Large scale data federation (Hinxton)• Optimising access to a large database
(Ensembl, WormBase)• Federating user data with public data
(Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)
dbsnp HapMap Ensembl
Give me frequency data from dbsnp
Give me genoype and frequency data from HapMap
Give me SNPs location on gene/transcript
Give me frequency, genotype, location on gene/transcript from dbsnp, HapMap, Ensembl, RefSeq, AceView and Vegas
Java graphical user interfaceWWW web browser
GMIA_SNP_mart_database
RefSeq
SNP1 T/A AL13929 963253 1SNP2 C/T AL13929 963255 -1SNP3 C/G AL13929 963258 1. ……………………………….. ……………………………….
AceView Vega
Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.
… what next ?
BioMart model
• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Variety of ‘in house’ projects
• In development– HapMap
Summary• BioMart interface
– Batch queries– ‘Data mining’– Large annotation
• BioMart software– Set up your own database– Make your database scalable and
responsive– Federate with other data
Where are we?• 0.2 released in february• 0.3 to be released in june
– Platforms• Mysql• Oracle• Postgres
Acknowledgments
• BioMart– Damian Smedley (EBI)– Darin London (EBI)– Will Spooner (CSHL)
• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)