BioMart Federated Database Architecture Arek Kasprzyk EBI 9 June 2005.

Post on 18-Jan-2018

219 views 0 download

description

Challenges Data sources –Large –Distributed –Different data

transcript

BioMart Federated Database Architecture

Arek KasprzykEBI9 June 2005

BioMart• A join project

– European Bioinformatics Institute (EBI) – Cold Spring Harbor Laboratory (CSHL)

• Aim– To develop a simple and scalable data management

system capable of integrating distributed data sources.

Challenges• Data sources

– Large– Distributed– Different data

Requirements• User

– All data accessible through a single set of interaces– Suitable for power biologists and bioinformaticians

• Deployer– ‘Out of the box’ installation– Built in query optimization– Easy data federation

• Architecture– Distributed– Domain agnostic– Platform independent

Query Engine

Federated architecture

BioMart

Data mart

User interfaces

Data sources

Data mart and dataset

Dataset

Data mart, dataset and schema

Schema

Dataset Configuration

XML

XML

XML

BioMart abstractions• Dataset

– A subset of data organized into 1 or more tables• Attribute

– A single data point – e. g. gene name

• Filter– An operation on an attribute – e. g. ‘Chromosome =1’

Datasets, Attributes and Filters

GENE

gene_id(PK)gene_stable_id gene_startgene_chrom_endchromosomegene_display_iddescription

Mart

Dataset

Attribute

Filter

ExamplesUpstream sequences

for all kinases up-regulated in brain and associated with a

QTL for a neurological disorder

Name, chromosome position, description of all genes located on chromosome 1, expressed in lung,

associated with human homologues and non-synonymous snp changes

FK

FK

FK

FK

PK

PK

Data model

FK

FK

FK

FK

PK

FK FK FKFK

PK PK

PK PK

Data model

FK

FK

FK

FK

PK

PK

FK FK

FK FK

Data model

main1

PK1

2

PK2PK1

FK2

dm

FK2

dm

FK1 FK2

dm

FK1 FK2

PK1FK1 FK1

FK2 FK2PK2 FK1

Data model - ‘reversed star’

DatasetFixed schema transformation

A

B

TA

TB

C

BioMart abstractions• Link

– ‘common currency’ between two datasets – e. g. accession

• Exportable – Potential links to export

• Importable– Potential links to import

Exportables, Importables and Links

Dataset 1

Dataset 2

Links

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importablename = uniprot_id

attributes = uniprot_ac

name = uniprot_id

filters = uniprot_ac

Links

Exportables, Importables and Links

Dataset 1 Dataset 2

Exportable Importablename=genomic_region

attributes=chr_name, chr_start, chr_end

name=genomic_region

filters=chr_name (=), chr_start (>=), chr_end (<=)

Links

Building BioMart databases

Source databases

Mart

Transformation

MartBuilder

Configuration

XML

MartEditor

MartEditor

Table naming conventionNaïve configuration

• Tables– Meta tables meta_content– Data tables dataset__content__type

• Data tables– Main __main – Dimension __dm

• Columns– Key _key

Retrieval

myDatabaseSNPVega

EnsemblUniProt

myMart

MSD

BioMart API

JAVA Perl

MartExplorer MartShell MartView

Schema transformation

MartBuilder

XML

MartEditor

Configuration

Databases

Public data (local or remote)

BioMart architecture

MartView

MartExplorer

MartShell

Using = dataset

Get = attributeWhere = filter

Mart Query Language (MQL)● Mart Query Language (MQL) syntax:using <dataset> get <attributes> where <filters>

● Can join datasets together:using Dataset1 get Attribute1 where Filter1=var1 as q;using Dataset2 get Attribute2 where Filter2=var2 and

filter3 in q

● Can script and pipe:martshell.sh -E MQLscript.mql > results.txtmartshell.sh -E MQLscript.mql | wc

Third party software• Bioconductor (biomaRt)

– BioMart schema• Taverna

– BioMart java library• DAS ProServer

– BioMart perl library

biomaRt

Taverna

ProServer• No programming• DAS request and responses defined by

Exportables and Importables and configured by MartEditor

• DAS1

BioMart deployers• Large scale data federation (EBI)• Optimising access to a large database

(Ensembl, WormBase)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

EBIUniprotMSD

SANGEREnsemblSNPVegaSequenceWWW

Hinxton example

BioMart deployers• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase, ArrayExpress)• Connecting priopriatery datasets to

public data (Pasteur, Unilever, Serono, Sanofi-Aventis, DevGen etc …)

WormBase

Ensembl

ArrayExpress

BioMart deployers• Large scale data federation (Hinxton)• Optimising access to a large database

(Ensembl, WormBase)• Federating user data with public data

(Pasteur, INRA, Bayer,Unilever, Serono, Sanofi-Aventis, DevGen, Solexa etc …)

dbsnp HapMap Ensembl

Give me frequency data from dbsnp

Give me genoype and frequency data from HapMap

Give me SNPs location on gene/transcript

Give me frequency, genotype, location on gene/transcript from dbsnp, HapMap, Ensembl, RefSeq, AceView and Vegas

Java graphical user interfaceWWW web browser

                GMIA_SNP_mart_database

RefSeq

SNP1 T/A AL13929 963253 1SNP2 C/T AL13929 963255 -1SNP3 C/G AL13929 963258 1. ……………………………….. ……………………………….

AceView Vega

Genetics of Infectious and Autoimmune Diseases, Pasteur Institute, INSERM U730, Paris, France.

… what next ?

BioMart model

• Already applied– Ensembl– Vega– SNP– Uniprot– MSD– ArrayExpress– WormBase– Variety of ‘in house’ projects

• In development– HapMap

Summary• BioMart interface

– Batch queries– ‘Data mining’– Large annotation

• BioMart software– Set up your own database– Make your database scalable and

responsive– Federate with other data

Where are we?• 0.2 released in february• 0.3 to be released in june

– Platforms• Mysql• Oracle• Postgres

Acknowledgments

• BioMart– Damian Smedley (EBI)– Darin London (EBI)– Will Spooner (CSHL)

• Contributors– Arne Stabenau (Ensembl)– Andreas Kahari (Ensembl)– Craig Melsopp (Ensembl)– Katerina Tzouvara (Uniprot)– Paul Donlon (Unilever)