Date post: | 28-Dec-2015 |
Category: |
Documents |
Upload: | nathan-mcdaniel |
View: | 217 times |
Download: | 2 times |
European Bioinformatics InstituteMGED Society
Establishing the infrastructure
for sharing microarray data
Alvis Brazma
European Bioinformatics Institute EMBL-EBI
Microarray Gene Expression Data Society
European Bioinformatics InstituteMGED Society
Outline
Establishing the infrastructure for sharing microarray data – MGED, MIAME, MAGE-ML, databases
Microarray Informatics at the EBI
Microarrays
- a tool for the golden age of genome discoveries
European Bioinformatics InstituteMGED Society
Some questions for the golden age of genomics How gene expression differs in different cell
types? How gene expression changes when the
organism develops and cells are differentiating?
How gene expression differs in a normal and diseased (e.g., cancerous) cell?
How gene expression changes when a cell is treated by a drug?
How gene expression is regulated – which genes regulate which and how?
European Bioinformatics InstituteMGED Society
Potential amounts of microarray data
Experiments:~ 30 000 genes in a human genome~ 320 cell types in a human organism– 2000 compounds for screening – 2 concentrations – 3 time points– 5 replicates
Data~ 1012 data-points 1 Tera Byte
European Bioinformatics InstituteMGED Society
Making microarray data available to the public Authors web-sites Local, lab based public databases
(Stanford University, Whitehead,…) Journal web-sites There is a wide community
consensus that there is a need for public repositories for microarray data, analogous to DDBJ/EMBL/Genbank for sequence data
Raw data
Array scans
Spo
ts
Quantitations
Quantitationmatrices
Gen
es
Samples
Gene expressiondata matrix
Gene expression levels
Which data to share?
SamplesG
enes
Gene expression levels – problem 2
Sample annotations problem 1
Gene annotations
Gene expression matrix
Annotations
hybridisationlabelled
nucleic acidarray
RNA extract
source
Sample treatment
elements(spots)
Design
protocols
image
quantitationmatrix
Sample annotation
Gene annotation
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
Experiment
Gene expression data matrix
transformation
integration
Gene expression measurements
European Bioinformatics InstituteMGED Society
Problem 4
The nature and structure of the above described gene expression data and annotations are complex
For the public repositories to make the maximum use out of these data, standards for representing and communicating it should be established
European Bioinformatics InstituteMGED Society
Standards for microarray data Understanding and agreement what
data and annotations should be provided
Standard controlled vocabularies (ontologies) that can be used in such annotations
Standard format for exchange of annotated data
Understanding how to compare different datasets
European Bioinformatics InstituteMGED Society
Microarray Gene Expression Database meeting was organised in Cambridge, UK, November 1999 to discuss these problems
European Bioinformatics InstituteMGED Society
MGED 1 – some participants Affymetrix DDBJ DKFZ EMBL Gene Logic Incyte Max Plank Institute NCGR
NHGRI Sanger Centre Stanford
University Uni Pennsylvania Uni Washington,
Seattle Whitehead
Institute
European Bioinformatics InstituteMGED Society
MGED working groups
Experiment annotation Data exchange format and
modelling Ontologies Data normalisation and
transformations Queries
European Bioinformatics InstituteMGED Society
MGED meetings
MGED 2, Heidelberg, May 2000MGED 3, Stanford University, April 2001MGED 4, Boston, February 2002MGED 5, Tokyo, September 2002
European Bioinformatics InstituteMGED Society
MGED Society was founded in June 2002
Microarray Gene Expression Data (MGED) society is an international organisation for facilitating sharing of functional genomics and proteomics array data
Board of 17 directors
www.mged.org
European Bioinformatics InstituteMGED Society
MGED standards
Annotation content – MIAME Data representation and exchange
format MAGE-OM (MAGE-ML) – jointly with OMG
European Bioinformatics InstituteMGED Society
MIAME – Minimum Information About a Microarray experiment An attempt to outline the
minimum information required to interpret unambiguously and potentially reproduce and verify an array based gene expression experiment
www.mged.org/miame
European Bioinformatics InstituteMGED Society
MGED standards
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
elements(spots)
Design
Experiment
Gene expression data matrix
normalization
integration
MIAME – the content (annotation) of all boxes and lines should be given
European Bioinformatics InstituteMGED Society
MIAME ‘checklist’ to authors and reviewers Experimental design Samples used, RNE extraction and
labelling Hybridisation Measurement data and specifications Array Design
– (Row images)– Image quantitation (data and specification)– Gene expression data matrix (data and
transformations)
European Bioinformatics InstituteMGED Society
MIAME ‘checklist’
An open letter was sent to the journals last week - all the information in MIAME ‘checklist’ should be made available as a requirement for accepting publications
The Lancet has indicated that it will adopt MIAME checklist as a requirement
Nature will adjust its policy in the line with MIAME recommendations
European Bioinformatics InstituteMGED Society
A need for a supporting infrastructure MIAME itself will not solve the
problem A standard format is needed for
representing and exchanging this information
European Bioinformatics InstituteMGED Society
MGED standards 2 Data exchange format – MicroArray Gene
Expression Mark-up language – MAGE-ML – an XML based file format able to capture all MIAME required information
Based on object model MAGE-OM (Paul Spellman, Michael Miller, Jason Stewart, Ugis Sarkans, …)
Adopted by OMG as a standard for microarrays
www.mged.org/mage
Treatment
Transformation
BioEvent
Experiment
ArrayDesign
BioMaterial
BioAssayData BioAssay
DesignElement
UML Packages of MAGE
HigherLevelAnalysis
BioSequence
ArrayQuantitationType
Description
Protocol
Measurement
AuditAndSecurity
BQS
MAGE – an example diagram
European Bioinformatics InstituteMGED Society
Use case of MAGE:ArrayExpress architecture
ArrayExpress(Oracle)
Browser
MIAMEexpress
MAGE-ML(DTD)
MAGE-OM
MAGE-ML (doc)MAGE-ML (doc)MAGE-ML (doc)
dataloader
Velocitytemplateengine
Castor
object/relationalmapping
Web pagetemplateWeb pagetemplate
Java servlets Tomcat
European Bioinformatics InstituteMGED Society
MGED standards 3
MGED ontologies – organism part, cell type, diseased state, genotype, chemical compounds (Chris Stoeckert, Helen Parkinson, Susanna Sansone,…)
Symposium “Standards and Ontologies for Functional Genomics” – November 17-20, Cambridge, UK
www.mged.org/ontology
European Bioinformatics InstituteMGED Society
MGED standards 4
Data transformation and normalisation (Cathy Ball, John Quackenbush, Gavin Sherlock, …) www.mged.org/normalization
European Bioinformatics InstituteMGED Society
Infrastructure for sharing microarray data
Standard for experiment annotation
Standard for data exchange Public repositories Local databases and LIMS Ways of comparing the data
European Bioinformatics InstituteMGED Society
ArrayExpress – a MIAME/MAGE supportive public repository for microarray data at EBI
ArrayExpress
MIAMExpressExpressionProfiler
MAGE-ML
Internet
www
MAGE-ML
Submissions Queries, Analysis
European Bioinformatics InstituteMGED Society
Microarray data sharing infrastructure
Public repositories
MAGE-MLMAGE-ML
www
www
www
Data queries, retrieval, and analysis
Data submissions
Array descriptions(from manufacturers)
Data analysissoftware
MIAMExpresslocal instalations
LIMS
MAGE-ML
LIMS
Data analysissoftware
html htmlOther databases
MAGE-ML
www
www
European Bioinformatics InstituteMGED Society
MIAME/MAGE supportive software Sanger Institute LIMS (MIDAS) TIGR LIMS Gene Traffic (Iobion) Affymetrix MAXDB (Manchester) Rosetta Resolver (Rosetta Biosoftware) Base (Lund) J-Express (Molmine) MIAMExpress (EBI) ArrayExpress (EBI)
European Bioinformatics InstituteMGED Society
Acknowledgements MGED board
– Cathy Ball (Stanford)– Helen Causton (Imperial Col)– Terry Gaasterland (Rockefel)– Jason Gonzales (Iobion)– Pascal Hingamp (Marseille)– Barbara Jasny (Science)– Helen Parkinson (EBI)– John Quackenbush (TIGR)– Martin Ringwald (Jackson)– Gavin Sherlock (Stanford)– Paul Spellman (Berkely)– Jason Stewart (Open Inf)– Chris Stoeckert (Uni Penns)– Yoshio Tateno (DDBJ)– Ron Taylor (Colorado)– Charles Troup (Agilent)
– MGED supporters– Rob Andrews (Sanger)– Wilhelm Ansorge (EMBL)– Mike Cherry (Stanford)– Peter Dansky (Affymetrix)– David Hancock (Manchester)– Frank Holstege (Utrecht)– Michael Miller (Rosetta)– Kate Rice (Sanger)– Christian Schwager (EMBL)– Joe White (TIGR)– Rick Young (MIT)
– EBI Microarry Team– Niran Abeygunawardena– Helen Parkinson– Philippe Rocca-Sera– Susanna Sansone– Ugis Sarkans– Mohammadreza Shojatalob– Jaak Vilo
Microarray informatics at the EBI ArrayExpress (Helen Parkinson) Expression profiler data analysis
tool and promoter analysis (Jaak Vilo)
Reconstructing and analysing gene networks
European Bioinformatics InstituteMGED Society
AEP2
AKR1
CMK2
ANP1
RAD16
AFR1
CEM1
CUP5
SST2
DIG1
UBP10
STE2
ERG2
PHO89ERG6
GAS1 PTP2
GYP1
HIR2HPT1
ISW1
FIG1 ISW2
KIN3
MAC1MRPL33
MSU1
NPR2
PET111
RAD57
RIP1
RRP6
ASG7
STE6RTS1
SCS7
SGS1
MFA1
SHE4AGA1
SWI4
FUS1SWI5
VAC8
VMA8
YAL004W
YAR014C
YEL044W
YER050C
FUS3
GPA1
BAR1
MFA2
YER083C
RTT104
YMR014W
YMR029C AGA2YMR031W-A
YMR293C
YOR078W
ADE2
AFG3
BNI1
CLA4
ERG3
FKS1
KAR4
YAR064W
CHS3
VAP1
ICS2
YCLX09W
YDL009C
STP4
PMT1
VCX1HO
THI13
ADR1
YDR249C PAM1
YDR275W
HXT7
HXT6 YDR366CYDR534C
URA3
YEL071W
MNN1
ICL1
RNR1
YER130C
YER135C
SPI1 DMC1
HSP12
NIL1
GSC2
KSS1
MUP1
YGR138C
SKN1
YGR250C
YHR097C YHR116W
YHR122W
YHR145C
YIL060W
YIL096C
YIL117C
RHO3
YIL122W FKH1
NCA3
YJL145W
RPL17B
YJL217W
CYC1
DAN1
PGU1
GFA1
HAP4
RRN3
STE3
PRY2
KTR2
SRL3
YLR040C
YLR042C
SSP120
HSP60
YLR297W
RPS22B YLR413W
HOF1
DDR48
RNA1
YMR266W
YNL078W
SPC98
YNL133C
YNL217W
WSC2YPT11
RFA2
YNR009W
YNR067C
MDH2
YOL154W
NDJ1
WSC3
CDC21
PFY1
RGA1
MSB1
SRL1
YOR248W
YOR296W
YOR338W
GDS1PDE2
FRE5
YPL080C
RPS9A
BBP1
YPL256C
SUA7
MEP3
YPR156C
HMG1
HOG1
MED2
QCR2
RAD6
RAS2
RPD3
RPS24A
CRS4CYC8
YAR031W
YBR012C
HIS7
YCLX07W
YCRX18C PCL2
YDR124W
ECM18APA2
YER024W
HOM3
THI5
YGL053W
NRC465
YGR161C YHR055C
YIL037C
YIL080W
YIL082W
HIS5
YJL037W
SAG1
CPA2
AAD10
HYM1
MET1
MID2
YML047C
KAR5
CIK1
FUS2 SCW10
BOP3
YNL279WTHI12
YOL119C
YOR203W
TEA1
ISU1
YPL156C
YPL192CYPL250C
KAR3YIL082W
-A
YML048W-A
YMR085W
STE11
STE12 STE18
URA1
URA4
STE24
STE4
STE5
STE7SWI6
MAK1
TUP1
YER044C YJL107C
Gene Networks – graphs: nodes are genes, arcs are relationships
European Bioinformatics InstituteMGED Society
Different ways to build a gene network
G1 G2- The product of gene G1 is a transcription factor, which binds to the promoter of gene G2 – physical interaction network
G1 G2- The disruption of gene G1 changes the expression level of gene G2 – data interpretation network
G1 G2- Gene G2 is mentioned in a paper about gene G1 – literature networks
Data for over 200 gene disruptions in Yeast
Hughes et al, Cell, 102 (2000)
European Bioinformatics InstituteMGED Society
Discretization of the data:The normalized expression log(ratios) are discretized using different thresholds = 2, 2.1 , … , 4 :
X < d(X) = 1 X d(X) = 0X > d(X) = 1
European Bioinformatics InstituteMGED Society
Gene disruption network
A C
B D
A B Cgene B
gene C
gene D
gene A
Data for over 200 gene disruptions in Yeast
Hughes et al, Cell, 102 (2000)
European Bioinformatics InstituteMGED Society
Mutation network for S. Cerevisiae
European Bioinformatics InstituteMGED Society
AEP2
AKR1
CMK2
ANP1
RAD16
AFR1
CEM1
CUP5
SST2
DIG1
UBP10
STE2
ERG2
PHO89ERG6
GAS1 PTP2
GYP1
HIR2HPT1
ISW1
FIG1 ISW2
KIN3
MAC1MRPL33
MSU1
NPR2
PET111
RAD57
RIP1
RRP6
ASG7
STE6RTS1
SCS7
SGS1
MFA1
SHE4AGA1
SWI4
FUS1SWI5
VAC8
VMA8
YAL004W
YAR014C
YEL044W
YER050C
FUS3
GPA1
BAR1
MFA2
YER083C
RTT104
YMR014W
YMR029C AGA2YMR031W-A
YMR293C
YOR078W
ADE2
AFG3
BNI1
CLA4
ERG3
FKS1
KAR4
YAR064W
CHS3
VAP1
ICS2
YCLX09W
YDL009C
STP4
PMT1
VCX1HO
THI13
ADR1
YDR249C PAM1
YDR275W
HXT7
HXT6 YDR366CYDR534C
URA3
YEL071W
MNN1
ICL1
RNR1
YER130C
YER135C
SPI1 DMC1
HSP12
NIL1
GSC2
KSS1
MUP1
YGR138C
SKN1
YGR250C
YHR097C YHR116W
YHR122W
YHR145C
YIL060W
YIL096C
YIL117C
RHO3
YIL122W FKH1
NCA3
YJL145W
RPL17B
YJL217W
CYC1
DAN1
PGU1
GFA1
HAP4
RRN3
STE3
PRY2
KTR2
SRL3
YLR040C
YLR042C
SSP120
HSP60
YLR297W
RPS22B YLR413W
HOF1
DDR48
RNA1
YMR266W
YNL078W
SPC98
YNL133C
YNL217W
WSC2YPT11
RFA2
YNR009W
YNR067C
MDH2
YOL154W
NDJ1
WSC3
CDC21
PFY1
RGA1
MSB1
SRL1
YOR248W
YOR296W
YOR338W
GDS1PDE2
FRE5
YPL080C
RPS9A
BBP1
YPL256C
SUA7
MEP3
YPR156C
HMG1
HOG1
MED2
QCR2
RAD6
RAS2
RPD3
RPS24A
CRS4CYC8
YAR031W
YBR012C
HIS7
YCLX07W
YCRX18C PCL2
YDR124W
ECM18APA2
YER024W
HOM3
THI5
YGL053W
NRC465
YGR161C YHR055C
YIL037C
YIL080W
YIL082W
HIS5
YJL037W
SAG1
CPA2
AAD10
HYM1
MET1
MID2
YML047C
KAR5
CIK1
FUS2 SCW10
BOP3
YNL279WTHI12
YOL119C
YOR203W
TEA1
ISU1
YPL156C
YPL192CYPL250C
KAR3YIL082W
-A
YML048W-A
YMR085W
STE11
STE12 STE18
URA1
URA4
STE24
STE4
STE5
STE7SWI6
MAK1
TUP1
YER044C YJL107C
Mutation network =2, filtered for the genes marked in red (mating)
Thomas Schlitt, Johan Rung
European Bioinformatics InstituteMGED Society
Comparison to literature network derived from YPD
ResultOverlap between calculated networks and YPD-graph is always larger than overlap between randomised networks and the YPD-graph
European Bioinformatics InstituteMGED Society
Network modularity
Is there one “big” dominant connected component and possibly a number of small components, or several components of comparable sizes?
Can the network be broken down in several components of comparable size by removing nodes of high degree (i.e., nodes with many incoming or outgoing edges)?
European Bioinformatics InstituteMGED Society
European Bioinformatics InstituteMGED Society
European Bioinformatics InstituteMGED Society
Number of connected components in the networks
component
full network
1% removed
5% removed
10% removed
2.0 largestsecond
total
5383
1
4707
1
368222
261452
3.0 largestsecond
total
355622
246122
138549
7646
17
4.0 largestsecond
total
235434
120537
5426
22
452851
European Bioinformatics InstituteMGED Society
Other opinions
Wagner, 2002 (Genome Res) – there exists many independent modules
Feathersone, 2002 (Bioessays) - there is only one giant module
All depends on the definition of the ‘module’
European Bioinformatics InstituteMGED Society
Disruption network properties In and out degree of genes
distributed according to power-low There are no obvious modules in
this particular network ‘Local’ networks make sense(J.Rung, T.Schlitt et al, to appear in
ECCB special issue of Bioinformatics)
European Bioinformatics InstituteMGED Society
Gaurab Mukherjee, Alvis Brazma, Gonzalo Garcia Lara, Ugis Sarkans, Koichi Tazaki, Ahmet Ociamen, Helen Parkinson, Mohammadreza Shojatalab, Thomas Schlitt, Katja Kivinen, Misha Kapushesky, Ele Holloway, Nastja Samsonova, Philppe Rocca-Serra, Johan Rung, Niran Abeygunawardena, Susanna Sansone, Jaak Vilo
Microarray Informatics at the EBI