GenePatternGenePattern
Overview for MAGE-TAB Workshop
Ted Liefeld
January 24, 2007
a platform for integrative genomics
Client User InterfacesPipeline EnvironmentModule Repository
Module Integrator
Desktop
Programming
Web
all_aml_train
all_aml_test
Preprocess
Class Neighbors
Weighted
VotingCross-
Val
SOM Clusterin
g
Preprocess
Weighted Voting
Train/Test
SOM Cluster Viewer
Marker SelectionViewer
Prediction
ResultsViewer
Prediction
ResultsViewer
Golub and Slonim et. al 1999
KNN
SVM
SOM
GSEA
NMF
PCA
Features
Automatic Module Integration Add new modules without writing code Supports any command line callable code (language
independent)
Multiple user interfaces Desktop client Web client Programmatic interfaces to Java, MATLAB, R
Local and Distributed Computing
Laptop
Client/Server
Compute farm
Public server (1/2008)
Interoperability caBIG
caArray
caGrid geWorkbench
Cytoscape
Analytic Reproducibility Easy, rapid sharing of methodologies via pipelines
Versioning using Life Sciences Identifier (LSID)
Executable history of all sessions
Automatic pipeline generation from result files
Executable research documents
Comprehensive Module Repository ~90 modules: analysis, visualization, pipelines
Expression, proteomic, sequence, variation (SNP), and whole genome association data
Construction of context-sensitive, flexible analytic workflows
Module suites
Module Integrator
Add modules and visualizers without
writing code
Share custom analysis tasks
Integrate your own or “third-party” tools
easily
Add tools to a common repository
as a Visualization & Analysis Engine
http://www.broad.mit.edu/mmgp
Portal
GenePattern
LSF Worker Nodes
GenePattern SNPViewervisualizer
(running as applet)
RunGenePattern
Analyses
Using MAGE-ML today
MAGE-TAB use tomorrow
Ideally Be able to automatically find raw/derived bioassay data
when parsing MAGE-TAB files
• Use MAGE-TAB like our native (tab-delimited) data formats, GCT, RES in (almost) any GenePattern analysis module
• Not require user interaction to specify Assays or quantitation types
• ? MGED-Ontology for common data transform protocols (eg RMA, MAS5) in addition to free text
Sub-optimal but still good Have an interactive viewer to convert from MAGE-TAB to a
native format (e.g. MAGE-ML import viewer)
• Human interaction required…
More MAGE-TAB thoughts
Define structure/format for keeping multiple MAGE-TAB files together IDF, ADF, SDRF, raw data files -> package together as
ZIP? tgz?
• Sub directories in the zip? (defined)
Does MAGE-TAB support for multiple Arrays in one file? Useful & MAGE-ML allows this now (but I don’t like it for
automated processing)
• E.g. E-GEOD-995.mageml.tgz from ArrayExpress
More MAGE-TAB thoughts
Persistent identifiers For protocols, samples etc
• Allow use of SDRF, data matrix (eg in GP with persistent references to external entities)
• Array details, experiment design, etc
Question? Should we consider MAGE-TAB DAG to record data
processing pipelines (provenance - HLA)?
• e.g. a protocol for each module execution added to MAGE-TAB file outputs
• File growth issues…
• Record all analysis for a publication
• Add additional SDRF file at each step
Release Information
Initially released in March, 2004
Current version 3.0, released April 2007 3.1 due Feb 08
Currently 5900+ users, 500+ organizations, ~90 countries
Availability
Freely available
Windows, Mac OS, and Unix platforms
Resources
http://www.genepattern.org
User workshops, documentation, email help desk, online user forum
Reich et al. (2006) Nature Genetics
GenePattern is a winner of the 2005 BioIT World Best Practices Award
Collaborations
caBIG
MAGNet NCBC
NCIBI NCBC