Date post: | 18-Dec-2015 |
Category: |
Documents |
Upload: | harry-gaines |
View: | 214 times |
Download: | 1 times |
Reproducible HTS research: MINSEQE and more
Chris StoeckertDept. of Genetics, Perelman School of Medicine
CHOP/Penn NGS SymposiumJune 17, 2011
Why worry about reproducible research?
• Genotyping and expression analysis of patients is common.
• Diagnostics based on microarrays available and those based on sequencing being explored
• An example of what can go wrong:Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J,
Cottrill H, Kelley MJ, Petersen R, Harpole D, Marks J, Berchuck A, Ginsburg GS, Febbo P, Lancaster J, Nevins JR. Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006 Nov;12(11):1294-300. Epub 2006 Oct 22. Erratum in: Nat Med. 2008 Aug;14(8):889. Nat Med. 2007 Nov;13(11):1388.
From www.newsobserver.com:Shoffner, now 63, has invasive ductile adenocarcinoma, a form of breast cancer that begins in the milk ducts. When Shoffner's oncologist at Duke mentioned the clinical trial in July 2008, she eagerly volunteered."I'm devastated by this whole thing," she said. "If you have a very serious cancer and two-and-a-half years later you think you are involved in a study that is cutting edge and [it's discredited], it is devastating.”"I have been told there was no harm done to us," she said. "But no good was done to us, either. I got all the side effects from chemo and none of the benefit. “
From July 21, 2010 N.Y. Times: “Last year, two biostatisticians [Keith A. Baggerly and Kevin R. Coombes]at the University of Texas MD Anderson Cancer Center published an article in the scientific journal Annals of Applied Statistics in which they identified errors in Duke’s data analysis and said they had not been able to reproduce Duke’s results.”
In the beginning there were microarrays…
• And they were developed for gene expression profiling.• Microarray studies were published but only gene lists were
provided.– No details, no place to get the data
• In 1999, the Microarray Gene Expression Data Society (MGED) began to address the lack of verifiable and reproducible datasets by developing and promoting standards.
MGED Standards
• What information is needed for a microarray experiment?– MIAME: Minimal Information About a Microarray
Experiment. Brazma et al., Nature Genetics 2001
• How do you “code up” microarray data?– MAGE-OM: MicroArray Gene Expression Object Model.
Spellman et al., Genome Biology 2002 – MAGE-TAB Rayner et al., BMC Bioinformatics 2006
• What words do you use to describe a microarray experiment?– MO: MGED Ontology. Whetzel et al. Bioinformatics 2006
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
Array design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidMicroarray
RNA extract
SampleExperiment
Gene expression data matrix
normalization
integration
ProtocolProtocolProtocolProtocolProtocolProtocol
genes
MIAME in a nutshell (ala Alvis Brazma)
Stoeckert et al. Drug Discovery Today TARGETS 2004
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
Array design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidMicroarray
RNA extract
SampleExperiment
Gene expression data matrix
normalization
integration
ProtocolProtocolProtocolProtocolProtocolProtocol
genes
Sequencing is replacing array technology
@HWI-EAS266_0011:8:1:6:969#0/1GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT+HWI-EAS266_0011:8:1:6:969#0/1_abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY@HWI-EAS266_0011:8:1:7:1688#0/1AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG+HWI-EAS266_0011:8:1:7:1688#0/1a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR@HWI-EAS266_0011:8:1:7:593#0/1CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG+HWI-EAS266_0011:8:1:7:593#0/1abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_@HWI-EAS266_0011:8:1:7:139#0/1CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT+HWI-EAS266_0011:8:1:7:139#0/1aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX@HWI-EAS266_0011:8:1:7:1390#0/1GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA+HWI-EAS266_0011:8:1:7:1390#0/1_U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX@HWI-EAS266_0011:8:1:7:1663#0/1TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG+HWI-EAS266_0011:8:1:7:1663#0/1a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
Array design
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisationlabelled
nucleic acidarray
RNA extract
Sample
hybridisation nucleic acid Microarray
Chromatin, DNA extract
SampleExperiment
ChiP-SeqMeDIP-SeqEtc.
normalization
integration
ProtocolProtocolProtocolProtocolProtocolProtocol
genes
Sequencing is replacing array technology
@HWI-EAS266_0011:8:1:6:969#0/1GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT+HWI-EAS266_0011:8:1:6:969#0/1_abb`a[DZ`aabaa_a`b]___^^aa_`aa_a^a[\\aZTZVY@HWI-EAS266_0011:8:1:7:1688#0/1AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG+HWI-EAS266_0011:8:1:7:1688#0/1a`^ab`^D\a]a`b``b_bbbaabb^abaa``^a_^_aa\]_VR@HWI-EAS266_0011:8:1:7:593#0/1CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG+HWI-EAS266_0011:8:1:7:593#0/1abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_@HWI-EAS266_0011:8:1:7:139#0/1CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT+HWI-EAS266_0011:8:1:7:139#0/1aab`[^YDY]Z\baa`aabaaaa`aa`a]aa```\aY]^\]ZVX@HWI-EAS266_0011:8:1:7:1390#0/1GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA+HWI-EAS266_0011:8:1:7:1390#0/1_U^b_`]D\__a_a`S```Y[a__]a\aa_`]`aTVZ__\HYVX@HWI-EAS266_0011:8:1:7:1663#0/1TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG+HWI-EAS266_0011:8:1:7:1663#0/1a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB
From MGED to FGED
• What information is needed for an HTS experiment?– MINSEQE: Minimum Information about a high-
throughput SeQuencing Experiment• How do you “code up” functional genomics data?
– MAGE-TAB can still be utilized
• What words do you use to describe a functional genomics experiment?– OBI: Ontology for Biomedical Investigations, incorporates
MO• http://obi-ontology.org/page/Main_Page
http:www.fged.org
Minimum Information about a high-throughput Nucleotide SeQuencing Experiment – MINSEQE
(April, 2008)
• The description of the biological system and the particular states that are studied
• The sequence read data for each assay• The 'final' processed (or summary) data for the set of
assays in the study• The experiment design including sample data
relationships• General information about the experiment• Essential experimental and data processing protocols
http://www.fged.org/projects/minseqe/
Minimal Information for Biological and Biomedical Investigations
38 Projects registered as of June, 2011
RNA-seq and ChIP-seq Guidelines• Quality measures• Data reporting checklist• Based on ENCODE/ modENCODE• Developed by the BCBC Workgroup on Bioinformatics/
Epigenomics– Mike Snyder (Chair), Stanford– Howard Chang, Stanford– Klaus Kaestner, Penn– Eugene Kolker, Seattle Children’s Research Institute– Chris Stoeckert, Penn– JP Cartailler, Vanderbilt– Mark Magnuson, Vanderbilt– Olivier Blondel, NIDDK– Kristin Abraham, NIDDK
RNA-seq Checklist
• Experiment:– Contact person – Objective – Experimental factors
• Samples:– Descriptions – Protocols used for cell isolation – Estimated cell purity – Identify biological vs. technical
replicates– Amplification method and amount, if
used Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.
RNA-seq Checklist (cont.)• Sequence:
– Protocols used– Relation to samples– Spike-in used – Machine (version)– Sequence length – Single or paired end reads – Strand specific? – Bar codes? – Data file of sequence reads with quality scores
• Sequence analysis:– Quality assessment (correlations between biological replicates) – Reference genome, transcriptome – Mapping software (version) and parameters used – Numbers of reads, whether trimming was done: total, mapped, unique – File of gene/ transcript coordinates and normalized number of assigned
reads (RPKM/ FPKM)
You can still deposit your sequence reads• ArrayExpress/ European Nucleotide Archive/ European
Genome-Phenome Archive• GEO/ Short Read Archive
– http://www.ncbi.nlm.nih.gov/About/news/09may2011.html
ChIP-seq Checklist• Experiment:
– Contact person – Objective – Experimental factors
• Samples:– Descriptions (where applicable):
• Cell line, Lot number • Cell or tissue • Mouse strain, relevant alleles • Gender, age, disease status
– Protocols used for cell isolation – Identify biological vs. technical replicates
• Antibody Characterization– Company/ Core, catalog or ID and lot number – Describe methods used and quality assessment;
provide images.Park PJChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009 Oct;10(10):669-80
ChIP-seq Checklist (cont.)• Sequence:
– Protocols used– Relation to samples– Machine (version)– Sequence length – Single or paired end reads – Data file of sequence reads with quality scores
• Sequence analysis:– Quality assessment (correlations between
biological replicates) – Reference genome– Mapping software (version) and parameters
used – Numbers of reads, whether trimming was done:
total, mapped, unique – Methods used to call peaks– Peak coordinates– Peak signal– Confidence score (e.g., p-value).
Park PJChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009 Oct;10(10):669-80
MAGE-TAB FormatWhat’s MAGE-TAB?
• MAGE-TAB is a simple spreadsheet view which has two files IDF - describing the experiment design, contact details, variables and protocols
• SDRF - a spreadsheet with columns that describe samples, annotations, protocol references, hybridizations and data
• Linked data files, e.g. CEL files, these are referenced by the SDRF
• For single channel data one row in the SDRF = 1 hybridization, for two channel data one row = 1 channel
• MAGE-TAB can also be used to annotate Next Gen Sequencing data
Where can I get MAGE-TAB from?
• ~10,000 MAGE-TAB files are available for download from ArrayExpress (includes GEO derived and ArrayExpress data)
• caArray also provides MAGE-TAB files for download.
Who’s using MAGE-TAB?
• BioConductor• GenePattern• MeV
MAGE-TAB File covering MIAME components making use of the MGED Ontology
IDF (Investigation Description Format) forE-TABM-34 taken from the ArrayExpress archive
MAGE-TAB File covering MIAME components making use of the MGED Ontology
SDRF (Sample and Data Relationship Format) forE-TABM-34 taken from the ArrayExpress archive
University of PennsylvaniaChris StoeckertElisabetta ManduchiEmily AllenAshwin Vishnuvardharna
Vanderbilt UniversityMark MagnusonJean-Philippe CartaillerThomas Houfek Josh Norman
MAGE-TAB is used by the Beta Cell Biology Consortium (BCBC)
http://www.betacell.org
In-silico ComponentPtf1a_s5_seq.txt s5_eland.txt
Ptf1a_s4_seq.txt s4_eland.txt
Input_s8_seq.txt s8_eland.txt
Rbpjl_s6_seq.txt s6_eland.txt
Input_s2_seq.txt s2_eland.txt
Rbpjl_s4_seq.txt s4_eland.txt
Ptf1a_s5
Ptf1a_s4
Input_s8
Rbpjl_s6
Input_s2
Rbpjl_s4
Ptf1a_peaks
Rbpjl_peaks
cluster generationimage acquisition
sequencing
alignment
peak calling
Annotare - An open source standalone MAGE-TAB editor
Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M, Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA.Annotare - a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics. 2010 Aug 23.
Annotare Features
• Intuitive graphical user interface forms for editing
• Ontology support, an inbuilt ontology and web services connectivity to bioportal
• Searchable standard templates• Design wizard • Validation module • Mac and Windows Support
http://code.google.com/p/annotare/
MAGE-TAB supports reproducible HTS research
• Captures the minimal information (MINSEQE) needed– RNA-seq and ChIP-seq guidelines are available from
the Beta Cell Biology Consortium based on ENCODE / modENCODE guidelines.
• Provides a description of the workflow used• Can use Excel or Annotare (facilitates use of
ontologies) to generate these files• Can use to submit data to archives for publication