Reproducible HTS research: MINSEQE and more Chris Stoeckert Dept. of Genetics, Perelman School of...

Reproducible HTS research: MINSEQE and more

Chris StoeckertDept. of Genetics, Perelman School of Medicine

CHOP/Penn NGS SymposiumJune 17, 2011

Why worry about reproducible research?

• Genotyping and expression analysis of patients is common.

• Diagnostics based on microarrays available and those based on sequencing being explored

• An example of what can go wrong:Potti A, Dressman HK, Bild A, Riedel RF, Chan G, Sayer R, Cragun J,

Cottrill H, Kelley MJ, Petersen R, Harpole D, Marks J, Berchuck A, Ginsburg GS, Febbo P, Lancaster J, Nevins JR. Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006 Nov;12(11):1294-300. Epub 2006 Oct 22. Erratum in: Nat Med. 2008 Aug;14(8):889. Nat Med. 2007 Nov;13(11):1388.

From www.newsobserver.com:Shoffner, now 63, has invasive ductile adenocarcinoma, a form of breast cancer that begins in the milk ducts. When Shoffner's oncologist at Duke mentioned the clinical trial in July 2008, she eagerly volunteered."I'm devastated by this whole thing," she said. "If you have a very serious cancer and two-and-a-half years later you think you are involved in a study that is cutting edge and [it's discredited], it is devastating.”"I have been told there was no harm done to us," she said. "But no good was done to us, either. I got all the side effects from chemo and none of the benefit. “

From July 21, 2010 N.Y. Times: “Last year, two biostatisticians [Keith A. Baggerly and Kevin R. Coombes]at the University of Texas MD Anderson Cancer Center published an article in the scientific journal Annals of Applied Statistics in which they identified errors in Duke’s data analysis and said they had not been able to reproduce Duke’s results.”

http://www.newsobserver.com/

In the beginning there were microarrays…

• And they were developed for gene expression profiling.• Microarray studies were published but only gene lists were

provided.– No details, no place to get the data

• In 1999, the Microarray Gene Expression Data Society (MGED) began to address the lack of verifiable and reproducible datasets by developing and promoting standards.

MGED Standards

• What information is needed for a microarray experiment?– MIAME: Minimal Information About a Microarray

Experiment. Brazma et al., Nature Genetics 2001

• How do you “code up” microarray data?– MAGE-OM: MicroArray Gene Expression Object Model.

Spellman et al., Genome Biology 2002 – MAGE-TAB Rayner et al., BMC Bioinformatics 2006

• What words do you use to describe a microarray experiment?– MO: MGED Ontology. Whetzel et al. Bioinformatics 2006

hybridisationlabelled

nucleic acidarray

RNA extract

Sample

Array design


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidMicroarray

RNA extract

SampleExperiment

Gene expression data matrix

normalization

integration

ProtocolProtocolProtocolProtocolProtocolProtocol

genes

MIAME in a nutshell (ala Alvis Brazma)

Stoeckert et al. Drug Discovery Today TARGETS 2004


nucleic acidarray

RNA extract

Sample

Array design


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidMicroarray

RNA extract

SampleExperiment

Gene expression data matrix

normalization

integration


genes

Sequencing is replacing array technology

@HWI-EAS266_0011:8:1:6:969#0/1GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT+HWI-EAS266_0011:8:1:6:969#0/1_abbà[DZàabaa_a`b]___^âa_àa_aâ[\\aZTZVY@HWI-EAS266_0011:8:1:7:1688#0/1AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG+HWI-EAS266_0011:8:1:7:1688#0/1a`âb`^D\a]a`b``b_bbbaabbâbaa``â_^_aa\]_VR@HWI-EAS266_0011:8:1:7:593#0/1CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG+HWI-EAS266_0011:8:1:7:593#0/1abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_@HWI-EAS266_0011:8:1:7:139#0/1CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT+HWI-EAS266_0011:8:1:7:139#0/1aab`[^YDY]Z\baaàabaaaaàaà]aa```\aY]^\]ZVX@HWI-EAS266_0011:8:1:7:1390#0/1GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA+HWI-EAS266_0011:8:1:7:1390#0/1_U^b_`]D\__a_a`S```Y[a__]a\aa_`]àTVZ__\HYVX@HWI-EAS266_0011:8:1:7:1663#0/1TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG+HWI-EAS266_0011:8:1:7:1663#0/1a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB


nucleic acidarray

RNA extract

Sample

Array design


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample


nucleic acidarray

RNA extract

Sample

hybridisation nucleic acid Microarray

Chromatin, DNA extract

SampleExperiment

ChiP-SeqMeDIP-SeqEtc.

normalization

integration


genes

Sequencing is replacing array technology

@HWI-EAS266_0011:8:1:6:969#0/1GTTTGCCNGTGTGTACGCTACCCCCTTCTTGTGTGTGTGTGTCT+HWI-EAS266_0011:8:1:6:969#0/1_abbà[DZàabaa_a`b]___^âa_àa_aâ[\\aZTZVY@HWI-EAS266_0011:8:1:7:1688#0/1AAGATGANGGCAGGGTGCAAGATGGCAGGATGCAAGATGGCAGG+HWI-EAS266_0011:8:1:7:1688#0/1a`âb`^D\a]a`b``b_bbbaabbâbaa``â_^_aa\]_VR@HWI-EAS266_0011:8:1:7:593#0/1CAGTTCANTTCTCAGCACCACACTGGGATGCTCACACATGCCTG+HWI-EAS266_0011:8:1:7:593#0/1abbbb_VD[bbbba_`bbbbbbbbbbbaa_`bbaabaabb_aa_@HWI-EAS266_0011:8:1:7:139#0/1CATGGGGNATAATTGCAATCCCCGATCCCCATCACGAATGGGGT+HWI-EAS266_0011:8:1:7:139#0/1aab`[^YDY]Z\baaàabaaaaàaà]aa```\aY]^\]ZVX@HWI-EAS266_0011:8:1:7:1390#0/1GAATAATNGAATAGGACCGCGGTTCTATTTTGTTGGTTTTCGGA+HWI-EAS266_0011:8:1:7:1390#0/1_U^b_`]D\__a_a`S```Y[a__]a\aa_`]àTVZ__\HYVX@HWI-EAS266_0011:8:1:7:1663#0/1TGATGTTNGTGGCAATAATGGGGGTAGCGGCAATGGTGGCGGGG+HWI-EAS266_0011:8:1:7:1663#0/1a`[_X]\DQTZ[^YYa[[aXV[PZUUYSYBBBBBBBBBBBBBBB

From MGED to FGED

• What information is needed for an HTS experiment?– MINSEQE: Minimum Information about a high-

throughput SeQuencing Experiment• How do you “code up” functional genomics data?

– MAGE-TAB can still be utilized

• What words do you use to describe a functional genomics experiment?– OBI: Ontology for Biomedical Investigations, incorporates

MO• http://obi-ontology.org/page/Main_Page

http:www.fged.org

Minimum Information about a high-throughput Nucleotide SeQuencing Experiment – MINSEQE

(April, 2008)

• The description of the biological system and the particular states that are studied

• The sequence read data for each assay• The 'final' processed (or summary) data for the set of

assays in the study• The experiment design including sample data

relationships• General information about the experiment• Essential experimental and data processing protocols

http://www.fged.org/projects/minseqe/

Minimal Information for Biological and Biomedical Investigations

38 Projects registered as of June, 2011

RNA-seq and ChIP-seq Guidelines• Quality measures• Data reporting checklist• Based on ENCODE/ modENCODE• Developed by the BCBC Workgroup on Bioinformatics/

Epigenomics– Mike Snyder (Chair), Stanford– Howard Chang, Stanford– Klaus Kaestner, Penn– Eugene Kolker, Seattle Children’s Research Institute– Chris Stoeckert, Penn– JP Cartailler, Vanderbilt– Mark Magnuson, Vanderbilt– Olivier Blondel, NIDDK– Kristin Abraham, NIDDK

Available at http://www.betacell.org/about/policies/

RNA-seq Checklist

• Experiment:– Contact person – Objective – Experimental factors

• Samples:– Descriptions – Protocols used for cell isolation – Estimated cell purity – Identify biological vs. technical

replicates– Amplification method and amount, if

used Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63.

RNA-seq Checklist (cont.)• Sequence:

– Protocols used– Relation to samples– Spike-in used – Machine (version)– Sequence length – Single or paired end reads – Strand specific? – Bar codes? – Data file of sequence reads with quality scores

• Sequence analysis:– Quality assessment (correlations between biological replicates) – Reference genome, transcriptome – Mapping software (version) and parameters used – Numbers of reads, whether trimming was done: total, mapped, unique – File of gene/ transcript coordinates and normalized number of assigned

reads (RPKM/ FPKM)

You can still deposit your sequence reads• ArrayExpress/ European Nucleotide Archive/ European

Genome-Phenome Archive• GEO/ Short Read Archive

– http://www.ncbi.nlm.nih.gov/About/news/09may2011.html

ChIP-seq Checklist• Experiment:

– Contact person – Objective – Experimental factors

• Samples:– Descriptions (where applicable):

• Cell line, Lot number • Cell or tissue • Mouse strain, relevant alleles • Gender, age, disease status

– Protocols used for cell isolation – Identify biological vs. technical replicates

• Antibody Characterization– Company/ Core, catalog or ID and lot number – Describe methods used and quality assessment;

provide images.Park PJChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009 Oct;10(10):669-80

ChIP-seq Checklist (cont.)• Sequence:

– Protocols used– Relation to samples– Machine (version)– Sequence length – Single or paired end reads – Data file of sequence reads with quality scores

• Sequence analysis:– Quality assessment (correlations between

biological replicates) – Reference genome– Mapping software (version) and parameters

used – Numbers of reads, whether trimming was done:

total, mapped, unique – Methods used to call peaks– Peak coordinates– Peak signal– Confidence score (e.g., p-value).

Park PJChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009 Oct;10(10):669-80

MAGE-TAB FormatWhat’s MAGE-TAB?

• MAGE-TAB is a simple spreadsheet view which has two files IDF - describing the experiment design, contact details, variables and protocols

• SDRF - a spreadsheet with columns that describe samples, annotations, protocol references, hybridizations and data

• Linked data files, e.g. CEL files, these are referenced by the SDRF

• For single channel data one row in the SDRF = 1 hybridization, for two channel data one row = 1 channel

• MAGE-TAB can also be used to annotate Next Gen Sequencing data

Where can I get MAGE-TAB from?

• ~10,000 MAGE-TAB files are available for download from ArrayExpress (includes GEO derived and ArrayExpress data)

• caArray also provides MAGE-TAB files for download.

Who’s using MAGE-TAB?

• BioConductor• GenePattern• MeV

http://annotare.googlecode.com/files/E-TABM-34.idf.txt

http://annotare.googlecode.com/files/E-TABM-34.sdrf.txt

MAGE-TAB File covering MIAME components making use of the MGED Ontology

IDF (Investigation Description Format) forE-TABM-34 taken from the ArrayExpress archive

MAGE-TAB File covering MIAME components making use of the MGED Ontology

SDRF (Sample and Data Relationship Format) forE-TABM-34 taken from the ArrayExpress archive

University of PennsylvaniaChris StoeckertElisabetta ManduchiEmily AllenAshwin Vishnuvardharna

Vanderbilt UniversityMark MagnusonJean-Philippe CartaillerThomas Houfek Josh Norman

MAGE-TAB is used by the Beta Cell Biology Consortium (BCBC)

http://www.betacell.org

A ChIP-Seq study

IDF

Experimental Design

Bench Component

In-silico ComponentPtf1a_s5_seq.txt s5_eland.txt

Ptf1a_s4_seq.txt s4_eland.txt

Input_s8_seq.txt s8_eland.txt

Rbpjl_s6_seq.txt s6_eland.txt

Input_s2_seq.txt s2_eland.txt

Rbpjl_s4_seq.txt s4_eland.txt

Ptf1a_s5

Ptf1a_s4

Input_s8

Rbpjl_s6

Input_s2

Rbpjl_s4

Ptf1a_peaks

Rbpjl_peaks

cluster generationimage acquisition

sequencing

alignment

peak calling

SDRF

Annotare - An open source standalone MAGE-TAB editor

Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M, Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA.Annotare - a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics. 2010 Aug 23.

Annotare Features

• Intuitive graphical user interface forms for editing

• Ontology support, an inbuilt ontology and web services connectivity to bioportal

• Searchable standard templates• Design wizard • Validation module • Mac and Windows Support

http://code.google.com/p/annotare/

MAGE-TAB supports reproducible HTS research

• Captures the minimal information (MINSEQE) needed– RNA-seq and ChIP-seq guidelines are available from

the Beta Cell Biology Consortium based on ENCODE / modENCODE guidelines.

• Provides a description of the workflow used• Can use Excel or Annotare (facilitates use of

ontologies) to generate these files• Can use to submit data to archives for publication

Thank you’s

• FGED Society• Beta Cell Biology Consortium

– Bioinformatics/ Epigenomics Working group– Beta Cell Genomics team

• MAGE-TAB/ Annotare group

Date post:	18-Dec-2015
Category:	Documents
Upload:	harry-gaines
View:	214 times
Download:	1 times

Reproducible HTS research: MINSEQE and more Chris Stoeckert Dept. of Genetics, Perelman School of...

Documents