Getting Data into R & Bioconductor

Post on 17-Jan-2016

36 views 3 download

description

Aed í n Culhane aedin@jimmy.harvard.edu. Getting Data into R & Bioconductor. http://www.hsph.harvard.edu/research/aedin-culhane/. Simple Excel SpreadSheet data. Already described Read.table() Read.csv() scan() Are other formats eg netcdf However more datatype specialized. - PowerPoint PPT Presentation

transcript

Getting Data into R & Bioconductor

Aedín Culhane

aedin@jimmy.harvard.edu

http://www.hsph.harvard.edu/research/aedin-culhane/http://www.hsph.harvard.edu/research/aedin-culhane/

Simple Excel SpreadSheet data

• Already described – Read.table()– Read.csv()– scan()

• Are other formats eg netcdf

• However more datatype specialized.– Look at Technologies on BiocViews.– http://www.bioconductor.org/packages/release/BiocViews.html

22

Some common data types

• Microarray

• SNP

• Increasingly NGS

May 2011May 2011 33

A Microarray OverviewA Microarray Overview

44

Reading Affymetrix Data

library(affy)

require(affy) # Alternative

affybatch <- ReadAffy(celfile.path="[Location of your data]")

eSet<-justRMA()

May 2011May 2011 55

Sample R code

66

ExpressionSet Class in R

May 2011May 2011 77

Assessing Data Quality

May 2011May 2011 88

Public Microarray Data

ArrayExpress • 21997 Studies (622,617 profiles,)

GEO • 22,735 Studies (558,074 profiles)

Statistics May 2011Statistics May 2011

>500,000 arrays x $500 = $250,000,000

Cancer Studies account for >14% of all studies in databases…

R Code

May 2011May 2011 1111

More on GEOquery

May 2011May 2011 1212

require(GEOquery) require(GEOquery)

Let's try to load the GDS810 dataset which contains data on Let's try to load the GDS810 dataset which contains data on Alzheimer's disease at various stages of severity. Alzheimer's disease at various stages of severity.

GDS810<-getGEO("GDS810") GDS810<-getGEO("GDS810")

The The getGEOgetGEO function returns an object of class function returns an object of class GEODataGEOData. You can . You can get a description of this class like this: get a description of this class like this: help("GEOData-class") help("GEOData-class")

Meta(GDS810) Meta(GDS810) Columns(GDS810) Columns(GDS810) head(Table(GDS810)) head(Table(GDS810))

Affy SNP Arrays

May 2011May 2011 1313

Process – Affy SNP Arrays (Oligo package)

May 2011May 2011 1414

Other Arrays

• Illumina– Lumi package

• 2 color spotted arrays– Limma package

• Other arrays– http://www.bioconductor.org/help/workflows/

oligo-arrays/

May 2011May 2011 1515

Next Generation Sequencing Data

R Code

May 2011May 2011 1717

Exercise

• From GEO bring down GSE

• Download the dataset GSE1297 using getGEO

• This data will be downloaded as an eSet, so to see the expression data and phenoData, use pData and exprs

• Use ArrayQualityMetrics to Assess the data quality of these data

May 2011May 2011 1818

• With thanks to

• www.bioconductor.org/help/course.../Bioconductor-Introduction-lab.pdf

May 2011May 2011 1919

A B

Quick Aside: Interpreting hierarchical clustering trees

Hierarchical analysis results viewed using a dendrogram (tree)

• Distance between nodes (Scale)• Ordering of nodes not important (like baby mobile)

Tree A and B are equivalentTree A and B are equivalent