Date post: | 15-Jan-2016 |
Category: |
Documents |
Upload: | milton-gregory |
View: | 216 times |
Download: | 0 times |
© 2010 by University of Pennsylvania School of Medicine
Making Sense out of Flow Cytometry Data Overload
A crash course in R/Bioconductor and flow cytometry fingerprinting
Outline
• Background R Bioconductor
• Motivating examples• Starting R, entering commands• How to get help• R fundamentals
Sequences and Repeats Characters and Numbers Vectors and Matrices Data Frames and Lists Importing data from spreadsheets
• flowCore Loading flow cytometry (FCS) data gating compensation transformation visualization
• flowFP Binning Fingerprinting Comparing multivariate distributions
• Writing your own functions• Installing and running R on your
computer• Suggestions for further reading
and reference
Background
• R Is an integrated suite of software facilities for data manipulation,
simulation, calculation and graphical display. It handles and analyzes data very effectively and it contains a suite of
operators for calculations on arrays and matrices. In addition, it has the graphical capabilities for very sophisticated graphs
and data displays. It is an elegant, object-oriented programming language. Started by Robert Gentleman and Ross Ihaka (hence “R”) in 1995
as a free, independent, open-source implementation of the S programming language (now part of Spotfire)
Currently, maintained by the R Core development team – an international group of hard-working volunteer developers
http://www.r-project.org
http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
Background
• Bioconductor “Is an open source and open development software project to provide
tools for the analysis and comprehension of genomic data.” Goals
To provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
To provide a common software platform that enables the rapid development and deployment of extensible, scalable, and interoperable software.
To further scientific understanding by producing high-quality documentation and reproducible research.
To train researchers on computational and statistical methods for the analysis of genomic data.
http://bioconductor.org/overview
A motivating example
I’ve just collected data from a T cell stimulation experiment in a 96-well plate format. I need to gate the data on CD3/CD4. How consistent are the distributions, so that I can establish one set of gates for the whole plate?
A motivating example
Another motivating example
I’m concerned that drawing gates to analyze my data introduces unintended bias. Additionally, since I have multiple data files, drawing multiple gates is time consuming. Can I use R to compute gates and then apply these same objective gating criteria to multiple data files?
Another motivating example
Autogate lymphocytesand monocytes
Automatically analyzeFMO tubes
Back to the basics
• R is a command-line driven program
the prompt is: > you type a command
(shown in blue), and R executes the command and gives the answer (shown in black)
Simple example: enter a set of measurements
• use the function c() to combine terms together• Create a variable named mfi• Put the result of c() into mfi using the
assignment operator <- (you can also use =)• The [1] indicates that the result is a vector
Help, functions, polymorphism
> help (log)
> ?log
> apropos(“log”)
Vignettes – really good help!
Sequences and Repeats
Characters and Numbers
• Characters and character strings are enclosed in “” or ‘’
• Special numbers• NA – “Not Available”• Inf – “Infinity”• NaN – “Not a Number”
Vectors and Matrices
Vectors and Matrices
• The subset operator for vectors and matrices is [ ]
Vectors and Matrices
• You can extend the length of a vector via subsetting
… but not a matrix
Vectors and Matrices
• However, all’s not lost if you want to extend either the columns …
… or rows
Data Frames
• A Data Frame is like a matrix, except that the data type in each column need not be the same
Often, a Data Frame is created from an Excel spreadsheet using the function read.table()
Save As…a tab-delimitedtext file.
Data Frames from spreadsheets
Data Frames from spreadsheets
Data Frames from spreadsheets
Lists
Handling Flow Cytometry Data: flowCore
• flowCore is a base package that supports reading and manipulation of FCS data files
• The fundamental object that encapsulates the data in an FCS file is a flowFrame
• A container object that holds a collection of flowFrames is called a flowSet
• In the next slides we will go over reading an FCS file gating compensation transformation visualization
Check out the example data
Read an FCS file, summarize the flowFrame
Apply the lymphocyte gate with Subset
needs to be transformed becauseit is rendering the linear datain the FCS file
hasn’t been compensated!
• Lines require library(fields)
• Percentages are in summary(fres)$p[1:4]
• Percentages are drawn in the graph with text()
Fingerprinting Flow Cytometry Data: flowFP
• flowFP aims to transform flow cytometric data into a form amenable to
algorithmic analysis tools Acts as in intermediate step between acquisition of high-throughput
FCM data and empirical modeling, machine learning and knowledge discovery
Implements ideas from
Roederer M, Moore W, Treister A, Hardy RR & Herzenberg LA. Probability binning comparison: a metric for quantitating multivariate distribution differences. Cytometry 45:47-55, 2001.
Rogers WT, Moser AR, Holyst HA, Bantly A, Mohler ER III, Scangas G, and Moore JS, Cytometric Fingerprinting: Quantitative Characterization of Multivariate Distributions, Cytometry 73A: 430-441, 2008.
and
The basic idea
• Subdivide multivariate space into bins Call this a “model” of the space
• For each flowFrame in a flowSet, count the number of events in
each bin in the model• Flatten the collection of counts for a flowFrame into a 1D feature
vector• Combine all of the feature vectors together into a n x m matrix
n = number of flowFrames (instances) m = number of bins in the model (features)
• Also, tag each event with its bin membership facilitates visualization, interpretation can be used for gating
Probability Binning
Probability Binning
Probability Binning
Probability Binning
Bin
Nu
mb
er
> plot (mod, fs)
Class Constructors
• flowFPModel (base class) Consumes a flowFrame or flowSet Produces a model, which is a recipe for subdividing multivariate space
• flowFP Consumes a flowFrame or flowSet, and a flowFPModel Produces a flowFP, which represents the multivariate probability density
function as a fingerprint Also tags each event with its bin membership
• flowFPPlex Consumes a collection of flowFPs The flowFPPlex is a container object to facilitate handling large and
complex collections of flowFPs
Writing Your Own Functions
commentscomments
declarationdeclaration
assignmentassignment
returnreturn
code blockcode block
## It’s a good idea to comment your code#
myfunc <- function (arg1=10, arg2, ...){
# your code goes hereanswer <- log (arg1, base=arg2)
return (answer)}
Writing Your Own Functions
Obtaining R and Bioconductor
• R http://cran.r-project.org/
• Bioconductor http://bioconductor.org/GettingStarted
General Reference Material
• A good beginner’s guide to R http://cran.r-project.org/doc/contrib/Owen-TheRGuide.pdf
• A nice one-page reference card http://cran.r-project.org/doc/contrib/Short-refcard.pdf
• Outstanding summary of R/Bioconductor, with many examples http://manuals.bioinformatics.ucr.edu/home/
R_BioCondManual#R_favorite • The definitive reference for writing R extensions (advanced!)
http://cran.r-project.org/doc/manuals/R-exts.pdf• Books
William N. Venables and Brian D. Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, New York, 2002. ISBN 0-387-95457-0.
John M. Chambers. Programming with Data. Springer, New York, 1998. ISBN 0-387-98503-4 (aka “the Green Book”)
Flow-Specific References
• Vignettes http://bioconductor.org/packages/2.6/bioc/vignettes/flowCore/inst/doc/HowTo-flowCore.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowViz/inst/doc/filters.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowStats/inst/doc/
GettingStartedWithFlowStats.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowQ/inst/doc/
DataQualityAssessment.pdf http://bioconductor.org/packages/2.6/bioc/vignettes/flowFP/inst/doc/flowFP_HowTo.pdf
• Original Articles flowCore
Hahne, F., N. LeMeur, et al. (2009). "flowCore: a Bioconductor package for high throughput flow cytometry." BMC Bioinformatics 10: 106.
Fingerprinting Rogers, W. T., A. R. Moser, et al. (2008). "Cytometric fingerprinting: quantitative
characterization of multivariate distributions." Cytometry A 73(5): 430-41. Rogers, W. T. and H. A. Holyst (2009). "flowFP: A Bioconductor Package for
Fingerprinting Flow Cytometric Data." Advances in Bioinformatics 2009(Article ID 193947): 11.