--packages: packages:
from ideas to softwarefrom ideas to software
Dr. Michael Shmoish
Computer Science Department
Technion – IIT
Workshop on R, Technion
March 28, 2005
OutlineOutline
� R-resources
� R-extension packages
a) What are they? How many of them? What kinds?
b) Loading packages
c) Getting help on packages
d) Using packages
� Home-made examples of transforming ideas into R-software:
a) Gene expression plots in the GeneCards http://genecards.weizmann.ac.il/
b) Gel-simulation
c) GC visualization
d) Genomic rearrangements
� R is a de facto language of Bioinformatics
� Bioconductor packages
RR--ResourcesResources
� R homepage
http://www.r-project.org/
contains information on the R-project and (almost)
everything related to it.
� CRAN page
http://cran.r-project.org/
is the download area, with R-base software itself,
extension packages, PDF manuals.
RR--ResourcesResources
RR Reference CardsReference Cards
�Rpad and R reference card,
by Tom Short (a long one: 4 pages)
www.rpad.org/Rpad/Rpad-refcard.pdf
�R reference card,
by Jonathan Baron (very short: 1 page only)
www.psych.upenn.edu/~baron/refcard.pdf
Publications (PNAS)Publications (PNAS)
Publications (Nature)Publications (Nature)
What areWhat are RR--packagespackages??
� packages are self-contained units of code
with documentation
� there are automatic testing features built in
� all functions must have examples and the
examples must run
� interesting commands:
– update.packages, example, >example(hclust)
How manyHow many R R packages? packages?
Quite a lot!
How manyHow many R R packages? packages?
�Base packages (CRAN - base, graphics, methods, stats).
�Contributed packages (CRAN – 485 packages as of March 2005).
�Bioconductor project packages (Bioconductor.organnotate, affy, marray, multtest,
hgu95av2, ALL, EMBO03)
�Others (Rgeo for analysis of spatial data;
Rmetrics for financial market analysis;
packages before submitting to CRAN; not always updated and checked : dna, DNAcopy, DEDS)
What kind ofWhat kind of R R packages?packages?
� Analysis packages: implementation of statistical and graphical methods (cluster, lattice, nnet, rpart).
� Data packages: datasets for tutorials/books (UsingR, datasets), biological metadata packages consisting of environment objects for mappings between different gene identifiers (e.g., Affymetrix ID, LocusLink ID, PubMed ID), CDF and probe sequence information for Affymetrix chips ( GO, hgu95av2 , humanLLMappings, KEGG).
� Specialized/custom packages: code, data, documentation, and exercises, for a particular project, article, or course
(DNAcopy - detecting chr. regions with abnormal DNA copy number
EMBO03 : Bioconductor course package;
GeneTS: analysing multiple gene expression time series golubEsets: Golub et al. (2000) ALL/AML dataset; yeastCC: Spellman et al. (1998) yeast cell cycle dataset
qtl – analysing QTL data).
R R packages (cluster analysis )packages (cluster analysis )
� cclust: convex clustering methods.
� class: self-organizing maps (SOM).
� cluster:
– AGglomerative NESting (agnes),
– Clustering LARe Applications (clara),
– DIvisive ANAlysis (diana),
– Fuzzy Analysis (fanny),
– MONothetic Analysis (mona),
– Partitioning Around Medoids (pam).
� e1071:
– fuzzy C-means clustering (cmeans),
– bagged clustering (bclust).
� mva (now part of the stats):
– hierarchical clustering (hclust),
– k-means (kmeans).
� GeneSOM: self-organizing maps
� Flexmix, fpc: fixed point clusters, clusterwise regression and discriminant plots.
R R packagespackages
Installing/loading Installing/loading RR--packagespackages
� From the RGui console menu – Packages
� Loading from the command line:
> 'library(RpackageName)'
� Loading from inside other functions:
'require(RpackageName)'
Both load the R-package named ‘RpackageName‘ and make its functions available in the R environment.
Getting help on Getting help on RR--packagespackages
The list of these functions (names with a brief description ) can be displayed with the commands
> library (help = RpackageName ) #or
> help ( package = RpackageName )
To get a help on a specific function FOO one can use :
> help (FOO, package = RpackageName ) #or
> ?FOO # in the case the relevant package is loaded
> library() # list all packages available for loading
> search() #gives a list of 'attach'ed packages
Getting help on Getting help on RR--packages (example)packages (example)
> help( package = limma)
> help( ebayes, package = limma)
> library(limma)
> ?ebayes
Warning:
> ?RpackageName #could fail!
R R packagespackages
�sound - provides basic functions for dealing with wav files and sound
samples
�rimage - provides functions for image processing, including sobel filter, rank filters, fft, histogram equalization, and reading JPEG file (could be used for the automatic processing of gels).
Using Using RR packages: rimagepackages: rimage
> library (rimage)
> a = read.jpeg ("sb.jpg")
> plot (a)
R R packages: rimagepackages: rimage
R R packages: rimagepackages: rimage
>a = read.jpeg("sb.jpg")
>plot(a)
>plot(rgb2grey(a))
R R packages: rimagepackages: rimage
R R packages: rimagepackages: rimage
>a = read.jpeg("sb.jpg")
>plot(a)
>plot(rgb2grey(a))
> print(a)
size: 179 x 600
type: rgb
>plot(imagematrix(matrix(rnorm(179 * 600 ), 179 ,600 )))Warning message:
Pixel values were automatically clipped because of range over. in: imagematrix(matrix(rnorm(179 * 600), 179, 600))
R R packages: rimagepackages: rimage
Probability distributionsProbability distributions
�cumulative distribution function P(X ≤ x): ‘p’ for the CDF
�probability density function d P(X ≤ x) /dx : ‘d’ for the density,
�quantile function (given q, the smallest x such that P(X ≤ x) > q):
‘q’ for the quantile
�Simulate random numbers from the distribution: ‘r’ for random
Distribution R name Additional argumentsbeta beta shape1, shape2, ncp
binomial binom size, prob
chi-squared chisq df, ncp
exponential exp rate
F f df1, df1, ncp
gamma gamma shape, scale
geometric geom prob
hypergeometric hyper m, n, k
Probability distributions (cont.)Probability distributions (cont.)
Distribution R name
log-normal lnorm
logistic logis
negative binomial nbinom
normal norm
Poisson pois
Student’s t t
uniform unif
Wilcoxon wilcox
R R packagespackages
HomeHome--made examples of transforming made examples of transforming
ideas into ideas into RR--softwaresoftware
Gene expression in the GeneCardsGene expression in the GeneCards
� non-standard ‘root’ scale
� colors
� easy to add texts and plots to existing plots
� correlation/length computations for
variation plots
� mining Unigene/CGAP
Dr. Shmoish and Prof. Lancet, Dr. Dr. Shmoish and Prof. Lancet, Dr. ChalifaChalifa -- CaspiCaspi, Dr. , Dr. ShmueliShmueli, Mrs. , Mrs. SafranSafran of the Weizmann Institute of Scienceof the Weizmann Institute of Science
Nucleic Acids ResearchNucleic Acids Research 31,1:14231,1:142--146 (2003)146 (2003)
Gene expression in the GeneCardsGene expression in the GeneCards
GC visualizationGC visualization
>Prochlorococcus_MIT9302 gi|51235135|gb|AY599029.1| Prochlorococcus marinus str. MIT 9302 PsbA (psbA) gene, partial cds
----GTTCCTTCATCTAACGCTATTGGTCTACACTTCTACCCAATTTGGGAAGCAGCTACTGTAGATGAGTGGT
TATACAACGGTGGTCCTTACCAGCTTGTTATTTTCCACTTCCTAATTGGTATCTCAGCATACATGGGAAG
ACAGTGGGAGCTTTCATACCGTTTAGGTATGCGTCCTTGGATCTGTGTTGCATACTCTGCACCAGTTTCA
GCAGCTTTCGCAGTATTTCTTGTATACCCATTCGGTCAAGGTTCATTCTCTGACGGAATGCCTCTAGGTA
TCTCTGGAACATTCAACTTCATGTTTGTTTTCCAGGCAGAGCACAACATTCTTATGCACCCATTCCATAT
GGCTGGTGTTGCTGGTATGTTCGGAGGATCTTTATTCTCAGCTATGCATGGTTCACTTGTTACTTCGTCT
CTAATCAGAGAAACAACTGAGACAGAGTCTCAGAACTATGGTTACAAGTTCGGACAAGAAGAAGAAACAT
>Prochlorococcus_MIT9515 gi|51235137|gb|AY599030.1| Prochlorococcus marinus str. MIT 9515 PsbA (psbA) gene, partial cds
----GTTCCTTCTTCAAATGCTATTGGTCTACACTTCTACCCAATTTGGGAAGCAGCTACTGTAGATGAGTGGT
TATACAACGGTGGTCCTTACCAGCTAGTAATTTTCCACTTCCTTATTGGTATCTCAGCTTACATGGGACG
TCAGTGGGAGCTTTCATACCGTTTAGGTATGCGTCCTTGGATCTGTGTTGCATACTCTGCACCAGTTTCA
GCAGCTTTCGCAGTATTCCTTGTATATCCATTTGGTCAAGGTTCATTCTCTGACGGAATGCCTTTAGGTA
TCTCTGGAACATTCAACTTCATGTTTGTTTTCCAGGCAGAGCACAACATTCTTATGCACCCATTCCATAT
GGCTGGTGTTGCAGGTATGTTCGGAGGATCATTATTCTCAGCAATGCATGGTTCACTTGTTACTTCATCT
CTAATCAGAGAAACAACTGAGACAGAGTCTCAGAACTATGGTTACAAGTTCGGACAAGAAGAAGAAACAT
>Prochlorococcus_MIT9312_HL
------gtggacatagacgaaataagcgagccagttgctggttcattcctatatggaaacaacatcatctcaggtgcagttgttccttcatccaacgctattggtc
Tacacttctacccaatttgggaagcagctactgtagatgagtggttatac
GC visualizationGC visualization
Dr. Shmoish and Dr. Beja , Mrs. Limor, Mr. Zeidner of BiologyDr. Shmoish and Dr. Beja , Mrs. Limor, Mr. Zeidner of Biology, Technion., Technion.
to appearto appear in in Environ Environ MicrobiolMicrobiol. (2005). (2005)
GelGel--simulationsimulation
Dr. Shmoish and Dr. Shmoish and Prof. Manor, Mr. Prof. Manor, Mr. RomiRomi of Biology, Technion.of Biology, Technion.
Genomic rearrangementsGenomic rearrangements
Dr. Shmoish and Dr. Shmoish and Prof. Pinter, Mr. Prof. Pinter, Mr. SwidanSwidan of of Computer Science Dept., Technion.Computer Science Dept., Technion.
http://www.zbi.uni-saarland.de/cbi/stud/perspekt.shtml
R is a (de facto) language of the Bioinformatics
Bioinformatics is an
interdisciplinary field
Definitions of Definitions of BioinformaticsBioinformatics on the Web:on the Web:
� The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. www.informatics.jax.org/mgihome/other/glossary.shtml
� The study of the application of computer and statistical techniques to the management of biological information. In genome projects,bioinformaticsincludes the development of methods to search databases quickly, to analyze DNA sequence information, and to predict protein sequence and structure from DNA sequence data. home.san.rr.com/dna/darryl/glossary.html
� The analysis of biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research. www.niehs.nih.gov/nct/glossary.htm
� The science of informatics as applied to biological research. Informatics is the management and analysis of data using advanced computing techniques. www.genencor.com/wt/gcor/glossary
Definitions of Definitions of BioinformaticsBioinformatics on the Web:on the Web:
� (Computational biology). This word has not a clear definition. It involves the analysis and interpretation of data and the development of algorithms and statistics. The term was coined to encompass computer applications in biological sciences but is now used to mean rather different things, from artificial intelligence and robotics to genome analysis. The term was originally applied to the computational manipulation and analysis of biological sequence data (DNA and/or protein), but now tends also to be used to embrace the manipulation and analysis of 3D structural data. www.biol.lu.se/mibiol/research/wachen/glossary.htm
� The use of computers to handle biological information. The term is often used to describe computational molecular biology – the use of computers to store, search and characterize the genetic code of genes, the proteins linked to each gene and their associated functions. www.syngenta.com/en/about_syngenta/research_tech_gloss.asp
� The application of computational techniques to the management and analysis of biological information. bioinf.uta.fi/xml/courses/glossary/glossary-items.xml
Definitions of Definitions of BioinformaticsBioinformatics on the Web:on the Web:
� The science that uses advanced computing techniques for management and analysis of biological data. Bioinformatics is particularly important as an adjunct to genomic research, which generates a large amount of complex data, involving billions of individual DNA building-blocks, and tens of thousands of genes. (SNP consortium) www.variagenics.com/glossary.html
� The science of managing and analyzing biological data using advanced computing techniques. Especially important in analyzing genomic research data. See also: informatics doegenomestolife.org/glossary/glossary_b.html
� the use of computers in solving information problems in the life sciences. It mainly involves the creation of extensive electronic databases on genomes, protein sequences etc. Also involves techniques such as three-dimensional modelling of biomolecules and biological systems. www.universityscience.ie/pages/glossary.htm
� Computational or algorithmic approaches to the analysis and integration of genomic, proteomic, or chemical data residing in databases. Bioinformatics includes applications for the analysis of DNA and protein sequence patterns and similarities, tools for t www.dddmag.com/scripts/glossary.asp
RR packages (proteomics)packages (proteomics)
Algorithms used for analysis of MALDI-MS spectra
1) LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis)
R package: MASSfunction: lda, qda
2) KNN (k-nearest neighbor) R package: classfunction: knn
3) Bagging, boosting classification trees
R package: rpart, treefunction: rpart, tree
4) SVM (Support Vector Machine) R package: e1071function: svm
Wu, B., Abbott, T., Fishman, D., McMurray, W., Mor, G., Stone, K., Ward, D., Williams, K., and Zhao, H. (2003) Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics , 19: 1636-1643
R/qtlR/qtl
Authors: Karl Broman, Hao Wu, Gary Churchill, Saunak Sen, & Brian Yandell
R/R/qtlqtl
RR
Bioinformatics challengesBioinformatics challenges
�Large data: tens of thousands of genes across a few
hundred samples
�Much of the data is non-numeric (e.g., the annotation of
genes, mutations), genomic rearrangments
�The role of the gene in a particular pathway
� Integration of the “omic” data (data sources are varied
with different formats)
BioconductorBioconductor
�Bioconductor is a (relatively) new software initiative
– www.bioconductor.org
� among the goals of this project is the deployment of high
quality software for the analysis of the “omic” data
� the challenges are varied and exciting
Bioconductor PackagesBioconductor Packages
� General infrastructure: Biobase, Biostrings, DynDoc, reposTools, rhdf5 , ruuid, tkWidgets, widgetTools.
� Annotation: annotate, AnnBuilder + metadata packages.
� Graphics: geneplotter, hexbin.
� Pre-processing microarray data: affy, affycomp, affydata, affylmGUI , affyPLM, annaffy, gcrma, makecdfenv, limma, limmaGUI , marray , vsn.
� Other assays: aCGH, DNAcopy, prada, PROcess, RSNPer, SAGElyzer.
� Differential gene expression: EBarrays, edd, factDesign, genefilter, limma, limmaGUI , multtest, ROC.
� Graphs and networks: graph, RBGL, Rgraphviz .
� Gene Ontology: GOstats, goTools.
R R packages (microarray data analysis)packages (microarray data analysis)
CEL, CDF
affy
vsn
.gpr, .Spot, MAGEMLPre-processing
exprSet
graph
RBGL
Rgraphviz
edd
genefilter
limma
multtest
ROC
+ CRAN
annotate
annaffy
+ metadata
packagesCRAN
class
cluster
MASS
mvageneplotter
hexbin
+ CRAN
marray
limma
vsn
Differential
expression
Graphs &
networksCluster
analysis
Annotation
CRAN
class
e1071
ipred
LogitBoost
MASS
nnet
randomForest
rpart
Prediction
Graphics
Messages to take homeMessages to take home
� R is a good thing to be aware of
� you can (and have to) tRy it at home
� for Life Science researchers/students R is a comprehensible high-level language with a lot of useful packages and with many friendly features that extends their standard Excel abilities for data analysis
� for Exact Science researchers/students R is a good entry point into the exciting world of Bioinformatics and arguably one of the best tools for transforming bioinformatics ideas into working and well-documented software organized in packages.
Thank you!