1
DataAnalysisusingtheRProjectforSta8s8calCompu8ng
DanielaUshizimaNERSCAnaly8cs
LawrenceBerkeleyNa8onalLaboratory
2
Outline
I. R‐programming– WhytouseR– Rinthescien8ficcommunity– Extensible– Graphics– Profiling
II. Exploratorydataanalysis– Regression– Clusteringalgorithms
III. Casestudy– Acceleratedlaser‐wakefieldpar8cles
IV. HPC– State‐of‐the‐art
3
R‐PROGRAMMING
Packages,datavisualiza8onandexamples
4
Download:hVp://www.r‐project.org
Recommendedtutorial:hVp://cran.r‐project.org/doc/contrib/Paradis‐rdebuts_en.pdf
is a language and environment forsta8s8cal compu8ng and graphics, aGNUproject.Rprovidesawidevarietyofsta8s8cal(linear and nonlinear modeling,classical sta8s8cal tests, 8me‐seriesanalysis, classifica8on, clustering, ...)and graphical techniques, and ishighlyextensible.
5
1.WhytouseR?
• Open‐source,mul8pla^orm,extensible;
• EasyonuserswithfamiliaritywithS/S+,Matlab,PythonorIDL;
• Ac8veandgrowingcommunity:– Google,Pfizer,Merck,BankofAmerica,Boeing,theInterCon8nentalHotelsGroupandShell.
2.Rinthescien8ficcommunity
6
2.1.YouRwithNERSC
• GetstartedwithRonDaVinci:>moduleloadR
>R
>help()
>demo()
>help.start()
>source(‘your_func8on.R’)
>library(package_name)
7
8
3.Extensible
• Add‐onpackages:– Datainput/output:hdf5,Rnetcdf,DICOM,etc.
– Graphics:trellis,gplot,RGL,fields,etc.
– Mul8variateanalysis:MASS,mclust,ape,etc.
– Otherlanguages:Rcpp,Rpy,R.matlab,etc.
9
4.Sta8s8calanalysisandgraphs
• Histogram
• Density• Boxplot• Mul8variateplot
• Condi8oningplot• Contourplot
10
4.1.Mul8variateplots
> data=read.table('ozone.data.txt',header=T)
> names(data)[1]"rad""temp""wind""ozone“
> pairs(data,panel.smooth)#panel.smooth=locally‐weightedpolynomialregression
Ex: Explanatory variables: solar radiation, temperature, wind and the response variable ozone;
- use of pairs() with dataframes to check for dependencies between the variables.
11
4.2.Condi8onalplots
• Checktherela8onofthetwoexplanatoryvariableswind,tempandtheresponsevariableozone;
>coplot(ozone~wind|temp,panel=panel.smooth)
12
4.3.PackageRGLfor3Dvisualiza8on
• OpenGL‐rgl.demo.lsystem() ‐kerneldensityes8ma8on
UseVisit:h?ps://wci.llnl.gov/codes/visit/
13
5.Profilingseveral.8mes<‐func8on(n,f,...){for(iin1:n){f(...)}}
matrix.mul8plica8on<‐func8on(s){A<‐matrix(1:(s*s),nr=s,nc=s)B<‐matrix(1:(s*s),nr=s,nc=s)C<‐A%*%B}
v<‐NULLfor(iin2:10){v<‐append(v,system.8me(several.8mes(10000,matrix.mul8plica8on,i))[1])}plot(v,type='b',pch=15,main="Matrixproductcomputa8on8me")
• Wheredoesyourprogramspendmore8me?
Variable number of arguments
Alsotrypackages:profrandproCools
14
EXPLORATORYDATAANALYSIS
Basicsandbeyond
15
1.Sta8s8calanalysis• Sta8s8calmodeling:checkforvaria8onsintheresponsevariablegivenexplanatoryvariables;– Linearregression
• Mul8variatesta8s8cs:lookforstructureinthedata;– Clustering:
• Hierarchical– Dendrograms
• Par88oning– Kmeans(stats)
– Mixture‐models(mclust)
16
2.Linearregression• Ex:Findtheequa8onthatbestfitthedata,giventhedecayof
radioac8veemissionovera50‐dayperiod
• Linearregression:variablesexpectedtobelinearlyrelated;• Maximumlikelihoodes8matesofparameters=leastsquares;
2.1.Linearregressiondata=read.table('sapdecay.txt',header=T)aEach(data)
par(mfrow=c(1,3))plot(x,y,main='DecayofradioacNveemissionovera50‐dayperiod',xlab='days')#thelog(y)givesaroughideaofthedecayconstant,a,forthesedatabylinearregressionoflog(y)againstx
mylm=lm(log(y)~x)print(mylm$coefficients)#sumofsquaresofthedifferencebetweentheobservedyvandpredictedypvaluesofy,givenaspecificvalueofparameterasumsq<‐funcNon(a,xv=x,yv=y)
{yp=exp(‐a*xv)#predictedmodelforysum((yv‐yp)^2)}
a=seq(0.01,0.2,.005)sq=sapply(a,sumsq)plot(a,sq,type='l',xlab='decayconstant',ylab='sumofsquaresof(observ‐predicted)')
decayK=a[min(sq)==sq]#thisistheleast‐squaresesNmateforthedecayconstantmatplot(decayK,min(sq),pch=19,col='red',add=T)plot(x,y)days=seq(0,50,0.1)
lines(days,exp(‐decayK*days),col='blue‘)detach()
17
18
3.Clusteranalysis
• Hierarchical– dendrogram(stats)
• Par88oning– kmeans(stats)
• Mixture‐models:– Mclust(mclust)
Iris dataset: 150 samples of Iris flowers described in terms of its petal and sepal length and width
3.1.Hierarchicalclustering
19
• Analysisonasetofdissimilari8es,combinedtoagglomera8onmethodsforanalyzingit:
• Dissimilari8es:Euclidean,ManhaVan,…
• Methods:– ward,single,complete,
average,mcquiVy,medianorcentroid.
3.2.K‐means
• Splitnobserva8onsintokclusters;– eachobserva8onbelongsto
theclusterwiththenearestmean.
20
setosaversicolorvirginica104814
20236
35000
3.3.Model‐basedclustering
• MixtureModels– Eachclusterismathema8callyrepresentedby
aparametricdistribu8on;– Setofkdistribu8onsiscalledamixture,and
theoverallmodelisafinitemixturemodel;– Eachprobabilitydistribu8ongivesthe
probabilityofaninstancebeinginagivencluster.
2121
22
Casestudy
Acceleratedlaser‐wakefieldpar8cles
http://www.lbl.gov/publicinfo/newscenter/features/2008/apr/af-bella.html
time steps
• PI:C.Geddes(LBNL)inSciDACCOMPASSproject,Incite.
• Accomplishments:– Describedcompactelectroncloudsusingminimumenclosingellipsoids;– Developedalgorithmstoadaptmixturemodelclusteringtolargedatasets;
• ScienceImpact:– Automateddetec8onandanalysisofcompactelectronclouds;– Deriveddispersionfeaturesofelectronclouds;– Extensiblealgorithmstootherscienceproblems;
• Collaborators:– Tech‐X– MathGroup,LBNL– UCDavis,UniversityofKaiserlautern
KnowledgediscoveryinLWFAscienceviamachinelearning
24
Framework
• Goal:automatetheanalysisofelectronbunchesbydetec8ngcompactgroupsofpar8cles,subjectedtosimilarmomentumandspa8o‐temporalcoherence.
25
B1.Selectrelevantpar8cles
• Beamsofinterestarecharacterizedbyhighdensityofhigh‐energypar8cles:
1. Elimina8onoflowenergypar8cles(px<1e10)
– Wakeoscilla8on:px<=1e9– Excludespar8clesofthebackground
2. Calcula8onofthesimula8onaveragenumberofpar8cles(µs);
3. Elimina8onof8mestepswithnumberofpar8clesinferiortoµs;
Representation of particle momentum in one time step: spline interpolation onto a grid for visualization of irregularly spaced input data.
Packages:akima,hdf5,fields
26
B2.Kernel‐basedes8ma8on
• Kernel density estimators are less sensitive to the placement of the bin edges;
• Goal: retrieve a dense group of particles with similar spatial and momentum characteristics: argmax f(x,y,px), Neighborhood: 2 µm
Packages:misc3d,rgl,fields
27
B3.Iden8fybeamcandidates
• Detec8onofcompactgroupsofpar8clesindependentofbeingamaximuminoneofthevariables;
28
B4.Clusterusingmixturemodels
• Modelandnumberofclusterscanbeselectedatrun8me(mclust);
• Par88onofmul8dimensionalspace;
• Assumethatthefunc8onalformoftheunderlyingprobabilitydensityfollowsamixtureofnormaldistribu8ons;
Packages:mclust,rgl
29
B5.Evalua8onofcompactness
• Bunchesofinterestmoveatspeed≈c,hencearenearlysta8onaryinthemovingsimula8onwindow;
• Movingaveragessmoothesoutshort‐termfluctua8onsandhighlightslonger‐termtrends.
30
HighperformancecompuNng
Packages,challengesandnewbusinesses
1.Improveperformance/reusability
• Goodcoding:avoidloops,vectoriza8on;• ExtendRusingcompiledcode:
– packages:Rcpp,inline• RecycleyourPythoncodes:
– Package:Rpython• Parallelism:
– Explicit:packagesRmpi,Rpvm,nws– Implicit:packagespnmath,pnmath0formul8threadedmath
func8ons
• Useout‐of‐memoryprocessingwith– packagesbigmemoryandff
31
2.WhatisgoingonHPCinR?
• Parallelism:– Mul8core:mul8core,pnmath,…
– Computercluster:snow,Rmpi,rpvm,…– Gridcompu8ng:GRIDR,…
• GPU:– gputools:parallelalgorithmsusingCUDA+CUBLAS
• Extremelylargedata:– ff:memorymappedpagesofbinaryflatfiles.
32
3.Nothingisperfect…
• Limitsonindividualobjects:onallversionsofR,themaximumnumberofelementsofavectoris2^31–1;
• RwilltakealltheRAMitcanget(Linuxonly);
• Moreinforma8on,type:
>help(‘Memory‐limits’)
>gc()#garbagecollector
>object.size(your_obj)#sizeofyourobject
33
Takehome
34Source: http://www.nettakeaway.com/tp/R/129/understanding-r
35
References• MichaelJ.Crawley.StaHsHcs:AnIntroducHonusingR.Wiley,2005.ISBN
0‐470‐02297‐3.– data:hVp://www.bio.ic.ac.uk/research/mjcraw/therbook/
• RobertH.ShumwayandDavidS.Stoffer.TimeSeriesAnalysisandItsApplicaHonsWithRExamples.Springer,NewYork,2006.ISBN978‐0‐387‐29317‐2
• Basics– h?p://cran.r‐project.org/doc/contrib/Short‐refcard.pdf– h?p://cran.r‐project.org/doc/contrib/refcard.pdf– hVp://cran.r‐project.org/doc/contrib/Paradis‐rdebuts_en.pdf– h?p://www.manning.com/kabacoff/Kabacoff_MEAPCH1.pdf
• Intermediate– h?p://math.acadiau.ca/ACMMaC/Rmpi/basics.html– User‐lists
Cheat sheets
36
Acknowledgements
http://www.sciviews.org/Tinn-R/