+ All Categories
Home > Documents > Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ......

Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ......

Date post: 28-Apr-2018
Category:
Upload: dinhanh
View: 216 times
Download: 1 times
Share this document with a friend
36
1 Data Analysis using the R Project for Sta8s8cal Compu8ng Daniela Ushizima NERSC Analy8cs Lawrence Berkeley Na8onal Laboratory
Transcript
Page 1: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

1

DataAnalysisusingtheRProjectforSta8s8calCompu8ng

DanielaUshizimaNERSCAnaly8cs

LawrenceBerkeleyNa8onalLaboratory

Page 2: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

2

Outline

I.  R‐programming–  WhytouseR–  Rinthescien8ficcommunity–  Extensible–  Graphics–  Profiling

II.  Exploratorydataanalysis–  Regression–  Clusteringalgorithms

III.  Casestudy–  Acceleratedlaser‐wakefieldpar8cles

IV.  HPC–  State‐of‐the‐art

Page 3: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

3

R‐PROGRAMMING

Packages,datavisualiza8onandexamples

Page 4: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

4

Download:hVp://www.r‐project.org

Recommendedtutorial:hVp://cran.r‐project.org/doc/contrib/Paradis‐rdebuts_en.pdf

is a language and environment forsta8s8cal compu8ng and graphics, aGNUproject.Rprovidesawidevarietyofsta8s8cal(linear and nonlinear modeling,classical sta8s8cal tests, 8me‐seriesanalysis, classifica8on, clustering, ...)and graphical techniques, and ishighlyextensible.

Page 5: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

5

1.WhytouseR?

•  Open‐source,mul8pla^orm,extensible;

•  EasyonuserswithfamiliaritywithS/S+,Matlab,PythonorIDL;

•  Ac8veandgrowingcommunity:– Google,Pfizer,Merck,BankofAmerica,Boeing,theInterCon8nentalHotelsGroupandShell.

Page 6: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

2.Rinthescien8ficcommunity

6

Page 7: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

2.1.YouRwithNERSC

•  GetstartedwithRonDaVinci:>moduleloadR

>R

>help()

>demo()

>help.start()

>source(‘your_func8on.R’)

>library(package_name)

7

Page 8: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

8

3.Extensible

•  Add‐onpackages:– Datainput/output:hdf5,Rnetcdf,DICOM,etc.

– Graphics:trellis,gplot,RGL,fields,etc.

– Mul8variateanalysis:MASS,mclust,ape,etc.

– Otherlanguages:Rcpp,Rpy,R.matlab,etc.

Page 9: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

9

4.Sta8s8calanalysisandgraphs

•  Histogram

•  Density•  Boxplot•  Mul8variateplot

•  Condi8oningplot•  Contourplot

Page 10: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

10

4.1.Mul8variateplots

>  data=read.table('ozone.data.txt',header=T)

>  names(data)[1]"rad""temp""wind""ozone“

>  pairs(data,panel.smooth)#panel.smooth=locally‐weightedpolynomialregression

Ex: Explanatory variables: solar radiation, temperature, wind and the response variable ozone;

- use of pairs() with dataframes to check for dependencies between the variables.

Page 11: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

11

4.2.Condi8onalplots

•  Checktherela8onofthetwoexplanatoryvariableswind,tempandtheresponsevariableozone;

>coplot(ozone~wind|temp,panel=panel.smooth)

Page 12: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

12

4.3.PackageRGLfor3Dvisualiza8on

•  OpenGL‐rgl.demo.lsystem() ‐kerneldensityes8ma8on

UseVisit:h?ps://wci.llnl.gov/codes/visit/

Page 13: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

13

5.Profilingseveral.8mes<‐func8on(n,f,...){for(iin1:n){f(...)}}

matrix.mul8plica8on<‐func8on(s){A<‐matrix(1:(s*s),nr=s,nc=s)B<‐matrix(1:(s*s),nr=s,nc=s)C<‐A%*%B}

v<‐NULLfor(iin2:10){v<‐append(v,system.8me(several.8mes(10000,matrix.mul8plica8on,i))[1])}plot(v,type='b',pch=15,main="Matrixproductcomputa8on8me")

•  Wheredoesyourprogramspendmore8me?

Variable number of arguments

Alsotrypackages:profrandproCools

Page 14: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

14

EXPLORATORYDATAANALYSIS

Basicsandbeyond

Page 15: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

15

1.Sta8s8calanalysis•  Sta8s8calmodeling:checkforvaria8onsintheresponsevariablegivenexplanatoryvariables;–  Linearregression

•  Mul8variatesta8s8cs:lookforstructureinthedata;–  Clustering:

•  Hierarchical–  Dendrograms

•  Par88oning–  Kmeans(stats)

–  Mixture‐models(mclust)

Page 16: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

16

2.Linearregression•  Ex:Findtheequa8onthatbestfitthedata,giventhedecayof

radioac8veemissionovera50‐dayperiod

•  Linearregression:variablesexpectedtobelinearlyrelated;•  Maximumlikelihoodes8matesofparameters=leastsquares;

Page 17: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

2.1.Linearregressiondata=read.table('sapdecay.txt',header=T)aEach(data)

par(mfrow=c(1,3))plot(x,y,main='DecayofradioacNveemissionovera50‐dayperiod',xlab='days')#thelog(y)givesaroughideaofthedecayconstant,a,forthesedatabylinearregressionoflog(y)againstx

mylm=lm(log(y)~x)print(mylm$coefficients)#sumofsquaresofthedifferencebetweentheobservedyvandpredictedypvaluesofy,givenaspecificvalueofparameterasumsq<‐funcNon(a,xv=x,yv=y)

{yp=exp(‐a*xv)#predictedmodelforysum((yv‐yp)^2)}

a=seq(0.01,0.2,.005)sq=sapply(a,sumsq)plot(a,sq,type='l',xlab='decayconstant',ylab='sumofsquaresof(observ‐predicted)')

decayK=a[min(sq)==sq]#thisistheleast‐squaresesNmateforthedecayconstantmatplot(decayK,min(sq),pch=19,col='red',add=T)plot(x,y)days=seq(0,50,0.1)

lines(days,exp(‐decayK*days),col='blue‘)detach()

17

Page 18: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

18

3.Clusteranalysis

•  Hierarchical–  dendrogram(stats)

•  Par88oning–  kmeans(stats)

•  Mixture‐models:– Mclust(mclust)

Iris dataset: 150 samples of Iris flowers described in terms of its petal and sepal length and width

Page 19: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

3.1.Hierarchicalclustering

19

•  Analysisonasetofdissimilari8es,combinedtoagglomera8onmethodsforanalyzingit:

•  Dissimilari8es:Euclidean,ManhaVan,…

•  Methods:–  ward,single,complete,

average,mcquiVy,medianorcentroid.

Page 20: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

3.2.K‐means

•  Splitnobserva8onsintokclusters;–  eachobserva8onbelongsto

theclusterwiththenearestmean.

20

setosaversicolorvirginica104814

20236

35000

Page 21: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

3.3.Model‐basedclustering

•  MixtureModels–  Eachclusterismathema8callyrepresentedby

aparametricdistribu8on;–  Setofkdistribu8onsiscalledamixture,and

theoverallmodelisafinitemixturemodel;–  Eachprobabilitydistribu8ongivesthe

probabilityofaninstancebeinginagivencluster.

2121

Page 22: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

22

Casestudy

Acceleratedlaser‐wakefieldpar8cles

http://www.lbl.gov/publicinfo/newscenter/features/2008/apr/af-bella.html

Page 23: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

time steps

•  PI:C.Geddes(LBNL)inSciDACCOMPASSproject,Incite.

•  Accomplishments:–  Describedcompactelectroncloudsusingminimumenclosingellipsoids;–  Developedalgorithmstoadaptmixturemodelclusteringtolargedatasets;

•  ScienceImpact:–  Automateddetec8onandanalysisofcompactelectronclouds;–  Deriveddispersionfeaturesofelectronclouds;–  Extensiblealgorithmstootherscienceproblems;

•  Collaborators:–  Tech‐X–  MathGroup,LBNL–  UCDavis,UniversityofKaiserlautern

KnowledgediscoveryinLWFAscienceviamachinelearning

Page 24: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

24

Framework

•  Goal:automatetheanalysisofelectronbunchesbydetec8ngcompactgroupsofpar8cles,subjectedtosimilarmomentumandspa8o‐temporalcoherence.

Page 25: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

25

B1.Selectrelevantpar8cles

•  Beamsofinterestarecharacterizedbyhighdensityofhigh‐energypar8cles:

1.  Elimina8onoflowenergypar8cles(px<1e10)

–  Wakeoscilla8on:px<=1e9–  Excludespar8clesofthebackground

2.  Calcula8onofthesimula8onaveragenumberofpar8cles(µs);

3.  Elimina8onof8mestepswithnumberofpar8clesinferiortoµs;

Representation of particle momentum in one time step: spline interpolation onto a grid for visualization of irregularly spaced input data.

Packages:akima,hdf5,fields

Page 26: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

26

B2.Kernel‐basedes8ma8on

•  Kernel density estimators are less sensitive to the placement of the bin edges;

•  Goal: retrieve a dense group of particles with similar spatial and momentum characteristics:   argmax f(x,y,px),   Neighborhood: 2 µm

Packages:misc3d,rgl,fields

Page 27: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

27

B3.Iden8fybeamcandidates

•  Detec8onofcompactgroupsofpar8clesindependentofbeingamaximuminoneofthevariables;

Page 28: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

28

B4.Clusterusingmixturemodels

•  Modelandnumberofclusterscanbeselectedatrun8me(mclust);

•  Par88onofmul8dimensionalspace;

•  Assumethatthefunc8onalformoftheunderlyingprobabilitydensityfollowsamixtureofnormaldistribu8ons;

Packages:mclust,rgl

Page 29: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

29

B5.Evalua8onofcompactness

•  Bunchesofinterestmoveatspeed≈c,hencearenearlysta8onaryinthemovingsimula8onwindow;

•  Movingaveragessmoothesoutshort‐termfluctua8onsandhighlightslonger‐termtrends.

Page 30: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

30

HighperformancecompuNng

Packages,challengesandnewbusinesses

Page 31: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

1.Improveperformance/reusability

•  Goodcoding:avoidloops,vectoriza8on;•  ExtendRusingcompiledcode:

–  packages:Rcpp,inline•  RecycleyourPythoncodes:

–  Package:Rpython•  Parallelism:

–  Explicit:packagesRmpi,Rpvm,nws–  Implicit:packagespnmath,pnmath0formul8threadedmath

func8ons

•  Useout‐of‐memoryprocessingwith–  packagesbigmemoryandff

31

Page 32: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

2.WhatisgoingonHPCinR?

•  Parallelism:– Mul8core:mul8core,pnmath,…

– Computercluster:snow,Rmpi,rpvm,…– Gridcompu8ng:GRIDR,…

•  GPU:– gputools:parallelalgorithmsusingCUDA+CUBLAS

•  Extremelylargedata:– ff:memorymappedpagesofbinaryflatfiles.

32

Page 33: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

3.Nothingisperfect…

•  Limitsonindividualobjects:onallversionsofR,themaximumnumberofelementsofavectoris2^31–1;

•  RwilltakealltheRAMitcanget(Linuxonly);

•  Moreinforma8on,type:

>help(‘Memory‐limits’)

>gc()#garbagecollector

>object.size(your_obj)#sizeofyourobject

33

Page 34: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

Takehome

34Source: http://www.nettakeaway.com/tp/R/129/understanding-r

Page 35: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

35

References•  MichaelJ.Crawley.StaHsHcs:AnIntroducHonusingR.Wiley,2005.ISBN

0‐470‐02297‐3.–  data:hVp://www.bio.ic.ac.uk/research/mjcraw/therbook/

•  RobertH.ShumwayandDavidS.Stoffer.TimeSeriesAnalysisandItsApplicaHonsWithRExamples.Springer,NewYork,2006.ISBN978‐0‐387‐29317‐2

•  Basics–  h?p://cran.r‐project.org/doc/contrib/Short‐refcard.pdf–  h?p://cran.r‐project.org/doc/contrib/refcard.pdf–  hVp://cran.r‐project.org/doc/contrib/Paradis‐rdebuts_en.pdf–  h?p://www.manning.com/kabacoff/Kabacoff_MEAPCH1.pdf

•  Intermediate–  h?p://math.acadiau.ca/ACMMaC/Rmpi/basics.html–  User‐lists

Cheat sheets

Page 36: Data Analysis using the R Project for Stascal Compung · R Project for Stascal Compung ... 3.1.Hierarchical clustering ... • Recycle your Python codes: ...

36

Acknowledgements

http://www.sciviews.org/Tinn-R/


Recommended