Post on 12-Apr-2017
transcript
LearningApacheSparkpart1
PresenterIntroduction• TimSpann,SeniorSolutionsArchitect,airis.DATA
• ex-PivotalSeniorFieldEngineer• DZONEMVBandZoneLeader• ex-StartupSeniorEngineer/TeamLead
http://www.slideshare.net/bunkertorhttp://sparkdeveloper.com/http://www.twitter.com/PaasDev
airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.
Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.
WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.
Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.
Agenda
• Overview
•WhatisMapReduce?
• Hands-On:• Installation• SparkMapReduce• BuildwithIntelliJ/SBT• DeployLocal
Overview
SparkisafastclustercomputingsystemthatsupportsJava,Scala,PythonandRAPIs.Itallowsformultipleworkloadsusingthesamesystemandcoding.
Onestopshoppingforyourbigdataprocessingatscaleneeds.
ItworkswellwithexistingHadoopclusters,byitself,withAWSoronit’sown.
http://spark.apache.org/docs/latest/index.html
WhatisMapReduce?
TRANSFORMATION
map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.
ACTION
reduce(func) Aggregatetheelementsofthedatasetusingafunction func (whichtakestwoargumentsandreturnsone).Thefunctionshouldbecommutativeandassociativesothatitcanbecomputedcorrectlyinparallel.
ProblemDefinition
WehaveApachelogsfromourwebsite.Theyfollowastandardpatternandwewanttoparsethemtogainsomeinsightsonusage.114.200.179.85- - [24/Feb/2016:00:10:02 -0500]"GET/wp HTTP/1.1"2005279"http://sparkdeveloper.com/""Mozilla/5.0"
BytesSentHTTPRefererUserAgent
IPAddressClientIDUserIDDateTimeStampRequestStringHTTPStatusCode
MapFunction
logFile.map(parseLogLine)
LogRecord(m.group(1),m.group(2),m.group(3),m.group(4),m.group(5),m.group(8).toInt,m.group(9).toLong,m.group(10),m.group(11))
Ourmapping function isparseLogLinewhichtakesaLogStringandsplitsitintofieldsinaCaseclassusing regularexpressions.
val contentSizes =accessLogs.map(log=>log.bytesSent)
Oursecondmapping function,mapstojustthebytefield
Reduce
contentSizes.reduce(_+_)
Wereducebyasummingupallthebytesinthedataset.Theresultisafinalsumofallsizes.
Spark1.6.1Stack
SparkSQL SparkStreaming MLlib GraphX
SparkCore
Standalone YARN Mesos
Hands-On
SparkMapReduceBuildwithIntelliJ/SBTDeployLocalRunHistoryServer
spark-1.6.1-bin-hadoop2.6/sbin/start-history-server.sh
Installation• InstallJDK• InstallScala2.10• InstallSBT• InstallMaven(Optional)• UnzipSpark1.6.1
EnvironmentVariableValue(example)Unix/Linux/MacexportSCALA_HOME=/usr/local/share/scalaexportPATH=$PATH:$SCALA_HOME/binWindowsSetSCALA_HOME=c:\Progra~1\ScalasetPATH=%PATH%;%SCALA_HOME%\bin
SparkResources
• https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/info• http://airisdata.com/scala-spark-resources-setup-learning/• http://spark.apache.org/docs/latest/monitoring.html• http://spark.apache.org/docs/latest/submitting-applications.html
SparkCluster
http://spark.apache.org/docs/latest/cluster-overview.html
Term Meaning
Application UserprogrambuiltonSpark.Consists ofa driverprogram and executorsonthecluster.
Application jar Ajarcontainingtheuser's Sparkapplication.Insome casesuserswillwanttocreatean"uberjar"containingtheirapplicationalongwithitsdependencies. Theuser's jarshould neverincludeHadooporSparklibraries,however,thesewillbeaddedatruntime.
Driverprogram Theprocess runningthemain() functionoftheapplication andcreatingtheSparkContext
Clustermanager Anexternalserviceforacquiringresourcesonthecluster(e.g.standalonemanager,Mesos, YARN)
Deploymode Distinguisheswherethedriverprocessruns.In"cluster"mode, theframeworklaunches thedriverinside ofthecluster.In"client"mode,thesubmitterlaunches thedriveroutside ofthecluster.
Workernode Anynodethatcanrunapplication codeinthecluster
Executor Aprocess launchedforanapplicationonaworkernode, thatrunstasksandkeepsdatainmemoryordisk storageacrossthem.Eachapplicationhasitsownexecutors.
Task Aunitofworkthatwillbesenttooneexecutor
Job Aparallelcomputationconsistingofmultiple tasksthatgetsspawnedinresponse toaSparkaction(e.g. save, collect);you'llseethistermused inthedriver'slogs.
Stage Eachjobgetsdivided intosmallersetsoftaskscalled stages thatdependoneachother(similartothemapandreducestagesinMapReduce);you'll seethistermusedinthedriver's logs.
Glossary The following table summarizes terms you’ll see used to refer to cluster concepts: