+ All Categories
Home > Documents > Overview of Spark

Overview of Spark

Date post: 14-Apr-2018
Category:
Upload: romanzotti
View: 217 times
Download: 0 times
Share this document with a friend

of 25

Transcript
  • 7/29/2019 Overview of Spark

    1/25

    MateiZaharia,MosharafChowdhury,TathagataDas,AnkurDave,JustinMa,MurphyMcCauley,MichaelFranklin,

    ScottShenker,IonStoica

    SparkFast,Interactive,Language-Integrated

    ClusterComputing

    UCBERKELEYwww.spark-project.org

  • 7/29/2019 Overview of Spark

    2/25

    ProjectGoalsExtendtheMapReducemodeltobettersupporttwocommonclassesofanalyticsapps:

    Iterativealgorithms(machinelearning,graphs)Interactivedatamining

    Enhanceprogrammability:IntegrateintoScalaprogramminglanguageAllowinteractiveusefromScalainterpreter

  • 7/29/2019 Overview of Spark

    3/25

    MotivationMostcurrentclusterprogrammingmodelsarebasedonacyclicdataflowfromstablestorage

    tostablestorage

    Map

    Map

    Map

    Reduce

    Reduce

    Input Output

  • 7/29/2019 Overview of Spark

    4/25

    Motivation

    Map

    Map

    Map

    Reduce

    Reduce

    Input Output

    Benefitsofdataflow:runtimecandecidewheretoruntasksandcanautomaticallyrecoverfromfailures

    Mostcurrentclusterprogrammingmodelsarebasedonacyclicdataflowfromstablestorage

    tostablestorage

  • 7/29/2019 Overview of Spark

    5/25

    MotivationAcyclicdataflowisinefficientforapplicationsthatrepeatedlyreuseaworkingsetofdata:

    Iterativealgorithms(machinelearning,graphs)Interactivedataminingtools(R,Excel,Python)

    Withcurrentframeworks,appsreloaddata

    fromstablestorageoneachquery

  • 7/29/2019 Overview of Spark

    6/25

    Solution:Resilient

    DistributedDatasets(RDDs)Allowappstokeepworkingsetsinmemoryfor

    efficientreuseRetaintheattractivepropertiesofMapReduceFaulttolerance,datalocality,scalability

    Supportawiderangeofapplications

  • 7/29/2019 Overview of Spark

    7/25

    OutlineSparkprogrammingmodel

    ImplementationDemo

    Userapplications

  • 7/29/2019 Overview of Spark

    8/25

    ProgrammingModel

    Resilientdistributeddatasets(RDDs)Immutable,partitionedcollectionsofobjectsCreatedthroughparalleltransformations(map,filter,

    groupBy,join,)ondatainstablestorageCanbecachedforefficientreuse

    ActionsonRDDsCount,reduce,collect,save,

  • 7/29/2019 Overview of Spark

    9/25

    Example:LogMining

    Loaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

    lines = spark.textFile(hdfs://...)

    errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))

    cachedMsgs = messages.cache()

    Block1

    Block2

    Block3

    Worker

    Worker

    Worker

    Driver

    cachedMsgs.filter(_.contains(foo)).count

    cachedMsgs.filter(_.contains(bar)).count. . .

    tasks

    results

    Cache1

    Cache2

    Cache3

    BaseRDDTransformedRDD

    Action

    Result:full-textsearchofWikipediain

  • 7/29/2019 Overview of Spark

    10/25

    RDDFaultTolerance

    RDDsmaintainlineageinformationthatcanbeusedtoreconstructlostpartitions

    Ex:

    messages = textFile(...).filter(_.startsWith(ERROR)).map(_.split(\t)(2))

    HDFSFile FilteredRDD MappedRDDfilter

    (func=_.contains(...))map

    (func=_.split(...))

  • 7/29/2019 Overview of Spark

    11/25

    Example:LogisticRegression

    Goal:findbestlineseparatingtwosetsofpoints

    +

    ++

    +

    +

    +

    +

    ++

    +

    target

    randominitialline

  • 7/29/2019 Overview of Spark

    12/25

    Example:LogisticRegression

    val data = spark.textFile(...).map(readPoint).cache()

    var w = Vector.random(D)

    for (i (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x

    ).reduce(_ + _)w -= gradient

    }

    println("Final w: " + w)

  • 7/29/2019 Overview of Spark

    13/25

    LogisticRegressionPerformance

    0500

    1000

    1500

    2000

    25003000

    3500

    4000

    4500

    1 5 10 20 30

    RunningTime(s)

    NumberofIterations

    Hadoop

    Spark

    127s/iteration

    firstiteration174sfurtheriterations6s

  • 7/29/2019 Overview of Spark

    14/25

    SparkApplications

    In-memorydataminingonHivedata(Conviva)

    Predictiveanalytics(Quantifind)Citytrafficprediction(MobileMillennium)

    Twitterspamclassification(Monarch)

    Collaborativefilteringviamatrixfactorization

  • 7/29/2019 Overview of Spark

    15/25

    ConvivaGeoReport

    Aggregationsonmanykeysw/sameWHEREclause

    40gaincomesfrom:Notre-readingunusedcolumnsorfilteredrecordsAvoidingrepeateddecompressionIn-memorystorageofdeserializedobjects

    0.5

    20

    0 5 10 15 20

    Spark

    Hive

    Time(hours)

  • 7/29/2019 Overview of Spark

    16/25

    FrameworksBuiltonSpark

    PregelonSpark(Bagel)Googlemessagepassing

    modelforgraphcomputation200linesofcode

    HiveonSpark(Shark)

    3000linesofcodeCompatiblewithApacheHiveMLoperatorsinScala

  • 7/29/2019 Overview of Spark

    17/25

    ImplementationRunsonApacheMesostoshareresourceswith

    Hadoop&otherapps

    CanreadfromanyHadoop

    inputsource(e.g.HDFS)

    Spark Hadoop MPI

    Mesos

    Node Node Node Node

    NochangestoScalacompiler

  • 7/29/2019 Overview of Spark

    18/25

    SparkSchedulerDryad-likeDAGs

    Pipelinesfunctions

    withinastage

    Cache-awareworkreuse&locality

    Partitioning-awaretoavoidshuffles

    join

    union

    groupBy

    map

    Stage3

    Stage1

    Stage2

    A: B:

    C: D:

    E:

    F:

    G:

    =cacheddatapartition

  • 7/29/2019 Overview of Spark

    19/25

    InteractiveSparkModifiedScalainterpretertoallowSparktobeusedinteractivelyfromthecommandline

    Requiredtwochanges:Modifiedwrappercodegenerationsothateachline

    typedhasreferencestoobjectsforitsdependenciesDistributegeneratedclassesoverthenetwork

  • 7/29/2019 Overview of Spark

    20/25

    Demo

  • 7/29/2019 Overview of Spark

    21/25

    Conclusion

    Sparkprovidesasimple,efficient,andpowerful

    programmingmodelforawiderangeofapps

    Downloadouropensourcerelease:

    www.spark-project.org

    [email protected]

  • 7/29/2019 Overview of Spark

    22/25

    RelatedWorkDryadLINQ,FlumeJavaSimilardistributedcollectionAPI,butcannotreuse

    datasetsefficientlyacrossqueries

    RelationaldatabasesLineage/provenance,logicallogging,materializedviews

    GraphLab,Piccolo,BigTable,RAMCloudFine-grainedwritessimilartodistributedsharedmemory

    IterativeMapReduce(e.g.Twister,HaLoop)Implicitdatasharingforafixedcomputationpattern

    Cachingsystems(e.g.Nectar)Storedatainfiles,noexplicitcontroloverwhatiscached

  • 7/29/2019 Overview of Spark

    23/25

    BehaviorwithNotEnoughRAM

    68.

    8

    58.1

    40.7

    29.7

    11.5

    0

    20

    40

    60

    80

    100

    Cachedisabled

    25% 50% 75% Fullycached

    Iterationtime(s)

    %ofworkingsetinmemory

  • 7/29/2019 Overview of Spark

    24/25

    FaultRecoveryResults119

    57

    56

    58

    58

    81

    57

    59

    57

    59

    020

    40

    60

    80100

    120

    140

    1 2 3 4 5 6 7 8 9 10

    Iteratriontime(s)

    Iteration

    NoFailure

    Failureinthe6thIteration

  • 7/29/2019 Overview of Spark

    25/25

    SparkOperations

    Transformations

    (defineanewRDD)

    mapfilter

    sample

    groupByKeyreduceByKey

    sortByKey

    flatMapunion

    join

    cogroupcross

    mapValues

    Actions(returnaresultto

    driverprogram)

    collectreducecountsave

    lookupKey


Recommended