+ All Categories
Home > Data & Analytics > Budapest Spark Meetup - Basics of Spark coding

Budapest Spark Meetup - Basics of Spark coding

Date post: 21-Apr-2017
Category:
Upload: mate-gulyas
View: 501 times
Download: 0 times
Share this document with a friend
43
Apache Spark Mate Gulyas
Transcript
Page 1: Budapest Spark Meetup  - Basics of Spark coding

Apache SparkMate Gulyas

Page 2: Budapest Spark Meetup  - Basics of Spark coding

CTO & Co-FounderGULYÁS MÁTÉ

@gulyasm

Page 3: Budapest Spark Meetup  - Basics of Spark coding

Getting Started

Page 4: Budapest Spark Meetup  - Basics of Spark coding

Spark CoreSpark SQLSpark StreamingMLlibGraphXCluster Managers

UNIFIED STACK

Spark CoreSpark SQLSpark StreamingMLlibGraphXCluster Managers

Page 5: Budapest Spark Meetup  - Basics of Spark coding

RDD API

Dataframe API

Dataset API

UNIFIED STACK

Spark Core

RDD API

Dataframe API

Dataset API

Page 6: Budapest Spark Meetup  - Basics of Spark coding

❏ Scala❏ Java❏ Python❏ R

WHICH LANGUAGE TO SPARK ON?

Page 7: Budapest Spark Meetup  - Basics of Spark coding

SPARK INSTALL

Page 8: Budapest Spark Meetup  - Basics of Spark coding

DRIVERSPARKCONTEXT

Page 9: Budapest Spark Meetup  - Basics of Spark coding

DRIVER PROGRAMYour main function. This is what you write.

Launches parallel operations on the cluster. The driver access Spark through SparkContext.

You access the computing cluster via SparkContext

Via SparkContext you can create RDDs.

Page 10: Budapest Spark Meetup  - Basics of Spark coding

❏ INTERACTIVE

❏ STANDALONE

A “SPARK SOFTWARE”

Page 11: Budapest Spark Meetup  - Basics of Spark coding

Resilient Distributed Dataset (RDD)

THE MAIN ATTRACTION

Page 12: Budapest Spark Meetup  - Basics of Spark coding

RDD

Page 13: Budapest Spark Meetup  - Basics of Spark coding

❏ TRANSFORMATION

❏ ACTION

OPERATIONS ON RDD

Page 14: Budapest Spark Meetup  - Basics of Spark coding

CREATES ANOTHER RDDTRANSFORMATION

Page 15: Budapest Spark Meetup  - Basics of Spark coding

CALCULATE VALUE AND RETURN IT TO THE DRIVER PROGRAM

ACTION

Page 16: Budapest Spark Meetup  - Basics of Spark coding

LAZY EVALUATION

Page 17: Budapest Spark Meetup  - Basics of Spark coding

INTERACTIVE

Page 18: Budapest Spark Meetup  - Basics of Spark coding

❏ The code: github.com/gulyasm/bigdata

❏ Databricks site: spark.apache.org

❏ User mailing list

❏ Spark books

MATERIALS

Page 19: Budapest Spark Meetup  - Basics of Spark coding

MATE [email protected]

@gulyasm@enbritely

THANK YOU!

Page 20: Budapest Spark Meetup  - Basics of Spark coding

TRANSFORMATIONSACTIONSLAZY EVALUATION

Page 21: Budapest Spark Meetup  - Basics of Spark coding

LIFECYCLE OF A SPARK PROGRAM

1. READ DATA FROM EXTERNAL SOURCE

2. CREATE LAZY EVALUATED

TRANSFORMATIONS

3. CACHE ANY INTERMEDIATE RDD TO REUSE

4. KICK IT OFF BY CALLING SOME ACTION

Page 22: Budapest Spark Meetup  - Basics of Spark coding

PARTITIONS

Page 23: Budapest Spark Meetup  - Basics of Spark coding

RDD INTERNALS

RDD INTERFACE

➔ set of PARTITIONS

➔ list of DEPENDENCIES on PARENT RDDs

➔ functions to COMPUTE a partition given parents

➔ preferred LOCATIONS (optional)

➔ PARTITIONER for K/V pairs (optional)

Page 24: Budapest Spark Meetup  - Basics of Spark coding

MULTIPLE RDDs /** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */ @DeveloperApi def compute(split: Partition, context: TaskContext): Iterator[T]

/** Implemented by subclasses to return the set of partitions in this RDD. */ protected def getPartitions: Array[Partition]

/** Implemented by subclasses to return how this RDD depends on parent RDDs. */ protected def getDependencies: Seq[Dependency[_]] = deps

/** Optionally overridden by subclasses to specify placement preferences. */ protected def getPreferredLocations (split: Partition): Seq[String] = Nil

/** Optionally overridden by subclasses to specify how they are partitioned. */ @transient val partitioner: Option[Partitioner] = None

Page 25: Budapest Spark Meetup  - Basics of Spark coding

INTERNALS

Page 26: Budapest Spark Meetup  - Basics of Spark coding

THE IMPORTANT PART

❏ HOW EXECUTION WORKS

❏ TERMINOLOGY

❏ WHAT SHOULD WE CARE ABOUT

Page 27: Budapest Spark Meetup  - Basics of Spark coding

PIPELINING

❏ Parallel to CPU pipelining❏ More steps at a time❏ Recap: computation kicks of when an

action is called due to lazy evaluation

Page 28: Budapest Spark Meetup  - Basics of Spark coding

PIPELINING

text = sc.textFile("twit1.txt")words = nonempty.flatMap(lambda x: x.split(" "))fwords = words.filter(lambda x: len(x) > 0)ones = fwords.map(lambda x: (x, 1))result = ones.reduceByKey(lambda l,r: r+l)result.collect()

Page 29: Budapest Spark Meetup  - Basics of Spark coding

PIPELINING

text = sc.textFile( )words = nonempty.flatMap( )fwords = words.filter( )ones = fwords.map( )result = ones.reduceByKey( )result.collect()

Page 30: Budapest Spark Meetup  - Basics of Spark coding

PIPELINING

sc.textFile( ) .flatMap( ) .filter( ) .map( ) .reduceByKey( )

Page 31: Budapest Spark Meetup  - Basics of Spark coding

PIPELINING

sc.textFile().flatMap().filter().map().reduceByKey()

Page 32: Budapest Spark Meetup  - Basics of Spark coding

RDD RDD RDD RDD RDD

textFile() flatMap() filter() map() reduceByKey()

text resultwords fwords ones

PIPELINING

Page 33: Budapest Spark Meetup  - Basics of Spark coding

PIPELINING

def runJob[T, U]( rdd: RDD[T],partitions: Seq[Int], func: (Iterator[T]) => U)

) : Array[U]

Page 34: Budapest Spark Meetup  - Basics of Spark coding

RDD RDD RDD RDD RDD

textFile() flatMap() filter() map() reduceByKey()

text resultwords fwords ones

collect()

PIPELINING

Page 35: Budapest Spark Meetup  - Basics of Spark coding

JOB

❏ Basically an action

❏ An action creates a job

❏ A whole computation with all dependencies

Page 36: Budapest Spark Meetup  - Basics of Spark coding

RDD RDD RDD RDD RDD

textFile() flatMap() filter() map() reduceByKey()

text resultwords fwords ones

collect()

Job

Page 37: Budapest Spark Meetup  - Basics of Spark coding

STAGE

❏ Unit of execution❏ Named after the last transformation

(the one runJob was called on)

❏ Transformations pipelined together into stages

❏ Stage boundary usually means shuffling

Page 38: Budapest Spark Meetup  - Basics of Spark coding

RDD RDD RDD RDD RDD

textFile() flatMap() filter() map() reduceByKey()

text resultwords fwords ones

collect()

JobStage 1 Stage 2

Page 39: Budapest Spark Meetup  - Basics of Spark coding

STAGE

❏ Unit of execution❏ Named after the last transformation

(the one runJob was called on)

❏ Transformations pipelined together into stages

❏ Stage boundary usually means shuffling

Page 40: Budapest Spark Meetup  - Basics of Spark coding

RDD RDD RDD RDD RDD

textFile() flatMap() filter() map() reduceByKey()

text resultwords fwords ones

collect()

JobStage 1 Stage 2

PT1

PT2

PT1

PT2

PT1

PT2

PT1

PT2

PT1

PT1

Shuffle

Page 41: Budapest Spark Meetup  - Basics of Spark coding

Repartitioning

text = sc.textFile("twit1.txt")words = nonempty.flatMap(lambda x: x.split(" "))fwords = words.filter(lambda x: len(x) > 1)ones = fwords.map(lambda x: (x, 1))rp = ones.repartition(6)result = rp.reduceByKey(lambda l,r: r+l)result.collect()

Page 42: Budapest Spark Meetup  - Basics of Spark coding

TaskSet

THE PROCESSRDD Objects DAG Scheduler Task Scheduler Executor

RDD

RDD RDD

RDD

RDD

sc.textFile.map()

.groupBy()

.filter()

Build DAG of operators

T

T

T

T

T

T

T

T

T

S

S

SS

- Split DAG into stages of tasks- Each stage when ready = ALL dependent task are finished

DAG Task

Task Scheduler

- Launches tasks- Retry failed tasks

ExecutorBlock manager

Task threads

Task threads

Task threads

- Store and serve blocks- Executes tasks

Page 43: Budapest Spark Meetup  - Basics of Spark coding

MATE [email protected]

@gulyasm@enbritely

THANK YOU!


Recommended