+ All Categories
Home > Data & Analytics > Anatomy of RDD : Deep dive into Spark RDD abstraction

Anatomy of RDD : Deep dive into Spark RDD abstraction

Date post: 16-Jul-2015
Category:
Upload: datamantra
View: 1,271 times
Download: 1 times
Share this document with a friend
Popular Tags:
42
Anatomy of RDD A deep dive into the RDD data structure https://github.com/phatak-dev/anatomy-of-rdd
Transcript
Page 1: Anatomy of RDD : Deep dive into Spark RDD abstraction

Anatomy of RDDA deep dive into the RDD data structure

https://github.com/phatak-dev/anatomy-of-rdd

Page 2: Anatomy of RDD : Deep dive into Spark RDD abstraction

● Madhukara Phatak

● Big data consultant and trainer at datamantra.io

● Consult in Hadoop, Spark and Scala

● www.madhukaraphatak.com

Page 3: Anatomy of RDD : Deep dive into Spark RDD abstraction

Agenda

● What is RDD?● Immutable and Distributed● Partitions● Laziness ● Caching● Extending Spark API

Page 4: Anatomy of RDD : Deep dive into Spark RDD abstraction

What is RDD?

Resilient Distributed Dataset- A big collection of data with following

properties- Immutable- Distributed- Lazily evaluated- Type inferred- Cacheable

Page 5: Anatomy of RDD : Deep dive into Spark RDD abstraction

Immutable and Distributed

Page 6: Anatomy of RDD : Deep dive into Spark RDD abstraction

Partitions

● Logical division of data● Derived from Hadoop Map/Reduce● All Input,Intermediate and output data will be

represented as partitions● Partitions are basic unit of parallelism● RDD data is just collection of partitions

Page 7: Anatomy of RDD : Deep dive into Spark RDD abstraction

Partition from Input Data

Data in HDFS

Chunk 1 Chunk 2 Chunk 3

Input Format

RDD

Partition 1 Partition 2 Partition 3

Page 8: Anatomy of RDD : Deep dive into Spark RDD abstraction

Partition example

Page 9: Anatomy of RDD : Deep dive into Spark RDD abstraction

Partition and Immutability

● All partitions are immutable ● Every transformation generates new partition● Partition immutability driven by underneath

storage like HDFS● Partition immutability allows for fault

recovery

Page 10: Anatomy of RDD : Deep dive into Spark RDD abstraction

Partitions and Distribution

● Partitions derived from HDFS are distributed by default

● Partitions also location aware● Location awareness of partitions allow for

data locality● For computed data, using caching we can

distribute in memory also

Page 11: Anatomy of RDD : Deep dive into Spark RDD abstraction

Accessing partitions

● We can access partition together rather single row at a time

● mapParititons API of RDD allows us that● Accessing partition at a time allows us to do

some partionwise operation which cannot be done by accessing single row.

Page 12: Anatomy of RDD : Deep dive into Spark RDD abstraction

Map partition example

Page 13: Anatomy of RDD : Deep dive into Spark RDD abstraction

Partition for transformed Data

● Partitioning will be different for key/value pairs that are generated by shuffle operation

● Partitioning is driven by partitioner specified● By default HashPartitioner is used● You can use your own partitioner also

Page 14: Anatomy of RDD : Deep dive into Spark RDD abstraction

Hash Partitioning

Input RDD

Partition 1 Partition 2 Partition 3

Hash(key)%no.of.partitions

Ouptut RDD Partition 1 Partition 2

Page 15: Anatomy of RDD : Deep dive into Spark RDD abstraction

Hash partition example

Page 16: Anatomy of RDD : Deep dive into Spark RDD abstraction

Custom Partitioner

● Partition the data according to your data structure

● Custom partitioning allows control over no of partitions and the distribution of data across when grouping or reducing is done

Page 17: Anatomy of RDD : Deep dive into Spark RDD abstraction

Custom partition example

Page 18: Anatomy of RDD : Deep dive into Spark RDD abstraction

Look up operation

● Partitioning allows faster lookups● Lookup operation allows to look up for a

given value by specifying the key● Using partitioner, lookup determines which

partition look for● Then it only need to look in that partition● If no partition is specified, it will fallback to

filter

Page 19: Anatomy of RDD : Deep dive into Spark RDD abstraction

Lookup example

Page 20: Anatomy of RDD : Deep dive into Spark RDD abstraction

Laziness

Page 21: Anatomy of RDD : Deep dive into Spark RDD abstraction

Parent(Dependency)

● Each RDD has access to it’s parent RDD● Nil is the value of parent for first RDD● Before computing it’s value, it always

computes it’s parent● This chain of running allows for laziness

Page 22: Anatomy of RDD : Deep dive into Spark RDD abstraction

Sub classing

● Each spark operator, creates an instance of specific sub class of RDD

● map operator results in MappedRDD, flatMap in FlatMappedRDD etc

● Subclass allows RDD to remember the operation that is performed in the transformation

Page 23: Anatomy of RDD : Deep dive into Spark RDD abstraction

RDD transformationsval dataRDD = sc.textFile(args(1))

val splitRDD = dataRDD.flatMap(value => value.split(“ “)

Hadoop RDD

splitRDD: FlatMappedRDD

dataRDD :MappedRDD

Nil

Page 24: Anatomy of RDD : Deep dive into Spark RDD abstraction

Laziness example

Page 25: Anatomy of RDD : Deep dive into Spark RDD abstraction

Compute

● Compute is the function for evaluation of each partition in RDD

● Compute is an abstract method of RDD● Each sub class of RDD like MappedRDD,

FilteredRDD have to override this method

Page 26: Anatomy of RDD : Deep dive into Spark RDD abstraction

RDD actionsval dataRDD = sc.textFile(args(1))

val flatMapRDD = dataRDD.flatMap(value => value.split(“ “)

flatMapRDD.collect()

Hadoop RDD

FlatMap RDD

Mapped RDD

Nil

runJob

compute

compute

compute

Page 27: Anatomy of RDD : Deep dive into Spark RDD abstraction

runJob API

● runJob API of RDD is the api to implement actions

● runJob allows to take each partition and allow you evaluate

● All spark actions internally use runJob api.

Page 28: Anatomy of RDD : Deep dive into Spark RDD abstraction

Run job example

Page 29: Anatomy of RDD : Deep dive into Spark RDD abstraction

Caching

● cache internally uses persist API● persist sets a specific storage level for a

given RDD● Spark context tracks persistent RDD● When first evaluates, partition will be put into

memory by block manager

Page 30: Anatomy of RDD : Deep dive into Spark RDD abstraction

Block manager

● Handles all in memory data in spark● Responsible for

○ Cached Data ( BlockRDD)○ Shuffle Data○ Broadcast data

● Partition will be stored in Block with id (RDD.id, partition_index)

Page 31: Anatomy of RDD : Deep dive into Spark RDD abstraction

How caching works?

● Partition iterator checks the storage level● if Storage level is set it calls

cacheManager.getOrCompute(partition)

● as iterator is run for each RDD evaluation, its transparent to user

Page 32: Anatomy of RDD : Deep dive into Spark RDD abstraction

Caching example

Page 33: Anatomy of RDD : Deep dive into Spark RDD abstraction

Extending Spark API

Page 34: Anatomy of RDD : Deep dive into Spark RDD abstraction

Why?

● Domain specific operators○ Allows developer to express domain specific

calculation in cleaner way○ Improves code readability○ Easy to maintain

● Domain specific RDD’s○ Better way of expressing domain data○ Control over partitioning and distribution

Page 35: Anatomy of RDD : Deep dive into Spark RDD abstraction

DSL Example

● salesRecordRDD: RDD[SalesRecord]● To make sum of sales

○ In plain spark■ salesRecord.map(_.itemValue).sum

○ In our dsl■ salesRecord.totalSales

● Our dsl hides internal representation and improves readability

Page 36: Anatomy of RDD : Deep dive into Spark RDD abstraction

How to Extend

● Custom operators to RDD○ Domain specific operators to specific RDD’s○ Uses scala implicit mechanism○ Feels and works like built in operator

● Custom RDD○ Extend RDD API to create our own RDD○ Combined with RDD it’s very powerful

Page 37: Anatomy of RDD : Deep dive into Spark RDD abstraction

Implicits in Scala

● A way of extending Types on the fly● Implicits also used to pass the parameters

functions which are read from environment● In our example, we just use the type

extension facility● All implicits are compile time checked.

Page 38: Anatomy of RDD : Deep dive into Spark RDD abstraction

Implicits example

Page 39: Anatomy of RDD : Deep dive into Spark RDD abstraction

Adding operators to RDD’s

● We use scala implicit facility, to add the custom operators on our RDD

● These operators only show up in our RDD’s● All implicit conversion are handled by Scala

not by Spark● Spark internally use similar tricks for

PairRDD’s

Page 40: Anatomy of RDD : Deep dive into Spark RDD abstraction

Custom operator Example

Page 41: Anatomy of RDD : Deep dive into Spark RDD abstraction

Extending RDD

● Extending RDD API allows to create our own custom RDD structure

● Custom RDD’s allows control over computation

● You can change partitions, locality and evaluation depending upon your requirement

Page 42: Anatomy of RDD : Deep dive into Spark RDD abstraction

Discount RDD example


Recommended