+ All Categories
Home > Documents > Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special...

Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special...

Date post: 21-May-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
122
TRANSFORMATIONS AND ACTIONS A Visual Guide of the API http://training.databricks.com/visualapi.pdf
Transcript
Page 1: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

TRANSFORMATIONS AND ACTIONSA Visual Guide of the APIhttp://training.databricks.com/visualapi.pdf

Page 2: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

Databricks would like to give a special thanks to Jeff Thomspon for contributing 67

visual diagrams depicting the Spark API under the MIT license to the Spark

community.

Jeff’s original, creative work can be found here and you can read more about

Jeff’s project in his blog post.

After talking to Jeff, Databricks commissioned Adam Breindel to further evolve

Jeff’s work into the diagrams you see in this deck.

LinkedIn

Blog: data-frack

Page 3: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

making big data simple

Databricks Cloud:

“A unified platform for building Big Data pipelines – from ETL to

Exploration and Dashboards, to Advanced Analytics and Data

Products.”

• Founded in late 2013

• by the creators of Apache Spark

• Original team from UC Berkeley AMPLab

• Raised $47 Million in 2 rounds

• ~55 employees

• We’re hiring!

• Level 2/3 support partnerships with

• Hortonworks

• MapR

• DataStax

(http://databricks.workable.com)

Page 4: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

key

RDD Elements

original item

transformed

type

object on driver

RDD

partition(s)A

B

user functions

user input

input

emitted value

Legend

Page 5: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

Randomized operation

Legend

Set Theory / Relational operation

Numeric calculation

Page 6: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

Operations =

TRANSFORMATIONS

ACTIONS

+

Page 7: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

• map• filter• flatMap• mapPartitions• mapPartitionsWithIndex• groupBy• sortBy

= medium

Essential Core & Intermediate Spark OperationsTR

AN

SFO

RMAT

ION

SA

CTIO

NS

General

• sample• randomSplit

Math / Statistical

= easy

Set Theory / Relational

• union• intersection• subtract• distinct• cartesian• zip

• takeOrdered

Data Structure / I/O

• saveAsTextFile• saveAsSequenceFile• saveAsObjectFile• saveAsHadoopDataset• saveAsHadoopFile• saveAsNewAPIHadoopDataset• saveAsNewAPIHadoopFile

• keyBy• zipWithIndex• zipWithUniqueID• zipPartitions• coalesce• repartition• repartitionAndSortWithinPartitions• pipe

• count• takeSample• max• min• sum• histogram• mean• variance• stdev• sampleVariance• countApprox• countApproxDistinct

• reduce• collect• aggregate• fold• first• take• forEach• top• treeAggregate• treeReduce• forEachPartition• collectAsMap

Page 8: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

= medium

Essential Core & Intermediate PairRDD OperationsTR

AN

SFO

RMAT

ION

SA

CTIO

NS

General

• sampleByKey

Math / Statistical

= easy

Set Theory / Relational Data Structure

• keys• values

• partitionBy

• countByKey• countByValue• countByValueApprox• countApproxDistinctByKey• countApproxDistinctByKey• countByKeyApprox• sampleByKeyExact

• cogroup (=groupWith)• join• subtractByKey• fullOuterJoin• leftOuterJoin• rightOuterJoin

• flatMapValues• groupByKey• reduceByKey• reduceByKeyLocally• foldByKey• aggregateByKey• sortByKey• combineByKey

Page 9: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

vs

narrow wide

each partition of the parent RDD is used by

at most one partition of the child RDD

multiple child RDD partitions may depend

on a single parent RDD partition

Page 10: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

“One of the challenges in providing RDDs as an abstraction is choosing a

representation for them that can track lineage across a wide range of

transformations.”

“The most interesting question in designing this interface is how to represent

dependencies between RDDs.”

“We found it both sufficient and useful to classify dependencies into two types:

• narrow dependencies, where each partition of the parent RDD is used by at

most one partition of the child RDD

• wide dependencies, where multiple child partitions may depend on it.”

Page 11: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

narrow wide

each partition of the parent RDD is used by

at most one partition of the child RDD

multiple child RDD partitions may depend

on a single parent RDD partition

map, filter union

join w/ inputs

co-partitioned

groupByKey

join w/ inputs not

co-partitioned

Page 12: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

TRANSFORMATIONSCore Operations

Page 13: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

3 items in RDD

RDD: x

Page 14: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

User function

applied item by item

RDD: x RDD: y

Page 15: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

RDD: x RDD: y

Page 16: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

RDD: x RDD: y

Page 17: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

RDD: x RDD: y

After map() has been applied…

before after

Page 18: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

RDD: x RDD: y

Return a new RDD by applying a function to each element of this RDD.

Page 19: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAP

x = sc.parallelize(["b", "a", "c"]) y = x.map(lambda z: (z, 1))print(x.collect())print(y.collect())

['b', 'a', 'c']

[('b', 1), ('a', 1), ('c', 1)]

RDD: x RDD: y

x:

y:

map(f, preservesPartitioning=False)

Return a new RDD by applying a function to each element of this RDD

val x = sc.parallelize(Array("b", "a", "c"))val y = x.map(z => (z,1))println(x.collect().mkString(", "))println(y.collect().mkString(", "))

Page 20: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FILTER

3 items in RDD

RDD: x

Page 21: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FILTER

Apply user function:

keep item if function

returns true

RDD: x RDD: y

emitsTrue

Page 22: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FILTER

RDD: x RDD: y

emitsFalse

Page 23: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FILTER

RDD: x RDD: y

emitsTrue

Page 24: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FILTER

RDD: x RDD: y

After filter() has been applied…

before after

Page 25: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FILTER

x = sc.parallelize([1,2,3])y = x.filter(lambda x: x%2 == 1) #keep odd valuesprint(x.collect())print(y.collect())

[1, 2, 3]

[1, 3]

RDD: x RDD: y

x:

y:

filter(f)

Return a new RDD containing only the elements that satisfy a predicate

val x = sc.parallelize(Array(1,2,3))val y = x.filter(n => n%2 == 1)println(x.collect().mkString(", "))println(y.collect().mkString(", "))

Page 26: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

3 items in RDD

RDD: x

Page 27: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

RDD: x RDD: y

Page 28: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

RDD: x RDD: y

Page 29: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

RDD: x RDD: y

Page 30: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

RDD: x RDD: y

After flatmap() has been applied…

before after

Page 31: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

RDD: x RDD: y

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results

Page 32: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

FLATMAP

x = sc.parallelize([1,2,3]) y = x.flatMap(lambda x: (x, x*100, 42))print(x.collect())print(y.collect())

[1, 2, 3]

[1, 100, 42, 2, 200, 42, 3, 300, 42]

x:

y:

RDD: x RDD: y

flatMap(f, preservesPartitioning=False)

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results

val x = sc.parallelize(Array(1,2,3))val y = x.flatMap(n => Array(n, n*100, 42))println(x.collect().mkString(", "))println(y.collect().mkString(", "))

Page 33: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBY

4 items in RDD

RDD: x

James

Anna

Fred

John

Page 34: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBY

RDD: x

James

Anna

Fred

John

emits‘J’

J [ “John” ]

RDD: y

Page 35: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

F [ “Fred” ]

GROUPBY

RDD: x

James

Anna

Fred

emits‘F’

J [ “John” ]John

RDD: y

Page 36: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

[ “Fred” ]

GROUPBY

RDD: x

James

Anna

emits‘A’

J [ “John” ]

A [ “Anna” ]

Fred

John

F

RDD: y

Page 37: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

[ “Fred” ]

GROUPBY

RDD: x

James

Anna

emits‘J’

J [ “John”, “James” ]

[ “Anna” ]

Fred

John

F

A

RDD: y

Page 38: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBY

x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])y = x.groupBy(lambda w: w[0])print [(k, list(v)) for (k, v) in y.collect()]

['John', 'Fred', 'Anna', 'James']

[('A',['Anna']),('J',['John','James']),('F',['Fred'])]

RDD: x RDD: y

x:

y:

groupBy(f, numPartitions=None)

Group the data in the original RDD. Create pairs where the key is the output of

a user function, and the value is all items for which the function yields this key.

val x = sc.parallelize(Array("John", "Fred", "Anna", "James"))

val y = x.groupBy(w => w.charAt(0))println(y.collect().mkString(", "))

Page 39: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBYKEY

5 items in RDD

Pair RDD: x

B

B

A

A

A

5

4

3

2

1

Page 40: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBYKEY

Pair RDD: x

5

4

3

2

1

RDD: y

A [ 2 , 3 , 1 ]

B

B

A

A

A

Page 41: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBYKEY

Pair RDD: x RDD: y

B [ 5 , 4 ]

A [ 2 , 3 , 1 ]

5

4

3

2

1

B

B

A

A

A

Page 42: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBYKEY

x = sc.parallelize([('B',5),('B',4),('A',3),('A',2),('A',1)])y = x.groupByKey()print(x.collect())print(list((j[0], list(j[1])) for j in y.collect()))

[('B', 5),('B', 4),('A', 3),('A', 2),('A', 1)]

[('A', [2, 3, 1]),('B',[5, 4])]

RDD: x RDD: y

x:

y:

groupByKey(numPartitions=None)

Group the values for each key in the original RDD. Create a new pair where the

original key corresponds to this collected group of values.

val x = sc.parallelize(Array(('B',5),('B',4),('A',3),('A',2),('A',1)))

val y = x.groupByKey()println(x.collect().mkString(", "))println(y.collect().mkString(", "))

Page 43: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAPPARTITIONS

RDD: x RDD: y

partitions

A

B

A

B

Page 44: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

REDUCEBYKEY VS GROUPBYKEY

val words = Array("one", "two", "two", "three", "three", "three")val wordPairsRDD = sc.parallelize(words).map(word => (word, 1))

val wordCountsWithReduce = wordPairsRDD.reduceByKey(_ + _).collect()

val wordCountsWithGroup = wordPairsRDD.groupByKey().map(t => (t._1, t._2.sum)).collect()

Page 45: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

REDUCEBYKEY

(a, 1)

(b, 1)

(a, 1)

(b, 1)

(a, 1)

(a, 1) (a, 2)

(b, 2)(b, 1)

(b, 1)

(a, 1)

(a, 1)(a, 3)

(b, 2)(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 2) (a, 6)

(a, 3)

(b, 1)

(b, 2) (b, 5)

(b, 2)

a b

Page 46: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GROUPBYKEY

(a, 1)

(b, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)

(a, 1)

(a, 1)

(b, 1)

(b, 1)

(a, 1)(a, 1)

(a, 6)(a, 1)

(b, 5)

a b

(a, 1)

(a, 1)

(a, 1)

(b, 1)(b, 1)

(b, 1)(b, 1)

(b, 1)

Page 47: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAPPARTITIONS

x:

y:

mapPartitions(f, preservesPartitioning=False)

Return a new RDD by applying a function to each partition of this RDD

A

B

A

B

x = sc.parallelize([1,2,3], 2)

def f(iterator): yield sum(iterator); yield 42

y = x.mapPartitions(f)

# glom() flattens elements on the same partitionprint(x.glom().collect())print(y.glom().collect())

[[1], [2, 3]]

[[1, 42], [5, 42]]

Page 48: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAPPARTITIONS

x:

y:

mapPartitions(f, preservesPartitioning=False)

Return a new RDD by applying a function to each partition of this RDD

A

B

A

B

Array(Array(1), Array(2, 3))

Array(Array(1, 42), Array(5, 42))

val x = sc.parallelize(Array(1,2,3), 2)

def f(i:Iterator[Int])={ (i.sum,42).productIterator }

val y = x.mapPartitions(f)

// glom() flattens elements on the same partitionval xOut = x.glom().collect()val yOut = y.glom().collect()

Page 49: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAPPARTITIONSWITHINDEX

RDD: x RDD: y

partitions

A

B

A

B

input

partition index

Page 50: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

mapPartitionsWithIndex(f, preservesPartitioning=False)

Return a new RDD by applying a function to each partition of this RDD,

while tracking the index of the original partition

A

B

A

B

x = sc.parallelize([1,2,3], 2)

def f(partitionIndex, iterator): yield (partitionIndex, sum(iterator))

y = x.mapPartitionsWithIndex(f)

# glom() flattens elements on the same partitionprint(x.glom().collect())print(y.glom().collect())

[[1], [2, 3]]

[[0, 1], [1, 5]]

MAPPARTITIONSWITHINDEX

partition index

B A

Page 51: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

mapPartitionsWithIndex(f, preservesPartitioning=False)

Return a new RDD by applying a function to each partition of this RDD,

while tracking the index of the original partition.

A

B

A

B

Array(Array(1), Array(2, 3))

Array(Array(0, 1), Array(1, 5))

MAPPARTITIONSWITHINDEX

partition index

B A

val x = sc.parallelize(Array(1,2,3), 2)

def f(partitionIndex:Int, i:Iterator[Int]) = {(partitionIndex, i.sum).productIterator}

val y = x.mapPartitionsWithIndex(f)

// glom() flattens elements on the same partitionval xOut = x.glom().collect()val yOut = y.glom().collect()

Page 52: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

SAMPLE

RDD: x RDD: y

1

3

5

4

3

2

1

Page 53: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

SAMPLE

x = sc.parallelize([1, 2, 3, 4, 5])y = x.sample(False, 0.4, 42)print(x.collect())print(y.collect())

[1, 2, 3, 4, 5]

[1, 3]

RDD: x RDD: y

x:

y:

sample(withReplacement, fraction, seed=None)

Return a new RDD containing a statistical sample of the original RDD

val x = sc.parallelize(Array(1, 2, 3, 4, 5))val y = x.sample(false, 0.4)

// omitting seed will yield different outputprintln(y.collect().mkString(", "))

Page 54: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

UNIONRDD: x RDD: y

4

3

3

2

1

A

BC

4

3

3

2

1

A

B

C

RDD: z

Page 55: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

UNION

x = sc.parallelize([1,2,3], 2)y = sc.parallelize([3,4], 1)z = x.union(y)print(z.glom().collect())

[1, 2, 3]

[3, 4]

[[1], [2, 3], [3, 4]]

x:

y:

union(otherRDD)

Return a new RDD containing all items from two original RDDs. Duplicates are not culled.

val x = sc.parallelize(Array(1,2,3), 2)val y = sc.parallelize(Array(3,4), 1)val z = x.union(y)val zOut = z.glom().collect()

z:

Page 56: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

5B

JOINRDD: x RDD: y

42B

A

1A3A

Page 57: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

JOINRDD: x RDD: y

RDD: z

(1, 3)A

5B

42B

A

1A3A

Page 58: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

JOINRDD: x RDD: y

RDD: z(1, 4)A

(1, 3)A

5B

42B

A

1A3A

Page 59: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

JOINRDD: x RDD: y

(2, 5) RDD: zB

(1, 4)A

(1, 3)A

5B

42B

A

1A3A

Page 60: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

JOIN

x = sc.parallelize([("a", 1), ("b", 2)])y = sc.parallelize([("a", 3), ("a", 4), ("b", 5)])z = x.join(y)print(z.collect()) [("a", 1), ("b", 2)]

[("a", 3), ("a", 4), ("b", 5)]

[('a', (1, 3)), ('a', (1, 4)), ('b', (2, 5))]

x:

y:

union(otherRDD, numPartitions=None)

Return a new RDD containing all pairs of elements having the same key in the original RDDs

val x = sc.parallelize(Array(("a", 1), ("b", 2)))val y = sc.parallelize(Array(("a", 3), ("a", 4), ("b", 5)))val z = x.join(y)println(z.collect().mkString(", "))

z:

Page 61: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

DISTINCT

4

3

3

2

1

RDD: x

Page 62: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

DISTINCTRDD: x

4

3

3

2

1

4

3

3

2

1

RDD: y

Page 63: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

DISTINCTRDD: x

4

3

3

2

1

4

3

2

1

RDD: y

Page 64: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

DISTINCT

x = sc.parallelize([1,2,3,3,4])y = x.distinct()

print(y.collect())[1, 2, 3, 3, 4]

[1, 2, 3, 4]

x:

y:

distinct(numPartitions=None)

Return a new RDD containing distinct items from the original RDD (omitting all duplicates)

val x = sc.parallelize(Array(1,2,3,3,4))val y = x.distinct()

println(y.collect().mkString(", "))

**

¤

Page 65: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

COALESCERDD: x

A

B

C

Page 66: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

COALESCERDD: x

B

C

RDD: y

AAB

Page 67: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

C

COALESCERDD: x

B

C

RDD: y

AAB

Page 68: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

COALESCE

x = sc.parallelize([1, 2, 3, 4, 5], 3)y = x.coalesce(2)print(x.glom().collect())print(y.glom().collect())

[[1], [2, 3], [4, 5]]

[[1], [2, 3, 4, 5]]

x:

y:

coalesce(numPartitions, shuffle=False)

Return a new RDD which is reduced to a smaller number of partitions

val x = sc.parallelize(Array(1, 2, 3, 4, 5), 3)val y = x.coalesce(2)val xOut = x.glom().collect()val yOut = y.glom().collect()

C

B

C

A

AB

Page 69: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

KEYBY

RDD: x

James

Anna

Fred

John

RDD: y

J “John”

emits‘J’

Page 70: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

KEYBY

RDD: x

James

Anna

‘F’

Fred “Fred”F

RDD: y

J “John”John

Page 71: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

James

“Anna”A

KEYBY

RDD: x

Anna

‘A’

Fred

John

“Fred”F

RDD: y

J “John”

Page 72: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

J “James”

“Anna”A

KEYBY

RDD: x

James

Anna

emits‘J’

Fred

John

“Fred”F

RDD: y

J “John”

Page 73: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

KEYBY

x = sc.parallelize(['John', 'Fred', 'Anna', 'James'])y = x.keyBy(lambda w: w[0])print y.collect()

['John', 'Fred', 'Anna', 'James']

[('J','John'),('F','Fred'),('A','Anna'),('J','James')]

RDD: x RDD: y

x:

y:

keyBy(f)

Create a Pair RDD, forming one pair for each item in the original RDD. The

pair’s key is calculated from the value via a user-supplied function.

val x = sc.parallelize(Array("John", "Fred", "Anna", "James"))

val y = x.keyBy(w => w.charAt(0))println(y.collect().mkString(", "))

Page 74: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBYRDD: x

J “John”

“Anna”A

“Fred”F

J “James”

Page 75: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBYRDD: x

J “John”

“Anna”A

“Fred”F

J “James”

RDD: yRDD: y

J “James”

1

Page 76: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBYRDD: x

J “John”

“Anna”A

“Fred”F

RDD: yRDD: y

J “James”

“Fred”F

0

J “James”

Page 77: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBYRDD: x

J “John”

“Anna”A

RDD: yRDD: y

J “James”

“Anna”A

“Fred”F

0

“Fred”F

J “James”

Page 78: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBYRDD: x

J “John”

RDD: yRDD: y

J “John”

J “James”

“Anna”A

“Fred”F1

“Anna”A

“Fred”F

J “James”

Page 79: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBY

x = sc.parallelize([('J','James'),('F','Fred'),('A','Anna'),('J','John')], 3)

y = x.partitionBy(2, lambda w: 0 if w[0] < 'H' else 1)print x.glom().collect()print y.glom().collect()

[[('J', 'James')], [('F', 'Fred')], [('A', 'Anna'), ('J', 'John')]]

[[('A', 'Anna'), ('F', 'Fred')],[('J', 'James'), ('J', 'John')]]

x:

y:

partitionBy(numPartitions, partitioner=portable_hash)

Return a new RDD with the specified number of partitions, placing original

items into the partition returned by a user supplied function

Page 80: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

PARTITIONBY

Array(Array((A,Anna), (F,Fred)), Array((J,John), (J,James)))

Array(Array((F,Fred), (A,Anna)), Array((J,John), (J,James)))

x:

y:

partitionBy(numPartitions, partitioner=portable_hash)

Return a new RDD with the specified number of partitions, placing original

items into the partition returned by a user supplied function.

import org.apache.spark.Partitionerval x = sc.parallelize(Array(('J',"James"),('F',"Fred"),

('A',"Anna"),('J',"John")), 3)

val y = x.partitionBy(new Partitioner() { val numPartitions = 2def getPartition(k:Any) = {

if (k.asInstanceOf[Char] < 'H') 0 else 1 }

})

val yOut = y.glom().collect()

Page 81: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

ZIPRDD: x RDD: y

3

2

1

A

B

9

4

1

A

B

Page 82: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

ZIPRDD: x RDD: y

3

2

1

A

B

4

A

RDD: z

9

4

1

A

B

1 1

Page 83: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

ZIPRDD: x RDD: y

3

2

1

A

B

4

A

RDD: z

9

4

1

A

B

2

1

4

1

Page 84: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

ZIPRDD: x RDD: y

3

2

1

A

B

4

3

A

B

RDD: z

9

4

1

A

B

2

1

9

4

1

Page 85: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

ZIP

x = sc.parallelize([1, 2, 3])y = x.map(lambda n:n*n)z = x.zip(y)

print(z.collect()) [1, 2, 3]

[1, 4, 9]

[(1, 1), (2, 4), (3, 9)]

x:

y:

zip(otherRDD)

Return a new RDD containing pairs whose key is the item in the original RDD, and whose

value is that item’s corresponding element (same partition, same index) in a second RDD

val x = sc.parallelize(Array(1,2,3))val y = x.map(n=>n*n)val z = x.zip(y)

println(z.collect().mkString(", "))

z:

Page 86: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

ACTIONSCore Operations

Page 87: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

vs

distributed driver

occurs across the cluster result must fit in driver JVM

Page 88: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

GETNUMPARTITIONS

partition(s)A

B

2

Page 89: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

getNumPartitions()

Return the number of partitions in RDD

[[1], [2, 3]]

2

GETNUMPARTITIONSA

B

2

x = sc.parallelize([1,2,3], 2)y = x.getNumPartitions()

print(x.glom().collect())print(y)

val x = sc.parallelize(Array(1,2,3), 2)val y = x.partitions.sizeval xOut = x.glom().collect()println(y)

Page 90: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

COLLECT

partition(s)A

B[ ]

Page 91: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

collect()

Return all items in the RDD to the driver in a single list

[[1], [2, 3]]

[1, 2, 3]

A

B

x = sc.parallelize([1,2,3], 2)y = x.collect()

print(x.glom().collect())print(y)

val x = sc.parallelize(Array(1,2,3), 2)val y = x.collect()

val xOut = x.glom().collect()println(y)

COLLECT [ ]

Page 92: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

REDUCE

3

2

1

emits

4

3

Page 93: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

4

REDUCE

3

2

1

emits

3

input

6

Page 94: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

4

REDUCE

3

2

1

10input

6

10

Page 95: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

reduce(f)

Aggregate all the elements of the RDD by applying a user function

pairwise to elements and partial results, and returns a result to the driver

[1, 2, 3, 4]

10

x = sc.parallelize([1,2,3,4])y = x.reduce(lambda a,b: a+b)

print(x.collect())print(y)

val x = sc.parallelize(Array(1,2,3,4))val y = x.reduce((a,b) => a+b)

println(x.collect.mkString(", "))println(y)

REDUCE******

***

***

Page 96: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

A

B

Page 97: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

Page 98: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

emits

([], 0)

([1], 1)

Page 99: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([], 0)

([2], 2)

([1], 1)

Page 100: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([2], 2)

([1], 1)([1,2], 3)

Page 101: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([2], 2)

([1], 1)([1,2], 3)

([3], 3)

([], 0)

Page 102: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([2], 2)

([1], 1)([1,2], 3)

([3], 3)

([], 0)

([4], 4)

Page 103: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([2], 2)

([1], 1)([1,2], 3)

([4], 4)

([3], 3)([3,4], 7)

Page 104: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([2], 2)

([1], 1)([1,2], 3)

([4], 4)

([3], 3)([3,4], 7)

Page 105: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([1,2], 3)

([3,4], 7)

Page 106: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

AGGREGATE

3

2

1

4

([1,2], 3)

([3,4], 7)

([1,2,3,4], 10)

([1,2,3,4], 10)

Page 107: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

aggregate(identity, seqOp, combOp)

Aggregate all the elements of the RDD by:

- applying a user function to combine elements with user-supplied objects,

- then combining those user-defined results via a second user function,

- and finally returning a result to the driver.

seqOp = lambda data, item: (data[0] + [item], data[1] + item)combOp = lambda d1, d2: (d1[0] + d2[0], d1[1] + d2[1])

x = sc.parallelize([1,2,3,4])y = x.aggregate(([], 0), seqOp, combOp)

print(y)

AGGREGATE*******

*****

[( ),#]

x:

y:

[1, 2, 3, 4]

([1, 2, 3, 4], 10)

Page 108: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

def seqOp = (data:(Array[Int], Int), item:Int) => (data._1 :+ item, data._2 + item)

def combOp = (d1:(Array[Int], Int), d2:(Array[Int], Int)) => (d1._1.union(d2._1), d1._2 + d2._2)

val x = sc.parallelize(Array(1,2,3,4))val y = x.aggregate((Array[Int](), 0))(seqOp, combOp)

println(y)

x:

y:

aggregate(identity, seqOp, combOp)

Aggregate all the elements of the RDD by:

- applying a user function to combine elements with user-supplied objects,

- then combining those user-defined results via a second user function,

- and finally returning a result to the driver.

[1, 2, 3, 4]

(Array(3, 1, 2, 4),10)

AGGREGATE*******

*****

[( ),#]

Page 109: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MAX

42

1

4

Page 110: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

max()

Return the maximum item in the RDD

[2, 4, 1]

4

MAX4

x = sc.parallelize([2,4,1])y = x.max()

print(x.collect())print(y)

val x = sc.parallelize(Array(2,4,1))val y = x.max

println(x.collect().mkString(", "))println(y)

2

1

4max

Page 111: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

SUM

72

1

4

Page 112: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

sum()

Return the sum of the items in the RDD

[2, 4, 1]

7

SUM7

x = sc.parallelize([2,4,1])y = x.sum()

print(x.collect())print(y)

val x = sc.parallelize(Array(2,4,1))val y = x.sum

println(x.collect().mkString(", "))println(y)

2

1

Page 113: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

MEAN

2.333333332

1

4

Page 114: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

mean()

Return the mean of the items in the RDD

[2, 4, 1]

2.3333333

MEAN2.3333333

x = sc.parallelize([2,4,1])y = x.mean()

print(x.collect())print(y)

val x = sc.parallelize(Array(2,4,1))val y = x.mean

println(x.collect().mkString(", "))println(y)

2

1

4x

Page 115: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

STDEV

1.24721912

1

4

Page 116: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

stdev()

Return the standard deviation of the items in the RDD

[2, 4, 1]

1.2472191

STDEV1.2472191

x = sc.parallelize([2,4,1])y = x.stdev()

print(x.collect())print(y)

val x = sc.parallelize(Array(2,4,1))val y = x.stdev

println(x.collect().mkString(", "))println(y)

2

1

Page 117: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

COUNTBYKEY

{'A': 1, 'J': 2, 'F': 1}

J “John”

“Anna”A

“Fred”F

J “James”

Page 118: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

countByKey()

Return a map of keys and counts of their occurrences in the RDD

[('J', 'James'), ('F','Fred'),('A','Anna'), ('J','John')]

{'A': 1, 'J': 2, 'F': 1}

COUNTBYKEY

{A: 1, 'J': 2, 'F': 1}

x = sc.parallelize([('J', 'James'), ('F','Fred'), ('A','Anna'), ('J','John')])

y = x.countByKey()print(y)

val x = sc.parallelize(Array(('J',"James"),('F',"Fred"),('A',"Anna"),('J',"John")))

val y = x.countByKey()println(y)

Page 119: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

SAVEASTEXTFILE

Page 120: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

x:

y:

saveAsTextFile(path, compressionCodecClass=None)

Save the RDD to the filesystem indicated in the path

[2, 4, 1]

[u'2', u'4', u'1']

SAVEASTEXTFILE

dbutils.fs.rm("/temp/demo", True)x = sc.parallelize([2,4,1])x.saveAsTextFile("/temp/demo")

y = sc.textFile("/temp/demo")print(y.collect())

dbutils.fs.rm("/temp/demo", true)val x = sc.parallelize(Array(2,4,1))x.saveAsTextFile("/temp/demo")

val y = sc.textFile("/temp/demo")println(y.collect().mkString(", "))

Page 121: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

LAB

Page 122: Transformations and Actions - Databricks · 2015-06-17 · Databricks would like to give a special thanks to Jeff Thomspon for contributing 67 visual diagrams depicting the Spark

Q&A


Recommended