+ All Categories
Home > Data & Analytics > APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

Date post: 14-Jan-2017
Category:
Upload: imaginea-technologies
View: 794 times
Download: 0 times
Share this document with a friend
21
Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve. APACHE SPARK TM RDD JOIN TO REDUCE JOB RUN TIME Insights from Imaginea
Transcript
Page 1: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.

APACHE SPARKTM

RDD JOIN TO REDUCE JOB RUN TIME

Insights from Imaginea

Page 2: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

THE PROBLEM

Over 1 TB data would be received as rows of tuple [ ID, v1, v2, … vn ] from S3

The need was to find the aggregate of values by their IDs and store it into HDFS

As we performed join between the existing data and the incremental data on Spark, there would be

HUGE AMOUNT OF DATA SHUFFLE across the cluster

We wanted to reduce this shuffle and thus reduce the job run time

Page 3: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

THE APPROACH

SAME ID GOES TO SAME PARTITION

So, given the fact that aggregation of each ID in one dataset has to be matched with the same ID in the other data set, we partitioned the data sets in such a way that the rows with same ID go to the same partition and thus on the same Spark worker

With this approach, rows were joined locally and the costly shuffle over network was avoided

Page 4: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

HASHPARTITIONER ON THE RDDs

THE HURDLE

1. Even though the HashPartitioner will divide data based on keys, it will not enforce the node affinity

2. Thus, the amount of data shuffle did not reduce

The approach is to use an HashPartitioner on the RDDs which partition data based on the key hash

Page 5: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

So, we dug into the Spark code to figure out

a way to reduce data shuffle

Page 6: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.

OUR SOLUTION

Page 7: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

STEP 1: OVERRIDE TASKSCHEDULER PROCESS TO ENSURE DATA ENTER ONE SINGLE NODE

This can be implemented as following (delegate everything to underlying RDD but

re-implement getPrefferedLocations)

1

2

3

4

5

6

7

class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) {

val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77")

override def getPreferredLocations(split: Partition): Seq[String] =

Seq(nodeIPs(split.index % nodeIPs.length))

}

TaskScheduler assigns worker nodes to partition. Override this process in the new wrapper RDD to ensure that the data always goes to the same node.

Page 8: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

STEP 2: WRITE SPARK CODES TO RUN A JOB AND MAKE EDITS TO EXECUTE A JOIN

1. Take a trial dataset

2. Move them to HDFS

3. Run a job which is very simple, do a couple of transformations and finally do dsRdd.join(devRdd)

1

2

3

4

5

6

7

8

val r1 = sc.textFile("hdfs://<a

href="http://192.168.2.145:9000/todelete/partitions/random1"

target="_blank">192.168.2.145:9000/todelete/partitions/random1</a>")

val r2 = sc.textFile("hdfs://<a

href="http://192.168.2.145:9000/todelete/partitions/random2"

target="_blank">192.168.2.145:9000/todelete/partitions/random2</a>")

val dsRdd = r1.map(line =&gt; <b>&lt;some transformation&gt;</b>).map(tokens

=&gt; <b>&lt;some more&gt;</b>)

val devRDD = r2.map(line =&gt; <b>&lt;some

transformation&gt;</b>).map(tokens =&gt; <b>&lt;some more&gt;</b>)

// finally join and materialize

dsRdd.join(devRDD, dummy).count

Page 9: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.

UNDERSTANDING THE JOB RUN

Page 10: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

WHY A THREE STAGE PROCESS?

After the job is run, this is the result that Spark UI delivered.

There are 3 stages. Stage 1 and stage 2 produce shuffle data which is consumed as a whole in stage 3.

Page 11: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

DAG FOR STAGE 0

It starts with reading the “random1” file, and then applies the two map functions on that RDD.

The same will be done with file “random2” in stage 1, which will have the same DAG.

Page 12: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

DAG FOR STAGE 2

Only one block corresponding to the “join” method call on dsRdd. So this corresponds to the dsRdd.join(devRDD, dummy).

Here we observe that there are shuffle boundaries involved in the job when ideally there’s no reason for them to be.

And hence a look at CoGrouped RDD is given to see what is causing the shuffle boundary between the join and the previous stages.

Page 13: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

ANALYSING COGROUPED RDD

If the RDD’s being joined do not have exact same partitioner as the one for this RDD, then they are marked as ShuffleDependency (in else block).

Clearly the dsRdd and devRdd went through this path — and both were marked as separate stages coming into “join”. Hence we get three stages.

1

2

3

4

5

6

7

8

9

1

0

1

1

1

2

override def getDependencies: Seq[Dependency[_]] = {

rdds.map { rdd: RDD[_] =&gt;

if (rdd.partitioner == Some(part)) {

logDebug("Adding one-to-one dependency with " + rdd)

new OneToOneDependency(rdd)

} else {

logDebug("Adding shuffle dependency with " + rdd)

new ShuffleDependency[K, Any, CoGroupCombiner](

rdd.asInstanceOf[RDD[_ &lt;: Product2[K, _]]], part, serializer)

}

}

}

Page 14: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

WRAP TWO RDDs WITHOUT REPARTITIONING

Without repartitioning the two RDD’s can be wrapped. Delegate everything to the underlying RDD’s by plugging in the dummy partitioner. This will make CoGroupedRdd to report that there are no stage boundaries and the DAG Scheduler will schedule everything locally on each worker.

Here’s the (very small) code for WrapRDD:

1

2

3

4

5

6

7

8

9

10

11

12

class WrapRDD[T: ClassTag](rdd: RDD[T], part: Partitioner)

extends RDD[T](rdd.sparkContext, rdd.dependencies) {

@DeveloperApi

override def compute(split: Partition,

context: TaskContext): Iterator[T] = rdd.compute(split, context)

override protected def getPartitions: Array[Partition] =

rdd.partitions

// ********* main thing/hack ******* ///

override val partitioner = Some(part)

}

Page 15: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.

THE RESULT

Page 16: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

ONE STAGE JOB RUN

After the changes are made, it can be seen that there is only one stage. The whole data is read (51 X 2 = 102 MB) and

nothing is written or read from shuffle.

Page 17: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

EXECUTION TIME IMPROVED BY 25%, FROM 4 SECOND TO 3 SECONDS

The two RDD’s that were computed in separate stages (0 and 1) earlier are now part of the same stage because they were wrapped in WrapRDD.

Also, the difference in the execution time: In the first case it was 4 seconds and now it is 3 seconds. A good 25% improvement. This is because the process doesn’t have to go to the disk for writing the shuffle and then reading it back immediately. Instead it works on intermediate data right away.

Page 18: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

IN SUMMARY …

Good workaround was found to substantiate a process with a large amount of data that needs to be

analyzed and segregated

This in turn improves the quality of work and delivery time while using Spark RDD’s

Execution time improved by 25%, from 4 second to 3 seconds

Page 19: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

EXPERIENCE THE POWER OF

APACHE SPARK WITH IMAGINEA

Imaginea is among the top contributor to Spark code

Building products on Spark since 2014

Opensource contributors to Apache Hadoop and Zeppelin

To find out more, visit http://www.imaginea.com/apache-spark

Page 20: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

ABOUT THE AUTHOR

SACHIN TYAGIHead – Data Engineering, Imaginea

Sachin heads the Data Engineering & Analytics practice at Imaginea. With over 10 years of IT experience, he brings in both Data Science & Data Engineering expertise to solve complex problems in Big Data & Machine Learning. At Imaginea, Sachin has been pivotal in implementing Apache Spark solutions to several FAST 500 companies in the areas such as Predictive Recommendation, Anomaly Detection & Contextual Search.

Page 21: APACHE SPARK RDD JOIN TO REDUCE JOB RUN TIME

Disclaimer

This document may contain forward-looking statements concerning products and strategies. These statements are based on management's current expectations and actual results may differ materially from those projected, as a result of certain risks, uncertainties and assumptions, including but not limited to: the growth of the markets addressed by our products and our customers' products, the demand for and market acceptance of our products; our ability to successfully compete in the markets in which we do business; our ability to successfully address the cost structure of our offerings; the ability to develop and implement new technologies and to obtain protection for the related intellectual property; and our ability to realize financial and strategic benefits of past and future transactions. These forward-looking statements are made only as of the date indicated, and the company disclaims any obligation to update or revise the information contained in any forward-looking statements, whether as a result of new information, future events or otherwise.

All Trademarks and other registered marks belong to their respective owners.

Copyright © 2012-2015, Imaginea Technologies, Inc. and/or its affiliates. All rights reserved.

Credits

Images under Creative Commons Zero license.

Private and confidential. Copyright (C) 2016, Imaginea Technologies Inc. All rights reserve.


Recommended