+ All Categories
Home > Education > Spark rdd operations in scala part 2

Spark rdd operations in scala part 2

Date post: 14-Apr-2017
Category:
Upload: acadgild
View: 59 times
Download: 2 times
Share this document with a friend
14
ACADGILD ACADGILD In our previous post, we had discussed about the basic RDD operations in Scala . Now, let’s discuss about some of the advanced RDD operations in Scala. Here we have taken two datasets, dept and emp, to work on this operations. The datasets looks like this: [DeptNo DeptName] [Emp_no DOB FName Lname gender HireDate DeptNo] Both the datasets are delimited by tab. Union: The Union operation results in an RDD which contains the elements of both the RDD's. You can refer to the below screen shot to see how the Union operation performs. https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit
Transcript
Page 1: Spark rdd operations in scala part   2

ACADGILDACADGILD

In our previous post, we had discussed about the basic RDD operations in Scala. Now, let’sdiscuss about some of the advanced RDD operations in Scala.

Here we have taken two datasets, dept and emp, to work on this operations. The datasetslooks like this:

[DeptNo DeptName] [Emp_no DOB FName Lname gender HireDate DeptNo]

Both the datasets are delimited by tab.

Union:

The Union operation results in an RDD which contains the elements of both the RDD's.You can refer to the below screen shot to see how the Union operation performs.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 2: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here, we have created two RDDs and loaded the two datasets into them. We haveperformed Union operation on them, and from the result you can see that both thedatasets are combined and have printed the first 10 records of the newly obtained RDD.Here the 10th record is the first record of the second dataset.

Intersection:

Intersection returns the elements of both the RDD's. Refer the below screen shot to knowhow to perform intersection.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 3: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here we have split the datasets by using tab delimiter and have extracted the 1st columnfrom the first dataset and the 7th column from the second dataset. We have alsoperformed intersection on the datasets and the result is as displayed.

Cartesian:

The Cartesian operation will return the RDD containing the Cartesian product of theelements contained in both the RDDs. You can refer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 4: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here we have split the datasets by using tab delimiter and have extracted 1st column fromthe first dataset and 7th column from the second dataset. Then, we have performed theCartesian operation on the RDDs and the results are displayed.

Subtract:

The Subtract operation will remove the common elements present in both the RDDs. Youcan refer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 5: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here, we have split the datasets by using tab delimiter and have extracted the 1st columnfrom the first dataset and the 7th column from the second dataset. Then we haveperformed the Subtract operation on the RDDs and the results are displayed.

Foreach:

The foreach operation is used to iterate every element in the RDD. You can refer to thebelow screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 6: Spark rdd operations in scala part   2

ACADGILDACADGILD

In the above screen shot, you can see that every element in the RDD emp are printed in aseparate line.

Operations on Paired RDD's:

Creating Pair RDD:

Here, we will create a RDD pair which consists of key and value pairs. To create a pairRDD, we need to import the RDD package by using the below statement:

import org.apache.spark.rdd.RDD

You can refer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 7: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here, we have split the dataset by using the tab as delimiter and made the key value pairsas shown in the above screen shot.

Keys:

The Keys operation is used to print all the keys in the RDD pair. You can refer to the belowscreen shot for the same.

Values:

The Values operation is used to print all the values in the RDD pair. You can refer to thebelow screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 8: Spark rdd operations in scala part   2

ACADGILDACADGILD

SortByKey:

The SortByKey operation returns the RDD that contains the key value pairs sorted byKeys. SortByKey accepts arguments true/false. ‘False’ will sort the keys in descendingorder and ‘True’ will sort the keys in ascending order. You can refer to the below screenshot for the same.

RDD's holding Objects:

Here, by using the case class, we will declare one object and will pass this case class asparameter to the RDD. You can refer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 9: Spark rdd operations in scala part   2

ACADGILDACADGILD

Join:

The Join operation is used to join two RDDs. The default Join will be Inner join. You canrefer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 10: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here, we have taken two case classes for the two datasets and have created two RDDswith the two datasets as the common element as key and the rest of the contents as valueand have performed Join operation on the RDDs and the result is as displayed on thescreen.

RighOuterJoin:

The RightOuterJoin operation returns the joined elements of both the RDDs, where thekey must be present in the first RDD. You can refer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 11: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here, we have taken two case classes for the two datasets and have created two RDDswith the two datasets as the common element as key and the rest of the contents as values

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 12: Spark rdd operations in scala part   2

ACADGILDACADGILD

and we have performed rightOuterJoin operation on the RDDs and the result is asdisplayed on the screen.

LeftOuterJoin:

The LeftOuterJoin operation returns the joined elements of both the RDDs, where the keymust be present in the second RDD. You can refer to the below screen shot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 13: Spark rdd operations in scala part   2

ACADGILDACADGILD

Here, we have taken two case classes for the two datasets and we have created two RDDswith the two datasets as the common element as key and the rest of the contents as valueand we have performed the LeftOuterJoin operation on the RDDs and the result is asdisplayed on the screen.

CountByKey:

The CountByKEy operation returns the number of elements present for each key. You can refer to the below screenshot for the same.

Here, we have loaded the dataset and split the records by using tab as delimiter and created the pair as DeptNo and DeptName. Then, we have performed CountByKey operation and the result is as displayed.

SaveAsTextFile:

The SaveAsTExtFile operation stores the result of the RDD in a text File in the given output path. You can refer to the below screenshot for the same.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit

Page 14: Spark rdd operations in scala part   2

ACADGILDACADGILD

Hope this post has been helpful in understanding the advanced RDD operations in Scala. In case of any queries, feel free to drop us a comment below or email us at [email protected].

Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

https://acadgild.com/blog/wp-admin/post.php?post=3350&action=edithttps://acadgild.com/blog/wp-admin/post.php?post=3350&action=edit


Recommended