Spark rdd part 2

Advanced RDD operations in Scala

In our previous blog we have discussed about the basic RDD operations you can have a look at Spark RDD operations in Scala. In this blog we will be discussing about some Advanced operations on RDD's.

Here we have taken two datasets dept and emp to work on this operations. The datasets looks like:

[DeptNo DeptName] [Emp_no DOB FName Lname gender HireDate DeptNo]

Both the datasets are delimeted by tab.

Union:Union results in an RDD which contians the elements of both the RDD's. Refer the below screen shot to know how to perform union.

Here we have created two RDD's and loaded two datasets into them. We have performed union operation on them. In the result you can see that both the datasets are union-ed and we have printed the first 10 records of the newly obtained RDD the 10th record is the first record of the second dataset.

Intersection:Intersection returns the elements of both the RDD's. Refer the below screen shot to know how to perform intersection.

Here we have splitted the datasets by using tab delimeter and we have extracted 1 st column from the first dataset and 7th column from the second dataset and we have performed intersection on the datasets the result is as displayed.

Cartesian:cartesian will return the RDD containing the cartesian product of the elements contains in both the RDD's. Refer the below screen shot for the same.

Here we have splitted the datasets by using tab delimeter and we have extracted 1 st column from the first dataset and 7th column from the second dataset and we have performed cartesian operation on the RDD's and the results are displayed.

Subtract:subtract will remove the comon elements present in both the RDD's. Refer the below screen shot for the same.

Here we have splitted the datasets by using tab delimeter and we have extracted 1 st column from the first dataset and 7th column from the second dataset and we have performed subtract operation on the RDD's and the results are displayed.

Foreach:foreach is used to iterate over every element in the RDD. Refer the below screen shot for the same.

In the screen shot you can see that every element in the RDD emp are printed in a separate line.

Pair RDD:Here we will create a pair RDD. Which consits of key and value pairs. To create a pair RDD, we need to import the RDD package by using the below statement

import org.apache.spark.rdd.RDDRefer the below screen shot for the same.

Here we have splitted the dataset by using the tab as delimeter and we making the key valuepairs as shown in the screen shot.

Keys:keys is used to print all the keys in the pair RDD. Refer the below screen shot for the same.

Values:values is used to print all the values in the pair RDD. Refer the below screen shot for the same.

SortByKey:Retursn the RDD consisting of the key value pairs sorted by Keys. SortByKey accepts arguments true/false. False will sort the keys in descending order and True will sort the keys in ascending order. Refer the below screen shot for the same.

RDD's holding Objects:Here by using case class we will declare one object and we will pass this case calss as parameter to the RDD. Refer the below screen shot for the same.

Join:Join is used to join two RDD's default join will be Inner join. Refer the below screen shot for the same.

Here we have taken two case classes for the two datasets and we have created two RDD's with the two datasets as the comon element as key and the rest of the contents as value and we have

performed join operation on the RDD's the result is as displayed on the screen.

RighOuterJoin:Returns the joined elements of both the RDD's where the key must be present in the first RDD. Refer the below screen shot for the same.

Here we have taken two case classes for the two datasets and we have created two RDD's with the two datasets as the comon element as key and the rest of the contents as value and we have performed rightOuterJoin operation on the RDD's the result is as displayed on the screen.

LeftOuterJoin:Returns the joined elements of both the RDD's where the key must be present in the second RDD. Refer the below screen shot for the same.

Here we have taken two case classes for the two datasets and we have created two RDD's with the two datasets as the comon element as key and the rest of the contents as value and we have performed leftOuterJoin operation on the RDD's the result is as displayed on the screen.

CountByKey:Returns the number of elements present for each key. Refer the below screen shot for the same.

Here we have loaded the dataset and splitted the records by using tab as delimiter and created the pair as DeptNo and DeptName and we have performed countByKey operation and the result is as displayed.

SaveAsTextFile:It will store the result of the RDD in a text File in the given output path. Refer the below screen shot for the same.

Hope this blog helped you in understanding the RDD operations in depth in scala. Keep visiting our site www.acadgild.com for more updates on Big Data and other technologies.

http://www.acadgild.com/

Date post:	23-Jan-2018
Category:	Software
Upload:	kiran-krishna
View:	109 times
Download:	0 times

Spark rdd part 2

Software