+ All Categories
Home > Software > Tulsa techfest Spark Core Aug 5th 2016

Tulsa techfest Spark Core Aug 5th 2016

Date post: 12-Apr-2017
Category:
Upload: mark-smith
View: 65 times
Download: 2 times
Share this document with a friend
19
Remove Duplicates Basic Spark Functionality
Transcript
Page 1: Tulsa techfest Spark Core Aug 5th 2016

Remove DuplicatesBasic Spark Functionality

Page 2: Tulsa techfest Spark Core Aug 5th 2016

Spark

Page 3: Tulsa techfest Spark Core Aug 5th 2016

Spark Core

• Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:• memory management and fault recovery • scheduling, distributing and monitoring jobs

on a cluster • interacting with storage systems

Page 4: Tulsa techfest Spark Core Aug 5th 2016

Spark Core• Spark introduces the concept of an RDD (Resilient

Distributed Dataset)• an immutable fault-tolerant, distributed collection of objects

that can be operated on in parallel. • contains any type of object and is created by loading an

external dataset or distributing a collection from the driver program.

• RDDs support two types of operations:• Transformations are operations (such as map, filter, join, union,

and so on) that are performed on an RDD and which yield a new RDD containing the result.

• Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.

Page 5: Tulsa techfest Spark Core Aug 5th 2016

Spark DataFrames• DataFrames API is inspired by data frames in R and Python

(Pandas), but designed from the ground-up to support modern big data and data science applications:• Ability to scale from kilobytes of data on a single laptop to

petabytes on a large cluster • Support for a wide array of data formats and storage systems • State-of-the-art optimization and code generation through the

Spark SQL Catalyst optimizer • Seamless integration with all big data tooling and

infrastructure via Spark • APIs for Python, Java, Scala, and R (in development via

SparkR)

Page 6: Tulsa techfest Spark Core Aug 5th 2016

Remove Duplicates

Page 7: Tulsa techfest Spark Core Aug 5th 2016

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")college: org.apache.spark.rdd.RDD[String] val cNoDups = college.distinctcNoDups: org.apache.spark.rdd.RDD[String]

college: RDD“as,df,asf” “qw,e,qw” “mb,kg,o”

“as,df,asf” “q3,e,qw” “mb,kg,o”

“as,df,asf” “qw,e,qw” “mb,k2,o”

cNoDups: RDD

“as,df,asf” “qw,e,qw” “mb,kg,o”

“q3,e,qw “mb,k2,o”

Page 8: Tulsa techfest Spark Core Aug 5th 2016

val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])]

college: RDD“as,df,asf” “qw,e,qw” “mb,kg,o”

“as,df,asf” “q3,e,qw” “mb,kg,o”

“as,df,asf” “qw,e,qw” “mb,k2,o”

cRows: RDDArray(as,df,asf) Array(qw,e,qw) Array(mb,kg,o)

Array(as,df,asf) Array(q3,e,qw) Array(mb,kg,o)

Array(as,df,asf) Array(qw,e,qw) Array(mb,k2,o)

cKeyRows: RDDkey->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)

Page 9: Tulsa techfest Spark Core Aug 5th 2016

val cGrouped = cKeyRows .groupBy(x => x._1) .map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])]

val cDups = cGrouped.filter(x => x._2.length > 1)

cKeyRows: RDDkey->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o)

key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)

cGrouped: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)

key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o)

key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)

Page 10: Tulsa techfest Spark Core Aug 5th 2016

val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])] val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]]

cGrouped: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)

key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o)

key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)

“as,df,asf” “qw,e,qw” “mb,kg,o”

“q3,e,qw “mb,k2,o”

cNoDups: RDD cDups: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)

key->Array(mb,kg,o) Array(mb,kg,o)

key->Array(qw,e,qw) Array(qw,e,qw)

Page 11: Tulsa techfest Spark Core Aug 5th 2016

Previously it was RDD but currently the Spark DataFrames API is considered to be the primary interaction point of Spark. but RDD is available if needed

Page 12: Tulsa techfest Spark Core Aug 5th 2016

What is partitioning in Apache Spark? Partitioning is actually the main concept of access your entire Hardware resources while executing any Job.More Partition = More ParallelismSo conceptually you must check the number of slots in your hardware, how many tasks can each of executors can handle.Each partition will leave in different Executor.

Page 13: Tulsa techfest Spark Core Aug 5th 2016
Page 14: Tulsa techfest Spark Core Aug 5th 2016

DataFrames• So Dataframe is more like column structure and each

record is actually a line. • Can Run statistics naturally as its somewhat works like

SQL or Python/R Dataframe. • In RDD, to process any data for last 7 days, spark

needed to go through entire dataset to get the details, but in Dataframe you already get Time column to handle the situation, so Spark won’t even see the data which is greater than 7 days.

• Easier to program. • Better performance and storage in the heap of executor.

Page 15: Tulsa techfest Spark Core Aug 5th 2016

How Dataframe ensures to read less data?

• You can skip partition while reading the data using Dataframe.

• Using Parquet

• Skipping data using statistucs (ie min, max)

• Using partitioning (ie year = 2015/month = 06…)

• Pushing predicates into storage systems.

Page 16: Tulsa techfest Spark Core Aug 5th 2016

What is Parquet• You can skip partition while reading the data using

Dataframe.

• Using Parquet

• Skipping data using statistucs (ie min, max)

• Using partitioning (ie year = 2015/month = 06…)

• Pushing predicates into storage systems.

Page 17: Tulsa techfest Spark Core Aug 5th 2016

• Parquet should be the source for any operation or ETL. So if the data is different format, preferred approach is to convert the source to Parquet and then process.

• If any dataset in JSON or comma separated file, first ETL it to convert it to Parquet.

• It limits I/O , so scans/reads only the columns that are needed.

• Parquet is columnar layout based, so it compresses better, so save spaces.

• So parquet takes first column and store that as a file, and so on. So if we have 3 different files and sql query is on 2 files, then parquet won’t even consider to read the 3rd file.

Page 18: Tulsa techfest Spark Core Aug 5th 2016

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv") college.count res2: Long = 7805 val collegeNoDups = college.distinct collegeNoDups.count res3: Long = 7805

val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27

val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29

cNoDups.count res7: Long = 7805

college.count res8: Long = 9000

val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29 val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31

cKeyRows.take(2) res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(

val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33 val cDups = cGrouped.filter(x => x._2.length > 1)

val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35

cDups.count res12: Long = 1195

val cNoDups = cGrouped.map(x => x._2(0)) cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35

cNoDups.count res13: Long = 7805

val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35

cNoDups.take(5) 16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28 res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245

Demo RDD Code

Page 19: Tulsa techfest Spark Core Aug 5th 2016

import org.apache.spark.sql.SQLContext

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv") df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI: string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string, SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string, ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri...

val dfd = df.distinct dfd.count res0: Long = 7804

df.count res1: Long = 8998

val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM")) dfdd.count res2: Long = 7804

val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt")) res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]

dfCnt.show \+--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1|

df.registerTempTable("colleges") val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM") dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]

dfCnt2.show

+--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| | 156921| 696100| 6961|Jefferson Communi...| 1|

Demo DataFrame Code


Recommended