Date post: | 12-Apr-2017 |
Category: |
Software |
Upload: | mark-smith |
View: | 65 times |
Download: | 2 times |
Remove DuplicatesBasic Spark Functionality
Spark
Spark Core
• Spark Core is the base engine for large-scale parallel and distributed data processing. It is responsible for:• memory management and fault recovery • scheduling, distributing and monitoring jobs
on a cluster • interacting with storage systems
Spark Core• Spark introduces the concept of an RDD (Resilient
Distributed Dataset)• an immutable fault-tolerant, distributed collection of objects
that can be operated on in parallel. • contains any type of object and is created by loading an
external dataset or distributing a collection from the driver program.
• RDDs support two types of operations:• Transformations are operations (such as map, filter, join, union,
and so on) that are performed on an RDD and which yield a new RDD containing the result.
• Actions are operations (such as reduce, count, first, and so on) that return a value after running a computation on an RDD.
Spark DataFrames• DataFrames API is inspired by data frames in R and Python
(Pandas), but designed from the ground-up to support modern big data and data science applications:• Ability to scale from kilobytes of data on a single laptop to
petabytes on a large cluster • Support for a wide array of data formats and storage systems • State-of-the-art optimization and code generation through the
Spark SQL Catalyst optimizer • Seamless integration with all big data tooling and
infrastructure via Spark • APIs for Python, Java, Scala, and R (in development via
SparkR)
Remove Duplicates
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv")college: org.apache.spark.rdd.RDD[String] val cNoDups = college.distinctcNoDups: org.apache.spark.rdd.RDD[String]
college: RDD“as,df,asf” “qw,e,qw” “mb,kg,o”
“as,df,asf” “q3,e,qw” “mb,kg,o”
“as,df,asf” “qw,e,qw” “mb,k2,o”
cNoDups: RDD
“as,df,asf” “qw,e,qw” “mb,kg,o”
“q3,e,qw “mb,k2,o”
val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])]
college: RDD“as,df,asf” “qw,e,qw” “mb,kg,o”
“as,df,asf” “q3,e,qw” “mb,kg,o”
“as,df,asf” “qw,e,qw” “mb,k2,o”
cRows: RDDArray(as,df,asf) Array(qw,e,qw) Array(mb,kg,o)
Array(as,df,asf) Array(q3,e,qw) Array(mb,kg,o)
Array(as,df,asf) Array(qw,e,qw) Array(mb,k2,o)
cKeyRows: RDDkey->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o)
key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o)
key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)
val cGrouped = cKeyRows .groupBy(x => x._1) .map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String,Array[(String, Array[String])])]
val cDups = cGrouped.filter(x => x._2.length > 1)
cKeyRows: RDDkey->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,kg,o)
key->Array(as,df,asf) key->Array(q3,e,qw) key->Array(mb,kg,o)
key->Array(as,df,asf) key->Array(qw,e,qw) key->Array(mb,k2,o)
cGrouped: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)
key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o)
key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)
val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, Array[(String, Array[String])])] val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]]
cGrouped: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)
key->Array(mb,kg,o) Array(mb,kg,o) key->Array(mb,k2,o)
key->Array(qw,e,qw) Array(qw,e,qw) key->Array(q3,e,qw)
“as,df,asf” “qw,e,qw” “mb,kg,o”
“q3,e,qw “mb,k2,o”
cNoDups: RDD cDups: RDDkey->Array(as,df,asf) Array(as,df,asf) Array(as,df,asf)
key->Array(mb,kg,o) Array(mb,kg,o)
key->Array(qw,e,qw) Array(qw,e,qw)
Previously it was RDD but currently the Spark DataFrames API is considered to be the primary interaction point of Spark. but RDD is available if needed
What is partitioning in Apache Spark? Partitioning is actually the main concept of access your entire Hardware resources while executing any Job.More Partition = More ParallelismSo conceptually you must check the number of slots in your hardware, how many tasks can each of executors can handle.Each partition will leave in different Executor.
DataFrames• So Dataframe is more like column structure and each
record is actually a line. • Can Run statistics naturally as its somewhat works like
SQL or Python/R Dataframe. • In RDD, to process any data for last 7 days, spark
needed to go through entire dataset to get the details, but in Dataframe you already get Time column to handle the situation, so Spark won’t even see the data which is greater than 7 days.
• Easier to program. • Better performance and storage in the heap of executor.
How Dataframe ensures to read less data?
• You can skip partition while reading the data using Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
What is Parquet• You can skip partition while reading the data using
Dataframe.
• Using Parquet
• Skipping data using statistucs (ie min, max)
• Using partitioning (ie year = 2015/month = 06…)
• Pushing predicates into storage systems.
• Parquet should be the source for any operation or ETL. So if the data is different format, preferred approach is to convert the source to Parquet and then process.
• If any dataset in JSON or comma separated file, first ETL it to convert it to Parquet.
• It limits I/O , so scans/reads only the columns that are needed.
• Parquet is columnar layout based, so it compresses better, so save spaces.
• So parquet takes first column and store that as a file, and so on. So if we have 3 different files and sql query is on 2 files, then parquet won’t even consider to read the 3rd file.
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/CollegeScoreCard.csv") college.count res2: Long = 7805 val collegeNoDups = college.distinct collegeNoDups.count res3: Long = 7805
val college = sc.textFile("/Users/marksmith/TulsaTechFest2016/colleges.csv") college: org.apache.spark.rdd.RDD[String] = /Users/marksmith/TulsaTechFest2016/colleges.csv MapPartitionsRDD[17] at textFile at <console>:27
val cNoDups = college.distinct cNoDups: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[20] at distinct at <console>:29
cNoDups.count res7: Long = 7805
college.count res8: Long = 9000
val cRows = college.map(x => x.split(",",-1)) cRows: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[21] at map at <console>:29 val cKeyRows = cRows.map(x => "%s_%s_%s_%s".format(x(0),x(1),x(2),x(3)) -> x ) cKeyRows: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[22] at map at <console>:31
cKeyRows.take(2) res11: Array[(String, Array[String])] = Array((UNITID_OPEID_opeid6_INSTNM,Array(
val cGrouped = cKeyRows.groupBy(x => x._1).map(x => (x._1,x._2.to[scala.collection.mutable.ArrayBuffer])) cGrouped: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[27] at map at <console>:33 val cDups = cGrouped.filter(x => x._2.length > 1)
val cDups = cGrouped.filter(x => x._2.length > 1) cDups: org.apache.spark.rdd.RDD[(String, scala.collection.mutable.ArrayBuffer[(String, Array[String])])] = MapPartitionsRDD[28] at filter at <console>:35
cDups.count res12: Long = 1195
val cNoDups = cGrouped.map(x => x._2(0)) cNoDups: org.apache.spark.rdd.RDD[(String, Array[String])] = MapPartitionsRDD[29] at map at <console>:35
cNoDups.count res13: Long = 7805
val cNoDups = cGrouped.map(x => x._2(0)._2) cNoDups: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:35
cNoDups.take(5) 16/08/04 16:44:24 ERROR Executor: Managed memory leak detected; size = 41227428 bytes, TID = 28 res16: Array[Array[String]] = Array(Array(145725, 00169100, 001691, Illinois Institute of Technology, Chicago, IL, www.iit.edu, npc.collegeboard.org/student/app/iit, 0, 3, 2, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, NULL, 520, 640, 640, 740, 520, 640, 580, 690, 580, 25, 30, 24, 31, 26, 33, NULL, NULL, 28, 28, 30, NULL, 1252, 1252, 0, 0, 0.2026, 0, 0, 0, 0.1225, 0, 0, 0.4526, 0.0245
Demo RDD Code
import org.apache.spark.sql.SQLContext
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/Users/marksmith/TulsaTechFest2016/colleges.csv") df: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, CITY: string, STABBR: string, INSTURL: string, NPCURL: string, HCM2: int, PREDDEG: int, CONTROL: int, LOCALE: string, HBCU: string, PBI: string, ANNHI: string, TRIBAL: string, AANAPII: string, HSI: string, NANTI: string, MENONLY: string, WOMENONLY: string, RELAFFIL: string, SATVR25: string, SATVR75: string, SATMT25: string, SATMT75: string, SATWR25: string, SATWR75: string, SATVRMID: string, SATMTMID: string, SATWRMID: string, ACTCM25: string, ACTCM75: string, ACTEN25: string, ACTEN75: string, ACTMT25: string, ACTMT75: string, ACTWR25: string, ACTWR75: string, ACTCMMID: string, ACTENMID: string, ACTMTMID: string, ACTWRMID: string, SAT_AVG: string, SAT_AVG_ALL: string, PCIP01: string, PCIP03: stri...
val dfd = df.distinct dfd.count res0: Long = 7804
df.count res1: Long = 8998
val dfdd = df.dropDuplicates(Array("UNITID", "OPEID", "opeid6", "INSTNM")) dfdd.count res2: Long = 7804
val dfCnt = df.groupBy("UNITID", "OPEID", "opeid6", "INSTNM").agg(count("UNITID").alias("cnt")) res8: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt.show \+--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1|
df.registerTempTable("colleges") val dfCnt2 = sqlContext.sql("select UNITID, OPEID, opeid6, INSTNM, count(UNITID) as cnt from colleges group by UNITID, OPEID, opeid6, INSTNM") dfCnt2: org.apache.spark.sql.DataFrame = [UNITID: int, OPEID: int, opeid6: int, INSTNM: string, cnt: bigint]
dfCnt2.show
+--------+-------+------+--------------------+---+ | UNITID| OPEID|opeid6| INSTNM|cnt| +--------+-------+------+--------------------+---+ |10236801| 104703| 1047|Troy University-P...| 2| |11339705|3467309| 34673|Marinello Schools...| 2| | 135276| 558500| 5585|Lively Technical ...| 2| | 145682| 675300| 6753|Illinois Central ...| 2| | 151111| 181300| 1813|Indiana Universit...| 1| | 156921| 696100| 6961|Jefferson Communi...| 1|
Demo DataFrame Code