+ All Categories
Home > Documents > D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library...

D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library...

Date post: 21-May-2020
Category:
Upload: others
View: 28 times
Download: 0 times
Share this document with a friend
55
CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017 Doc 8 Spark Intro Modified Sep 26, 2017 Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.
Transcript
Page 1: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

CS 696 Intro to Big Data: Tools and Methods Fall Semester, 2017

Doc 8 Spark Intro Modified Sep 26, 2017

Copyright ©, All rights reserved. 2017 SDSU & Roger Whitney, 5500 Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://www.opencontent.org/opl.shtml) license defines the copyright on this document.

Page 2: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Spark

2

Created at UC Berkeley’s AMPLab

2009 Project started 2014 May Version 1.0 2016 July Version 2.0.2 2017 July Version 2.2.0

Programming interface for Java, Python, Scala, R

Interactive shell for Python, Scala, R (experimental)

Runs on Linux, Mac, Windows

Cluster manager Native Spark cluster Hadoop YARN Apache Mesos

File System HDFS MapR File System Cassandra OpenStack Swift S3

Pseudo-Distributed Mode Single machine Uses local file system

Page 3: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Python vs Scala on Spark

3

Scala is faster that Python But that is not so important here Most of the computation on Spark is done in Spark

Using Python with Spark Python data has to be

Converted between Python format and Scala/Java format Sent between Python process and JVM

Page 4: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Time Line

4

1991 - Java project started

1995 - Java 1.0 released, Design Patterns book published

2000 - Java 3

2001 - Scala project started

2002 - Nutch started

2004 - Google MapReduce paper

Scala version 1 released

2005 - F# released

2006 - Hadoop split from Nutch

Scala version 2 released

2007 - Clojure released

2009 - Spark project started

2012 - Hadoop 1.0

2014 - Spark 1.0

Page 5: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Major Parts of Spark

5

Spark Core Resilient Distributed Dataset (RDD)

Spark SQL SQL, csv, json Dataframe

Spark Streaming Near real-time response

MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction Optimization

GraphX

Page 6: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Spark

6

Ecosystem of packages, libraries and systems on top of Spark Core

Unstructured API Structured API

Resilient Distributed Datasets (RDD) Accumulators Broadcast variables

DataFrames Datasets Spark SQL

Newer, faster, higher level Preferred over Unstructured

Page 7: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Basic Architecture

7

Page 8: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Local Mode

8

Driver Process

Sparksession

UserCode

WorkerThreads

We will start using local mode

Use local mode to Develop Spark code

Page 9: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

SparkContext

9

Connection to Spark cluster Runs on master node Used to create RDDs, accumulators, broadcast variables Only one SparkContext per JVM stop() the current SparkContext before starting another

SparkContext org.apache.spark.SparkContext Scala version

JavaSparkContext org.apache.spark.api.java.JavaSparkContext Java version

Entry point for Unstructured API

Page 10: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

SparkSession

10

Contains a SparkContext

Entry point to use Dataset & DataFrame

Connection to Spark cluster Runs on master node

org.apache.spark.sql.SparkSession

Page 11: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Major Data Structures

11

Resilient Distributed Datasets (RDDs) Fault-tolerant collection of elements that can be operated on in parallel

Dataset & Dataframes Fault-tolerant collection of elements that can be operated on in parallel Rows & Columns JSON, csv, SQL tables Part of SparkSQL

Page 12: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Partitions

12

RDD & Dataset Divided into partitions Each partition is on different machine

Page 13: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Resilient & Distributed

13

Distributed Partitions on different machines

Resilient Each partition can be replicated on multiple machines Data structure knows how to reproduce operations

Page 14: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Basic Operations

14

RDDs, Dataframes, Datasets Immutable

Transformations Create new dataset (RDD) from existing one Lazy Only done when needed by an action Examples

map, filter, sample, union, distinct, groupByKey, repartition

Actions Return results to driver program Examples

reduce, collect, count, first, take

Page 15: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Actions & Transformations on DataSet

15

org.apache.spark.sql.Dataset

View the Spark Scala API

Page 16: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Starting Spark Scala REPL

16

./bin/spark-shell

From Spark installation

scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@508abc74

scala> spark res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@1c618295

Provided variables - Spark Scala REPL only

Page 17: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Starting spark shell

17

Al pro 9->spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/09/24 19:57:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:58:03 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://192.168.1.9:4040 Spark context available as 'sc' (master = local[*], app id = local-1506308274361). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131) Type in expressions to have them evaluated. Type :help for more information.

Page 18: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Sample Interaction

18

where - Transformation count - Action

scala> val range = spark.range(100) range: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val rangeWithLabel = range.toDF("number") rangeWithLabel: org.apache.spark.sql.DataFrame = [number: bigint]

scala> val divisibleBy2 = rangeWithLabel.where("number % 2 = 0") divisibleBy2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [number: bigint]

scala> divisibleBy2.count() res2: Long = 50

Page 19: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Sample Interaction

19

scala> val filePath = "/Users/whitney/test/README.md" filePath: String = /Users/whitney/test/README.md

scala> val textFile = spark.read.textFile(filePath) textFile: org.apache.spark.sql.Dataset[String] = [value: string]

scala> textFile.count() res3: Long = 103

scala> textFile.first() res4: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]

scala> linesWithSpark.count() res5: Long = 20

Page 20: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

spark.read returns DataFrameReader

20

Can read cvs jdbc json ORC Parquet text

Page 21: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Setting number of Workers

21

./bin/spark-shell --master local[4]

Page 22: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Application using SBT

22

name := "Simple Project"

version := "1.0"

scalaVersion := “2.11.11"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

build.sbt

Page 23: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Sample Program Using RDDs

23

import org.apache.spark.{SparkConf, SparkContext}

object MasterConnect { def main(args: Array[String]): Unit = {

val filePath = "/Users/whitney/test/README.md" val conf = new SparkConf().setAppName("Simple Application").setMaster("local[*]") val sc = new SparkContext(conf) val minPartitions = 2 val logData = sc.textFile(filePath, minPartitions) val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println("Lines with a: %s, Lines with b: %s".format(numAs, numBs)) } }

Page 24: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Sample Program Using Spark SQL

24

import org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object SparkTest { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName(“Word Count") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val filePath = "/Users/whitney/test/README.md" val textFile = spark.read.textFile(filePath) val linesWithSpark = textFile.filter(line => line.contains("Spark")) val sparkCount = linesWIthSpark.count() println(sparkCount) } }

Page 25: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Output

25

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/09/24 19:53:17 INFO SparkContext: Running Spark version 2.2.0 17/09/24 19:53:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/09/24 19:53:19 INFO SparkContext: Submitted application: Datasets Test 17/09/24 19:53:19 INFO SecurityManager: Changing view acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls to: whitney 17/09/24 19:53:19 INFO SecurityManager: Changing view acls groups to: 17/09/24 19:53:19 INFO SecurityManager: Changing modify acls groups to: 17/09/24 19:53:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(whitney); groups with view permissions: Set(); users with modify permissions: Set(whitney); groups with modify permissions: Set() 17/09/24 19:53:20 INFO Utils: Successfully started service 'sparkDriver' on port 61753. 17/09/24 19:53:20 INFO SparkEnv: Registering MapOutputTracker 17/09/24 19:53:20 INFO SparkEnv: Registering BlockManagerMaster 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 17/09/24 19:53:20 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 17/09/24 19:53:20 INFO DiskBlockManager: Created local directory at /private/var/folders/br/q_fcsjqc8xj9qn0059bctj3h0000gr/T/blockmgr-45dac785-124b-4203-a39e-895866d2cca3 17/09/24 19:53:20 INFO MemoryStore: MemoryStore started with capacity 912.3 MB 17/09/24 19:53:20 INFO SparkEnv: Registering OutputCommitCoordinator 17/09/24 19:53:20 INFO Utils: Successfully started service 'SparkUI' on port 4040. 17/09/24 19:53:21 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.9:4040 17/09/24 19:53:21 INFO Executor: Starting executor ID driver on host localhost 17/09/24 19:53:21 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 61754.

Page 26: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

DataFrame, DataSet & RDD

26

What are they

What is the difference

When do use which one

Which languages can use them

Page 27: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

DataFrame

27

+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table with rows and Columns

Schema Column labels Column types

Row org.apache.spark.sql.Row

Partitioner Distributes DataFrame among cluster

Plan Series of transformations to perform on DataFrame

Langauges Scala, Java, JVM languages, Python, R

Optimized Spark Catalyst Optimizer

Page 28: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

DataSet

28

Same as DataFrame except for Rows

Programmer defines Row class Scala Cas Class Java Bean

Difference from DataFrame Compiler knows column names and column types in DataSet

Compile time error checking

Better data layout

Languages Scala, Java, JVM languages

Page 29: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

RDD

29

+-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Table No information about types

No compile time or runtime type checking

Far fewer optimizations No Catalyst Optimizer No space optimization

Example - Same data RDD 33.3 MB DataFrame 7.3 MB

Languages Java, Scala Python, R - not recommended

Shares same basic operations as DataFrames & DataSets

Page 30: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Spark Types

30

Java Types are not space efficient “abcd” - 48 bytes

Spark has its own types

Special memory representation of each type Space efficient Cache aware

Spark Scala Python Python API

ByteType Byte int or long ByteType()

ShortType Short int or long ShortType()

IntegerType Int int or long IntergerType()

LongType Long int or long LongType()

Page 31: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Structured verses Unstructured

31

Structured = DataSet, DataFrame

Unstructured = RDD

Page 32: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Typed verses Untyped

32

Typed = DataSet

Untyped = DataFrame

Page 33: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Some Sample Data

33

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15} {"ORIGIN_COUNTRY_NAME":"Croatia","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Ireland","DEST_COUNTRY_NAME":"United States","count":344} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Egypt","count":15} {"ORIGIN_COUNTRY_NAME":"India","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"Singapore","DEST_COUNTRY_NAME":"United States","count":1} {"ORIGIN_COUNTRY_NAME":"Grenada","DEST_COUNTRY_NAME":"United States","count":62} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Costa Rica","count":588} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Senegal","count":40} {"ORIGIN_COUNTRY_NAME":"United States","DEST_COUNTRY_NAME":"Moldova","count":1}

JSON flight Data 2015

United States Bureau of Transportation statistics

The Definitive Guide, Zaharia & Chambers, O’Reilly Media, Inc, 2017-10-??

2015-summary.json

Page 34: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

34

scala> val jsonFlightFile = "/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json"

flightFile: String = /Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summary.json

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> flightData2015.take(2) res3: Array[org.apache.spark.sql.Row] =

Array([United States,Romania,15], [United States,Croatia,1])

Page 35: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Explain - Spark Plan

35

scala> flightData2015.explain() == Physical Plan == *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>

Page 36: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

36

scala> val sortedFlightData2015 = flightData2015.sort("count") sortedFlightData2015: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

scala> sortedFlightData2015.explain() == Physical Plan == *Sort [count#46L ASC NULLS FIRST], true, 0 +- Exchange rangepartitioning(count#46L ASC NULLS FIRST, 200) +- *FileScan json [DEST_COUNTRY_NAME#44,ORIGIN_COUNTRY_NAME#45,count#46L]

Batched: false, Format: JSON, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkBookData/flight-data/json/2015-summ..., PartitionFilters: [], PushedFilters: [], ReadSchema:

struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:bigint>

scala> sortedFlightData2015.take(2) res6: Array[org.apache.spark.sql.Row] =

Array([United States,Singapore,1], [Moldova,United States,1])

Page 37: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Conceptual Plan

37

Lazy

Spark stores the plan in case it needs to recompute the result

Page 38: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Schema

38

scala> val jsonSchema = spark.read.json(jsonFlightFile).schema jsonSchema: org.apache.spark.sql.types.StructType =

StructType( StructField(DEST_COUNTRY_NAME,StringType,true), StructField(ORIGIN_COUNTRY_NAME,StringType,true), StructField(count,LongType,true))

scala> val flightData2015 = spark.read.json(jsonFlightFile) flightData2015: org.apache.spark.sql.DataFrame =

[DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]

StructField name The name of this field. dataType The data type of this field. nullable Indicates if values of this field can be null values. metadata

Page 39: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Schema

39

Spark was able to infer the schema since JSON object has labels and types

{"ORIGIN_COUNTRY_NAME":"Romania","DEST_COUNTRY_NAME":"United States","count":15}

Other data formats are less structured

Page 40: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Reading CSV

40

+-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

name,age Andy,30 Justin,19 Michael,

people.csv

root |-- name: string (nullable = true) |-- age: integer (nullable = true)

scala> val peopleFile = “/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv"

scala> val reader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@288aaeaf

scala> reader.option(“header",true) scala> reader.option("inferSchema",true)

scala> val df = reader.csv(peopleFile) df: org.apache.spark.sql.DataFrame = [name: string, age: int]

Page 41: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Reading CSV

41

scala> df.show +-------+----+| name| age|+-------+----+| Andy| 30|| Justin| 19||Michael|null|+-------+----+

scala> df.schema res10: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))

scala> df.printSchema root |-- name: string (nullable = true) |-- age: integer (nullable = true)

Page 42: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Some CSV options

42

encoding sep (erator) header inferSchema ignoreLeadingWhiteSpace nullValue dateFormat timeStampFormat

mode PERMISSIVE - sets record field on corrupt record DROPMALFORMED - ignores whole corrupt records FAILFAST - throw exception on corrupt record

Page 43: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

43

port org.apache.spark.sql.SparkSession import org.apache.spark.SparkConf

object PeopleExample { def main(args: Array[String]): Unit = { val conf = new SparkConf() conf.setAppName("Datasets Test") conf.setMaster("local[2]") val spark = SparkSession .builder() .config(conf) .getOrCreate();

val peopleFile = "/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv" val reader = spark.read reader.option("header", true) reader.option("inferSchema", true) val df = reader.csv(peopleFile) df.show df.printSchema spark.stop } }

Page 44: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Type Inference Issues

44

val reader = spark.read What type is reader?

scala> val reader: DataFrameReader = spark.read <console>:23: error: not found: type DataFrameReader val reader: DataFrameReader = spark.read

scala> val reader: org.apache.spark.sql.DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@3982832f

scala> import org.apache.spark.sql.DataFrameReader import org.apache.spark.sql.DataFrameReader

scala> val reader: DataFrameReader = spark.read reader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@e6d4002

Page 45: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

We can select columns

45

name,age Andy,30 Justin,19 Michael,

people.csv

scala> val names = df.select("name") names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Page 46: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

We can select columns

46

name,age Andy,30 Justin,19 Michael,

people.csv

scala> import org.apache.spark.sql.functions.col import org.apache.spark.sql.functions.col

scala> val names = df.select(col("name")) names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Page 47: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

If you Don’t lk Abrvtns

47

name,age Andy,30 Justin,19 Michael,

people.csv

scala> import org.apache.spark.sql.functions.column import org.apache.spark.sql.functions.col

scala> val names = df.select(column(“name")) names: org.apache.spark.sql.DataFrame = [name: string]

scala> names.show +-------+| name|+-------+| Andy|| Justin||Michael|+-------+

Page 48: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Column Operations

48

scala> val older = df.select(col("name"), col("age").plus(1)) older: org.apache.spark.sql.DataFrame = [name: string, (age + 1): int]

scala> older.show +-------+---------+ | name|(age + 1)| +-------+---------+ | Andy| 31| | Justin| 20| |Michael| null| +———+---------+

scala> older.printSchema root |-- name: string (nullable = true) |-- (age + 1): integer (nullable = true)

scala> val older = df.select($"name", $"age" + 1)

Page 49: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Column Operations - Java vs Scala

49

df.select(col("name"), col("age").plus(1))

df.select($"name", $"age" + 1)

Scala Only

Java or Scala

Page 50: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

50

scala> val adult = older.filter($"age" > 21) scala> val adult = older.filter(col("age") > 21) scala> val adult = older.filter(col("age").gt(21)) adult: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [name: string, (age + 1): int]

scala> adult.show +----+---------+ |name|(age + 1)| +----+---------+ |Andy| 31| +----+---------+

scala> adult.explain == Physical Plan == *Project [name#104, (age#105 + 1) AS (age + 1)#123] +- *Filter (isnotnull(age#105) && (age#105 > 21)) +- *FileScan csv [name#104,age#105]

Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/whitney/Courses/696/Fall17/SparkExamples/people.csv], PartitionFilters: [], PushedFilters: [IsNotNull(age), GreaterThan(age,21)], ReadSchema: struct<name:string,age:int>

Page 51: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

51

scala> df.groupBy("age").count.show +----+-----+ | age|count| +----+-----+ |null| 1| | 19| 1| | 30| 1| +----+-----+

Page 52: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Saving DataFrames

52

{"name":"Andy","age":30} {"name":"Justin","age":19} {"name":"Michael"}

json, parquet, jdbc, orc, libsvm, csv, text

Formats

scala> df.write.format("json").save("people.json")

Produces a directory: people.json Contents:

_SUCCESS 0 Byte file

part-00000-71516d50-2bcc-4830-ad61-554d1c107f51-c000.json

Page 53: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Using SQL

53

scala> df.createOrReplaceTempView("people")

scala> val sqlExample = spark.sql("SELECT * FROM people") sqlExample: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> sqlExample.show +-------+----+ | name| age| +-------+----+ | Andy| 30| | Justin| 19| |Michael|null| +-------+----+

Page 54: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

show

54

action No return value Only prints out value So can not use the result

What happens on cluster? Actions return value to master node But often run in batch mode

Page 55: D08 Spark Intro · Dataframe Spark Streaming Near real-time response MLib Machine Learning Library Statistics, regression, clustering, dimension reduction, feature extraction ...

Collect - Returns a result

55

scala> val data = sqlExample.collect data: Array[org.apache.spark.sql.Row] = Array([Andy,30], [Justin,19], [Michael,null])

scala> data(0) res22: org.apache.spark.sql.Row = [Andy,30]

scala> data(0)(0) res23: Any = Andy


Recommended