Agenda● Brief Review of Spark (15 min)● Intro to Spark SQL (30 min)● Code session 1: Lab (45 min)● Break (15 min)● Intermediate Topics in Spark SQL (30 min)● Code session 2: Quiz (30 min)● Wrap up (15 min)
Spark ReviewBy Aaron Merlob
Apache Spark● Open-source cluster computing framework ● “Successor” to Hadoop MapReduce● Supports Scala, Java, and Python!
https://en.wikipedia.org/wiki/Apache_Spark
Spark Core + Libraries
https://spark.apache.org
Resilient Distributed Dataset● Distributed Collection● Fault-tolerant● Parallel operation - Partitioned● Many data sourcesImplementation...
RDD - Main Abstraction
Immutable
Mute
Immutable
Lazily Evaluated
Cachable
Type Inferred
Lazily EvaluatedHow Good Is Aaron’s Presentation? Immutable
Lazily Evaluated
Cachable
Type Inferred
CachableImmutable
Lazily Evaluated
Cachable
Type Inferred
Type Inferred (Scala)Immutable
Lazily Evaluated
Cachable
Type Inferred
RDD Operations Actions
Transformations
Cache & PersistTransformed RDDs recomputed each actionStore RDDs in memory using cache (or persist)
SparkContext.● Your way to get data into/out of RDDs● Given as ‘sc’ when you launch Spark shell.
For example: sc.parallelize()
SparkContext
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2)result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
Transformation vs. Action?val data = sc.parallelize(Seq(
“Aaron Aaron”, “Aaron Brian”, “Charlie”, “” ))val words = data.flatMap(d => d.split(" "))val result = words.map(word => (word, 1)).
reduceByKey((v1, v2) => v1 + v2).cache()result.filter( kv => kv._1.contains(“a”) ).count()result.filter{ case (k, v) => v > 2 }.count()
Spark SQLBy Aaron Merlob
Spark SQLRDDs with Schemas!
Spark SQLRDDs with Schemas!
Schemas = Table Names + Column Names + Column Types = Metadata
Schemas● Schema Pros
○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured
● Schema Cons○ ??○ ??○ ??
Schemas● Schema Pros
○ Enable column names instead of column positions○ Queries using SQL (or DataFrame) syntax○ Make your data more structured
● Schema Cons○ Make your data more structured○ Reduce future flexibility (app is more fragile)○ Y2K
HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContextval sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
FYI - a less preferred alternative:org.apache.spark.sql.SQLContext
DataFramesPrimary abstraction in Spark SQL
Evolved from SchemaRDDExposes functionality via SQL or DF APISQL for developer productivity (ETL, BI, etc)DF for data scientist productivity (R / Pandas)
Live Coding - Spark-ShellMaven Packages for CSV and Avroorg.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1
spark-shell --packages $SPARK_PKGS
Live Coding - Loading CSVval path = "AAPL.csv"val df = sqlContext.read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(path)df.registerTempTable("stocks")
CachingIf I run a query twice, how many times will the data be read from disk?
CachingIf I run a query twice, how many times will the data be read from disk?
1. RDDs are lazy.2. Therefore the data will be read twice.3. Unless you cache the RDD, All transformations
in the RDD will execute on each action.
Caching TablessqlContext.cacheTable("stocks")
Particularly useful when using Spark SQL to explore data, and if your data is on S3.
sqlContext.uncacheTable("stocks")
Caching in SQLSQL Command Speed`CACHE TABLE sales;` Eagerly`CACHE LAZY TABLE sales;` Lazily`UNCACHE TABLE sales;` Eagerly
Caching ComparisonCaching Spark SQL DataFrames vs
caching plain non-DataFrame RDDs● RDDs cached at level of individual records● DataFrames know more about the data.● DataFrames are cached using an in-memory
columnar format.
Caching ComparisonWhat is the difference between these:(a) sqlContext.cacheTable("df_table")(b) df.cache(c) sqlContext.sql("CACHE TABLE df_table")
Lab 1Spark SQL Workshop
Spark SQL,the SequelBy Aaron Merlob
Live Coding - Filetype ETL● Read in a CSV● Export as JSON or Parquet● Read JSON
Live Coding - Common● Show● Sample● Take● First
Read FormatsFormat ReadParquet sqlContext.read.parquet(path)
ORC sqlContext.read.orc(path)
JSON sqlContext.read.json(path)
CSV sqlContext.read.format(“com.databricks.spark.csv”).load(path)
Write FormatsFormat WriteParquet sqlContext.write.parquet(path)
ORC sqlContext.write.orc(path)
JSON sqlContext.write.json(path)
CSV sqlContext.write.format(“com.databricks.spark.csv”).save(path)
Schema InferenceInfer schema of JSON files:
● By default it scans the entire file.● It finds the broadest type that will fit a field.● This is an RDD operation so it happens fast.
Infer schema of CSV files:● CSV parser uses same logic as JSON
parser.
User Defined FunctionsHow do you apply a “UDF”?● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)Notes:● UDFs can take single or multiple arguments● Optional registerFunction arg2: ‘return type’
Live Coding - UDF● Import types (StringType, IntegerType, etc)● Create UDF (in Scala)● Apply the function (in SQL)
Live Coding - AutocompleteFind all types available for SQL schemas +UDF
Types and their meanings:StringType = StringIntegerType = IntDoubleType = Double
Spark UI on port 4040