+ All Categories
Home > Software > Atlanta Spark User Meetup 09 22 2016

Atlanta Spark User Meetup 09 22 2016

Date post: 15-Jan-2017
Category:
Upload: chris-fregly
View: 84 times
Download: 1 times
Share this document with a friend
72
pipeline.io After Dark 2.0 End-to-End, Real-time, Advanced Analytics and ML Big Data Reference Pipeline Atlanta Spark User Group Sept 22, 2016 Thanks Emory Continuing Education! Chris Fregly Research Scientist @ PipelineIO We’re Hiring - Only Nice People! pipeline.io advancedspark.com
Transcript
Page 1: Atlanta Spark User Meetup 09 22 2016

pipeline.io

After Dark 2.0End-to-End, Real-time, Advanced Analytics and ML

Big Data Reference Pipeline

Atlanta Spark User GroupSept 22, 2016

Thanks Emory Continuing Education!

Chris FreglyResearch Scientist @ PipelineIO

We’re Hiring - Only Nice People!

pipeline.ioadvancedspark.com

Page 2: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Who Am I?

2

Research Scientist @ PipelineIOgithub.com/fluxcapacitor/pipeline

Meetup Founder

Advanced

Book Author

Advanced .

Page 3: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Who Was I?

3

Streaming Data Engineer

Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer

IBM Technology Center

Page 4: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Advanced Spark and Tensorflow MeetupMeetup Metrics

Top 5 Most Active Spark Meetup!4000+ Members in just 1 year!!6000+ Downloads of Docker Image

with many Meetup Demos!!!@ advancedspark.com

Meetup GoalsCode dive deep into Spark and related open source code basesStudy integrations with Cassandra, ElasticSearch,Tachyon, S3,

BlinkDB, Mesos, YARN, Kafka, R, etcSurface and share patterns and idioms of well-designed,

distributed, big data processing systems

Page 5: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Atlanta Hadoop User Meetup Last Night

http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016

Page 6: Atlanta Spark User Meetup 09 22 2016

pipeline.io

MLconf ATL Tomorrow

Page 7: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Current PipelineIO ResearchModel Deploying and Testing

Model Scaling and Serving

Online Model Training

Dynamic Model Optimizing

7

Page 8: Atlanta Spark User Meetup 09 22 2016

pipeline.io

PipelineIO Deliverables100% Open Source!!

Githubhttps://github.com/fluxcapacitor/

DockerHubhttps://hub.docker.com/r/fluxcapacitor

Workshophttp://pipeline.io

8

Page 9: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Topics of This Talk (20-30 mins each)① Spark Streaming and Spark ML

Generating Real-time Recommendations

② Spark CoreTuning and Profiling

③ Spark SQLTuning and Customizing

9

Page 10: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Live, Interactive Demo!Kafka, Cassandra, ElasticSearch, Redis, Spark ML

Page 11: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Audience Participation Required!

11

You -> Audience Instructions①Navigate to

demo.pipeline.io

②Swipe software used inProduction Only!

Data -> Scientist

This is Totally Anonymous!!

Page 12: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Topics of This Talk (15-20 mins each)① Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker

② Spark CoreTuning and Profiling

③ Spark SQLTuning and Customizing

12

Page 13: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Mechanical Sympathy“Hardware and software working together.”

- Martin Thompsonhttp://mechanical-sympathy.blogspot.com

“Whatever your data structure, my array will win.”- Scott Meyers

Every C++ Book, basically

13

Page 14: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Spark and Mechanical Sympathy

14

Project Tungsten(Spark 1.4-1.6+)

100TBGraySortChallenge(Spark 1.1-1.2)

Minimize Memory & GCMaximize CPU Cache

Saturate Network I/OSaturate Disk I/O

Page 15: Atlanta Spark User Meetup 09 22 2016

pipeline.io

CPU Cache Refresher

15

aka“LLC”

MyLaptop

Page 16: Atlanta Spark User Meetup 09 22 2016

pipeline.io

CPU Cache Sympathy (AlphaSort paper)

Key (10 bytes) + Pointer (4 bytes) = 14 bytes

16

Key PtrPre-process & Pull Key From Record

PtrKey-Prefix2x CPU Cache-line Friendly!

Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

Key (10 bytes) + Pad (2 bytes) + Pointer (4 bytes) = 16 bytes

PtrMust Dereference to Compare Key

Key

Pointer (4 bytes) = 4 bytes

Key Ptr

Pad

/Pad CPU Cache-line Friendly!

Dereference full key only to resolve prefix duplicates

Page 17: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Sort Performance Comparison

17

Page 18: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Sequential vs Random Cache Misses

18

Page 19: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Demo!Sorting

Page 20: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Instrumenting and Monitoring CPUUse Linux perf command!

20

http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

Page 21: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Results of Random vs. Sequential Sort

21

Naïve Random AccessPointer Sort

Cache Friendly Sequential Key/Pointer Sort

Ptr KeyMust Dereference to Compare Key

Key PtrPre-process & Pull Key From Record

-35%

-90%-68%

-26%

% Change

-55%

perf stat –event \L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,LLC-prefetch-misses

Page 22: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Demo!Matrix Multiplication

Page 23: Atlanta Spark User Meetup 09 22 2016

pipeline.io

CPU Cache Naïve Matrix Multiplication

// Dot product of each row & column vectorfor (i <- 0 until numRowA)for (j <- 0 until numColsB)for (k <- 0 until numColsA)

res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

23

Bad: Row-wise traversal, not using full CPU cache line,ineffective pre-fetching

Page 24: Atlanta Spark User Meetup 09 22 2016

pipeline.io

CPU Cache Friendly Matrix Multiplication

// Transpose Bfor (i <- 0 until numRowsB)for (j <- 0 until numColsB)

matBT[ i ][ j ] = matB[ j ][ i ];

24

Good: Full CPU Cache Line,Effective Prefetching

OLD: matB [ k ][ j ];

// Modify algo for Transpose Bfor (i <- 0 until numRowsA)for (j <- 0 until numColsB)for (k <- 0 until numColsA)res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];

Page 25: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Results Of Matrix Multiplication

Cache-Friendly Matrix Multiply

25

Naïve Matrix Multiply

perf stat –event \L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \

LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

-96%

-93%

-93%-70%

-53%

% Change

-63%

+8543%?

Page 26: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Demo!Thread Synchronization

Page 27: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Thread and Context Switch SympathyProblemAtomically Increment 2 Counters (each at different increments) by1000’s of Simultaneous Threads

27

Possible Solutions①Synchronized Immutable②Synchronized Mutable③AtomicReference CAS④Volatile?

Context Switches are Expen$ive!!

aka“LLC”

Page 28: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Synchronized Immutable Counterscase class Counters(left: Int, right: Int)

object SynchronizedImmutableCounters {var counters = new Counters(0,0)def getCounters(): Counters = {

this.synchronized { counters }}def increment(leftIncrement: Int, rightIncrement: Int): Unit = {

this.synchronized {counters = new Counters(counters.left + leftIncrement,

counters.right + rightIncrement)}

}} 28

Locks wholeouter object!!

Page 29: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Synchronized Mutable Countersclass MutableCounters(left: Int, right: Int) {def increment(leftIncrement: Int, rightIncrement: Int): Unit={this.synchronized {…}

}def getCountersTuple(): (Int, Int) = {this.synchronized{ (counters.left, counters.right) }

}}object SynchronizedMutableCounters {val counters = new MutableCounters(0,0)…def increment(leftIncrement: Int, rightIncrement: Int): Unit = {counters.increment(leftIncrement, rightIncrement)

}} 29

Locks justMutableCounters

Page 30: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Lock-Free AtomicReference Counterscase class Counters(left: Int, right: Int)

object LockFreeAtomicReferenceCounters {val counters = new AtomicReference[Counters](new Counters(0,0))def increment(leftIncrement: Int, rightIncrement: Int) : Long = {

var originalCounters: Counters = nullvar updatedCounters: Counters = nulldo {

originalCounters = getCounters()updatedCounters = new Counters(originalCounters.left+ leftIncrement,

originalCounters.right+ rightIncrement)} // Retry lock-free, optimistic compareAndSet() until AtomicRef updateswhile !(counters.compareAndSet(originalCounters, updatedCounters))

}} 30Lock Free!!

Page 31: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Lock-Free AtomicLong Countersobject LockFreeAtomicLongCounters {// a single Long (64-bit) will maintain 2 separate Ints (32-bits each)val counters = new AtomicLong()…def increment(leftIncrement: Int, rightIncrement: Int): Unit = {

var originalCounters = 0Lvar updatedCounters = 0Ldo {

originalCounters = counters.get()…// Store two 32-bit Int into one 64-bit Long// Use >>> 32 and << 32 to set and retrieve each Int from the Long}

// Retry lock-free, optimistic compareAndSet() until AtomicLong updateswhile !(counters.compareAndSet(originalCounters, updatedCounters))

}}

31 Lock Free!!

A: The JVM does notguarantee atomicupdates of 64-bitlongs and doubles

Q: Why not use@volatile long?

Page 32: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Results of Thread Synchronization

Immutable Case Class

32

Lock-Free AtomicLong

-64%

-46%

-17%

-33%

perf stat –event \context-switches,L1-dcache-load-misses,L1-dcache-prefetch-misses, \

LLC-load-misses, LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

% Change

-31%-32%

-33%

-27%

case class Counters(left: Int, right: Int)...this.synchronized {counters = new Counters(counters.left + leftIncrement,

counters.right + rightIncrement)}

val counters = new AtomicLong()…do {…} while !(counters.compareAndSet(originalCounters,

updatedCounters))

Page 33: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Profile Visualizations: Flame Graphs

33Example: Spark Word Count

Java Stack Traces are Good! JDK 1.8+(-XX:-Inline -XX:+PreserveFramePointer)

Plateaus are Bad!I/O stalls, Heavy CPU

serialization, etc

Page 34: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Project Tungsten: CPU and MemoryCreate Custom Data Structures & Algorithms

Operate on serialized and compressed ByteArrays!Minimize Garbage Collection

Reuse ByteArraysIn-place updates for aggregations

Maximize CPU Cache Effectiveness8-byte alignmentAlphaSort-based Key-Prefix

Utilize Catalyst Dynamic Code GenerationDynamic optimizations using entire query planDeveloper implements genCode() to create Scala source code

(String)Project Janino compiles source code into JVM bytecode 34

Page 35: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Why is CPU the Bottleneck?CPU for serialization, hashing, & compression

Spark 1.2 updates saturated Network, Disk I/O

10x increase in I/O throughput relative to CPU

More partition, pruning, and pushdown support

Newer columnar file formats help reduce I/O35

Page 36: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Custom Data Structs & Algos: AggsUnsafeFixedWidthAggregationMap

Uses BytesToBytesMap internallyIn-place updates of serialized aggregationNo object creation on hot-path

TungstenAggregate & TungstenAggregationIteratorOperates directly on serialized, binary UnsafeRow2 steps to avoid single-key OOMs

① Hash-based (grouping) agg spills to disk if needed② Sort-based agg performs external merge sort on spills

36

Page 37: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Custom Data Structures & Algorithmso.a.s.util.collection.unsafe.sort.

UnsafeSortDataFormatUnsafeExternalSorterUnsafeShuffleWriterUnsafeInMemorySorterRecordPointerAndKeyPrefix

37

PtrKey-Prefix

2x CPU Cache-line Friendly!

SortDataFormat<RecordPointerAndKeyPrefix, Long[ ]>Note: Mixing multiple subclasses of SortDataFormatsimultaneously will prevent JIT inlining.

Supports merging compressed records(if compression CODEC supports it, ie. LZF)

In-place external sorting of spilled BytesToBytes data

AlphaSort-based, 8-byte aligned sort key

In-place sorting of BytesToBytesMap data

Page 38: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Code GenerationProblem

Boxing creates excessive objectsExpression tree evaluations are costlyJVM can’t inline polymorphic impls

Lack of polymorphism == poor code design

SolutionCode generation enables inliningRewrite and optimize code using overall plan, 8-byte alignDefer source code generation to each operator, UDF,

UDAFUse Janino to compile generated source code ->bytecode

(More IDE friendly than Scala quasiquotes) 38

Page 39: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Autoscaling Spark Workers (Spark 1.5+)Scaling up is easy J

SparkContext.addExecutors() until max is reachedScaling down is hard L

SparkContext.removeExecutors()Lose RDD cache inside Executor JVMMust rebuild active RDD partitions in another Executor

JVM

Uses External Shuffle Service from Spark 1.1-1.2If Executor JVM dies/restarts, shuffle keeps shufflin’! 39

Page 40: Atlanta Spark User Meetup 09 22 2016

pipeline.io

“Hidden” Spark Submit REST APIhttp://arturmkrtchyan.com/apache-spark-hidden-rest-apiSubmit Spark Job

curl -X POST http://127.0.0.1:6066/v1/submissions/create \--header "Content-Type:application/json;charset=UTF-8" \--data ’{"action" : "CreateSubmissionRequest”,

"mainClass" : "org.apache.spark.examples.SparkPi”,"sparkProperties" : {

"spark.jars" : "file:/spark/lib/spark-examples-1.5.1.jar","spark.app.name" : "SparkPi",…

}}’

Get Spark Job Statuscurl http://127.0.0.1:6066/v1/submissions/status/<job-id-from-submit-request>

Kill Spark Jobcurl -X POST http://127.0.0.1:6066/v1/submissions/kill/<job-id-from-submit-request>

40

(the snitch)

Page 41: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Outline① Spark Streaming and Spark ML

Kafka, Cassandra, ElasticSearch, Redis, Docker

② Spark CoreTuning and Profiling

③ Spark SQLTuning and Customizing

41

Page 42: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Parquet Columnar File FormatBased on Google Dremel paper ~2010Collaboration with Twitter and ClouderaColumnar storage format for fast columnar aggsSupports evolving schemaSupports pushdownsSupport nested partitionsTight compressionMin/max heuristics enable file and chunk skipping 42

Min/Max HeuristicsFor Chunk Skipping

Page 43: Atlanta Spark User Meetup 09 22 2016

pipeline.io

PartitionsPartition Based on Data Access Patterns

/genders.parquet/gender=M/…/gender=F/… <-- Use Case: Access Users by Gender/gender=U/…

Dynamic Partition Creation (Write)Dynamically create partitions on write based on column (ie. Gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF: gendersDF.write.format("parquet").partitionBy("gender")

.save("/genders.parquet")

Partition Discovery (Read)Dynamically infer partitions on read based on paths (ie./gender=F/…)SQL: SELECT id FROM genders WHERE gender=FDF: gendersDF.read.format("parquet").load("/genders.parquet/").select($"id").

.where("gender=F")43

Page 44: Atlanta Spark User Meetup 09 22 2016

pipeline.io

PruningPartition Pruning

Filter out rows by partition

SELECT id, gender FROM genders WHERE gender = ‘F’

Column PruningFilter out columns by column filterExtremely useful for columnar storage formats (Parquet)Skip entire blocks of columns

SELECT id, gender FROM genders44

Page 45: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Pushdownsaka. Predicate or Filter Pushdowns Predicate returns true or false for given functionFilters rows deep into the data sourceReduces number of rows returnedData Source must implement PrunedFilteredScan

def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

45

Page 46: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Demo!File Formats, Partitions, Pushdowns, and Joins

Page 47: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Predicate Pushdowns & Filter Collapsing

47

Filter pushdownNo extra pass

Filter combiningOnly 1 extra pass

2 extra passes through the data after retrieval

Page 48: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Join Between Partitioned & Unpartitioned

48

Page 49: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Join Between Partitioned & Partitioned

49

Page 50: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Broadcast Join vs. Normal Shuffle Join

50

Page 51: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Cartesian Join vs. Inner Join

51

Page 52: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Visualizing the Query Plan

52

Effectiveness of Filter

Cost-basedJoin Optimization

Similar toMapReduce

Map-side Join& DistributedCache

Peak Memory forJoins and Aggs

UnsafeFixedWidthAggregationMapgetPeakMemoryUsedBytes()

Page 53: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Data Source APIRelations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class): Provides schema of dataTableScan (impl): Read all data from sourcePrunedFilteredScan (impl): Column pruning & predicate pushdownsInsertableRelation (impl): Insert/overwrite data based on SaveModeRelationProvider (trait/interface): Handle options, BaseRelation factory

Filters (o.a.s.sql.sources.filters.scala)Filter (abstract class): Handles all filters supported by this source

EqualTo (impl)GreaterThan (impl)StringStartsWith (impl) 53

Page 54: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Native Spark SQL Data Sources

54

Page 55: Atlanta Spark User Meetup 09 22 2016

pipeline.io

JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.json.bz2")-- or –val ratingsDF = sqlContext.read.json

("file:/root/pipeline/datasets/dating/ratings.json.bz2")

SQL CodeCREATE TABLE genders USING jsonOPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2")

55

json() convenience method

Page 56: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Parquet Data SourceConfiguration

spark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=false (unless your schema is evolving)spark.sql.parquet.cacheMetadata=true (requires sqlContext.refreshTable())spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

DataFramesval gendersDF = sqlContext.read.format("parquet")

.load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender")

.save("file:/root/pipeline/datasets/dating/genders.parquet")

SQLCREATE TABLE genders USING parquetOPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

56

Page 57: Atlanta Spark User Meetup 09 22 2016

pipeline.io

ElasticSearch Data SourceGithub

https://github.com/elastic/elasticsearch-hadoop

Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0

Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>",

"es.port" -> "<port>")df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConfig).save("<index>/<document-type>")57

Page 58: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Cassandra Data SourceGithub

https://github.com/datastax/spark-cassandra-connector

Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

CoderatingsDF.write

.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"<keyspace>","table"->"<table>")).save(…) 58

Page 59: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Tips for Cassandra AnalyticsBy-pass Cassandra CQL “front door”

CQL Optimized for Transactions

Bulk read and write directly against SSTablesCheck out Netflix OSS project “Aegisthus”

Cassandra becomesa first-class analytics optionReplicated analytics cluster no longer needed

59

Page 60: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Creating a Custom Data Source① Study existing implementations

o.a.s.sql.execution.datasources.jdbc.JDBCRelation② Extend base traits & implement required methods

o.a.s.sql.sources.{BaseRelation,PrunedFilterScan}

Spark JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation

with PrunedFilteredScanwith InsertableRelation

DataStax Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRelation

with PrunedFilteredScanwith InsertableRelation 60

Page 61: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Demo!Create a Custom Data Source

Page 62: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Publishing Custom Data Sources

62

spark-packages.org

Page 63: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Spark SQL UDF Code Generation100+ UDFs now generating code

More to come in Spark 1.6+

Details in SPARK-8159, SPARK-9571

Every UDF must use Expressions andimplement Expression.genCode()

to participate in the funLambdas (RDD or Dataset API)

and sqlContext.udf.registerFunction()are not enough!!

Page 64: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Creating a Custom UDF with Code Gen① Study existing implementations

o.a.s.sql.catalyst.expressions.Substring

② Extend and implement base traito.a.s.sql.catalyst.expressions.Expression.genCode

③ Don’t forget about Python!python.pyspark.sql.functions.py

64

Page 65: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Demo!Creating a Custom UDF participating in Code Generation

Page 66: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Spark 1.6 and 2.0 ImprovementsAdaptiveness, Metrics, Datasets, and Streaming State

Page 67: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Adaptive Query ExecutionAdapt query execution using data from previous stagesDynamically choose spark.sql.shuffle.partitions (default 200)

67

Broadcast Join(popular keys)

Shuffle Join(not-so-popular keys)

AdaptiveHybrid

Join

Page 68: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Adaptive Memory ManagementSpark <1.6

Manual configure between 2 memory regionsSpark execution engine (shuffles, joins, sorts, aggs)

spark.shuffle.memoryFractionRDD Data Cache

spark.storage.memoryFractionSpark 1.6+

Unified memory regionsDynamically expand/contract memory regionsSupports minimum for RDD storage (LRU Cache)

68

Page 69: Atlanta Spark User Meetup 09 22 2016

pipeline.io

MetricsShows exact memory usage per operator & nodeHelps debugging and identifying skew

69

Page 70: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Spark SQL APIDatasets type safe API (similar to RDDs) utilizing Tungsten

val ds = sqlContext.read.text("ratings.csv").as[String]val df = ds.flatMap(_.split(",")).filter(_ != "").toDF() // RDD API, convert to DFval agg = df.groupBy($"rating").agg(count("*") as "ct”).orderBy($"ct" desc)

Typed Aggregators used alongside UDFs and UDAFsval simpleSum = new Aggregator[Int, Int, Int] with Serializable {

def zero: Int = 0def reduce(b: Int, a: Int) = b + a def merge(b1: Int, b2: Int) = b1 + b2def finish(b: Int) = b

}.toColumnval sum = Seq(1,2,3,4).toDS().select(simpleSum)

Query files directly without registerTempTable()%sql SELECT * FROM json.`/datasets/movielens/ml-latest/movies.json` 70

Page 71: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Spark Streaming State ManagementNew trackStateByKey()

Store deltas, compact laterMore efficient per-key state updateSession TTL

Integrated A/B Testing (?!)

Show Failed Output in Admin UIBetter debugging

71

Page 72: Atlanta Spark User Meetup 09 22 2016

pipeline.io

Thank You!!Chris FreglyResearch Scientist @ PipelineIO(http://pipeline.io)San Francisco, California, USA

advancedspark.comSign up for the Meetup and BookContribute on Github!Run All Demos in Docker~6000 Docker Downloads!!

Find me on LinkedIn, Twitter, Github, Email, Fax72


Recommended