+ All Categories
Home > Engineering > Frustration-Reduced PySpark: Data engineering with DataFrames

Frustration-Reduced PySpark: Data engineering with DataFrames

Date post: 16-Apr-2017
Category:
Upload: ilya-ganelin
View: 3,784 times
Download: 5 times
Share this document with a friend
29
Frustration- Reduced PySpark Data engineering with DataFrames Ilya Ganelin
Transcript
Page 1: Frustration-Reduced PySpark: Data engineering with DataFrames

Frustration-Reduced PySpark

Data engineering with DataFramesIlya Ganelin

Page 2: Frustration-Reduced PySpark: Data engineering with DataFrames

Why are we here?Spark for quick and easy batch ETL (no

streaming)Actually using data frames

CreationModificationAccessTransformation

Lab!Performance tuning and operationalization

Page 3: Frustration-Reduced PySpark: Data engineering with DataFrames

What does it take to solve a data science problem?

Data Prep IngestCleanupError-handling &

missing valuesData munging

TransformationFormattingSplitting

ModelingFeature extractionAlgorithm selectionData creation

TrainTestValidate

Model buildingModel scoring

Page 4: Frustration-Reduced PySpark: Data engineering with DataFrames

Why Spark?Batch/micro-batch processing of large datasetsEasy to use, easy to iterate, wealth of common

industry-standard ML algorithmsSuper fast if properly configuredBridges the gap between the old (SQL, single

machine analytics) and the new (declarative/functional distributed programming)

Page 5: Frustration-Reduced PySpark: Data engineering with DataFrames
Page 6: Frustration-Reduced PySpark: Data engineering with DataFrames

Why not Spark?Breaks easily with poor usage or improperly

specified configsScaling up to larger datasets 500GB -> TB scale

requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms

While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation

Page 7: Frustration-Reduced PySpark: Data engineering with DataFrames

ScalaYes, I recommend Scala

Python API is underdeveloped, especially for ML Lib

Java (until Java 8) is a second class citizen as far as convenience vs. Scala

Spark is written in Scala – understanding Scala helps you navigate the source

Can leverage the spark-shell to rapidly prototype new code and constructs

http://www.scala-lang.org/docu/files/ScalaByExample.pdf

Page 8: Frustration-Reduced PySpark: Data engineering with DataFrames

Why DataFrames?Iterate on datasets MUCH fasterColumn access is easierData inspection is easiergroupBy, join, are faster due to under-the-hood

optimizationsSome chunks of ML Lib now optimized to use

data frames

Page 9: Frustration-Reduced PySpark: Data engineering with DataFrames

Why not DataFrames?RDD API is still much better developedGetting data into DataFrames can be clunkyTransforming data inside DataFrames can be

clunkyMany of the algorithms in ML Lib still depend on

RDDs

Page 10: Frustration-Reduced PySpark: Data engineering with DataFrames

CreationRead in a file with an embedded header

http://stackoverflow.com/questions/24718697/pyspark-drop-rows

Page 11: Frustration-Reduced PySpark: Data engineering with DataFrames

Create a DFOption A – Inferred types from Rows RDD

Option B – Specify schema as strings

DataFrame Creation

Page 12: Frustration-Reduced PySpark: Data engineering with DataFrames

Option C – Define the schema explicitly

Check your work with df.show()

DataFrame Creation

Page 13: Frustration-Reduced PySpark: Data engineering with DataFrames

Column ManipulationSelection

GroupByConfusing! You get a GroupedData object, not an

RDD or DataFrameUse agg or built-ins to get back to a DataFrame.Can convert to RDD with dataFrame.rdd

Page 14: Frustration-Reduced PySpark: Data engineering with DataFrames

Custom Column FunctionsAdd a column with a custom function:

http://stackoverflow.com/questions/33287886/replace-empty-strings-with-none-null-values-in-dataframe

Page 15: Frustration-Reduced PySpark: Data engineering with DataFrames

Row ManipulationFilter

Range:

Equality:

Column functionshttps://spark.apache.org/docs/1.6.0/api/python/

pyspark.sql.html#pyspark.sql.Column

Page 16: Frustration-Reduced PySpark: Data engineering with DataFrames

JoinsOption A (inner join)

Option B (explicit)

Join types: inner, outer, left_outer, right_outer, leftsemi

DataFrame joins benefit from Tungsten optimizationsNote: PySpark will not drop columns for outer joins

Page 17: Frustration-Reduced PySpark: Data engineering with DataFrames

Null HandlingBuilt in support for handling nulls/NA in data

frames.Drop, fill, replace https://spark.apache.org/docs/1.6.0/api/python/

pyspark.sql.html#pyspark.sql.DataFrameNaFunctions

Page 18: Frustration-Reduced PySpark: Data engineering with DataFrames

What does it take to solve a data science problem?

Data Prep IngestCleanupError-handling &

missing valuesData munging

TransformationFormattingSplitting

ModelingFeature extractionAlgorithm selectionData creation

TrainTestValidate

Model buildingModel scoring

Page 19: Frustration-Reduced PySpark: Data engineering with DataFrames

Lab RulesAsk Google and StackOverflow before you ask

me You do not have to use my code.Use DataFrames until you can’t.Keep track of what breaks!There are no stupid questions.

Page 20: Frustration-Reduced PySpark: Data engineering with DataFrames

LabIngest DataRemove invalid entrees or fill missing entriesSplit into test, train, validateReformat a single column, e.g. map IDs or change

formatAdd a custom metric or feature based on other

columnsRun a classification algorithm on this data to figure

out who will survive!

Page 21: Frustration-Reduced PySpark: Data engineering with DataFrames

What problems did you encounter?

What are you still confused about?

Page 22: Frustration-Reduced PySpark: Data engineering with DataFrames

Spark Architecture

Page 23: Frustration-Reduced PySpark: Data engineering with DataFrames
Page 24: Frustration-Reduced PySpark: Data engineering with DataFrames

PartitionsHow data is split on diskAffects memory usage, shuffle sizeCount ~ speed, Count ~ 1/memory

CachingPersist RDDs in distributed memoryMajor speedup for repeated operations

SerializationEfficient movement of data Java vs. Kryo

Partitions, Caching, and Serialization

Page 25: Frustration-Reduced PySpark: Data engineering with DataFrames

Shuffle!All-all operations

reduceByKey, groupByKeyData movement

SerializationAkka

Memory overheadDumps to disk when OOMGarbage collection

EXPENSIVE!

Map Reduce

Page 26: Frustration-Reduced PySpark: Data engineering with DataFrames

What else?Save your work => Write completed datasets to

fileWork on small data first, then go to big dataCreate test data to capture edge casesLMGTFY

Page 27: Frustration-Reduced PySpark: Data engineering with DataFrames

By popular demand:screen pyspark--driver-memory 100g \--num-executors 60 \--executor-cores 5 \--master yarn-client \--conf "spark.executor.memory=20g” \--conf "spark.io.compression.codec=lz4" \--conf "spark.shuffle.consolidateFiles=true" \--conf "spark.dynamicAllocation.enabled=false" \--conf "spark.shuffle.manager=tungsten-sort" \--conf "spark.akka.frameSize=1028" \--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC \-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops"

Page 28: Frustration-Reduced PySpark: Data engineering with DataFrames

Any Spark on YARNE.g. Deploy Spark 1.6 on CDH 5.4Download your Spark binary to the cluster and untar In $SPARK_HOME/conf/spark-env.sh:

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/conf This tells Spark where Hadoop is deployed, this also gives it the

link it needs to run on YARNexport SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop

classpath) This defines the location of the Hadoop binaries used at run

time


Recommended