Date post: | 16-Apr-2017 |
Category: |
Engineering |
Upload: | ilya-ganelin |
View: | 3,784 times |
Download: | 5 times |
Frustration-Reduced PySpark
Data engineering with DataFramesIlya Ganelin
Why are we here?Spark for quick and easy batch ETL (no
streaming)Actually using data frames
CreationModificationAccessTransformation
Lab!Performance tuning and operationalization
What does it take to solve a data science problem?
Data Prep IngestCleanupError-handling &
missing valuesData munging
TransformationFormattingSplitting
ModelingFeature extractionAlgorithm selectionData creation
TrainTestValidate
Model buildingModel scoring
Why Spark?Batch/micro-batch processing of large datasetsEasy to use, easy to iterate, wealth of common
industry-standard ML algorithmsSuper fast if properly configuredBridges the gap between the old (SQL, single
machine analytics) and the new (declarative/functional distributed programming)
Why not Spark?Breaks easily with poor usage or improperly
specified configsScaling up to larger datasets 500GB -> TB scale
requires deep understanding of internal configurations, garbage collection tuning, and Spark mechanisms
While there are lots of ML algorithms, a lot of them simply don’t work, don’t work at scale, or have poorly defined interfaces / documentation
ScalaYes, I recommend Scala
Python API is underdeveloped, especially for ML Lib
Java (until Java 8) is a second class citizen as far as convenience vs. Scala
Spark is written in Scala – understanding Scala helps you navigate the source
Can leverage the spark-shell to rapidly prototype new code and constructs
http://www.scala-lang.org/docu/files/ScalaByExample.pdf
Why DataFrames?Iterate on datasets MUCH fasterColumn access is easierData inspection is easiergroupBy, join, are faster due to under-the-hood
optimizationsSome chunks of ML Lib now optimized to use
data frames
Why not DataFrames?RDD API is still much better developedGetting data into DataFrames can be clunkyTransforming data inside DataFrames can be
clunkyMany of the algorithms in ML Lib still depend on
RDDs
CreationRead in a file with an embedded header
http://stackoverflow.com/questions/24718697/pyspark-drop-rows
Create a DFOption A – Inferred types from Rows RDD
Option B – Specify schema as strings
DataFrame Creation
Option C – Define the schema explicitly
Check your work with df.show()
DataFrame Creation
Column ManipulationSelection
GroupByConfusing! You get a GroupedData object, not an
RDD or DataFrameUse agg or built-ins to get back to a DataFrame.Can convert to RDD with dataFrame.rdd
Custom Column FunctionsAdd a column with a custom function:
http://stackoverflow.com/questions/33287886/replace-empty-strings-with-none-null-values-in-dataframe
Row ManipulationFilter
Range:
Equality:
Column functionshttps://spark.apache.org/docs/1.6.0/api/python/
pyspark.sql.html#pyspark.sql.Column
JoinsOption A (inner join)
Option B (explicit)
Join types: inner, outer, left_outer, right_outer, leftsemi
DataFrame joins benefit from Tungsten optimizationsNote: PySpark will not drop columns for outer joins
Null HandlingBuilt in support for handling nulls/NA in data
frames.Drop, fill, replace https://spark.apache.org/docs/1.6.0/api/python/
pyspark.sql.html#pyspark.sql.DataFrameNaFunctions
What does it take to solve a data science problem?
Data Prep IngestCleanupError-handling &
missing valuesData munging
TransformationFormattingSplitting
ModelingFeature extractionAlgorithm selectionData creation
TrainTestValidate
Model buildingModel scoring
Lab RulesAsk Google and StackOverflow before you ask
me You do not have to use my code.Use DataFrames until you can’t.Keep track of what breaks!There are no stupid questions.
LabIngest DataRemove invalid entrees or fill missing entriesSplit into test, train, validateReformat a single column, e.g. map IDs or change
formatAdd a custom metric or feature based on other
columnsRun a classification algorithm on this data to figure
out who will survive!
What problems did you encounter?
What are you still confused about?
Spark Architecture
PartitionsHow data is split on diskAffects memory usage, shuffle sizeCount ~ speed, Count ~ 1/memory
CachingPersist RDDs in distributed memoryMajor speedup for repeated operations
SerializationEfficient movement of data Java vs. Kryo
Partitions, Caching, and Serialization
Shuffle!All-all operations
reduceByKey, groupByKeyData movement
SerializationAkka
Memory overheadDumps to disk when OOMGarbage collection
EXPENSIVE!
Map Reduce
What else?Save your work => Write completed datasets to
fileWork on small data first, then go to big dataCreate test data to capture edge casesLMGTFY
By popular demand:screen pyspark--driver-memory 100g \--num-executors 60 \--executor-cores 5 \--master yarn-client \--conf "spark.executor.memory=20g” \--conf "spark.io.compression.codec=lz4" \--conf "spark.shuffle.consolidateFiles=true" \--conf "spark.dynamicAllocation.enabled=false" \--conf "spark.shuffle.manager=tungsten-sort" \--conf "spark.akka.frameSize=1028" \--conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC \-XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops"
Any Spark on YARNE.g. Deploy Spark 1.6 on CDH 5.4Download your Spark binary to the cluster and untar In $SPARK_HOME/conf/spark-env.sh:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/conf This tells Spark where Hadoop is deployed, this also gives it the
link it needs to run on YARNexport SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop
classpath) This defines the location of the Hadoop binaries used at run
time
References http://spark.apache.org/docs/latest/programming-guide.html http://spark.apache.org/docs/latest/sql-programming-
guide.html http://tinyurl.com/leqek2d (Working With Spark, by Ilya
Ganelin) http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apa
che-spark-jobs-part-1/ (by Sandy Ryza)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ (by Sandy Ryza)
http://www.slideshare.net/ilganeli/frustrationreduced-pyspark-data-engineering-with-dataframes