+ All Categories
Home > Data & Analytics > Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Date post: 21-Apr-2017
Category:
Upload: spark-summit
View: 2,169 times
Download: 7 times
Share this document with a friend
26
Handling data skew adaptively in Spark using Dynamic Repartitioning Zoltán Zvara [email protected]
Transcript
Page 1: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Handlingdataskewadaptivelyin Spark usingDynamicRepartitioning

Zoltá[email protected]

Page 2: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Introduction

• HungarianAcademyofSciences, InstituteforComputerScienceandControl (MTASZTAKI)• Researchinstitutewithstrongindustryties• BigDataprojectsusingSpark,Flink,Cassandra,Hadoopetc.• Multipletelcousecaseslately,withchallengingdatavolumeanddistribution

Page 3: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Agenda

• Ourdata-skewstory• Problemdefinitions&aims• DynamicRepartitioning

• Architecture• Component breakdown• Repartitioning mechanism• Benchmarkresults

• Visualization• Conclusion

Page 4: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Motivation

• Wehavedevelopedanapplicationaggregatingtelcodatathattestedwellontoydata• Whendeployingitagainsttherealdatasettheapplicationseemedhealthy• Howeveritcouldbecomesurprisinglyslow orevencrash• Whatdidgowrong?

Page 5: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Ourdata-skewstory

• Wehaveuse-caseswhenmap-sidecombineisnotanoption:groupBy, join• 80%ofthetrafficgenerated by 20%ofthecommunication towers• Mostofourdatais80–20rule

Page 6: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Theproblem

• Usingdefaulthashingisnotgoingtodistributethedatauniformly• Someunknownpartition(s)tocontainalotofrecordsonthereducerside• Slowtaskswillappear• Datadistributionisnotknowninadvance• „Conceptdrifts”arecommon

stageboundary&shuffle

evenpartitioning skeweddata

slow task

slow task

Page 7: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Aim

Generally,makeSparkadata-awaredistributeddata-processingframework

• Collectinformationaboutthedata-characteristicson-the-fly• Partitionthedataasuniformlyaspossibleon-the-fly• Handlearbitrarydatadistributions• Mitigatetheproblemofslowtasks• Belightweightandefficient• Shouldnotrequireuser-guidance

spark.repartitioning = true

Page 8: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Architecture

Slave Slave Slave

Master

approximatelocal keydistribution &statistics

1collect to master 2

global keydistribution &statistics

3

4redistribute newhash

Page 9: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Driverperspective

• RepartitioningTrackerMaster partofSparkEnv• Listenstojob&stagesubmissions• Holdsavarietyofrepartitioningstrategiesforeachjob&stage• Decides when &how to (re)partition

Page 10: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Executorperspective

• RepartitioningTrackerWorker partofSparkEnv• Duties:

• StoresScannerStrategies (Scanner included) receivedfromtheRTM• Instantiates andbindsScanners toTaskMetrics (wheredata-characteristics iscollected)

• DefinesaninterfaceforScanners tosendDataCharacteristics backtotheRTM

Page 11: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Scalable sampling

• Key-distributions are approximatedwith astrategy,that is• not sensitive to early or late concept drifts,• lightweight andefficient,• scalable by using abackoff strategy

keys

frequency

counters

keys

frequency

sampling rate of𝑠" sampling rate of𝑠" /𝑏

keysfre

quency

sampling rate of𝑠"%&/𝑏

truncateincrease by 𝑠"

𝑇𝐵

Page 12: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Complexity-sensitive sampling

• Reducerrun-timecanhighlycorrelatewiththecomputationalcomplexity ofthevaluesforagivenkey• CalculatingobjectsizeiscostlyinJVMandinsomecasesshowslittlecorrelationtocomputationalcomplexity(functiononthenextstage)• Solution:

• Iftheuserobject isWeightable, usecomplexity-sensitive sampling• Whenincreasingthecounterforaspecificvalue,consider itscomplexity

Page 13: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Scalable sampling in numbers

• InSpark,ithasbeenimplementedwithAccumulators• Notsoefficient,butwewantedtoimplementitwithminimalimpactontheexistingcode

• Optimizedwithmicro-benchmarking• Mainfactors:

• Samplingstrategy- aggressiveness (initialsamplingratio,back-offfactor,etc…)• Complexityofthecurrentstage(mapper)

• Currentstage’sruntime:• Whenusedthroughout theexecutionofthewholestageitadds5-15%toruntime• Afterrepartitioning,wecutoutthesampler’scode-path; inpractice,itadds0.2-1.5% toruntime

Page 14: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Mainnew task-level metrics

• TaskMetrics• RepartitioningInfo – newhashfunction,repartitionstrategy• ShuffleReadMetrics

• (Weightable)DataCharacteristics – usedfortestingthecorrectnessoftherepartitioning&forexecutionvisualizationtools

• ShuffleWriteMetrics• (Weightable)DataCharacteristics – scannedperiodicallybyScannersbasedonScannerStrategy

• insertionTime• repartitioningTime

• inMemory• ofSpills

Page 15: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Scanner

• Instantiatedforeachtaskbeforetheexecutorstartsthem• Differentimplementations:Throughput, Timed, Histogram• ScannerStrategy defines:

• whentosendtotheRTM,• histogram-compaction level.

Page 16: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Decisiontorepartition

• RepartitioningTrackerMaster canusedifferentdecisionstrategies:• numberoflocalhistogramsneeded,• globalhistogram’sdistance fromuniformdistribution,• preferences intheconstruction ofthenewhashfunction.

Page 17: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Construction ofthe new hash function

𝑐𝑢𝑡, - single-key cut

hash akey herewith the probabilityproportional to the remaining space to reach 𝑙

𝑐𝑢𝑡. - uniform distribution

topprominent keys hashed

𝑙

𝑐𝑢𝑡/ - probabilistic cut

Page 18: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Newhash function in numbers

• MorecomplexthanahashCode• Weneedtoevaluateitforeveryrecord• Micro-benchmark(forexampleString):

• Numberofpartitions: 512• HashPartitioner: AVGtime to hash arecord is90.033 ns• KeyIsolatorPartitioner: AVGtime to hash arecord is121.933ns

• Inpracticeitaddsnegligibleoverhead,under1%

Page 19: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Repartitioning

• Reorganizepreviousnaivehashing• Usuallyhappensin-memory• Inpractice,addsadditional1-8%overhead(usuallythelowerend),basedon:• complexity ofthemapper,• lengthofthescanningprocess.

Page 20: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Morenumbers (groupBy)

MusicTimeseries – groupBy on tags from alistenings-stream

1710 200 400

83M

134M

92M

sizeofth

ebiggestpartition

number ofpartitions

over-partitioning overhead&blind scheduling64%reduction

observed 3%map-side overhead

naive

Dynamic Repartitioning

38%reduction

Page 21: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Morenumbers (join)

• Joiningtrackswithtags• Tracksdatasetisskewed• Numberofpartitionsissetto33• Naive:

• sizeofthebiggest partition=4.14M• reducer’sstageruntime=251seconds

• DynamicRepartitioning• sizeofthebiggest partition=2.01M• reducer’sstageruntime=124 seconds• heavymap,only0.9%overhead

Page 22: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Spark RESTAPI

• NewmetricsareavailablethroughtheRESTAPI• AddednewqueriestotheRESTAPI,forexample:„whathappenedinthelast3second?”• BlockFetches arecollectedtoShuffleReadMetrics

Page 23: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Execution visualization ofSpark jobs

Page 24: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Repartitioning in the visualization

heavy keys createheavy partitions (slow tasks)

heavy isalone,size ofthe biggest partition

isminimized

Page 25: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Conclusion

• OurDynamicRepartitioning canhandledataskewdynamically,on-the-flyonanyworkloadandarbitrarykey-distributions• Withverylittleoverhead,dataskewcanbehandledinanatural &general way• Visualizationscanaiddeveloperstobetterunderstandissuesandbottlenecksofcertainworkloads• MakingSparkdata-aware paysoff

Page 26: Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning

Thankyouforyourattention

Zoltá[email protected]


Recommended