Date post: | 13-Jun-2015 |
Category: |
Technology |
Upload: | takuya-ueshin |
View: | 2,132 times |
Download: | 0 times |
Failing GracefullyAaron Davidson
07/01/2014
What does “failure” mean for Spark?
• Spark is a cluster-compute framework targeted at analytics workloads
• Supported failure modes:– Transient errors (e.g., network, HDFS outage)–Worker machine failures
• Unsupported failure modes:– Systemic exceptions (e.g., bad code, OOMs)– Driver machine failure
What makes a recovery model good?
• A good recovery model should:– Be simple– Consistently make progress towards
completion– Always be in use (“fail constantly”)
Outline of this talk
• Spark architecture overview• Common failures• Special considerations for fault
tolerance
Example program
Goal: Find number of names per “first character”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
うえしんさいとうえだ
Example program
Goal: Find number of names per “first character”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
うえしんさいとうえだ
(う , 1)(さ , 1)(う , 1)
Example program
Goal: Find number of names per “first character”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
(う , 2) (さ , 1)
うえしんさいとうえだ
(う , 1)(さ , 1)(う , 1)
Example program
Goal: Find number of names per “first character”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
(う , 2) (さ , 1)
うえしんさいとうえだ
(う , 1)(さ , 1)(う , 1)
res0 = [(う ,2), (さ ,1)]
Spark Execution Model
1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
Step 1: Create RDDs
sc.textFile(“hdfs:/names”)
map(name => (name.charAt(0), 1))
reduceByKey(_ + _)
collect()
Step 1: Create RDDs
HadoopRDD
map()
reduceByKey()
collect()
Step 2: Create execution plan
• Pipeline as much as possible• Split into “stages” based on need to
reorganize data
Stage 1 HadoopRDD
map()
reduceByKey()
collect()
うえしんさいとうえだ
( う , 2)
( う , 1)( さ , 1)( う , 1)
res0 = [(う ,2), (さ ,1)]
( さ , 1)Stage 2
Step 3: Schedule tasks
• Split each stage into tasks• A task is data + computation• Execute all tasks within a stage
before moving on
Step 3: Schedule tasksComputation Data
hdfs:/names/0.gz
hdfs:/names/1.gz
hdfs:/names/2.gz
Task 0
Task 1Task 2
hdfs:/names/3.gz
…
Stage 1HadoopRDD
map()Task 3
hdfs:/names/0.gz
Task 0
HadoopRDD
map()
hdfs:/names/1.gz
Task 1
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
/names/0.gz
HadoopRDD
map() Time
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map() Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map() Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
The Shuffle
Stage 1
Stage 2
HadoopRDD
map()
reduceByKey()
collect()
The Shuffle
Stage 1
Stage 2
• Redistributes data among partitions• Hash keys into buckets
• On reduce side, build hashmap within each partitionReduce 0:{ う => 137, さ => 86, …}
Reduce 1:{ な => 144, る => 12, …}
…
The Shuffle
Disk
Stage 2
Stage 1
• Pull-based, not push-based• Write intermediate files to disk
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Reduce 0
reduceByKey
collect
Reduce 1
reduceByKey
collect
Reduce 2
reduceByKey
collect
Reduce 3
reduceByKey
collect
When things go wrong
• Task failure• Task taking a long time• Executor failure
Task Failure
• Task fails with exception retry it• RDDs are immutable and “stateless”,
so rerunning should have same effect– Special logic required for tasks which
write data out (atomic rename)– Statelessness not enforced by
programming modelsc.parallelize(0 until 100).map { x => val myVal = sys.prop(“foo”, 0) + x sys.prop(“foo”) = myVal myVal}
Task Failure
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Speculative Execeution
• Try to predict slow or failing tasks, restart task on a different machine in parallel
• Also assumes immutability and statelessness
• Enable with “spark.speculation=true”
Speculative Execution
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Speculative Execution
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Speculative Execution
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Speculative Execution
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
/names/3.gz
HadoopRDD
map()
Speculative Execution
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
/names/3.gz
HadoopRDD
map()
Executor Failure
• Examine tasks run on that executor:– If task from final stage, we’ve already
collected its results – don’t rerun– If task from intermediate stage, must
rerun.
• May require executing “finished” stage
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Reduce 0
reduceByKey
collect
Reduce 1
reduceByKey
collect
Reduce 2
reduceByKey
collect
Reduce 3
reduceByKey
collect
Step 3: Schedule tasks
/names/0.gz
/names/3.gz
HDFS
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
/names/3.gz
HadoopRDD
map()
Reduce 0
reduceByKey
collect
Reduce 1
reduceByKey
collect
Reduce 2
reduceByKey
collect
Reduce 3
reduceByKey
collect
Step 3: Schedule tasks
/names/1.gz
/names/0.gz
HDFS
/names/2.gz
/names/3.gz
HDFS
/names/0.gz
HadoopRDD
map()
/names/1.gz
HadoopRDD
map()
/names/2.gz
HadoopRDD
map()
Time
Reduce 0
reduceByKey
collect
Reduce 1
reduceByKey
collect
Reduce 2
reduceByKey
collect
Reduce 3
reduceByKey
collect
Tasks to rerun:
/names/3.gz
HadoopRDD
map()
Completed tasks:
/names/0.gz
HadoopRDD
map()
/names/3.gz
HadoopRDD
map()
Reduce 3
reduceByKey
collect
Other Failure Scenarios
What happens when:1. We have a large number of stages?2. Our input data is not immutable (e.g.
streaming)?3. Executors had cached data?
1. Dealing with many stages
Problem:Executor loss causes recomputation of all non-final stages
Solution:Checkpoint whole RDD to HDFS periodically
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
Stage 7
1. Dealing with many stages
Problem:Executor loss causes recomputation of all non-final stages
Solution:Checkpoint whole RDD to HDFS periodically
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
Stage 7
HDFS
2. Dealing with lost input data
Problem:Input data is consumed when read (e.g., streaming), and re-execution is not possible.
Solution:No general solution today – either use an HDFS source or implement it yourself.Spark 1.2 roadmap includes a general solution which may trade throughput for safety.
3. Loss of cached data
Problem:Executor loss causes cache to become incomplete.
Solution:Do nothing – a task caches data locally while it runs, causing the cache to stabilize.
3. Loss of cached dataval file = sc.textFile(“s3n://”).cache() // 8 blocksfor (i <- 0 until 10) { file.count()}
Cache Block 0
Block 2
Block 4
Block 6
Block 0
Block 2
Block 4
Block 6
Cache Block 1
Block 3
Block 5
Block 7
Block 1
Block 3
Block 5
Block 7
i = 0
Block 0
Block 2
Block 4
Block 6
Block 1
Block 3
Block 5
Block 7
i = 1
3. Loss of cached dataval file = sc.textFile(“s3n://”).cache()for (i <- 0 until 10) { file.count()}
Cache
Cache Block 1
Block 3
Block 5
Block 7
Block 1
Block 3
Block 5
Block 7
Block 1
Block 3
Block 5
Block 7
i = 1i = 0
Block 0
Block 1
Block 3
Block 5
Block 2
Block 4
Block 6
Block 7
Block 0
Block 2
Block 4
Block 6
Block 7
i = 2
Block 0
Block 1
Block 3
Block 5
Block 2
Block 4
Block 6
Block 7
i = 3
Conclusions
• Spark comes equipped to handle the most common forms of failure
• Special care must be taken in certain cases:– Highly iterative use-cases
(checkpointing)– Streaming (atomic data consumption)– Violating Spark’s core immutability and
statelessness assumptions