+ All Categories
Home > Documents > AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Date post: 23-Feb-2016
Category:
Upload: donagh
View: 37 times
Download: 0 times
Share this document with a friend
Description:
40th IEEE/IFIP International Conference on Dependable Systems and Networks. AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks. Ignacio Laguna , Saurabh Bagchi. Greg Bronevetsky , Bronis R. de Supinski, Dong H. Ahn, Martin Schulz. Jun 30 th , 2010. - PowerPoint PPT Presentation
Popular Tags:
24
Slide 1/24 Lawrence Livermore National Laboratory AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks Greg Bronevetsky, Bronis R. de Supinski, Dong H. Ahn, Martin Schulz Jun 30 th , 2010 40th IEEE/IFIP International Conference on Dependable Systems and Networks Ignacio Laguna, Saurabh Bagchi
Transcript
Page 1: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 1/24Lawrence Livermore National Laboratory

AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Greg Bronevetsky, Bronis R. de Supinski, Dong H. Ahn,

Martin Schulz

Jun 30th, 2010

40th IEEE/IFIP International Conference on Dependable Systems and Networks

Ignacio Laguna, Saurabh Bagchi

Page 2: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 2/24Lawrence Livermore National Laboratory

Debugging Large-Scale Parallel Applications is Challenging

• Large systems will have millions of cores in near future

– Increased difficulty for developing correct HPC applications– Traditional debuggers don’t perform well at this scale

• Faults come from various sources– Hardware: soft errors, physical degradation, design bugs– Software: coding bugs, misconfigurations

Sequoia (2011)1.6 million cores

Page 3: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 3/24Lawrence Livermore National Laboratory

Developer Steps When Debugging a Parallel Application

When did it fail?

Parallel task that failed?

Code region?

Line of code?

Questions a developer has to answer when an application fails:

AutomaDeD

• Need for tools to help developers find root cause quickly

root cause

Page 4: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 4/24Lawrence Livermore National Laboratory

AutomaDeD’s Error Detection Approach

Phase Annotation

Task1 Task2… Taskn

Application

Model1 …

Clustering

(1) Abnormal Phases(2) Abnormal Tasks

(3) Characteristic Transitions

PNMPI Profiler

Offline

Online

Offline

Model2 Modeln

Page 5: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 5/24Lawrence Livermore National Laboratory

Types of Behavioral Differences

1 …

MPI Application

2 3 n

Tasks

Spatial(between

tasks)

Temporal(between

time points)time

1 …

MPI Application

2 3 n

Tasks

time

Run 1 Run 2

1 …

MPI Application

2 3 n

Tasks

time

Run 3

Between runs

Page 6: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 6/24Lawrence Livermore National Laboratory

Semi-Markov Models (SMM)

• Like a Markov model but with time between transitions– Nodes: application states– Edges: transitions from one state to another

0.2 , 5μs0.7, 15μs

0.1, 500μs

A

B

C

D

Transitionprobability

Time spent in current state (before transition)

Page 7: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 7/24Lawrence Livermore National Laboratory

SMM Represents Task Control Flow

• States correspond to:– Calls to MPI routines– Code between MPI routines

Computation

main()foo()Send-DBL

Computation

main()foo()Recv-DBL

Computation

main()Finalize

main()Initmain() { MPI_Init() … Computation … MPI_Send(…, 1, MPI_INTEGER, …); for(…) foo(); MPI_Recv(…, 1, MPI_INTEGER, …); MPI_Finalize();}

foo() { MPI_Send(…, 1024, MPI_DOUBLE, …); …Computation… MPI_Recv(…, 1024, MPI_DOUBLE, …); …Computation…}

Application Code Semi-Markov Model

main()Send-INT

main()Recv-INT

Different statefor different

calling context

Page 8: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 8/24Lawrence Livermore National Laboratory

Two Approaches for Time Density Estimation:Parametric and Non-parametric

Bucket Counts

Gaussian TailLine Connectors

Time Values

DensityFunction

DataSamples

• Cheaper• Lower Accuracy

• More Expensive• Greater Accuracy

Time Values Time Values

Gaussian Distribution(Parametric model)

Histograms(Non-parametric model)

Page 9: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 9/24Lawrence Livermore National Laboratory

AutomaDeD’s Error Detection Approach

Phase Annotation

Task1 Task2… Taskn

Application

Model1 …

Clustering

(1) Abnormal Phases(2) Abnormal Tasks

(3) Characteristic Transitions

PNMPI Profiler

Offline

Online

Offline

Model2 Modeln

Page 10: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 10/24Lawrence Livermore National Laboratory

Phase 1, 2, 3,….

User’s Phase Annotations

main() { MPI_Init() … Computation … MPI_Send(…, MPI_INTEGER, …); for(…) { MPI_Send(…, MPI_DOUBLE, …); …Computation… MPI_Recv(…, MPI_DOUBLE, …); MPI_Pcontrol(); } …Computation… MPI_Recv(…, MPI_INTEGER, …); MPI_Finalize();}

• Phases denote regions of execution repeated dynamically• Developers annotate phases in the code

– MPI_Pcontrol is intercepted by wrapper library

Sample Code:

Page 11: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 11/24Lawrence Livermore National Laboratory

A Semi-Markov Model per Task, per Phase

SMM

SMM

SMM

. . .

SMM

SMM

SMM. . .

Task 1

Task 2

Task n

SMM

SMM

SMM

. . .

SMM

SMM

SMM

. . .

Phase 1 Phase 2 Phase 3 Phase 4

time

time

time

Page 12: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 12/24Lawrence Livermore National Laboratory

AutomaDeD’s Error Detection Approach

Phase Annotation

Task1 Task2… Taskn

Application

Model1 …

Clustering

(1) Abnormal Phases(2) Abnormal Tasks

(3) Characteristic Transitions

PNMPI Profiler

Offline

Online

Offline

Model2 Modeln

Page 13: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 13/24Lawrence Livermore National Laboratory

Faulty Phase Detection:Find the Time Period of Abnormal Behavior

• Goal: find phase that differs the most from other phases

SMM1

SMM2

SMMn

Phase 1

SMM1

SMM2

SMMn

Phase 2

SMM1

SMM2

SMMn

Phase M

Sample runs available:

SMM1

SMM2

SMMn

Phase 1 Phase 2 Phase M

…SMM1

SMM2

SMMn

SMM1

SMM2

SMMn

Sample Runs

SMM1

SMM2

SMMn

Phase 1

SMM1

SMM2

SMMn

Phase 2

SMM1

SMM2

SMMn

Phase M

Without sample runs:

Compare to counterpart

Compare each phase to all others

Deviationscore

Page 14: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 14/24Lawrence Livermore National Laboratory

Clustering Tasks’ Models:Hierarchical Agglomerative Clustering (HAC)

Task 1 SMM

Task 2 SMM

Task 3 SMM

Task 4 SMM

Step 2

Task 1 SMM

Task 2 SMM

Task 3 SMM

Task 4 SMMStep 3

Diss(SMM1, SMM2) = L2 Norm (Transition prob.) + L2 Norm (Time prob.)

Task 1 SMM

Task 2 SMM

Task 3 SMM

Task 4 SMM

Step 1

Each task starts in its own cluster

We need a dissimilarity threshold to decide when to stop

Step 4

Do we stop? or, Do we get one cluster?

?

Page 15: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 15/24Lawrence Livermore National Laboratory

How To Select The Number Of Clusters

• User provides application’s natural cluster count k

• Use sample runs to compute clustering threshold τ that produces k clusters– Use sample runs if available– Otherwise, compute τ from start of execution– Threshold based on highest increase in dissimilarity

• During real runs, cluster tasks using threshold τ

Page 16: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 16/24Lawrence Livermore National Laboratory

Cluster Isolation Example

Task 3

Task 4 Task 5 Task 6

Task 7 Task 8 Task 9

Task 1

Task 2

Master-Worker Application Example

Normal Execution

Cluster 1

Cluster 2

Task 3

Task 4 Task 5 Task 6

Task 7 Task 8 Task 9

Task 1

Task 2

Buggy Execution

Cluster 1

Cluster 2

Cluster 3

Bug in Task 9

Cluster Isolation: to separate buggy task in unusual cluster

Page 17: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 17/24Lawrence Livermore National Laboratory

Transition Isolation:Erroneous Code Region Detection

• Method 1:– Find edge that distinguishes

faulty cluster from the others– Recall: SMM dissimilarity is

based in part on L2 norm of SMM’s parameters

• Method 2:– Find unusual individual edge– Edge that takes unusual amount

of time (compared to observed times)

Visualization of Results

Isolated transition (cluster 2)

Page 18: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 18/24Lawrence Livermore National Laboratory

Fault Injections

• NAS Parallel Benchmarks (MPI programs):– BT, CG, FT, MG, LU and SP – 16 tasks, Class A (input)

• 2000 injection experiments per application:Name DescriptionFIN_LOOP Local livelock/deadlock (delay 1,5, 10 sec)

INF_LOOP Transient stall (infinite loop)

DROP_MESG MPI message loss

REP_MESG MPI message duplication

CPU_THR CPU-intensive thread

MEM_THR Memory-intensive thread

Page 19: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 19/24Lawrence Livermore National Laboratory

Phase Detection Accuracy• ~90% for loops and message drops • ~60% for extra threads

– Training = sample runs available– Training significantly better than no training – Histograms better than Gaussians

Fault10 - GaussFault10 - HistogramNoFault - GaussNoFault - HistogramNoSample - GaussNoSample - Histogram

0%

20%

40%

60%

80%

100%

Training vs.No Training

Some Faults vs. NoFault Samples

Gaussian vs. Histogram

Faults

Page 20: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 20/24Lawrence Livermore National Laboratory

Cluster Isolation Accuracy:Isolating the abnormal task(s)

• Results assume phase detected accurately• Accuracy of Cluster Isolation highly variable

BTCGFTLUMGSP

CPU_THR

MEM

_THR

DROP_MSG

REP_MSG

FIN_LOOP-1

FIN_LOOP-5

FIN_LOOP-10

INF_LOOP

0%

20%

40%

60%

80%

100%

Application

Faults

Accuracy up to 90% for extra threads

Poor detection elsewhere because of fault propagation: buggy task normal task(s)

Page 21: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 21/24Lawrence Livermore National Laboratory

Transition Isolation Accuracy• Erroneous transition lies in top 5 candidates (identified by

AutomaDeD)– Accuracy ~90% for loop faults– Highly variable for others– Less variable if event order information is used

CPU_THR

MEM

_THR

DROP_MSG

REP_MSG

FIN_LOOP-1

FIN_LOOP-5

FIN_LOOP-10

INF_LOOP

0%

20%

40%

60%

80%

100%

BTCGFTLUMGSP

Page 22: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 22/24Lawrence Livermore National Laboratory

MVAPICH Bug

• Job execution script failed to clean up at job end– MPI tasks executer (mpirun, version 0.9.9)– Left runaway processes on nodes

• Simulation:– Execute BT (affected application)– Run concurrently runaway applications (LU, MG or SP)– Runaway tasks interfere with normal BT execution

Page 23: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 23/24Lawrence Livermore National Laboratory

MVAPICH Bug Results:SMMs Deviation Scores in Affected Application

1 2 3 4 5 6 7 8 9 101E+1

1E+2

1E+3

1E+4

1E+516-task BT / 16-task SP/LU/MG

Phase

SMM

Dev

iatio

n Sc

ore

AVG No-Interference

Concurrent SP

Concurrent LU

Concurrent MG

Affected application: BT benchmarkInterfering applications: SP, LU, MG benchmarks

Abnormal phase detected in phase 1 in SP and LU, and in phase 2 in MG

Constant (average) SMM difference in regular BT runs

Page 24: AutomaDeD: Automata-Based Debugging for Dissimilar Parallel Tasks

Slide 24/24Lawrence Livermore National Laboratory

Concluding Remarks• Contributions:

– Novel way to model and compare parallel tasks’ behavior– Focus debugging efforts on time period, tasks and code region

where bug is first manifested– Accuracy up to ~90% for phase detection, cluster and transition

isolation (delays and hangs)

• Ongoing work:– Scaling implementation to work on millions of tasks– Improving accuracy through different statistical models

(e.g., Kernel Density Estimation, Gaussian Mixture Models)


Recommended