+ All Categories
Home > Documents > Handling Production-Run Concurrency-Bug...

Handling Production-Run Concurrency-Bug...

Date post: 12-Jun-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
15
Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago Karu Sankaralingam University of Wisconsin - Madison 1
Transcript
Page 1: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Handling Production-Run Concurrency-Bug Failures

Shan Lu

University of Chicago

Karu Sankaralingam

University of Wisconsin - Madison

1

Page 2: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Reliability is crucial

• Production-run failures are costly

• Reliability affects other aspects of parallel systems

2

“We don’t use thread-level parallelism because

that is difficult to get correct“

Page 3: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Reliability is challenging

• Concurrency bugs

– Synchronization problems in multi-threaded s/w

• Concurrency bugs widely exist in production

– In-house testing is ineffective for concurrency bugs

3

How to handle production-run failures caused by concurrency bugs?

if (proc){tmp=*proc;

}

Thread 2Thread 1

proc = NULL;

MySQL

_state=mThdstate;

Thread 2Thread 1

mThd

MySQL

Page 4: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Rollback recovery

4

Thread 1

Thread 3

Thread 2

Page 5: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Problems of the traditional approach

5

Thread 1

Thread 3

Thread 2

Page 6: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Bug examples

6

Do we need to roll back all threads?

Do we need memory checkpoint?

ConAir: Featherweight Concurrency Bug Recovery Via Single-Threaded Idempotent Execution [ASPLOS13]

if (proc){tmp=*proc;

}

Thread 2Thread 1

proc = NULL;

MySQL

_state=mThdstate;

Thread 2Thread 1

mThd = CreateThd();

MySQL

Page 7: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

ConAir system

7

Thread 1

Thread 3

Thread 2

• Guarantee no change to program semantics

• No change to OS/Hardware

• No prior bug knowledge required

• Negligible overhead (<0.2%)

• Work for 16 out of 26 real-world bugs

ConAir: Featherweight Concurrency Bug Recovery Via Single-Threaded Idempotent Execution [ASPLOS13]

Page 8: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

The problems

• What if the error propagation distance is long?

• What if the failure thread was already too slow?

tmp=*ptr;

Thread 2Thread 1

free (ptr);

MySQL

Page 9: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Proactive prevention

9

Thread 1

Thread 3

Thread 2

Page 10: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Proactive prevention

10

Thread 1

Thread 3

Thread 2

Page 11: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Bug examples

11

if (proc){tmp=*proc;

}

Thread 2Thread 1

proc = NULL;

MySQL

_state=mThdstate;

Thread 2Thread 1

mThd = CreateThd();

MySQL

Do we need to perturb multiple threads?

Do we perturb at random place?

AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14] ACM SIGSOFT Distinguished Paper Award

Page 12: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

A simple and generic scheme

• The best perturbation point is

Right before a memory access that has an

incorrect/abnormal remote predecessor

12AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14] ACM SIGSOFT Distinguished Paper Award

Page 13: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

AI system

13

if (proc){tmp=*proc;

}

Thread 2Thread 1

proc = NULL;

MySQL

_state=mThdstate;

Thread 2Thread 1

mThd = CreateThd();

MySQL

• Guarantee no change to program semantics

• No change to OS/Hardware

• No prior bug knowledge required

• Negligible overhead (<0.2%)

• Work for 16 out of 26 real-world bugs

AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14]

35 out of 35

1% ~ 10X

Training required

ACM SIGSOFT Distinguished Paper Award

Page 14: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

ConAir vs. AI

• Can we combine ConAir and AI?

14

ConAir AI

Performance Great Poor when there are intensive shared-memory accesses

Functionality Poor when failure thread is too slowPoor when error propagation is long

Great, but require training

Page 15: Handling Production-Run Concurrency-Bug Failuressynergy.cs.vt.edu/2015-nsf-xps-workshop/reports/... · Handling Production-Run Concurrency-Bug Failures Shan Lu University of Chicago

Summary & Other efforts

• Reactive approach to production-run failures

• Proactive approach to production-run failures

• How much can developers contribute?

• How much can hardware contribute?

15

What change history tells us about thread synchronization [FSE15]

Thank XPS!

AI: A Lightweight System for Tolerating Concurrency Bugs [FSE14]

ConAir: Featherweight Concurrency Bug Recovery Via Single-Threaded Idempotent Execution [ASPLOS13]

ACM SIGSOFT Distinguished Paper Award


Recommended