Demand-Driven Software Race Detection using Hardware...

Post on 14-Oct-2020

3 views 0 download

transcript

Demand-Driven Software Race

Detection using Hardware

Performance CountersPerformance Counters

Joseph L. Greathouse†, Zhiqiang Ma‡, Matthew I. Frank‡

Ramesh Peri‡, Todd Austin†

†University of Michigan ‡Intel Corporation

CSCADS

Aug 2, 2011

In spite of proposed hardware solutions

Concurrency Bugs Still Matter

Bulk Memory

Commits

2

Commits

TRANSACTIONAL

MEMORYAMD

ASF

?

Sun

Rock

?

Concurrency Bugs Matter NOW

if(ptr==NULL)

if(ptr==NULL)

TIME

Thread 2mylen=large

Thread 1mylen=small

len2=thread_local->mylen;

ptr=malloc(len2);Nov. 2010 OpenSSL Security Flaw

3

memcpy(ptr, data2, len2)

ptrLEAKED

TIME

len1=thread_local->mylen;

ptr=malloc(len1);

memcpy(ptr, data1, len1)

Nov. 2010 OpenSSL Security Flaw

This Talk in One Sentence

Speed up software race detection

with existing hardware support.with existing hardware support.

4

Software Data Race Detection

� Add checks around every memory access

� Find inter-thread sharing events

� Synchronization between write-shared

accesses?

� No? Data race.

5

Thread 2mylen=large

Thread 1mylen=small

if(ptr==NULL)

len1=thread_local->mylen;

ptr=malloc(len1);

memcpy(ptr, data1, len1)

Example of Data Race Detection

ptr write-shared?Interleaved

Synchronization?

TIME

if(ptr==NULL)

len2=thread_local->mylen;len2=thread_local->mylen;

ptr=malloc(len2);ptr=malloc(len2);

memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)

6

TIME

SW Race Detection is Slow

150

200

250

300

Race Detector Slowdown (x) Phoenix PARSEC

7

0

50

100

his

tog

ram

km

ea

ns

line

ar_

reg

rC

ma

trix

_m

ulC

pca

str

ing_

ma

tch

wo

rd_

co

un

t

Ge

oM

ea

n

bla

cksch

ole

s

bo

dytr

ack

face

sim

ferr

et

fre

qm

ine

raytr

ace

sw

ap

tio

ns

flu

ida

nim

ate

vip

s

x2

64

ca

nn

ea

l

de

du

p

str

ea

mclu

sC

Ge

oM

ea

n

Race Detector Slowdown (x

Goal of this Work

Accelerate Software Data Race Detection

Technique #1: Making it Fast

Demand-Driven Data Race Detection

Technique #2: Keeping it Real

Find sharing events with existing HW

8

Inter-thread Sharing is What’s Important

“Data races ... are failures in programs that access and

update shared data in critical sections” – Netzer & Miller, 1992

if(ptr==NULL)

len1=thread_local->mylen;

ptr=malloc(len1);

memcpy(ptr, data1, len1)

Thread-local data

NO SHARING

Shared data

NO INTER-THREAD

TIME

9

if(ptr==NULL)

memcpy(ptr, data1, len1)

len2=thread_local->mylen;len2=thread_local->mylen;

ptr=malloc(len2);ptr=malloc(len2);

memcpy(ptr, data2, len2)memcpy(ptr, data2, len2)

NO INTER-THREAD

SHARING EVENTS

TIME

Very Little Inter-Thread Sharing

Phoenix PARSEC

40

50

60

70

80

90

100

Sharing Events

1.5

2

2.5

3

Sharing Events

10

0

10

20

30

40

% Write-Sharing Events

0

0.5

1

% Write-Sharing Events

Technique 1: Demand-Driven Analysis

Multi-threadedApplication

SoftwareRace Detector

11

Local

Access

Inter-thread

sharing

Inter-thread Sharing Monitor

Inter-thread Sharing Monitor

� Check each memory op. for write-sharing

� Signal software race detector on sharing

� Possible to do in software

+ Can be built now with instrumentation

– Slow. May take as long as race detection

12

� Follow read/write sets of threads

Fast user-level faults

Sharing Monitor

Thread 1

WRITE Y

Ideal Hardware Sharing Detector

Thread 2

READ Y

Thread 1

WRITE Y

T1

R: ∅

W: ∅

T2

R: ∅

W: ∅

Thread 2

READ Y

T1

R: ∅

W: {Y}

W->R

Sharing

� Fast user-level faults

13

Multi-threadedApplication

SoftwareRace Detector

Inter-thread Sharing Monitor

Limitations of Existing Hardware

� Fast faults

� Solution: Enable detector for long periods of time

Multi-threadedApplication

SoftwareRace Detector

NO

� Read/write sets

� Solution:

14

Inter-thread Sharing Monitor

NO

SHARING

Technique 2: Hardware Sharing Detector

� Hardware Performance Counters

� Interrupt on cache coherency events

� Intel’s HITM event: W→R Data Sharing

S

M

S

IHITM

� Limitations of this method:

� SMT sharing can’t be counted

� Cache eviction

� Others in paper

15

M IHITM

Demand-Driven Analysis on Real HW

Execute

Instruction

Disable HITM AnalysisNO

NO

16

SW Race

Detection

Enable

Analysis

AnalysisInterrupt?

Sharing

Recently?

Enabled?

NOYES

YES

YES

Experimental Evaluation

� Modified Intel Inspector XE Race Detector

� Linux on 4-core Core i7, no Hyper-Threading

� Performance Tests:

� Phoenix Suite� Phoenix Suite

� PARSEC

� Accuracy Tests:

� Phoenix Suite

� PARSEC

� Pre-release version of RADBench

17

Simulation

Performance Difference

150

200

250

300

Race Detector Slowdown (x) Phoenix PARSEC

18

0

50

100

his

tog

ram

km

ea

ns

line

ar_

reg

rC

ma

trix

_m

ulC

pca

str

ing_

ma

tch

wo

rd_

co

un

t

Ge

oM

ea

n

bla

cksch

ole

s

bo

dytr

ack

face

sim

ferr

et

fre

qm

ine

raytr

ace

sw

ap

tio

ns

flu

ida

nim

ate

vip

s

x2

64

ca

nn

ea

l

de

du

p

str

ea

mclu

sC

Ge

oM

ea

n

Race Detector Slowdown (x

Performance Increases

8

10

12

14

16

18

20

driven Analysis

Speedup (x)

Phoenix PARSEC

51x

19

0

2

4

6

8

his

tog

ram

km

ea

ns

line

ar_

reg

rC

ma

trix

_m

ulC

pca

str

ing_

ma

tch

wo

rd_

co

un

t

Ge

oM

ea

n

bla

cksch

ole

s

bo

dytr

ack

face

sim

ferr

et

fre

qm

ine

raytr

ace

sw

ap

tio

ns

flu

ida

nim

ate

vip

s

x2

64

ca

nn

ea

l

de

du

p

str

ea

mclu

sC

Ge

oM

ea

n

Demand-driven Analysis

Speedup (x

Demand-Driven Analysis Accuracy

8

10

12

14

16

18

20

driven Analysis

Speedup (x)

1/1 2/4 3/3 4/4 3/3 4/4 4/42/4 4/4 4/42/4

Accuracy vs.

Continuous Analysis:

20

0

2

4

6

8

his

tog

ram

km

ea

ns

line

ar_

reg

rC

ma

trix

_m

ulC

pca

str

ing_

ma

tch

wo

rd_

co

un

t

Ge

oM

ea

n

bla

cksch

ole

s

bo

dytr

ack

face

sim

ferr

et

fre

qm

ine

raytr

ace

sw

ap

tio

ns

flu

ida

nim

ate

vip

s

x2

64

ca

nn

ea

l

de

du

p

str

ea

mclu

sC

Ge

oM

ea

n

Demand-driven Analysis

Speedup (x)

Continuous Analysis:

97%

Future Directions

� Better Performance

� Fast user-level faults

� Application specific hardware

� More Accuracy

� Better performance counters� Better performance counters

� Inform SW on cache evictions/misses

� Smooth transition to ideal hardware

� Combine sampling & demand-driven analysis

21