+ All Categories
Home > Documents > 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya...

1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya...

Date post: 16-Dec-2015
Category:
Upload: sara-leonard
View: 214 times
Download: 1 times
Share this document with a friend
Popular Tags:
52
1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2 , Peng Li 1 , Geof Sawaya 1 , Ganesh Gopalakrishnan 1 , Indradeep Ghosh 2 , Sreeranga P. Rajan 2 1 Feb. 2012 Fujitsu Labs of America 2 1
Transcript
Page 1: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

1

GKLEE: Concolic Verification and Test Generation for GPUs

Guodong Li1,2, Peng Li1, Geof Sawaya1, Ganesh Gopalakrishnan1, Indradeep Ghosh2, Sreeranga P. Rajan2

1

Feb. 2012

Fujitsu Labs of America2

1

Page 2: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

GPUs are widely used!• About 40 of the top 500 machines are GPU based

• Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs

2

(courtesy of AMD) (courtesy of Nvidia)

(courtesy of Nvidia, www.engadget.com)

(courtesy of Intel)

In such application domains, it is important that GPU computations yield correct answers and are bug-free.

Page 3: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races

3

Page 4: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races

4

Write(a) Write(a) Write(a) Read(a)

Page 5: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Data races are a huge problem– Testing is NEVER conclusive – One has to infer data race's ill effects indirectly

through corrupted values– Even instrumented race checking gives results

only for a specific platform, and not for future validations, • for example for a different warp scheduling, e.g.

change over from old Tesla to New Fermi

5

Page 6: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races

–Missed deadlocks

6

Page 7: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races

–Missed deadlocks

7

__SyncThreads()

Page 8: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks

• Insufficient measurement of performance penalties due to–Warp Divergence

8

Page 9: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks

• Insufficient measurement of performance penalties due to–Warp Divergence

9

Page 10: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks

• Insufficient measurement of performance penalties due to– Warp Divergence

– Non-coalesced memory accesses

10

Page 11: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks

• Insufficient measurement of performance penalties due to– Warp Divergence

– Non-coalesced memory accesses

11

Memory

Page 12: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks

• Insufficient measurement of performance penalties due to– Warp Divergence– Non-coalesced memory accesses

– Bank conflicts

12

Memory Banks

Page 13: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Existing GPU Testing Methods are Inadequate

• CUDA GDB Debugger– Manually debug the code and check races and deadlocks

• CUDA Profiler– Report numbers difficult to read– Low coverage (i.e. no all possible inputs)

13

• GKLEE– Better tool for verification and testing– Can address all the previously mentioned

points– e.g. has found bugs in real SDK kernels

previously thought to be bug-free– give root causes of the bugs

Page 14: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

Our Contributions• GKLEE: a Symbolic Virtual GPU for

Verification, Analysis, and Test-generation

• GKLEE reports Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses,

Warp Divergences

• GKLEE generates Tests to Run on GPU Hardware

14

Page 15: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

15

Architecture of GKLEE

LLVM GCC Compiler

LLVM GCC Compiler

GKLEE(Executor, scheduler,

checker, test generator)

GKLEE(Executor, scheduler,

checker, test generator)

C++ GPU Program

(with Sym. Inputs)

LLVMcuda

GPU configuration

CUDA Syntax HandlerNVCCNVCC

Test Cases

Replay on Real GPU

Statistics /Bugs

Page 16: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

16

Rest of the Talk

• Simple CUDA example• Details of Symbolic Virtual GPU• Analysis Details:– Races, Deadlocks– Degree of

• Warp divergences, Bank Conflicts, Non-Coalesced Accesses

– Functional Correctness

• Automatic Test Generation– Coverage-directed test-case reduction

Page 17: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

CUDA

• A simple dialect of C++ with CUDA directives

• Thread blocks / teams -- SIMD “warps”• Synchronization through barriers / atomics

(GKLEE being extended to handle atomics)

17

Page 18: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

18

Example: Increment Array Elements

Increment N-element array A by scalar b

tid 0 1 …

A

A[0]+b

__global__ void inc_gpu(int*A, int b, intN) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[idx] + b;}

...A[1]+b

t0 t1

Page 19: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

19

Illustration of Race

Increment N-element vector A by scalar btid 0 1 63

A

t63:write A[63]

...

__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) A[idx] = A[(idx – 1) % N] + b;}

RACE!

t0: read A[63]

Page 20: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

20

Illustration of Deadlock

Increment N-element vector A by scalar btid 0 1 …

A

...

__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) { A[idx] = A[idx] + b;

__syncthreads(); }

DEADLOCK!

idx < N idx ≥ N

Page 21: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

21

Example of a Race Found by GKLEE

21

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

“GKLEE: Is there a Race ?”

Page 22: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

22

Example of a Race Found by GKLEE

22

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

Threads 5 and and 13 have a WW race

when d_Data[5] = 0x04040404 and d_Data[13] = 0. GKLEE

Page 23: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

23

Example of Test Coverage due to GKLEE

23

__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }

Page 24: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

24

Example of Test Coverage due to GKLEE

24

__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }

“How do we test this?”

Page 25: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

25

Example of Test Coverage due to GKLEE

25

__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }

Answer 1 : “Random + “

Page 26: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

26

Example of Test Coverage due to GKLEE

26

__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }

Answer 2 : Ask GKLEE:

Here are 5 tests with100% source code coverage79% avg. thread + barrier interval coverage

Page 27: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

27

GKLEE: Symbolic Virtual GPUHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

• GKLEE models a GPU using software– The virtual GPU

represents the CUDA Programming Model (hence hide many hardware details)

– Similar to the CUDA emulator in this aspect; but with many unique features

– Can simulate CPU+GPU

virtual CPU

virtual GPU

GKLEE

Page 28: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

28

Concolic Execution on the Virtual GPU• The values can be CONCrete or symbOLIC

(CONCOLIC) in GKLEE

– A value may be a complicated symbolic

expression

– Symbolic expressions are handled by constraint

solvers

• Determine satisfiability

• Give concrete values as evidence

– Constraint solving has become 1,000x faster over the last 10 years

Page 29: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

29

Comparing Concrete and Symbolic Execution

10

a b c

Program:

b = a * 2;

c = a + b;

if (c > 100)

assert(0);

2010

302010

unreachable

All values are concrete

Page 30: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

30

Comparing Concrete and Symbolic Execution

x(-,+ )

a b c

Program:

b = a * 2;

c = a + b;

if (c > 100)

assert(0);

else

reachable, e.g. x = 40

x(-,+ ) 2x

x(-,+ ) 3x2x

reachable, e.g. x = 30Now path condition is: 3x <= 100

The values can be concrete or symbolic

Page 31: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

31

GKLEE Works on LLVM Bytecode• CUDA C++ programs are compiled to LLVM bytecode by

LLVM-GCC with our CUDA syntax handler• Our online technical report contains detailed description• GKLEE extends KLEE to handle CUDA features

LLVMcuda Syntax and Semantics

Page 32: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

32

Thread Scheduling: In general, an Exp. Number of Schedules!

It is like shuffling decks of cards

> 13 trillion shuffles exist for 5 decks with 5 cards !!

> 13 trillion schedules exist for 5 threads with 5 instructions !!

More precisely, 25! / (5!)5

Page 33: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

33

GKLEE Avoids Examining Exp. Schedules !!

Instead of considering allSchedules and All Potential Races…

Page 34: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

34

GKLEE Avoids Examining Exp. Schedules !!

Instead of considering allSchedules and All Potential Races…

Consider JUST THIS SINGLECANONICAL SCHEDULE !!

Folk Theorem (proved in our paper):“We will find A RACEIf there is ANY race” !!

Page 35: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

35

Closer Look: canonical scheduling

Race-free operations can be exchanged

another valid schedule (e.g. canonical schedule):

t1:a1:read x

t2:a2: write y

t1:a3:write x

t2:a4:write y

t1:a5:read x

t2:a6:read y

a valid schedule:

t2:a2:write y

t1:a1: read x

t1:a3:write x

t2:a4:write y

t2:a6:read y

t1:a5:read x

The scheduler:

(1) Applies the canonical schedule;

(2) Checks races upon the barriers;

(3) If no race then continues; otherwise reports the race and terminate

Page 36: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

36

SIMD-aware Canonical Scheduling in GKLEE

SIMD/Barrier Aware Canonical scheduling within warp/blockt1 t32

BarrierInterval (BI1)

BarrierInterval (BI2)

Instr. 1t2

Instr. 2

Instr. 3

t33 t64

Instr. 1t34

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 6

Instr. 4

Instr. 5

Instr. 6

Record accesses in canonical scheduleCheck whether the accesses conflict (e.g. have the same address)

Page 37: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

37

SIMD-aware Race Checking in GKLEE

Check races on the fly (in the canonical schedule) t1 t32

BarrierInterval (BI1)

BarrierInterval (BI2)

Instr. 1t2

Instr. 2

Instr. 3

t33 t64

Instr. 1t34

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 6

Instr. 4

Instr. 5

Instr. 6

intra-warp races inter-warp and inter-block races

Page 38: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

38

SIMD-aware Race Checking in GKLEE

Check races on the fly (in the canonical schedule) t1 t32

BarrierInterval (BI1)

BarrierInterval (BI2)

Instr. 1t2

Instr. 2

Instr. 3

t33 t64

Instr. 1t34

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 6

Instr. 4

Instr. 5

Instr. 6

intra-warp races inter-warp and inter-block races

Page 39: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

SDK Kernel Example: race checking

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }

threadPos = … threadPos = …

data = (data4>26) & 0x3FU

data = (data4>26) & 0x3FU

s_Hist[threadPos + Data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

t1 t2

Page 40: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

SDK Kernel Example: race checking

threadPos = … threadPos = …

data = (data4>26) & 0x3FU

data = (data4>26) & 0x3FU

s_Hist[threadPos + data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

t1 t2

t1,t2,d_Data: (t1 t2) (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)

?

Page 41: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

SDK Kernel Example: race checking

threadPos = … threadPos = …

data = (data4>26) & 0x3FU

data = (data4>26) & 0x3FU

s_Hist[threadPos + data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

t1 t2

GKLEE indicates that these two addresses

are equal when

t1 = 5, t2 = 13, d_data[5]= 0x04040404,

and d_data[13] = 0

indicating a Write-Write race

RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

Page 42: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

42

Experimental Results, Part I (check correctness and performance issues)

The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks(1) well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5)

memory coalescing; (6) warp divergence; (7) required volatile keyword.

Kernels Loc Race Func. Corrct.

#T Bank Conflict Perf.

Coalesced Accesses (Perf.)

Warp Divergperf

.

Volatile Needed

1.X 2.X ≤1.1 2.x

Bitonic Sort 30 yes 4 0% 0% 100% 100%

60% no

Scalar Prod. 30 yes 64 0% 0% 11% 100%

100% yes

Matric Mult 61 yes 64 0% 0% 100% 100%

0% no

Histogram64th.

69 WW unknown

32 66% 66% 100% 100%

0% yes

Reduction (7)

231 yes 16 0% 0% 100% 100%

16-83%

yes

Scan Best 78 yes 32 71% 71% 100% 100%

71% no

Scan Naïve 28 yes 32 0% 0% 50% 100%

85% yes

Scan Effi. 60 yes 32 83% 16% 0% 0% 83% no

Scan Large 196 yes 32 71% 71% 100% 100%

71% no

Radix Sort 750 WW unknown

16 3% 0% 0% 100%

5% yes

Bisect Small 1,000

ben. _ 16 38% 0% 97% 100%

43% yes

Bisect Large 1,400

ben. _ 16 15% 0% 99% 100%

53% yes

Page 43: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

43

Automatic Test Generation • GKLEE guarantees to explore all paths w.r.t. given

inputs• The path constraint at the end of each path is solved

to generate concrete test cases – GKLEE supports many heuristic reduction techniques

t1

c2

¬c1c1

¬c2 c4

¬c3

¬c4

c3

t2

c2

¬c1c1

¬c2

c4

¬c3

¬c4

c3

c4

¬c3

¬c4

c3 c4

¬c3

¬c4

c3

t1+t2

c1c2 c3 c4

¬ c1 ¬c3

solve this constraint to give a concrete test

Page 44: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

44

SDK Example: comprehensive testing

44

__global__ void BitonicKernel(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}

shared[0] > shared[1]

shared[0]≤shared[1]

shared[1] < shared[2]

shared[1] ≥ shared[2]

shared[0] > shared[2]

shared[0] ≤ shared[2]

Unsat: shared[0] > shared[1] shared[1] ≥ shared[2] shared[0] ≤ shared[2]

Page 45: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

45

SDK Example: comprehensive verification

45

… ……

Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n]

…values=…

values=…

values=…

values=…

values=…

values=…

…… …

Page 46: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

46

Experimental Results, Part II… (Automatic Test Generation)

Coverage information about the generated tests for some CUDA kernels.

Kernels src. code coverage

Avg. Covt

max. Covt

Avg. CovBIt

Max. CovBIt

Exec. time

Bitonic Sort 100%/100%

78%/76%

100%/94%

79%/66% 90%/76% 1s

Merge Sort 100%/100%

88%/70%

100%/85%

93%/86% 100%/100%

1.6s

Word Search

100%/100%

100%/81%

100%/85%

100%/97%

100%/100%

0.1s

Suffix Tree Match

100%/90%

55%/49%

98%/66%

55%/49% 98%/83% 31s

Histogram64

100%/100%

100%/75%

100%/75%

100%/100%

100%/100%

600s

Covt and CovTBt measure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation.

Page 47: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

47

Experimental Results, Part II (Coverage Directed Test Reduction)

Results after applying reduction Heuristics

RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage.

Kernels No Reductions RedTB RedBI

#path

Avg. CovBIt

#path

Avg. CovBIt

#path

Avg. CovBIt

Bitonic Sort 28 79%/66% 5 79%/66% 5 79%/65%

Merge Sort 34 93%/86% 4 92%/84% 4 92%/84%

Word Search 8 100%/97% 2 100%/97% 2 94%/85%

Suffix Tree Match

31 55%/49% 6 55%/49% 6 55%/49%

Histogram64 13 100%/100%

5 100%/100%

5 100%/100%

Page 48: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

48

Additional GKLEE Features

• GKLEE employs an efficient memory

organization

• Employs many expression evaluation

optimizations• Simplify concolic expressions on the fly• Dynamically cache results• Apply dependency analysis before constraint

solving• Use manually optimized C/C++ Libraries

• GKLEE also handles all of the C++ Syntax

• GKLEE never generates false alarms

Page 49: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

49

Experimental Results, Part III(performance comparison of two tools)

Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check.

#T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes.

Kernels #T = 4 #T = 16 #T = 64 #T = 256 #T = 1,024

PUG GKLEE PUG GKLEE GKLEE GKLEE GKLEE

Simple Reduct.

2.8 <0.1(<0.1)

T.O <0.1(<0.1)

<0.1(<0.1)

0.2(0.3) 2.3(2.9)

Matrix. Transp. 1.9 <0.1(<0.1)

T.O <0.1(0.3) <0.1(3.2) <0.1(63) 0.9(T.O)

Bitonic Sort 3.7 0.9(1) T.O T.O T.O T.O T.O

Scan Large _ <0.1(<0.1)

_ <0.1(<0.1)

0.1(0.2) 1.6(3) 22(51)

Page 50: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

50

Other Details• Diverged warp scheduling, intra-warp, inter-warp/-

block race checking, textual aligned barrier checking

• Checking performance issues– warp divergence, bank conflicts, global memory coalescing

• Path/Test reduction techniques• Volatile declaration checking • Handling symbolic aliasing and pointers• Drivers for the kernels and replaying on the real

GPU• Other results, e.g. on CUDA SDK 4.0 programs• CUDA’s relaxed memory model and semantics

Page 51: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

51

Summary

• GKLEE: symbolic virtual GPU– Identify correctness and performance issues– Produce concrete tests with high code coverage– Enable symbolic parallel debugging for CUDA programs – Good for other CUDA applications (e.g. compiler

optimization verification, regression testing, etc.)• The tool is open source and available at:– www.cs.utah.edu/fv/GKLEE– with tutorial, manual, tech. report, liveDVD,, etc.

• Future Work– Parameterized verification (e.g. equivalence checking) – Support for floating point numbers– Combination with runtime execution (on the real GPU)

Page 52: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga.

52

Thank You!

Questions?

Obtain GKLEE from

www . cs . utah . edu / fv / GKLEE


Recommended