Date post: | 21-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
cpeg421-10-F/Topic-3-II-EARTH 1
Topic 2 -- II: Compilers and Runtime Technology:
Optimization Under Fine-Grain Multithreading- The EARTH Model (in more details)
Guang R. Gao
ACM Fellow and IEEE FellowEndowed Distinguished ProfessorElectrical & Computer Engineering
University of Delaware
cpeg421-10-F/Topic-3-II-EARTH 2
Outline
• Overview
• Fine-grain multithreading
• Compiling for fine-grain multithreading
• The power of fine-grain synchronization - SSB
• The percolation model and its applications
• Summary
cpeg421-10-F/Topic-3-II-EARTH 3
Outline
• Overview
• Fine-grain multithreading
• Compiling for fine-grain multithreading
• The power of fine-grain synchronization - SSB
• The percolation model and its applications
• Summary
TheThe EARTH EARTH Multithreaded Execution Model
cpeg421-10-F/Topic-3-II-EARTH 4
fiber within a frame
Aync. function invocation
A sync operation
Invoke a threaded func
Two Level of Fine-Grain Threads:- threaded procedures- fibers
22 22 11 22
11 22 22 44
Signal Token
Total # signals
Arrived # signals
EARTH vs. CILK
cpeg421-10-F/Topic-3-II-EARTH 5
Fiber within a frame
Parallel function invocation frames
fork a procedure
SYNC ops
Note: EARTH has it origin in static dataflow model
EARTH Model CILK Model
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 6
00 22 00 22
00 11 00 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 7
11 22 00 22
00 11 00 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 8
22 22 00 22
00 11 00 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 9
22 22 00 22
11 11 00 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 10
22 22 00 22
11 11 11 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 11
22 22 11 22
11 11 11 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 12
22 22 22 22
11 11 11 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 13
22 22 22 22
11 11 22 22 00 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 14
22 22 22 22
11 11 22 22 11 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 15
22 22 22 22
11 11 22 22 22 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 16
22 22 22 22
11 11 22 22 33 44
Signal TokenTotal # signals
Arrived # signals
The “Fiber” Execution Model
cpeg421-10-F/Topic-3-II-EARTH 17
22 22 22 22
11 11 22 22 44 44
Signal TokenTotal # signals
Arrived # signals
A Loop Example
cpeg421-10-F/Topic-3-II-EARTH 18
for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: …}
for(i =1; i <= N; ++i){ S1: … S2: x[i] = … S3: y[i] = … + x[i-1] … . . . Sk: …}
S1:
S2:
S3:
Sk:
i= 1 i= 2 i= 3 i= N
Note:
How loop carried dependencies are handled?And its implication on cross core software pipelining
T1 T2 T3
Main Features of EARTH
* Fast thread context switching
• Efficient parallel function invocation
• Good support of fine grain dynamic load balancing
* Efficient support split phase transactions and fibers
cpeg421-10-F/Topic-3-II-EARTH 19
*Features unique to the EARTH model in comparison to the CILK model
cpeg421-10-F/Topic-3-II-EARTH 20
Outline
• Overview
• Fine-grain multithreading
• Compiling for fine-grain multithreading
• The power of fine-grain synchronization - SSB
• The percolation model and its applications
• Summary
Compiling C for EARTHObjectives
• Design simple high-level extensions for C that allow programmers to write programs that will run efficiently on multi-threaded architectures. (EARTH-C)
• Develop compiler techniques to automatically translate programs written in EARTH-C to multi-threaded programs. (EARTH-C, Threaded-C)
• Determine if EARTH-C + compiler can compete with hand-coded Threaded-C programs.
cpeg421-10-F/Topic-3-II-EARTH 21
Summary of EARTH-C Extensions
• Explicit Parallelism– Parallel versus Sequential statement sequences
– Forall loops
• Locality Annotation– Local versus Remote Memory references (global, local,
replicate, …)
• Dynamic Load Balancing– Basic versus remote function and invocation sites
cpeg421-10-F/Topic-3-II-EARTH 22
EARTH-C Compiler Environment
cpeg421-10-F/Topic-3-II-EARTH 23
McCAT
EARTH-C Compiler
Threaded-C Compiler
C EARTH-C
EARTH SIMPLE
Threaded-C
Program Dependence Analysis
Program Dependence Analysis
Thread GenerationThread Generation
EARTH SIMPLE
Th
read P
artitionin
gT
hread
Partition
ing
Threaded-CEARTH Compilation EnvironmentThe EARTH Compiler
The McCAT/EARTH Compiler
cpeg421-10-F/Topic-3-II-EARTH 24
EARTH-C
THREADED-C
EARTH-SIMPLE-C
EARTH-SIMPLE-C
Simplify goto eliminationLocal function inlining Points-to Analysis
Heap AnalysisR/W Set Analysis
Array Dependence Tester
Simplify goto eliminationLocal function inlining Points-to Analysis
Heap AnalysisR/W Set Analysis
Array Dependence Tester
Forall Loop DetectionLoop Partitioning
Forall Loop DetectionLoop Partitioning
Build Hierarchical DDGThread Generation
Build Hierarchical DDGThread Generation
Code GenerationCode Generation
04/18/23 \Petaflop\Workshop98-7B.ppt 25
If n < 2
DATA_RSYNC (1, result, done)
else
{
TOKEN (fib, n-1, & sum1, slot_1);
TOKEN (fib, n-2, & sum2, slot_2);
}
END_THREAD( ) ;
THREAD-1;
DATA_RSYNC (sum1 + sum2, result, done);
END_THREAD ( ) ;
END_FUNCTION
0 0
2 2
fibn result done
The Fibonacci Example
04/18/23 \Petaflop\Workshop98-7B.ppt 26
void main ( ){ int i, j, k; float sum;
for (i=0; i < N; i++) for (j=0; j < N ; j++) { sum = 0; for (k=0; k < N; k++) sum = sum + a [i] [k] * b [k] [j] c [i] [j] = sum; }}
Sequential Version
Matrix Multiplication
04/18/23 \Petaflop\Workshop98-7B.ppt 27
BLKMOV_SYNC (a, row_a, N, slot_1);
BLKMOV_SYNC (b, column_b, N, slot_1);
sum = 0;
END_THREAD;
THREAD-1;
for (i=0; i<N; i++ );
sum = sum + (row_a[i] * column_b[i]);
DATA_RSYNC (sum, result, done);
END_THREAD ( ) ;
0 0
2 2
innera result doneb
The Inner Product Example
END_FUNCTION
Summary of EARTH-C Extensions
• Explicit Parallelism– Parallel versus Sequential statement sequences
– Forall loops
• Locality Annotation– Local versus Remote Memory references (global, local,
replicate, …)
• Dynamic Load Balancing– Basic versus remote function and invocation sites
cpeg421-10-F/Topic-3-II-EARTH 28
EARTH C Threaded C(Thread Generation)
Given a sequence of statements, s1, s2, …sn, we wish to create threads such that:– Maximize thread length (minimize thread
switching overhead)– retain sufficient parallelism– Issue remote memory requests as early as
possible (prefetching)– Compile split-phase remote memory operations
and remote function calls correctly
cpeg421-10-F/Topic-3-II-EARTH 29
An Example
cpeg421-10-F/Topic-3-II-EARTH 30
int f(int *x, int i, int j){ int a, b, sum, prod, fact; int r1, r2, r3; a = x[i]; fact = 1; fact = fact * a;
b = x[j]; sum = a + b; prod = a * b; r1 = g(sum); r2 = g(prod); r3 = g(fact); return(r1 + r2 + r3); }
11
33
11
Example Partitioned into Four Fibers
cpeg421-10-F/Topic-3-II-EARTH 31
a = x[i];fact = 1;
fact = fact * a;b = x[j];
sum = a + b;prod = a * b;r1 = g(sum);r2 = g(prod);r3 = g(fact);
return (r1 + r2 + r3);
Fiber-0:
Fiber-1:
Fiber-2:
Fiber-3:
Better Strategy Using List Scheduling
• Put each instruction in the earliest possible thread.
• Within a thread, the remote operations are executed as early as possible.
Build a Data Dependence Graph (DDG), and use a list scheduling strategy, where the selection of instructions is guided by Earliest Thread Number and Statement Type.
cpeg421-10-F/Topic-3-II-EARTH 32
Instruction Types
• Schedule First– remote_read, remote_write– remote_fn_call– local_simple– remote_compound– local_compound– basic_fn_call
• Schedule Lastcpeg421-10-F/Topic-3-II-EARTH 33
List Scheduling Previous Example
cpeg421-10-F/Topic-3-II-EARTH 34
(0,RR)(0,RR) (0,RR)(0,RR) (0,LS)(0,LS)
(1,LS)(1,LS) (1,LS)(1,LS) (1,LC)(1,LC)
(1,RF)(1,RF) (1,RF)(1,RF) (1,RF)(1,RF)
(2,LS)(2,LS)
Resulting List Scheduled Threads
cpeg421-10-F/Topic-3-II-EARTH 35
a=x[i];b=x[j];fact=1;
a=x[i];b=x[j];fact=1;
sum=a+b;r1=g(sum);prod=a*b;r2=g(prod);fact=fact*i;r3=g(fact)
sum=a+b;r1=g(sum);prod=a*b;r2=g(prod);fact=fact*i;r3=g(fact)
return (r1+r2+r3);return (r1+r2+r3);
22
33
Generating Threaded-C Code
cpeg421-10-F/Topic-3-II-EARTH 36
THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, int i, int j){
SLOTS SYNC_SLOTS[2];int a, b, sum, prod, fact, r1, r2, r3;
THREADED f ( int *ret_parm, SLOT *rsync_parm, int *x, int i, int j){
SLOTS SYNC_SLOTS[2];int a, b, sum, prod, fact, r1, r2, r3;
/* THREAD_0:; */INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2);GET_SYNC_L (&x[i], &a, 0);GET_SYNC_L (&x[j], &b, 0);fact = 1;END_THREAD( );
/* THREAD_0:; */INIT_SYNC(0, 2, 2, 1); INIT_SYNC (1, 3, 3, 2);GET_SYNC_L (&x[i], &a, 0);GET_SYNC_L (&x[j], &b, 0);fact = 1;END_THREAD( );
THREAD_1:;sum = a + b;TOKEN (G, &r1, SLOT_ADR(1), sum);prod = a * b;TOKEN (g, &r2, SLOT_ADR(1), prod);fact = fact * a;TOKEN (g, &r3, SLOT_ADR(1), fact);END_THREAD( );
THREAD_1:;sum = a + b;TOKEN (G, &r1, SLOT_ADR(1), sum);prod = a * b;TOKEN (g, &r2, SLOT_ADR(1), prod);fact = fact * a;TOKEN (g, &r3, SLOT_ADR(1), fact);END_THREAD( );
THREAD_2:;DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm);END_FUNCTION( );
}
THREAD_2:;DATA_RSYNC_L(r1 + r2 + r3, ret_parm, rsync_parm);END_FUNCTION( );
}
cpeg421-10-F/Topic-3-II-EARTH 37
Outline
• Overview
• Fine-grain multithreading
• Compiling for fine-grain multithreading
• The power of fine-grain synchronization - SSB
• The percolation model and its applications
• Summary
Fine-Grain Synchronization: Two Types
Sync Type Enforce Mutual Exclusion
Enforce Data Dependencies
Order No Specific Order required
Uni-directional
Fine Grain Sync. Solution
• Software Fine grained locks• Lock free concurrent data structures• Full / Empty bits
• I-structures• Full / Empty bits
cpeg421-10-F/Topic-3-II-EARTH 38
Enforce Data Dependencies
• A DoAcross loop with positive and constant dependence distance.
cpeg421-10-F/Topic-3-II-EARTH 39
for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}
for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}
In parallel iterations are assigned to different threads
T0T0 T1T1
(i = 2 + D){ A[2+D] = … … … = A[2]}
(i = 2 + D){ A[2+D] = … … … = A[2]}
(i = 2){ A[2] = … … … = A[2-D]}
(i = 2){ A[2] = … … … = A[2-D]}
The data dependence needs to be enforced by synchronization
Memory Based Fine-Grain Synchronization:
• Full/Empty Bits (HEP, Tera MTA, etc) & I-Structures (dataflow based machines)
• Associate “state” to a memory location (fine-granularity). Fine-grain synchronization for the memory location is realized through “state transition” on such “state”.
cpeg421-10-F/Topic-3-II-EARTH 40
I-Structure state transition[ArvindEtAl89 @ TOPLAS]
EmptyEmpty
FullFull DeferredDeferred
read
readwrite
reset
write
read
With Memory Based Fine-Grain Sync
• Using a single atomic operation complete synchronized write/read in memory directly
• No need to implement synchronization with other resources, e.g., shared memory.
• Low overhead: just one memory transaction
cpeg421-10-F/Topic-3-II-EARTH 41
for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}
for(i= D; i < N; ++i){ A[i] = … … … = A[i-D];}
for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D]));}
for(i= D; i < N; ++i){ write_sync(&(A[i]),…) … … = read_sync(&(A[i-D]));}
With Memory Based Fine-Grain Sync
• Using a single atomic operation complete synchronized write/read in memory directly
• No need to implement synchronization with other resources, e.g., shared memory.
• Low overhead: just one memory transaction
cpeg421-10-F/Topic-3-II-EARTH 42
T1T1
(i = 2 + D){ write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));}
(i = 2 + D){ write_sync(&(A[2 + D]),…); … … = read_sync(&(A[2]));}
T0T0
(i = 2){ write_sync(&(A[2]),…); … … = read_sync(&(A[2-D]));}
(i = 2){ write_sync(&(A[2]),…); … … = read_sync(&(A[2-D]));}
An Alternative: control-flow based synchronizations
cpeg421-10-F/Topic-3-II-EARTH 43
• The post/wait instructions needs to be implemented in shared memory in coordination with the underline memory (consistency) models
• You may need to worry about this:
A[i] = …;fence;post(i);
A[i] = …;fence;post(i);
wait(i-D);fence;… = A[i-D];
wait(i-D);fence;… = A[i-D];
for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D];}
for(i= D; i < N; ++i){ A[i] = … post(i); … wait(i-D); … = A[i-D];}
No data dependencyNo data dependency
No data dependencyNo data dependency
For computation with more complicated data dependencies, memory-based fine-grain synchronization is more effective and efficient. [ArvindEtAl89 @ TOPLAS]
What is SSB?
• A small hardware buffer attached to the memory controller of each memory bank.
• Record and manage states of actively synchronized data units.
• Hardware Cost– Each SSB is a small look-up table: Easy-to-implement
– Independence of each SSB: hardware cost increases only linearly proportional to # of memory banks
cpeg421-10-F/Topic-3-II-EARTH 46
cpeg421-10-F/Topic-3-II-EARTH 47
SSB on Many-Core (IBM C64)
IBM Cyclops-64, Designed by Monty Denneau.
SSB Synchronization Functionalities
Data Synchronization: Enforce RAW data dependencies• Support word-level
– Two single-writer-single-reader (SWSR) modes– One single-writer-multiple-reader (SWMR) mode
Fine-Grain Locking: Enforce mutual exclusion• Support word-level
– write lock (exclusive lock)– read lock (shared lock)– recursive lock
SSB is capable of supporting more functionality
cpeg421-10-F/Topic-3-II-EARTH 48
Experimental Infrastructure
cpeg421-10-F/Topic-3-II-EARTH 49
IBM Cyclops-64 Chip Architecture• 160 thread units (500MHz)• Three-level explicit-addressable memory hierarchy • Efficient thread-level execution support• SSB for on-chip SRAM bank: 16-entry, 8-way associative
Cyclops-64 Micro Kernel
Simulation Testbed: FAST Simulator (Software) Ms. Clops Hardware Emulator
CCompiler
(GCC/Open64)
OpenMP Compiler
Binutils:
assembler
linkerLibraries:
OpenMP RTS
TiNy Threads Library/RTS
Std C/Math lib
SSB Fine-Grain Sync. is Efficient
• For all the benchmarks, the SSB-based version shows significant performance improvement over the versions based on other synchronization mechanisms.
• For example, with up to 128 threads– Livermore loop 6 (linear recurrence): a 312%
improvement over the barrier based version
– Ordered integer set (hash table): outperform the software-based fine-grain methods by up to 84%
cpeg421-10-F/Topic-3-II-EARTH 50
cpeg421-10-F/Topic-3-II-EARTH 51
Outline
• Overview
• Fine-grain multithreading
• Compiling for fine-grain multithreading
• The power of fine-grain synchronization - SSB
• The percolation model and its applications
• Summary
Research LayoutFuture Programming Models
cpeg421-10-F/Topic-3-II-EARTH 52
Advanced Execution / Programming Model
PercolationLocation
Consistency
Base Execution Model Fine Grain Multi
threading (e.g. EARTH, CARE)
Infrastructure & Tools•System Software•Simulation / Emulation•Analytical Modeling
HTMT like Architecture
Cellular Multithreaded Architecture(e.g. BG/c)
High End PIM Architecture
Percolation Model
cpeg421-10-F/Topic-3-II-EARTH 53
Hig
h S
peed
C
PU
s
Hig
h S
peed
C
PU
s
SRA
M
PIM
SRA
M
PIM
DR
AM
PI
MD
RA
M
PIM
Primary Execution Engine
Prepare and percolate “parceled threads”
Perform intelligent memory operations
Global Memory Management
A User’s Perspective
The Percolation Model
cpeg421-10-F/Topic-3-II-EARTH 54
• What is percolation?dynamic, adaptive computation/data movement, migration, transformation in-place or on-the fly to keep system resource usefully busy
• Features of percolation– both data and thread
may percolate– computation
reorganization and data layout reorganization
– asynchronous invocation
An Example of percolation—Cannon’s Algorithm
Level 0
Level 1
Level 2
Level 3
Level 0: fast cpu
Level 1 PIM
Level 2 PIM
Level 3
percolation
HTML-like Architectures
Cannon’s nearest neighbor data transferData layout reorganization during percolation
Performance of SCCA2Kernel 4
cpeg421-10-F/Topic-3-II-EARTH 55
#threads C64 SMPs MTA2
4 2917082 5369740 752256
8 5513257 2141457 619357
16 9799661 915617 488894
32 17349325 362390 482681
• Reasonable scalability–Scale well with # threads–Linear speedup for #threads < 32
• Commodity SMPs has poor performance• Competitive vs. MTA-2
Metric:TEPS -- Traversed Edges per second
SMPs: 4-way Xeon dual-core, 2MB L2 Cache