Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | ira-stuart |
View: | 36 times |
Download: | 0 times |
1 University of MichiganElectrical Engineering and Computer Science
Compiler-directed Data Partitioning Compiler-directed Data Partitioning for Multicluster Processorsfor Multicluster Processors
Michael Chu and Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan
March 28, 2006
2 University of MichiganElectrical Engineering and Computer Science
Processor
I IF FM M
Data Memory
Multicluster ArchitecturesMulticluster Architectures
• Addresses the register file bottleneck
• Decentralizes architecture• Compilation focuses on
partitioning operations• Most previous work
assumes a unified memory Data MemoryData Mem 1 Data Mem 2
Register File
Intercluster Communication Network
Register File
Cluster 1
Register File
Cluster 2
IF FM MI
3 University of MichiganElectrical Engineering and Computer Science
Data Mem 1 Data Mem 2
Cluster 1 Cluster 2
I F M I F M
Problem: Partitioning of DataProblem: Partitioning of Data
• Determine object placement into data memories• Limited by:
– Memory sizes/capacities– Computation operations related to data
• Partitioning relevant to caches and scratchpad memories
int x[100] struct foo
int y[100]
4 University of MichiganElectrical Engineering and Computer Science
Cluster 1 Cluster 2
I F M I F M
int x[100] foo int y[100]
Architectural ModelArchitectural Model
• This work focuses on use of scratchpad-like static local memories– Each cluster has one local memory– Each object placed in one specific memory– Data object available in the memory throughout the
lifetime of the program
5 University of MichiganElectrical Engineering and Computer Science
Data Unaware PartitioningData Unaware Partitioning
Lose average 30% performance by ignoring data
6 University of MichiganElectrical Engineering and Computer Science
Our ObjectiveOur Objective
• Goal: Produce efficient code• Strategy:
– Partition both data objects and computation operations
– Balance memory size across clusters
• Improve memory bandwidth• Maximize parallelism
int x[100] struct foo
int y [100]
7 University of MichiganElectrical Engineering and Computer Science
First Try: Greedy ApproachFirst Try: Greedy Approach
• Computation-centric partition of data– Place data where computation references it most often
• Greedy approach:– Pass 1: Region-view computation partition
Greedy data cluster assignment– Pass 2: Region-view computation repartition
Full knowledge of data location
Data Partition Computation Partition
Data Unaware None, Profile-based placement
Region-view
Greedy Region-view
Greedy Profile-based
Region-view
8 University of MichiganElectrical Engineering and Computer Science
Greedy Approach ResultsGreedy Approach Results• 2 Clusters:
– One Integer, Float, Memory, Branch unit per cluster
• Relative to a unified, dual-ported memory
• Improvement over Data Unaware, still room for improvement
9 University of MichiganElectrical Engineering and Computer Science
Second Try: Global Data PartitionSecond Try: Global Data Partition• Data-centric partition of computation• Hierarchical technique• Pass 1: Global-view for data
– Consider memory relationships throughout program
– Locks memory operations to clusters• Pass 2: Region-view for computation
– Partition computation based on data location
Global Data Partition
Regional ComputationPartition
10 University of MichiganElectrical Engineering and Computer Science
Pass 1: Global Data PartitioningPass 1: Global Data Partitioning
• Determine memory relationships– Pointer analysis & profiling of memory
• Build program-level graph representation of all operations• Perform data object memory operation merging:
– Respect correctness constraints of the program
InterproceduralPointer Analysis
&Memory Profile
Step 1
Merge MemoryOperations
Step 3
METISGraph
Partitioner
Step 4
Build ProgramData Graph
Step 2
11 University of MichiganElectrical Engineering and Computer Science
• Nodes: Operations, either memory or non-memory– Memory operations: loads, stores, malloc callsites
• Edges: Data flow between operations• Node weight: Data object size
– Sum of data sizes forreferenced objects
• Object size determined by:– Globals/locals: pointer analysis– Malloc callsites: memory profile
Global Data Graph RepresentationGlobal Data Graph Representation
int x[100] struct foomalloc site 1
400 bytes
1 Kbyte
200bytes
12 University of MichiganElectrical Engineering and Computer Science
Global Data Partitioning ExampleGlobal Data Partitioning ExampleBB1
BB2
2 Objects referenced80 Kb2 Objects referenced80 Kb
1 Object referenced100 Kb1 Object referenced100 Kb
2 Objects referenced200 Kb2 Objects referenced200 Kb
Non-memory op
Memory op
Cluster 0
Cluster 1
malloc site 1
malloc site 2
int x[100]
struct foo
struct bar
13 University of MichiganElectrical Engineering and Computer Science
Pass 2: Computation PartitioningPass 2: Computation Partitioning
BB1
• Observation:Global-level data partition is only half the answer:– Doesn’t account for operation resource usage– Doesn’t consider code scheduling regions
• Second pass of partitioning on each scheduling region– Memory operations from first phase locked in place
BB1 BB1
14 University of MichiganElectrical Engineering and Computer Science
Experimental MethodologyExperimental Methodology
• Compared to:
• 2 Clusters:– One Integer, Float, Memory, Branch unit per cluster
• All results relative to a unified, dual-ported memory
Data Partitioning Computation Partition
Global Global-view
Data-centric
Know data location
Greedy Region-view
Greedy computation-centric
Know data location
Data Unaware None, assume unified memory Assume unified memory
Unified Memory N/A Unified memory
15 University of MichiganElectrical Engineering and Computer Science
Performance: 1-cycle Remote AccessPerformance: 1-cycle Remote Access
UnifiedMemory
16 University of MichiganElectrical Engineering and Computer Science
Performance: 10-cycle Remote AccessPerformance: 10-cycle Remote Access
UnifiedMemory
17 University of MichiganElectrical Engineering and Computer Science
Case Study: rawcaudioCase Study: rawcaudio
Global Data PartitionGlobal Data Partition
X
X
Greedy Profile-basedGreedy Profile-based
18 University of MichiganElectrical Engineering and Computer Science
SummarySummary
• Global Data Partitioning– Data placement: first-order design principle– Global data-centric partition of computation– Phased ordered approach
• Global-view for decisions on data • Region-view for decisions on computation
• Achieves 96% performance of a unified memory on partitioned memories
• Future work: apply to cache memories
19 University of MichiganElectrical Engineering and Computer Science
Data Partitioning for MulticoresData Partitioning for Multicores
• Adapt global data partitioning for cache memory domain• Similar goals:
– Increase data bandwidth– Maximize parallel computation
• Different goals:– Reducing coherence traffic– Keep working set ≤ cache size
20 University of MichiganElectrical Engineering and Computer Science
Questions?Questions?
http://cccp.eecs.umich.edu
21 University of MichiganElectrical Engineering and Computer Science
BackupBackup
22 University of MichiganElectrical Engineering and Computer Science
Future Work: Cache MemoriesFuture Work: Cache Memories
• Adapt global data partitioning for cache memory domain
• Similar goals:– Increase data bandwidth– Maximize parallel
computation• Different goals:
– Reducing coherence traffic– Balancing working set
23 University of MichiganElectrical Engineering and Computer Science
Memory Operation MergingMemory Operation Merging
int * x;int foo [100];int bar [100];
void main() { int *a = malloc() int *b; int c; if(cond) c = foo[1]; b = a; else c = bar[1]; b = &bar[1];
b = 100; foo[0] = c;}
mallocmalloc
load “foo”load “foo”
store “malloc” or “bar”
store “malloc” or “bar”
load “bar”load “bar”
store “foo”store “foo”
• Interprocedural pointer analysis determines memory relationships
24 University of MichiganElectrical Engineering and Computer Science
Multicluster CompilationMulticluster Compilation
• Previous techniques focused on operation partitioning [cite some papers]
• Ignores the issue of data object placement in memory
• Assumes shared memory accessible from each cluster
25 University of MichiganElectrical Engineering and Computer Science
Phase 2: Computation PartitioningPhase 2: Computation Partitioning• Observation:
Global-level data partition is only half the solution:– Doesn’t properly account for resource usage details– Doesn’t consider code scheduling regions
• Second pass of partitioning is done locally on each basic block of the program– Memory operations locked into specific clusters
• Uses Region-based Hierarchical Operation Partitioner (RHOP)
26 University of MichiganElectrical Engineering and Computer Science
Computation Partitioning ExampleComputation Partitioning Example
BB1
L L
S
+
*
+ +
&
&
+
BB1
L L
S
+
*
+ +
&
&
+
BB1
L L
S
+
*
+ +
&
&
+
• Memory operations from first phase locked in place• RHOP performs a detailed resource-cognizant
computation partition• Modified multi-level Kernighan-Lin algorithm using
schedule estimates