How Can Computer Architecture Revolutionize Parallel Scientific Computing
Thomas SterlingLouisiana State University
andCalifornia Institute of Technology
February 24, 2006
Invited Presentation to SIAM 2006 Mini-Symposium (MS49) :
Challenges to Petascale Scientific Computing that architecture can help
• Bringing orders of magnitude greater sustained performance and memory capacity to real-world scientific applications – Many problems are Exa(fl)ops scale or greater
• Exploiting ultra-massive parallelism at all levels– Either automatic discovery or ease of representation– Hardware use for reduced time or latency hiding
• Breaking the “barrier”– Moving away from global barrier synchronization– Over constrained
• Removing burden of explicit manual resource allocation– Locality management– Load balancing
• Memory wall– Accelerating memory intense problems– Hardware structures for latency tolerance– Enabling efficient sparse irregular data structures manipulation
• Greater availability and cheaper machines
Challenges to Computer Architecture• Expose and exploit extreme fine-grain parallelism
– Possibly multi-billion-way– Data structure-driven
• State storage takes up much more space than logic– 1:1 flops/byte ratio infeasible– Memory access bandwidth is the critical resource
• Latency – can approach a hundred thousand cycles– All actions are local– Contention due to inadequate bandwidth
• Overhead for fine grain parallelism must be very small – or system can not scale– One consequence is that global barrier synchronization is untenable
• Reliability– Very high replication of elements– Uncertain fault distribution– Fault tolerance essential for good yield
• Design complexity– Impacts development time, testing, power, and reliability
Metric of Physical Locality,
• Locality of operation dependent on amount of logic and state that can be accessed round-trip within a single clock cycle
• Define as ratio of number of elements (e.g., gates, transistors) per chip to the number of elements accessible within a single clock cycle
• Not just a speed of light issue• Also involves propagation through sequence of
elements• When I was an undergrad, = 1• Today, > 30• For SFQ at 100 GHz, 100 < < 1000• At nano-scale, > 100,000
A Sustainable Strategy for Long Term Investment in Technology Trends
• Message-driven split-transaction parallel computing– Alternative parallel computing model – Parallel programming languages– Adaptable ultra lightweight runtime kernel– Hardware co-processors that manage threads and “active messages”
• Memory accelerator– Exploits embedded DRAM technology– Lightweight MIND data processing accelerator co-processors
• Heterogeneous System Architecture– Very high speed numeric processor for high temporal locality– Plus eco-system co-processor for parcel/multithreading– Plus memory accelerator chips– Data pre-staging
Asynchronous Message Driven Split-Transaction Computing
• Powerful mechanism for hiding system-wide latency
• All operations are local• Transactions performed on local data in
response to incident messages (parcels)• Results may include outgoing parcels• No waiting for response to remote requests
Target Operand
Parcels for Latency Hiding in Multicore-based Systems
Data
Methods
Target Action Code
Thread Frames
Remote Thread
DestinationActionPayload
Source Locale
Destination Locale
Destination Action Payload
Remote Thread Create Parcel
Return Parcel
Latency Hiding with Parcelswith respect to System Diameter in cycles
Sensitivity to Remote Latency and Remote Access Fraction16 Nodes
deg_parallelism in RED (pending parcels @ t=0 per node)
0.1
1
10
100
1000
Remote Memory Latency (cycles)
Tota
l tra
nsac
tiona
l wor
k do
ne/T
otal
pro
cess
wor
k do
ne
1/4%
1/2%
1%
2%
4%
16
4
1
64256
2
Latency Hiding with ParcelsIdle Time with respect to Degree of Parallelism
Idle Time/Node(number of nodes in black)
0.E+00
1.E+05
2.E+05
3.E+05
4.E+05
5.E+05
6.E+05
7.E+05
8.E+05
Parallelism Level (parcels/node at time=0)
Idle
tim
e/n
od
e (c
ycle
s)
Process
Transaction
12 4 8
16 32 64 128 256
Multi-Core Microprocessor with Memory Accelerator Co-processor
Memory TLcycleTML
PIM
pro
cess
or
PIM Node 1
PIM Node 2
PIM Node 3
PIM Node N
mixl/s
TMH
ALU
Reg
Contrl
Cac
he
Microprocessor
TCH
TRR
THcycle
Pmissmixl/s
%WH
%WL
ops store and loadfor mix n instructio
rate miss cachet heavyweigh
timeaccessmemory t lightweigh
timeaccess cachet heavyweigh
timeaccessmemory t heavyweigh
timecyclet lightweigh
timecyclet heavyweigh
t worklightweighpercent
t workheavyweighpercent
worktotal
l/s
miss
ML
CH
MH
Lcycle
Hcycle
L
H
mix
P
T
T
T
T
T
%W
%W
WWW
H
L
Metrics
DIVA USC ISI
WideWordWideWordDatapathDatapath
MEMORYMEMORYCONTROLCONTROL
&&ARBITERARBITER
DATA DATA REGISTERS
ICacheICache
Node MainNode MainData BusData Bus
CTL
Node Memory Requests
HostMemory
Requests ICache Mem Requests
HEADER REGISTERS
PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE
MEMORYMEMORY
32b
InstructionInstructionPipelinePipeline
Instructions
WideWord Registers
256b
ScalarScalarDatapathDatapath
256b256b
Scalar Registers
Simulation of Performance Gain
1.00E+00
1.00E+01
1.00E+02
1.00E+03
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
1 Node
2 Nodes
4 Nodes
8 Nodes
16 Nodes
32 Nodes
64 Nodes
PIM Workload
Per
form
ance
Gai
n
ParalleX: A Latency Tolerant Parallel Computing Strategy
• split-transaction programming model (Dally, Culler)• Distributed shared memory – not cache coherent
(Scott, UPC)• Embodies copy semantics in the form of location
consistency (Gao)• Message-driven (Hewitt)• Multithreaded (Smith/Callahan)• Futures synchronization (Halstead)• Local Control Objects (e.g. dataflow synchronization)• In-memory synchronization for producer-consumer
Data Vortex Network
N-1
N-2
1
MIND
MIND
MIND
Static
Data-flow
Accelerator
Long Bow
Long Bow Architecture Attributes
• Latency Hiding Architecture• Heterogeneous based on temporal locality
– Low/no temporal locality high memory bandwidth logic– Hi temporal locality high clock rate ALU intensive structures
• Asynchronous– Remote message-driven– Local multithreaded, dataflow– Rich array of synchronization primitives including in-memory and in-
register• Global shared memory
– Not cache coherence– Location consistency
• Percolation for prestaging– Flow control managed by MIND array– Processors are dumb, memory is smart
• Graceful Degradation– Isolation of replicated structures
High Speed Computing Element
• A kind of streaming architecture– Merrimac– Trips
• Employs array of ALUs in static data flow structure• Temporary values passed directly between
successive ALUs• Data flow synchronization for asymmetric flow graphs• Packet switched (rather than line switched)• More general than pure SIMD• Initialized (and take down) by MIND via percolation• Multiple coarse-grain threads
Concepts of the MIND Architecture
• Virtual to physical address translation in memory– Global distributed shared memory thru distributed directory table– Dynamic page migration– Wide registers serve as context sensitive TLB
• Multithreaded control– Unified dynamic mechanism for resource management– Latency hiding– Real time response
• Parcel active message-driven computing– Decoupled split-transaction execution– System wide latency hiding– Move work to data instead of data to work
• Parallel atomic struct processing – Exploits direct access to wide rows of memory banks for fine grain
parallelism and guarded compound operations– Exploits parallelism for better performance– Enables very efficient mechanisms for synchronization
• Fault tolerance through graceful degradation• Active power management
Chip Layout
Bit-1Jnctn
Bit-2Jnctn
Bit-1Jnctn
Bit-1Jnctn
Bit-1Jnctn
Fat-tree controller
Fat-tree ctrl
Power / Ground Rails
Fat-tree ctrlBit-3
JnctnBit-2Jnctn
Bit-1Jnctn
Bit-2Jnctn
Bit-1Jnctn
Bit-1Jnctn
Bit-1Jnctn
Fat-tree controller
Fat-tree ctrl
Fat-tree ctrl
Bit-3Jnctn
Bit-2Jnctn
SHARED
Local Parcel Bus Stage 2 Router
Shared Functional Unit
Local Parcel Bus Stage 2 Router
3 2x2 routers
Local Parcel Bus Stage 2 Router
SHARED
Local Parcel Bus Stage 2 Router
3 2x2 routersShared External
Memory interface
Me
mo
ry I
O
PORT
Timing, and master configuration and power controller
Fat Tree Switch
Fat Tree SwitchFat Tree
Switch
Fat Tree Switch
Fat Tree Switch
Fat Tree Switch
Node pairNode pair
Node pairNode pair
Node pair Node pair
Node pair Node pair
Node pairNode pair
Node pairNode pair
Node pair Node pair
Node pair Node pair
MIND Memory Accelerator
WideALU
MemoryManager
ThreadManager
Embedded DRAM Macro
wide datapaths
control interfaces
to local parcel interconnect
ParcelHandler
to module control unit
FrameCache
Top 10 Challenges in HPC Architectureconcepts, semantics, mechanisms, and structures
• 10: Inter-chip high bandwidth data interconnect• 9: Scalability through locality-oriented asynchronous control• 8: Global name space translation including first-class processes
for direct resource management• 7: Multithreaded intra process flow control• 6: Graceful degradation for reliability and high yield• 5: Accelerators for lightweight-object synchronization including
futures and other in-memory synchronization for mutual exclusion• 4: Merger of data-links and go-to flow control semantics for
directed graph traversal• 3: In memory logic for memory accelerator for effective no-locality
execution• 2: Message-driven split-transaction processing• 1: New execution model governing parallel computing
Some Revolutionary Changes Won’t Come from Architecture
• Debugging– Can be eased with hardware diagnostics
• Problem set up– e.g., mesh generation
• Data analysis– Science understanding
– Computer visualization – an oxymoron• Computers don’t know how to abstract knowledge from data• Once people visualize, the insight can’t be computerized
• Problem specification – easier to program– Better architectures are will be easier to program
– But problem representation can be intrinsically hard
• New models for advanced phenomenology – New algorithms
– New physics