How Can Computer Architecture Revolutionize Parallel Scientific Computing

How Can Computer Architecture Revolutionize Parallel Scientific Computing

Thomas SterlingLouisiana State University

andCalifornia Institute of Technology

February 24, 2006

Invited Presentation to SIAM 2006 Mini-Symposium (MS49) :

Challenges to Petascale Scientific Computing that architecture can help

• Bringing orders of magnitude greater sustained performance and memory capacity to real-world scientific applications – Many problems are Exa(fl)ops scale or greater

• Exploiting ultra-massive parallelism at all levels– Either automatic discovery or ease of representation– Hardware use for reduced time or latency hiding

• Breaking the “barrier”– Moving away from global barrier synchronization– Over constrained

• Removing burden of explicit manual resource allocation– Locality management– Load balancing

• Memory wall– Accelerating memory intense problems– Hardware structures for latency tolerance– Enabling efficient sparse irregular data structures manipulation

• Greater availability and cheaper machines

Challenges to Computer Architecture• Expose and exploit extreme fine-grain parallelism

– Possibly multi-billion-way– Data structure-driven

• State storage takes up much more space than logic– 1:1 flops/byte ratio infeasible– Memory access bandwidth is the critical resource

• Latency – can approach a hundred thousand cycles– All actions are local– Contention due to inadequate bandwidth

• Overhead for fine grain parallelism must be very small – or system can not scale– One consequence is that global barrier synchronization is untenable

• Reliability– Very high replication of elements– Uncertain fault distribution– Fault tolerance essential for good yield

• Design complexity– Impacts development time, testing, power, and reliability

Metric of Physical Locality,

• Locality of operation dependent on amount of logic and state that can be accessed round-trip within a single clock cycle

• Define as ratio of number of elements (e.g., gates, transistors) per chip to the number of elements accessible within a single clock cycle

• Not just a speed of light issue• Also involves propagation through sequence of

elements• When I was an undergrad, = 1• Today, > 30• For SFQ at 100 GHz, 100 < < 1000• At nano-scale, > 100,000

A Sustainable Strategy for Long Term Investment in Technology Trends

• Message-driven split-transaction parallel computing– Alternative parallel computing model – Parallel programming languages– Adaptable ultra lightweight runtime kernel– Hardware co-processors that manage threads and “active messages”

• Memory accelerator– Exploits embedded DRAM technology– Lightweight MIND data processing accelerator co-processors

• Heterogeneous System Architecture– Very high speed numeric processor for high temporal locality– Plus eco-system co-processor for parcel/multithreading– Plus memory accelerator chips– Data pre-staging

Asynchronous Message Driven Split-Transaction Computing

• Powerful mechanism for hiding system-wide latency

• All operations are local• Transactions performed on local data in

response to incident messages (parcels)• Results may include outgoing parcels• No waiting for response to remote requests

Target Operand

Parcels for Latency Hiding in Multicore-based Systems

Data

Methods

Target Action Code

Thread Frames

Remote Thread

DestinationActionPayload

Source Locale

Destination Locale

Destination Action Payload

Remote Thread Create Parcel

Return Parcel

Latency Hiding with Parcelswith respect to System Diameter in cycles

Sensitivity to Remote Latency and Remote Access Fraction16 Nodes

deg_parallelism in RED (pending parcels @ t=0 per node)

0.1

1

10

100

1000

Remote Memory Latency (cycles)

Tota

l tra

nsac

tiona

l wor

k do

ne/T

otal

pro

cess

wor

k do

ne

1/4%

1/2%

1%

2%

4%

16

4

1

64256

2

Latency Hiding with ParcelsIdle Time with respect to Degree of Parallelism

Idle Time/Node(number of nodes in black)

0.E+00

1.E+05

2.E+05

3.E+05

4.E+05

5.E+05

6.E+05

7.E+05

8.E+05

Parallelism Level (parcels/node at time=0)

Idle

tim

e/n

od

e (c

ycle

s)

Process

Transaction

12 4 8

16 32 64 128 256

Multi-Core Microprocessor with Memory Accelerator Co-processor

Memory TLcycleTML

PIM

pro

cess

or

PIM Node 1

PIM Node 2

PIM Node 3

PIM Node N

mixl/s

TMH

ALU

Reg

Contrl

Cac

he

Microprocessor

TCH

TRR

THcycle

Pmissmixl/s

%WH

%WL

ops store and loadfor mix n instructio

rate miss cachet heavyweigh

timeaccessmemory t lightweigh

timeaccess cachet heavyweigh

timeaccessmemory t heavyweigh

timecyclet lightweigh

timecyclet heavyweigh

t worklightweighpercent

t workheavyweighpercent

worktotal

l/s

miss

ML

CH

MH

Lcycle

Hcycle

L

H

mix

P

T

T

T

T

T

%W

%W

WWW

H

L

Metrics

DIVA USC ISI

WideWordWideWordDatapathDatapath

MEMORYMEMORYCONTROLCONTROL

&&ARBITERARBITER

DATA DATA REGISTERS

ICacheICache

Node MainNode MainData BusData Bus

CTL

Node Memory Requests

HostMemory

Requests ICache Mem Requests

HEADER REGISTERS

PARCEL BUFFER (“PBUF”)HOST “DRAM” INTERFACE

MEMORYMEMORY

32b

InstructionInstructionPipelinePipeline

Instructions

WideWord Registers

256b

ScalarScalarDatapathDatapath

256b256b

Scalar Registers

Simulation of Performance Gain

1.00E+00

1.00E+01

1.00E+02

1.00E+03

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

1 Node

2 Nodes

4 Nodes

8 Nodes

16 Nodes

32 Nodes

64 Nodes

PIM Workload

Per

form

ance

Gai

n

ParalleX: A Latency Tolerant Parallel Computing Strategy

• split-transaction programming model (Dally, Culler)• Distributed shared memory – not cache coherent

(Scott, UPC)• Embodies copy semantics in the form of location

consistency (Gao)• Message-driven (Hewitt)• Multithreaded (Smith/Callahan)• Futures synchronization (Halstead)• Local Control Objects (e.g. dataflow synchronization)• In-memory synchronization for producer-consumer

Data Vortex Network

N-1

N-2

1

MIND

MIND

MIND

Static

Data-flow

Accelerator

Long Bow

Long Bow Architecture Attributes

• Latency Hiding Architecture• Heterogeneous based on temporal locality

– Low/no temporal locality high memory bandwidth logic– Hi temporal locality high clock rate ALU intensive structures

• Asynchronous– Remote message-driven– Local multithreaded, dataflow– Rich array of synchronization primitives including in-memory and in-

register• Global shared memory

– Not cache coherence– Location consistency

• Percolation for prestaging– Flow control managed by MIND array– Processors are dumb, memory is smart

• Graceful Degradation– Isolation of replicated structures

High Speed Computing Element

• A kind of streaming architecture– Merrimac– Trips

• Employs array of ALUs in static data flow structure• Temporary values passed directly between

successive ALUs• Data flow synchronization for asymmetric flow graphs• Packet switched (rather than line switched)• More general than pure SIMD• Initialized (and take down) by MIND via percolation• Multiple coarse-grain threads

Concepts of the MIND Architecture

• Virtual to physical address translation in memory– Global distributed shared memory thru distributed directory table– Dynamic page migration– Wide registers serve as context sensitive TLB

• Multithreaded control– Unified dynamic mechanism for resource management– Latency hiding– Real time response

• Parcel active message-driven computing– Decoupled split-transaction execution– System wide latency hiding– Move work to data instead of data to work

• Parallel atomic struct processing – Exploits direct access to wide rows of memory banks for fine grain

parallelism and guarded compound operations– Exploits parallelism for better performance– Enables very efficient mechanisms for synchronization

• Fault tolerance through graceful degradation• Active power management

Chip Layout

Bit-1Jnctn

Bit-2Jnctn

Bit-1Jnctn

Bit-1Jnctn

Bit-1Jnctn

Fat-tree controller

Fat-tree ctrl

Power / Ground Rails

Fat-tree ctrlBit-3

JnctnBit-2Jnctn

Bit-1Jnctn

Bit-2Jnctn

Bit-1Jnctn

Bit-1Jnctn

Bit-1Jnctn

Fat-tree controller

Fat-tree ctrl

Fat-tree ctrl

Bit-3Jnctn

Bit-2Jnctn

SHARED

Local Parcel Bus Stage 2 Router

Shared Functional Unit


3 2x2 routers


SHARED


3 2x2 routersShared External

Memory interface

Me

mo

ry I

O

PORT

Timing, and master configuration and power controller

Fat Tree Switch

Fat Tree SwitchFat Tree

Switch

Fat Tree Switch

Fat Tree Switch

Fat Tree Switch

Node pairNode pair

Node pairNode pair

Node pair Node pair

Node pair Node pair

Node pairNode pair

Node pairNode pair

Node pair Node pair

Node pair Node pair

MIND Memory Accelerator

WideALU

MemoryManager

ThreadManager

Embedded DRAM Macro

wide datapaths

control interfaces

to local parcel interconnect

ParcelHandler

to module control unit

FrameCache

Top 10 Challenges in HPC Architectureconcepts, semantics, mechanisms, and structures

• 10: Inter-chip high bandwidth data interconnect• 9: Scalability through locality-oriented asynchronous control• 8: Global name space translation including first-class processes

for direct resource management• 7: Multithreaded intra process flow control• 6: Graceful degradation for reliability and high yield• 5: Accelerators for lightweight-object synchronization including

futures and other in-memory synchronization for mutual exclusion• 4: Merger of data-links and go-to flow control semantics for

directed graph traversal• 3: In memory logic for memory accelerator for effective no-locality

execution• 2: Message-driven split-transaction processing• 1: New execution model governing parallel computing

Some Revolutionary Changes Won’t Come from Architecture

• Debugging– Can be eased with hardware diagnostics

• Problem set up– e.g., mesh generation

• Data analysis– Science understanding

– Computer visualization – an oxymoron• Computers don’t know how to abstract knowledge from data• Once people visualize, the insight can’t be computerized

• Problem specification – easier to program– Better architectures are will be easier to program

– But problem representation can be intrinsically hard

• New models for advanced phenomenology – New algorithms

– New physics

Date post:	31-Jan-2016
Category:	Documents
Upload:	belle
View:	28 times
Download:	0 times

How Can Computer Architecture Revolutionize Parallel Scientific Computing

Documents