Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | kimberly-allison |
View: | 213 times |
Download: | 0 times |
INEL6067
Technology ---> Limitations & Opportunities
• Wires
- Area
- Propagation speed
• Clock
• Power
• VLSI
- I/O pin limitations
- Chip area
- Chip crossing delay
- Power
• Can not make light go any faster
• KISS rule (Keep It Simple, Stupid)
INEL6067
Major theme
• Look at typical applications
• Understand physical limitations
• Make tradeoffs
ARCHITECTURE
Application requirements
Technological constraints
INEL6067
Unfortunately
° Requirements and constraints are often at odds with each other!
° Architecture ---> making tradeoffs
Full connectivity!
Gasp!!!
INEL6067
Putting it all together
° The systems approach
• Lesson from RISCs
• Hardware software tradeoffs
• Functionality implemented at the right level
- Hardware
- Runtime system
- Compiler
- Language, Programmer
- Algorithm
INEL6067
Commercial Computing
° Relies on parallelism for high end• Computational power determines scale of business that can be
handled
° Databases, online-transaction processing, decision support, data mining, data warehousing ...
INEL6067
Scientific Computing Demand
INEL6067
1980 1985 1990 1995
1 MIPS
10 MIPS
100 MIPS
1 GIPS
Sub-BandSpeech Coding
200 WordsIsolated SpeechRecognition
SpeakerVeri¼cation
CELPSpeech Coding
ISDN-CD StereoReceiver
5,000 WordsContinuousSpeechRecognition
HDTV Receiver
CIF Video
1,000 WordsContinuousSpeechRecognitionTelephone
NumberRecognition
10 GIPS
• Also CAD, Databases, . . .
• 100 processors gets you 10 years, 1000 gets you 20 !
Applications: Speech and Image Processing
INEL6067
Is better parallel arch enough?
° AMBER molecular dynamics simulation program
° Starting point was vector code for Cray-1
° 145 MFLOP on Cray90, 406 for final version on 128-processor Paragon, 891 on 128-processor Cray T3D
INEL6067
Summary of Application Trends
° Transition to parallel computing has occurred for scientific and engineering computing
° In rapid progress in commercial computing• Database and transactions as well as financial
• Usually smaller-scale, but large-scale systems also used
° Desktop also uses multithreaded programs, which are a lot like parallel programs
° Demand for improving throughput on sequential workloads
• Greatest use of small-scale multiprocessors
° Solid application demand exists and will increase
INEL6067
Per
form
ance
0.1
1
10
100
1965 1970 1975 1980 1985 1990 1995
Supercomputers
Minicomputers
Mainframes
Microprocessors
Technology Trends
° Today the natural building-block is also fastest!
INEL6067
Proc $
Interconnect
Technology: A Closer Look
° Basic advance is decreasing feature size ( )• Circuits become either faster or lower in power
° Die size is growing too• Clock rate improves roughly proportional to improvement in • Number of transistors improves like (or faster)
° Performance > 100x per decade• clock rate < 10x, rest is transistor count
° How to use more transistors?• Parallelism in processing
- multiple operations per cycle reduces CPI• Locality in data access
- avoids latency and reduces CPI- also improves processor utilization
• Both need resources, so tradeoff
° Fundamental issue is resource distribution, as in uniprocessors
INEL6067
• 30% per year
0.1
1
10
100
1,000
19701975
19801985
19901995
20002005
Cloc
k ra
te (M
Hz)
i4004i8008
i8080
i8086 i80286i80386
Pentium100
R10000
Growth Rates
Tran
sisto
rs
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
19701975
19801985
19901995
20002005
i4004i8008
i8080
i8086
i80286i80386
R2000
Pentium R10000
R3000
40% per year
INEL6067
Architectural Trends
° Architecture translates technology’s gifts into performance and capability
° Resolves the tradeoff between parallelism and locality
• Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip connect
• Tradeoffs may change with scale and technology advances
° Understanding microprocessor architectural trends
=> Helps build intuition about design issues or parallel machines
=> Shows fundamental role of parallelism even in “sequential” computers
INEL6067
Transis
tors
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Bit-level parallelism Instruction-level Thread-level (?)
i4004
i8008i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
Phases in “VLSI” Generation
INEL6067
Architectural Trends
° Greatest trend in VLSI generation is increase in parallelism
• Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
- slows after 32 bit
- adoption of 64-bit now under way, 128-bit far (not performance issue)
- great inflection point when 32-bit micro and cache fit on a chip
• Mid 80s to mid 90s: instruction level parallelism
- pipelining and simple instruction sets, + compiler advances (RISC)
- on-chip caches and functional units => superscalar execution
- greater sophistication: out of order execution, speculation, prediction
– to deal with control transfer and latency problems• Next step: thread level parallelism
INEL6067
0 1 2 3 4 5 6+0
5
10
15
20
25
30
0 5 10 150
0.5
1
1.5
2
2.5
3
Fra
ctio
n o
f to
tal c
ycle
s (%
)
Number of instructions issued
Sp
ee
du
p
Instructions issued per cycle
How far will ILP go?
° Infinite resources and fetch bandwidth, perfect branch prediction and renaming
– real caches and non-zero miss latencies
INEL6067
Threads Level Parallelism “on board”
° Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop
° Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
Proc Proc Proc Proc
MEM
INEL6067
What about Multiprocessor Trends?
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30
SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Nu
mb
er
of
pro
cess
ors
INEL6067
What about Storage Trends?
° Divergence between memory capacity and speed even more pronounced
• Capacity increased by 1000x from 1980-95, speed only 2x
• Gigabit DRAM by c. 2000, but gap with processor speed much greater
° Larger memories are slower, while processors get faster• Need to transfer more data in parallel
• Need deeper cache hierarchies
• How to organize caches?
° Parallelism increases effective size of each level of hierarchy, without increasing access time
° Parallelism and locality within memory systems too• New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
• Buffer caches most recently accessed data
° Disks too: Parallel disks plus caching
INEL6067
Economics° Commodity microprocessors not only fast but CHEAP
• Development costs tens of millions of dollars
• BUT, many more are sold compared to supercomputers
• Crucial to take advantage of the investment, and use the commodity building block
° Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
° Standardization makes small, bus-based SMPs commodity
° Desktop: few smaller processors versus one larger one?
° Multiprocessor on a chip?
INEL6067
Consider Scientific Supercomputing
° Proving ground and driver for innovative architecture and techniques
• Market smaller relative to commercial as MPs become mainstream
• Dominated by vector machines starting in 70s
• Microprocessors have made huge gains in floating-point performance
- high clock rates
- pipelined floating point units (e.g., multiply-add every cycle)
- instruction-level parallelism
- effective use of caches (e.g., automatic blocking)
• Plus economics
° Large-scale multiprocessors replace vector supercomputers
INEL6067
LIN
PA
CK
(G
FLO
PS
) CRAY peak MPP peak
Xmp /416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP(1024)
Paragon XP/S MP(6768)
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
Raw Parallel Performance: LINPACK
° Even vector Crays became parallel• X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
° Since 1993, Cray produces MPPs too (T3D, T3E)
INEL6067
Where is Parallel Arch Going?
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
• Uncertainty of direction paralyzed parallel software development!
Old view: Divergent architectures, no predictable pattern of growth.
INEL6067
Modern Layered Framework
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
INEL6067
Summary: Why Parallel Architecture?
° Increasingly attractive• Economics, technology, architecture, application demand
° Increasingly central and mainstream
° Parallelism exploited at many levels• Instruction-level parallelism
• Multiprocessor servers
• Large-scale multiprocessors (“MPPs”)
° Focus of this class: multiprocessor level of parallelism
° Same story from memory system perspective• Increase bandwidth, reduce average latency with many local
memories
° Spectrum of parallel architectures make sense• Different cost, performance and scalability
INEL6067
Threads Level Parallelism “on board”
° Micro on a chip makes it natural to connect many to shared memory
– dominates server and enterprise market, moving down to desktop
° Faster processors began to saturate bus, then bus technology advanced
– today, range of sizes for bus-based systems, desktop to large servers
Proc Proc Proc Proc
MEM
INEL6067
What about Multiprocessor Trends?
0
10
20
30
40
CRAY CS6400
SGI Challenge
Sequent B2100
Sequent B8000
Symmetry81
Symmetry21
Power
SS690MP 140 SS690MP 120
AS8400
HP K400AS2100SS20
SE30
SS1000E
SS10
SE10
SS1000
P-ProSGI PowerSeries
SE60
SE70
Sun E6000
SC2000ESun SC2000SGI PowerChallenge/XL
SunE10000
50
60
70
1984 1986 1988 1990 1992 1994 1996 1998
Nu
mb
er
of
pro
cess
ors
INEL6067
What about Storage Trends?
° Divergence between memory capacity and speed even more pronounced
• Capacity increased by 1000x from 1980-95, speed only 2x
• Gigabit DRAM by c. 2000, but gap with processor speed much greater
° Larger memories are slower, while processors get faster• Need to transfer more data in parallel
• Need deeper cache hierarchies
• How to organize caches?
° Parallelism increases effective size of each level of hierarchy, without increasing access time
° Parallelism and locality within memory systems too• New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
• Buffer caches most recently accessed data
° Disks too: Parallel disks plus caching
INEL6067
Economics° Commodity microprocessors not only fast but CHEAP
• Development costs tens of millions of dollars
• BUT, many more are sold compared to supercomputers
• Crucial to take advantage of the investment, and use the commodity building block
° Multiprocessors being pushed by software vendors (e.g. database) as well as hardware vendors
° Standardization makes small, bus-based SMPs commodity
° Desktop: few smaller processors versus one larger one?
° Multiprocessor on a chip?
INEL6067
LIN
PA
CK
(G
FLO
PS
) CRAY peak MPP peak
Xmp /416(4)
Ymp/832(8) nCUBE/2(1024)iPSC/860
CM-2CM-200
Delta
Paragon XP/S
C90(16)
CM-5
ASCI Red
T932(32)
T3D
Paragon XP/S MP(1024)
Paragon XP/S MP(6768)
0.1
1
10
100
1,000
10,000
1985 1987 1989 1991 1993 1995 1996
Raw Parallel Performance: LINPACK
° Even vector Crays became parallel• X-MP (2-4) Y-MP (8), C-90 (16), T94 (32)
° Since 1993, Cray produces MPPs too (T3D, T3E)
INEL6067
Where is Parallel Arch Going?
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
• Uncertainty of direction paralyzed parallel software development!
Old view: Divergent architectures, no predictable pattern of growth.
INEL6067
Modern Layered Framework
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
INEL6067
Application Software
System Software SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
History
° Parallel architectures tied closely to programming models
• Divergent architectures, with no predictable pattern of growth.
• Mid 80s revival
INEL6067
Programming Model
° Look at major programming models• Where did they come from?
• What do they provide?
• How have they converged?
° Extract general structure and fundamental issues
° Reexamine traditional camps from new perspective
SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Generic
Architecture
INEL6067
Programming Model
° Conceptualization of the machine that programmer uses in coding applications
• How parts cooperate and coordinate their activities
• Specifies communication and synchronization operations
° Multiprogramming• no communication or synch. at program level
° Shared address space• like bulletin board
° Message passing• like letters or phone calls, explicit point to point
° Data parallel: • more regimented, global actions on data
• Implemented with shared address space or message passing
INEL6067
Adding Processing Capacity
° Memory capacity increased by adding modules
° I/O by controllers and devices
° Add processors for processing! • For higher-throughput multiprogramming, or parallel
programs
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/Odevices
INEL6067
Historical Development
P
P
C
C
I/O
I/O
M MM M
PP
C
I/O
M MC
I/O
$ $
° “Mainframe” approach• Motivated by multiprogramming• Extends crossbar used for Mem and I/O• Processor cost-limited => crossbar• Bandwidth scales with p• High incremental cost
- use multistage instead
° “Minicomputer” approach• Almost all microprocessor systems have bus• Motivated by multiprogramming, TP• Used heavily for parallel computing• Called symmetric multiprocessor (SMP)• Latency larger than for uniprocessor• Bus is bandwidth bottleneck
- caching is key: coherence problem• Low incremental cost
INEL6067
Shared Physical Memory
° Any processor can directly reference any memory location
° Any I/O controller - any memory
° Operating system can run on any processor, or all.• OS uses shared memory to coordinate
° Communication occurs implicitly as result of loads and stores
° What about application processes?
INEL6067
Shared Virtual Address Space
° Process = address space plus thread of control
° Virtual-to-physical mapping can be established so that processes shared portions of address space.
• User-kernel or multiple processes
° Multiple threads of control on one address space.• Popular approach to structuring OS’s
• Now standard application capability° Writes to shared address visible to other threads
• Natural extension of uniprocessors model• conventional memory operations for communication• special atomic operations for synchronization
- also load/stores
INEL6067
Structured Shared Address Space
° Add hoc parallelism used in system code
° Most parallel applications have structured SAS
° Same program on each processor• shared variable X means the same thing to each thread
St or e
P1
P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses