SGI Multi-Paradigm Architecture
Michael WoodacreChief Engineer, Server Platform Group
2
SGI Proprietary
A History of Innovation in HPC
20011982 1984 1988 1994 1995 1996 20042003
Power Series™, multi-processing systems provide compute power for high-end graphics applications
NASA Ames and Altix® set world record for STREAMS benchmark
First 512p Altix cluster drives ocean research at NASA Ames +10000p upgrade!
SGI introduces its first 64-bit operating system
First systems deployed in Stephen Hawking’s COSMOS system
First generation NUMA System: Origin® 2000
1997
Dockside engineering analysis on Origin®
2000 and Indigo2™ helps Team New Zealand win America’s Cup
Altix®, first scalable 64-bit Linux® Server
1998
DOE deploys 6144p Origin 2000 to monitor and simulate nuclear stockpile
Challenge® XL media server fuels Steven Spielberg’s Shoahproject to document Holocaust survivor stories
Introduced modularNUMAflex™architecturewith Origin® 3000
Jim Clarkfounded SGI onthe vision ofComputerVisualization
IRIS® Workstations become first integrated 3D graphics systems
Images courtesy of Team New Zealand and the University of Cambridge
3
SGI Proprietary
Over Time, Problems Get More Complex, Data Sets Exploding
Bumper, hood, engine, wheels Organ damageE-crash dummyEntire car
First Row Images: EAI, Lana Rushing, Engineering Animation, Inc, Volvo Car Corporation, Images courtesy of the SCI, Second Row Images: The MacNeal-Schwendler Corp , Manchester Visualization Center and University Department of Surgery, Paradigm Geophysical, the Laboratory forAtmospheres, NASA Goddard Space Flight Center.
Improve design& manufacturing
Improve hurricane predictionImprove oil explorationImprove patient safety
This Trend Continues Across SGI's Markets
4
SGI Proprietary
SGI Scalable ccNUMA ArchitectureBasic Node Structure and Interconnect
Physical Memory
CACHE
CPU
InterfaceChip
CPUCACHE
NUMAlink™ Interconnect
Physical Memory
InterfaceChip
CACHE
CPU CPUCACHE
5
SGI Proprietary
SGI Scalable ccNUMA ArchitectureBasic Node Structure and Interconnect
Global Shared Memory
CACHE
CPU
InterfaceChip
CPUCACHE
NUMAlink™ InterconnectInterface
Chip
CACHE
CPU CPUCACHE
6
SGI Proprietary
Logical Layout - 8TB
Plane B
Altix 128 Processor 8TB - 1.6GB/s Uniform Memory Bandwidth
Level 1 Routers
Level 2 Routers
Level 1 Routers
C C C C
R1
C C C C
R2
C C C C
R3
C C C C
R4
C C C C
R5
C C C C
R6
C C C C
R7
C C C C
R8
R1 R2 R3 R4 R5 R6 R7 R8
C C C C
R13
C C C C
R14
C C C C
R15
C C C C
R16
R13 R14 R15 R16
C C C C
R17
C C C C
R18
C C C C
R19
C C C C
R20
C C C C
R21
C C C C
R22
C C C C
R23
C C C C
R24
R17 R18 R19 R20 R21 R22 R23 R24
C C C C
R25
C C C C
R26
C C C C
R27
C C C C
R28
C C C C
R29
C C C C
R30
C C C C
R31
C C C C
R32
R25 R26 R27 R28 R29 R30 R31 R32
Plane A
RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB26 RB27 RB30 RB31RB18 RB19 RB22 RB23
RC3A RC7ARC5A RC9ARC2A RC6ARC4A RC8A
RC2B RC3BRC6B RC7BRC4B RC5BRC8B RC9B
RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB18 RB19 RB22 RB23 RB26 RB27 RB30 RB31
C C C C
R9
C C C C
R10
C C C C
R11
C C C C
R12
R9 R10 R11 R12
Level 2 Routers
7
SGI Proprietary
Interconnect Topology
Bi-Section Bandwidth ProfilesGBs/sec/cpu
Dual Plane - NL3 router -8 port router bricks
Dual Plane - NL4 router -8 port router bricks
4 8 16 32 64 128 256 512 1024
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Processors
Raw
Bi -S
e ctio
nBa
n dw
idth
(GBs
/sec
/ cpu
)
3.2
1.8
2.0
2.2
2.4
2.6
2.8
3.0
2048
8
SGI Proprietary
Examples of Single-Paradigm Architectures
Scalar
Intel Itanium
SGI MIPS
IBM Power
Sun SPARC
HP PA
Vector
Cray X1
NEC SX
App-Specific
Graphics - GPU
Signals - DSP
Prog’ble - FPGA
Other ASICs
9
SGI Proprietary
Paradigms to Applications
Low Data locality High
Low
C
ompu
tehi
ghIn
tens
ity
Vector
Application-specific
Scalar
Application-specific
10
SGI Proprietary
Peer I/O: Increased I/O Flexibility & Performance
XIO+
1:1 Ratio C-brick to I/O
Peer I/O
RI/O C
I/O CNL
XIO+
C
RI/O
I/ONL
C
11
SGI Proprietary
SGI Scalable ccNUMA ArchitectureMulti-Paradigm Computing Architecture
Physical Memory
CACHE
CPU
InterfaceChip
CPUCACHE
Physical Memory
CACHE
CPU
InterfaceChip
CPUCACHE
Physical Memory
CACHE
CPU
InterfaceChip
CPUCACHE
Physical Memory
CACHE
CPU
InterfaceChip
CPUCACHE
TIO
GeneralPurpose
I/O
GeneralPurpose
I/O
General PurposeI/O Interfaces
TIO
GPUs GPUs
Scalable GPUs
TIO
FPGA(s)
RASC™ (FPGA)
NUMAlink™Interconnect
Fabric
12
SGI Proprietary
Data-Centric Architecture
Very Large GAM. Globally Addressable. Low Latency. High Bandwidth. Many Ports
CPUCPUCPU IO
IO
CPU
APUFPGAAPU
FPGA
GraphicsGPU-0
GraphicsGPU-3
GraphicsGPU-1
GraphicsGPU-2Composition
APUGPU-1
APUGPU-2
APUGPU-0
APUGPU-3
13
SGI Proprietary
Big Data
Each box represents 1GB
1TB, 32*32=1024 elements
14
SGI Proprietary
Big Data
15
SGI Proprietary
Big Data
16
SGI Proprietary
Big Datasets : 3D Interactive Visualization
1993100 MB
10% viewed / year~1 MB / month
2004400 GB
100% viewed / month400 GB / month
40,000xProductivity
17
SGI Proprietary
Commodity GPU systems 5X the price of a Scale-up System
March 17, 2005nVIDIA visualizes large data set•473 million triangles•128 GPU’s on Dell Systems•~$1million system
January 21, 2005SGI visualizes large data set•350 million triangles•12P, 56GB memory•Utilizes a ray tracer•~$180,000 system
Compliments of Boeing
Compliments of nVIDIA
18
SGI Proprietary
Dynamic Load Balancing
Load Balancing ONLoad Balancing OFF
Giv
en M
ost W
ork
19
SGI Proprietary
Dimensions of Scalability
• Processors• Processor bandwidth• Memory bandwidth• Memory capacity• Interconnect bandwidth• IO bandwidth• Graphics processing• Reconfigurable processing• Other acceleration elements
20
SGI Proprietary
Origin3000 Building Blocks (Bricks)
C-brickCPU Module
D-brickDisk Storage
R-brickRouter Interconnect
X-brickXIO Expansion
P-brickPCI Expansion
I-brickBase I/O Module
21
SGI Proprietary
PA-brick, PX-brickPCI-X expansion
D-brick2Disk expansion
R-brickRouter interconnect
IX-brickBase I/O module
M-brickMemory
Itanium® 2 CR-brickCPU and memory
SGI Altix™ 3700 Bx2 Platform Introduction: Building Blocks
SGI®Advanced
LinuxEnvironment
WithSGI
ProPack
22
SGI Proprietary
High-End Servers – Moving Forward:Altix® 4700 Platform….. Blade Packaging
•Innovative Blade-to-NUMALink4 Concept: Provides Unprecedented Versatility, Density
•Blade Architecture Leads Next-Wave of HPC Blade-Based Platforms: With Better Upgradeability, Expansion & Repair
•Investment Protection: Processor-Only Upgrade to Future Dual Core Processors
•Enables Flexible Multi-Paradigm Computing:Enhanced integrated RASC, Graphics
23
SGI Proprietary
Blade Base PackageNext Generation RASC™ TechnologyBlade Based Package
24
SGI Proprietary
Standardized Blades, NUMAlink Backbone
Blade
RackSmall Rack = 4 IRUsIndividual Rack Unit (IRU)
(Contains 10 Blades)
L1 Display
L1 Display
L1 Display
L1 Display
L1 Display
L1 Display
L1 Display
L1 Display
L1 Display
Fille
r Pan
el
Fille
r Pan
el
Bla
de S
lot 1
B
lade
Slo
t 2
Bla
de S
lot 3
B
lade
Slo
t 4
Bla
de S
lot 5
B
lade
Slo
t 6
Bla
de S
lot 7
Bla
de S
lot 8
Bla
de S
lot 9
Bla
de S
lot 1
0
25
SGI Proprietary
• Support for Madison9M Processors (Montecito/Montvale as Available)
• Two Compute Blade Options to Provide Different System Capabilities:
– Best $/FLOP, Best Density (Density Compute Blade)
OR– Best Performance, Memory BW
(Bandwidth Compute Blade)
Altix® 4700 Compute Blades
L1 Display
Fille
r Pan
el
Fille
r Pan
el
Bla
de S
lot 1
B
lade
Slo
t 2
Bla
de S
lot 3
B
lade
Slo
t 4
Bla
de S
lot 5
B
lade
Slo
t 6
Bla
de S
lot 7
Bla
de S
lot 8
Bla
de S
lot 9
Bla
de S
lot 1
0
I/O B
lade
s
Com
pute
Bla
de
Gra
phic
s B
lade
RA
SC B
lade
Mem
ory
Bla
de
26
SGI Proprietary
Altix® 4700 Compute Blades
Shub2.0 NL4 6.4GB/s
10.7GB/s
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMMDDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMMDDR2 DIMM
DDR2 DIMM
M9M Socket
Bandwidth Compute Blade
Top ViewHighest Memory BW, Performance:
Bandwidth Compute Blade• 667MHz FSB Madison9M ->
10.7GB/s Local Memory Bandwidth• 32 M9M Sockets / S-Rack• Processors Supported: 1.66GHz/9M,
1.66GHz/6M Madison9M with 667MHz FSB
• Memory Sizes: 2G – 48G/core
Front View
Single Blade
Shub2.0 NL4 6.4GB/s
8.5GB/s
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMMDDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMMDDR2 DIMM
DDR2 DIMM
M9M Socket
Top View Front View
Single Blade
M9M Socket
Best $/FLOP, Best Density: Density Compute Blade
• 533MHz FSB Madison9M -> 8.524GB/s Local Memory Bandwidth
• 64 M9M Sockets / S-Rack• Processors Supported: 1.6GHz/9M,
1.6GHz/6M Madison9M with 533MHz FSB
• Memory Sizes: 1G – 24GB/core
Density Compute Blade
27
SGI Proprietary
Memory Blade
Q2 CY06
Shub2.0 NL4 6.4GB/s
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMMDDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMM
DDR2 DIMMDDR2 DIMM
DDR2 DIMM
Top View Front View
Single Blade
Memory Blade:• Scale Memory Independently with 12
DDR2 DIMM Slots Per Blade• Up to 128TB
28
SGI Proprietary
Altix® 4700 RASC Blade
L1 Display
Fille
r Pan
el
Fille
r Pan
el
Bla
de S
lot 1
B
lade
Slo
t 2
Bla
de S
lot 3
B
lade
Slo
t 4
Bla
de S
lot 5
B
lade
Slo
t 6
Bla
de S
lot 7
Bla
de S
lot 8
Bla
de S
lot 9
Bla
de S
lot 1
0
I/O B
lade
s
CPU
Bla
de
Gra
phic
s B
lade
RA
SC B
lade
Mem
ory
Bla
de
• RASC Blade– Abacus Computation Blade– Enhanced Performance, Tightly
Integrated
29
SGI Proprietary
RASC Blades – Cont.
Top View
Abacus Computation Blade:• New Levels of Performance:
– High Performance V4LX160 FGPA with 160K Logic Cells
– Increased Memory Sizes,12 DIMM per Blade
• Optional Brick Packaging for Legacy Platforms
NL4 6.4GB/s
TIO
TIO
Loader
PCIFPGASSP
SSP
SelmapSelmap
SSRAM SSRAM
SSRAM
SSRAM
SSRAM SSRAM
FPGA
SSRAM SSRAM
SSRAM
SSRAM
SSRAM SSRAM
Front View
Single Blade
30
SGI Proprietary
How does RASC™ Technology Differ from Traditional CPUs?
Directly map computationally-intensive algorithms to hardware with RASC
Identify RASCappropriate algorithm
Compare Application Run Time %’s
Export A
lgorithm to R
ASC
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
App 1 App 2 App 3 App 4 App 5
% o
f Run
time
Algorithm Algorithm Memory Calls Branche inst.
Application Run-Time Comparison
RASC MethodKey Algorithm
running on FPGA
AlgorithmExecution Time
010010000100100111010010101011100101010001100010001100010101010101011100000111100100000100101110100 11 001 00011 11 11011110011 0
Traditional MethodCPU only
AlgorithmExecution time
TimeSavings
Job
Run
Tim
e
0100100001001001110100101010
31
SGI Proprietary
Application Segments
Application segments Sample applications
Image and video processing
Digital Signal Processing FFT, IFFT, Filtering (FIR and IIR)
Database Acceleration Query, sorting, pattern recognition, data compression
Network and Communication
HPC Algorithm Acceleration– Gov/Defense
MATLAB, STAR-P, random number generators, Sigint/Elint, image recognition (radar/vision/IR), DEM
Transcoding (digital watermarks, format conversion), compression (JPEG, MPEG), color correction, ray-tracing, edge detection (Sobel)
HPC Algorithm Acceleration–Bioinformatics
Interleaver/de-interleaver, coding/decoding (Reed Solomon, Viterbi), convolution encoders, encryption, error correction, packet processing (IPsec)
Blast, Smith-Waterman
32
SGI Proprietary
Ease of Use
•Leverage 3rd Party Std Language Tools– Celoxica, Mitrionics, Starbridge Systems, Nallatech
•Developed an FPGA aware version of GDB– Capable of debugging the FPGA and System Software– Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger– Highest performance
33
SGI Proprietary
3rd Party Tools
• Celoxica – http://www.celoxica.com– Handel-C
• Mitrionics - http://www.mitrionics.com– Mitrion C
• Starbridge Systems - http://www.starbridgesystems.com/– Viva graphical development environment
• Nallatech - http://www.nallatech.com/– SGI strategic partner
34
SGI Proprietary
Ease of Use v. Efficiency
Easy Ease of Use Difficult
Low
Ef
ficie
ncy
Hig
h
Verilog
VHDL
xx
x
x
35
SGI Proprietary
Bitstream Generation… HLL Tools
IA-32 Linux®
Machine
RTL Generation and Integration with Core Services
Design Synthesis(Synplify Pro,
Amplify)
Design Verification
Behavioral Simulation(VCS, Modelsim)
Static Timing Analysis(ISE Timing Analyzer)
.v, .vhd .v, .vhd
.ncd, .pcf
.bin
.edf
MetadataProcessing
(Python)
.v, .vhd
.cfg
Altix®
Server Device Programming(RASC™ Abstraction Layer,
Device Manager, Device Driver)
Real-time Verification
(gdb)
.c
Design Implementation (ISE)
HLL Design Entry(Handel-C, Impulse C, Mitrion C, Viva)
36
SGI Proprietary
Ease of Use
•Leverage 3rd Party Std Language Tools– Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva– In discussions with other HLL tool vendors
•Developed an FPGA aware version of GDB– Capable of debugging the FPGA and System Software– Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger– Highest performance
37
SGI Proprietary
FPGA Aware Debugger
• Based on Open Source GNU Debugger (GDB)
• Uses extensions to current command set
• Can debug host application and FPGA
• Provides notification when FPGA starts or stops
• Supplies information on FPGA characteristics
• Can “single-step” or “run N steps” of the algorithm
• Can HLL line step per C-line source
• Dumps data regarding the set of “registers” that are visible when the FPGA is active
38
SGI Proprietary
GDB Debugging Environment
tmp = a & b;
d = tmp | c;(gdb) fpgastep
(gdb) p/x $a$6 = 0x444433
(gdb) p/x $b$7 = 0x111122
(gdb) p/x $tmp$8 = 0x555533
(gdb) fpgastep
(gdb) p/x $tmp$9 = 0x555533
(gdb) p/x $c$10 = 0x331222
(gdb) p/x $d$11 = 0x111022
&
|
a
b
tmp
c
d
Algorithm.c
COP FPGA
Debugger running
in real time
39
SGI Proprietary
Ease of Use
•Leverage 3rd Party Std Language Tools– Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva– In discussions with other HLL tool vendors
•Developed an FPGA aware version of GDB– Capable of debugging the FPGA and System Software– Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger– Highest performance
40
SGI Proprietary
SpeedShop™Debugger (GDB)
RASC™ Software Stack
Algorithm Device Driver
COP (TIO, Algorithm FPGA, Memory, Download FPGA)
DownloadUtilities
Application
Device Manager
User Space
Abstraction LayerLibrary
Linux® Kernel
Hardware
Download Driver
41
SGI Proprietary
Abstraction Layer: Algorithm API
The Abstraction Layer’s algorithm API mirrors the COP API with a few additions that enable:
Wide Scaling
- and -
Deep Scaling
Working with industry/customers (www.openfpga.org) on API stds…
Output Data
Application
COP
COP
COP
Input Data Algorithm
COP
Input Data Output DataAlgorithm
Application COP
42
SGI Proprietary
Ease of Use
•Leverage 3rd Party Std Language Tools– Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva– In discussions with other HLL tool vendors
•Developed an FPGA aware version of GDB– Capable of debugging the FPGA and System Software– Capable of multiple CPUs and multiple FPGAs
•Developed RASC Abstraction Layer (RASCAL)
•Provide for HDL modules – Integrated environment with debugger– Highest performance
43
SGI Proprietary
FPGA Architecture Overview
Core
Services
BlockAlgorithm Block
RAM
Bank 0
RAM
Bank N
SSP
3.2 GB/s
3.2 GB/s
Readport 0
Writeport 0
Writeport N
Readport N
44
SGI Proprietary
Algorithm Block as Submodule
Algorithm controller Algorithm
Block
SRAM controller(one bank shown)
alg_clkdo_stepalg_rst
step_flagsram_wr_gnt
sram_rd_gnt
sram_rd_data
sram_wr_req
sram_rd_dvld
sram_wr_addr[17:0]sram_wr_data[63:0]sram_wr_be[7:0]
sram_wr_dvld
sram_rd_req
sram_rd_addr[17:0]sram_rd_cmd_vld
alg_done
debug0debug63
Debugport
45
SGI Proprietary
Verilog / VHDL Module Support
• Templates for Verilog and VHDL– Fast start to algorithm coding
• Provide a system simulation stub– Allows both simulation debug or system debug
• Provide source code for core service– Allows user to modify to meet special needs
• Extractor tools supports GDB meta-data– Application and FPGA debugging
46
SGI Proprietary
• EXERGY – MAPLD 2005 paper 190
RASC™ Technology — Demonstrated Application Speed-up
Bit Manipulation (Cryptography)1
• 79x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit)• 119x 1.5GHz Intel® Itanium® 2 Processor (dual RASC Unit)
Customer Application• 20,000x speedup on scalar microprocessor
Graphics Edge Detection1
• 7.4x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit)
1 Based on internal testing
47
SGI Proprietary
RASC Platform Capabilities
• Direct Connection to NUMAlink4 6.4GB/s/connection
• Fast System Level Reprogramming of FPGAFPGA load at memory speeds
• Atomic Memory OperationsSame set as System CPUs
• Hardware BarriersDynamic Load Balancing
• Configurations to 8191 NUMA/FPGA NodesScalability
Thank You
49
SGI Proprietary
Strategy for Big Data
Heterogeneous. IRIX. Linux. Windows. Solaris. IBM AIX. HP-UX. Mac OS X
PBsDisk
(Datasets)
TBs MemoryDataset
IO
IOMPU MPU
IOIO
IO IO IO
Open SourceScalable
Filesystem
Heterogeneous. IRIX. Linux. Windows. Solaris. IBM AIX. HP-UX. Mac OS X
APU-GPUMPU
MPUMPU MPU
IO
APU
TBs MemoryDataset
APU APU
50
SGI Proprietary
SGI® RASC™ Technology Summary
• Tightly coupled, high bandwidth/low latency integration into NUMA fabric– Significant bandwidth advantage (6.4GB/s)– Coherent shared memory access – Atomic memory operations – Scalability (wide scaling and deep scaling)
• Orders-of-magnitude performance improvement and application speedup– Beneficial when running data intensive applications critical to oil and gas exploration,
defense and intelligence, bioinformatics, medical imaging, broadcast media, and other data-dependent industries.
• Ease of programming—complete software stack– RASClib (API and core services library) provides abstraction layer to support
reconfigurable elements in a multi-processing, multi-user environment– Fully integrated third-party party HLL development tools– FPGA-aware enhancements to GNU debugger (open-source)
• Add-in module that seamlessly operates with SGI® Altix® servers and Silicon Graphics Prism™ visualization systems
51
SGI Proprietary
Multi-Paradigm ComputingOther Non-traditional Processing Initiatives
• GPU-based processing– High potential performance (200-300GF peak today) and performance/price
on single precision floating point applications…clear roadmap to future semiconductor process technologies
– SGI working with SI on scaling to multiple GPUs and on development environment/programming paradigms…initial focus on signal processing apps
• Specialized processors… ClearSpeed™ processors, custom processors (MD-GRAPE, classified chip)
– High potential performance/watt on certain apps
This slide contains forward-looking statements. The results and forecasts as stated may vary. Other risks and uncertainties relating to this slide may be found in the "Safe-Harbor" statement at the beginning of this presentation.