SGI Multi-Paradigm Architecture€¦ · – Abacus Computation Blade – Enhanced Performance,...

SGI Multi-Paradigm Architecture

Michael WoodacreChief Engineer, Server Platform Group

[email protected]

2

SGI Proprietary

A History of Innovation in HPC

20011982 1984 1988 1994 1995 1996 20042003

Power Series™, multi-processing systems provide compute power for high-end graphics applications

NASA Ames and Altix® set world record for STREAMS benchmark

First 512p Altix cluster drives ocean research at NASA Ames +10000p upgrade!

SGI introduces its first 64-bit operating system

First systems deployed in Stephen Hawking’s COSMOS system

First generation NUMA System: Origin® 2000

1997

Dockside engineering analysis on Origin®

2000 and Indigo2™ helps Team New Zealand win America’s Cup

Altix®, first scalable 64-bit Linux® Server

1998

DOE deploys 6144p Origin 2000 to monitor and simulate nuclear stockpile

Challenge® XL media server fuels Steven Spielberg’s Shoahproject to document Holocaust survivor stories

Introduced modularNUMAflex™architecturewith Origin® 3000

Jim Clarkfounded SGI onthe vision ofComputerVisualization

IRIS® Workstations become first integrated 3D graphics systems

Images courtesy of Team New Zealand and the University of Cambridge

3

SGI Proprietary

Over Time, Problems Get More Complex, Data Sets Exploding

Bumper, hood, engine, wheels Organ damageE-crash dummyEntire car

First Row Images: EAI, Lana Rushing, Engineering Animation, Inc, Volvo Car Corporation, Images courtesy of the SCI, Second Row Images: The MacNeal-Schwendler Corp , Manchester Visualization Center and University Department of Surgery, Paradigm Geophysical, the Laboratory forAtmospheres, NASA Goddard Space Flight Center.

Improve design& manufacturing

Improve hurricane predictionImprove oil explorationImprove patient safety

This Trend Continues Across SGI's Markets

4

SGI Proprietary

SGI Scalable ccNUMA ArchitectureBasic Node Structure and Interconnect

Physical Memory

CACHE

CPU

InterfaceChip

CPUCACHE

NUMAlink™ Interconnect

Physical Memory

InterfaceChip

CACHE

CPU CPUCACHE

5

SGI Proprietary

SGI Scalable ccNUMA ArchitectureBasic Node Structure and Interconnect

Global Shared Memory

CACHE

CPU

InterfaceChip

CPUCACHE

NUMAlink™ InterconnectInterface

Chip

CACHE

CPU CPUCACHE

6

SGI Proprietary

Logical Layout - 8TB

Plane B

Altix 128 Processor 8TB - 1.6GB/s Uniform Memory Bandwidth

Level 1 Routers

Level 2 Routers

Level 1 Routers

C C C C

R1

C C C C

R2

C C C C

R3

C C C C

R4

C C C C

R5

C C C C

R6

C C C C

R7

C C C C

R8

R1 R2 R3 R4 R5 R6 R7 R8

C C C C

R13

C C C C

R14

C C C C

R15

C C C C

R16

R13 R14 R15 R16

C C C C

R17

C C C C

R18

C C C C

R19

C C C C

R20

C C C C

R21

C C C C

R22

C C C C

R23

C C C C

R24

R17 R18 R19 R20 R21 R22 R23 R24

C C C C

R25

C C C C

R26

C C C C

R27

C C C C

R28

C C C C

R29

C C C C

R30

C C C C

R31

C C C C

R32

R25 R26 R27 R28 R29 R30 R31 R32

Plane A

RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB26 RB27 RB30 RB31RB18 RB19 RB22 RB23

RC3A RC7ARC5A RC9ARC2A RC6ARC4A RC8A

RC2B RC3BRC6B RC7BRC4B RC5BRC8B RC9B

RB2 RB3 RB6 RB7 RB10 RB11 RB14 RB15 RB18 RB19 RB22 RB23 RB26 RB27 RB30 RB31

C C C C

R9

C C C C

R10

C C C C

R11

C C C C

R12

R9 R10 R11 R12

Level 2 Routers

7

SGI Proprietary

Interconnect Topology

Bi-Section Bandwidth ProfilesGBs/sec/cpu

Dual Plane - NL3 router -8 port router bricks

Dual Plane - NL4 router -8 port router bricks

4 8 16 32 64 128 256 512 1024

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Processors

Raw

Bi -S

e ctio

nBa

n dw

idth

(GBs

/sec

/ cpu

)

3.2

1.8

2.0

2.2

2.4

2.6

2.8

3.0

2048

8

SGI Proprietary

Examples of Single-Paradigm Architectures

Scalar

Intel Itanium

SGI MIPS

IBM Power

Sun SPARC

HP PA

Vector

Cray X1

NEC SX

App-Specific

Graphics - GPU

Signals - DSP

Prog’ble - FPGA

Other ASICs

9

SGI Proprietary

Paradigms to Applications

Low Data locality High

Low

C

ompu

tehi

ghIn

tens

ity

Vector

Application-specific

Scalar

Application-specific

10

SGI Proprietary

Peer I/O: Increased I/O Flexibility & Performance

XIO+

1:1 Ratio C-brick to I/O

Peer I/O

RI/O C

I/O CNL

XIO+

C

RI/O

I/ONL

C

11

SGI Proprietary

SGI Scalable ccNUMA ArchitectureMulti-Paradigm Computing Architecture

Physical Memory

CACHE

CPU

InterfaceChip

CPUCACHE

Physical Memory

CACHE

CPU

InterfaceChip

CPUCACHE

Physical Memory

CACHE

CPU

InterfaceChip

CPUCACHE

Physical Memory

CACHE

CPU

InterfaceChip

CPUCACHE

TIO

GeneralPurpose

I/O

GeneralPurpose

I/O

General PurposeI/O Interfaces

TIO

GPUs GPUs

Scalable GPUs

TIO

FPGA(s)

RASC™ (FPGA)

NUMAlink™Interconnect

Fabric

12

SGI Proprietary

Data-Centric Architecture

Very Large GAM. Globally Addressable. Low Latency. High Bandwidth. Many Ports

CPUCPUCPU IO

IO

CPU

APUFPGAAPU

FPGA

GraphicsGPU-0

GraphicsGPU-3

GraphicsGPU-1

GraphicsGPU-2Composition

APUGPU-1

APUGPU-2

APUGPU-0

APUGPU-3

13

SGI Proprietary

Big Data

Each box represents 1GB

1TB, 32*32=1024 elements

14

SGI Proprietary

Big Data

15

SGI Proprietary

Big Data

16

SGI Proprietary

Big Datasets : 3D Interactive Visualization

1993100 MB

10% viewed / year~1 MB / month

2004400 GB

100% viewed / month400 GB / month

40,000xProductivity

17

SGI Proprietary

Commodity GPU systems 5X the price of a Scale-up System

March 17, 2005nVIDIA visualizes large data set•473 million triangles•128 GPU’s on Dell Systems•~$1million system

January 21, 2005SGI visualizes large data set•350 million triangles•12P, 56GB memory•Utilizes a ray tracer•~$180,000 system

Compliments of Boeing

Compliments of nVIDIA

18

SGI Proprietary

Dynamic Load Balancing

Load Balancing ONLoad Balancing OFF

Giv

en M

ost W

ork

19

SGI Proprietary

Dimensions of Scalability

• Processors• Processor bandwidth• Memory bandwidth• Memory capacity• Interconnect bandwidth• IO bandwidth• Graphics processing• Reconfigurable processing• Other acceleration elements

20

SGI Proprietary

Origin3000 Building Blocks (Bricks)

C-brickCPU Module

D-brickDisk Storage

R-brickRouter Interconnect

X-brickXIO Expansion

P-brickPCI Expansion

I-brickBase I/O Module

21

SGI Proprietary

PA-brick, PX-brickPCI-X expansion

D-brick2Disk expansion

R-brickRouter interconnect

IX-brickBase I/O module

M-brickMemory

Itanium® 2 CR-brickCPU and memory

SGI Altix™ 3700 Bx2 Platform Introduction: Building Blocks

SGI®Advanced

LinuxEnvironment

WithSGI

ProPack

22

SGI Proprietary

High-End Servers – Moving Forward:Altix® 4700 Platform….. Blade Packaging

•Innovative Blade-to-NUMALink4 Concept: Provides Unprecedented Versatility, Density

•Blade Architecture Leads Next-Wave of HPC Blade-Based Platforms: With Better Upgradeability, Expansion & Repair

•Investment Protection: Processor-Only Upgrade to Future Dual Core Processors

•Enables Flexible Multi-Paradigm Computing:Enhanced integrated RASC, Graphics

23

SGI Proprietary

Blade Base PackageNext Generation RASC™ TechnologyBlade Based Package

24

SGI Proprietary

Standardized Blades, NUMAlink Backbone

Blade

RackSmall Rack = 4 IRUsIndividual Rack Unit (IRU)

(Contains 10 Blades)

L1 Display

L1 Display

L1 Display

L1 Display

L1 Display

L1 Display

L1 Display

L1 Display

L1 Display

Fille

r Pan

el

Fille

r Pan

el

Bla

de S

lot 1

B

lade

Slo

t 2

Bla

de S

lot 3

B

lade

Slo

t 4

Bla

de S

lot 5

B

lade

Slo

t 6

Bla

de S

lot 7

Bla

de S

lot 8

Bla

de S

lot 9

Bla

de S

lot 1

0

25

SGI Proprietary

• Support for Madison9M Processors (Montecito/Montvale as Available)

• Two Compute Blade Options to Provide Different System Capabilities:

– Best $/FLOP, Best Density (Density Compute Blade)

OR– Best Performance, Memory BW

(Bandwidth Compute Blade)

Altix® 4700 Compute Blades

L1 Display

Fille

r Pan

el

Fille

r Pan

el

Bla

de S

lot 1

B

lade

Slo

t 2

Bla

de S

lot 3

B

lade

Slo

t 4

Bla

de S

lot 5

B

lade

Slo

t 6

Bla

de S

lot 7

Bla

de S

lot 8

Bla

de S

lot 9

Bla

de S

lot 1

0

I/O B

lade

s

Com

pute

Bla

de

Gra

phic

s B

lade

RA

SC B

lade

Mem

ory

Bla

de

26

SGI Proprietary

Altix® 4700 Compute Blades

Shub2.0 NL4 6.4GB/s

10.7GB/s

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMMDDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMMDDR2 DIMM

DDR2 DIMM

M9M Socket

Bandwidth Compute Blade

Top ViewHighest Memory BW, Performance:

Bandwidth Compute Blade• 667MHz FSB Madison9M ->

10.7GB/s Local Memory Bandwidth• 32 M9M Sockets / S-Rack• Processors Supported: 1.66GHz/9M,

1.66GHz/6M Madison9M with 667MHz FSB

• Memory Sizes: 2G – 48G/core

Front View

Single Blade

Shub2.0 NL4 6.4GB/s

8.5GB/s

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMMDDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMMDDR2 DIMM

DDR2 DIMM

M9M Socket

Top View Front View

Single Blade

M9M Socket

Best $/FLOP, Best Density: Density Compute Blade

• 533MHz FSB Madison9M -> 8.524GB/s Local Memory Bandwidth

• 64 M9M Sockets / S-Rack• Processors Supported: 1.6GHz/9M,

1.6GHz/6M Madison9M with 533MHz FSB

• Memory Sizes: 1G – 24GB/core

Density Compute Blade

27

SGI Proprietary

Memory Blade

Q2 CY06

Shub2.0 NL4 6.4GB/s

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMMDDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMM

DDR2 DIMMDDR2 DIMM

DDR2 DIMM

Top View Front View

Single Blade

Memory Blade:• Scale Memory Independently with 12

DDR2 DIMM Slots Per Blade• Up to 128TB

28

SGI Proprietary

Altix® 4700 RASC Blade

L1 Display

Fille

r Pan

el

Fille

r Pan

el

Bla

de S

lot 1

B

lade

Slo

t 2

Bla

de S

lot 3

B

lade

Slo

t 4

Bla

de S

lot 5

B

lade

Slo

t 6

Bla

de S

lot 7

Bla

de S

lot 8

Bla

de S

lot 9

Bla

de S

lot 1

0

I/O B

lade

s

CPU

Bla

de

Gra

phic

s B

lade

RA

SC B

lade

Mem

ory

Bla

de

• RASC Blade– Abacus Computation Blade– Enhanced Performance, Tightly

Integrated

29

SGI Proprietary

RASC Blades – Cont.

Top View

Abacus Computation Blade:• New Levels of Performance:

– High Performance V4LX160 FGPA with 160K Logic Cells

– Increased Memory Sizes,12 DIMM per Blade

• Optional Brick Packaging for Legacy Platforms

NL4 6.4GB/s

TIO

TIO

Loader

PCIFPGASSP

SSP

SelmapSelmap

SSRAM SSRAM

SSRAM

SSRAM

SSRAM SSRAM

FPGA

SSRAM SSRAM

SSRAM

SSRAM

SSRAM SSRAM

Front View

Single Blade

30

SGI Proprietary

How does RASC™ Technology Differ from Traditional CPUs?

Directly map computationally-intensive algorithms to hardware with RASC

Identify RASCappropriate algorithm

Compare Application Run Time %’s

Export A

lgorithm to R

ASC

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

App 1 App 2 App 3 App 4 App 5

% o

f Run

time

Algorithm Algorithm Memory Calls Branche inst.

Application Run-Time Comparison

RASC MethodKey Algorithm

running on FPGA

AlgorithmExecution Time

010010000100100111010010101011100101010001100010001100010101010101011100000111100100000100101110100 11 001 00011 11 11011110011 0

Traditional MethodCPU only

AlgorithmExecution time

TimeSavings

Job

Run

Tim

e

0100100001001001110100101010

31

SGI Proprietary

Application Segments

Application segments Sample applications

Image and video processing

Digital Signal Processing FFT, IFFT, Filtering (FIR and IIR)

Database Acceleration Query, sorting, pattern recognition, data compression

Network and Communication

HPC Algorithm Acceleration– Gov/Defense

MATLAB, STAR-P, random number generators, Sigint/Elint, image recognition (radar/vision/IR), DEM

Transcoding (digital watermarks, format conversion), compression (JPEG, MPEG), color correction, ray-tracing, edge detection (Sobel)

HPC Algorithm Acceleration–Bioinformatics

Interleaver/de-interleaver, coding/decoding (Reed Solomon, Viterbi), convolution encoders, encryption, error correction, packet processing (IPsec)

Blast, Smith-Waterman

32

SGI Proprietary

Ease of Use

•Leverage 3rd Party Std Language Tools– Celoxica, Mitrionics, Starbridge Systems, Nallatech

•Developed an FPGA aware version of GDB– Capable of debugging the FPGA and System Software– Capable of multiple CPUs and multiple FPGAs

•Developed RASC Abstraction Layer (RASCAL)

•Provide for HDL modules – Integrated environment with debugger– Highest performance

33

SGI Proprietary

3rd Party Tools

• Celoxica – http://www.celoxica.com– Handel-C

• Mitrionics - http://www.mitrionics.com– Mitrion C

• Starbridge Systems - http://www.starbridgesystems.com/– Viva graphical development environment

• Nallatech - http://www.nallatech.com/– SGI strategic partner

http://www.celoxica.com/

http://www.mitrionics.com/

http://www.starbridgesystems.com/

http://www.nallatech.com/

34

SGI Proprietary

Ease of Use v. Efficiency

Easy Ease of Use Difficult

Low

Ef

ficie

ncy

Hig

h

Verilog

VHDL

xx

x

x

35

SGI Proprietary

Bitstream Generation… HLL Tools

IA-32 Linux®

Machine

RTL Generation and Integration with Core Services

Design Synthesis(Synplify Pro,

Amplify)

Design Verification

Behavioral Simulation(VCS, Modelsim)

Static Timing Analysis(ISE Timing Analyzer)

.v, .vhd .v, .vhd

.ncd, .pcf

.bin

.edf

MetadataProcessing

(Python)

.v, .vhd

.cfg

Altix®

Server Device Programming(RASC™ Abstraction Layer,

Device Manager, Device Driver)

Real-time Verification

(gdb)

.c

Design Implementation (ISE)

HLL Design Entry(Handel-C, Impulse C, Mitrion C, Viva)

36

SGI Proprietary

Ease of Use

•Leverage 3rd Party Std Language Tools– Celoxica, Impulse Acceleration, Mitrion, Starbridge Viva– In discussions with other HLL tool vendors




37

SGI Proprietary

FPGA Aware Debugger

• Based on Open Source GNU Debugger (GDB)

• Uses extensions to current command set

• Can debug host application and FPGA

• Provides notification when FPGA starts or stops

• Supplies information on FPGA characteristics

• Can “single-step” or “run N steps” of the algorithm

• Can HLL line step per C-line source

• Dumps data regarding the set of “registers” that are visible when the FPGA is active

38

SGI Proprietary

GDB Debugging Environment

tmp = a & b;

d = tmp | c;(gdb) fpgastep

(gdb) p/x $a$6 = 0x444433

(gdb) p/x $b$7 = 0x111122

(gdb) p/x $tmp$8 = 0x555533

(gdb) fpgastep

(gdb) p/x $tmp$9 = 0x555533

(gdb) p/x $c$10 = 0x331222

(gdb) p/x $d$11 = 0x111022

&

|

a

b

tmp

c

d

Algorithm.c

COP FPGA

Debugger running

in real time

39

SGI Proprietary

Ease of Use





40

SGI Proprietary

SpeedShop™Debugger (GDB)

RASC™ Software Stack

Algorithm Device Driver

COP (TIO, Algorithm FPGA, Memory, Download FPGA)

DownloadUtilities

Application

Device Manager

User Space

Abstraction LayerLibrary

Linux® Kernel

Hardware

Download Driver

41

SGI Proprietary

Abstraction Layer: Algorithm API

The Abstraction Layer’s algorithm API mirrors the COP API with a few additions that enable:

Wide Scaling

- and -

Deep Scaling

Working with industry/customers (www.openfpga.org) on API stds…

Output Data

Application

COP

COP

COP

Input Data Algorithm

COP

Input Data Output DataAlgorithm

Application COP

42

SGI Proprietary

Ease of Use





43

SGI Proprietary

FPGA Architecture Overview

Core

Services

BlockAlgorithm Block

RAM

Bank 0

RAM

Bank N

SSP

3.2 GB/s

3.2 GB/s

Readport 0

Writeport 0

Writeport N

Readport N

44

SGI Proprietary

Algorithm Block as Submodule

Algorithm controller Algorithm

Block

SRAM controller(one bank shown)

alg_clkdo_stepalg_rst

step_flagsram_wr_gnt

sram_rd_gnt

sram_rd_data

sram_wr_req

sram_rd_dvld

sram_wr_addr[17:0]sram_wr_data[63:0]sram_wr_be[7:0]

sram_wr_dvld

sram_rd_req

sram_rd_addr[17:0]sram_rd_cmd_vld

alg_done

debug0debug63

Debugport

45

SGI Proprietary

Verilog / VHDL Module Support

• Templates for Verilog and VHDL– Fast start to algorithm coding

• Provide a system simulation stub– Allows both simulation debug or system debug

• Provide source code for core service– Allows user to modify to meet special needs

• Extractor tools supports GDB meta-data– Application and FPGA debugging

46

SGI Proprietary

• EXERGY – MAPLD 2005 paper 190

RASC™ Technology — Demonstrated Application Speed-up

Bit Manipulation (Cryptography)1

• 79x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit)• 119x 1.5GHz Intel® Itanium® 2 Processor (dual RASC Unit)

Customer Application• 20,000x speedup on scalar microprocessor

Graphics Edge Detection1

• 7.4x 1.5GHz Intel® Itanium® 2 Processor (single RASC Unit)

1 Based on internal testing

47

SGI Proprietary

RASC Platform Capabilities

• Direct Connection to NUMAlink4 6.4GB/s/connection

• Fast System Level Reprogramming of FPGAFPGA load at memory speeds

• Atomic Memory OperationsSame set as System CPUs

• Hardware BarriersDynamic Load Balancing

• Configurations to 8191 NUMA/FPGA NodesScalability

Thank You

49

SGI Proprietary

Strategy for Big Data

Heterogeneous. IRIX. Linux. Windows. Solaris. IBM AIX. HP-UX. Mac OS X

PBsDisk

(Datasets)

TBs MemoryDataset

IO

IOMPU MPU

IOIO

IO IO IO

Open SourceScalable

Filesystem

Heterogeneous. IRIX. Linux. Windows. Solaris. IBM AIX. HP-UX. Mac OS X

APU-GPUMPU

MPUMPU MPU

IO

APU

TBs MemoryDataset

APU APU

50

SGI Proprietary

SGI® RASC™ Technology Summary

• Tightly coupled, high bandwidth/low latency integration into NUMA fabric– Significant bandwidth advantage (6.4GB/s)– Coherent shared memory access – Atomic memory operations – Scalability (wide scaling and deep scaling)

• Orders-of-magnitude performance improvement and application speedup– Beneficial when running data intensive applications critical to oil and gas exploration,

defense and intelligence, bioinformatics, medical imaging, broadcast media, and other data-dependent industries.

• Ease of programming—complete software stack– RASClib (API and core services library) provides abstraction layer to support

reconfigurable elements in a multi-processing, multi-user environment– Fully integrated third-party party HLL development tools– FPGA-aware enhancements to GNU debugger (open-source)

• Add-in module that seamlessly operates with SGI® Altix® servers and Silicon Graphics Prism™ visualization systems

51

SGI Proprietary

Multi-Paradigm ComputingOther Non-traditional Processing Initiatives

• GPU-based processing– High potential performance (200-300GF peak today) and performance/price

on single precision floating point applications…clear roadmap to future semiconductor process technologies

– SGI working with SI on scaling to multiple GPUs and on development environment/programming paradigms…initial focus on signal processing apps

• Specialized processors… ClearSpeed™ processors, custom processors (MD-GRAPE, classified chip)

– High potential performance/watt on certain apps

This slide contains forward-looking statements. The results and forecasts as stated may vary. Other risks and uncertainties relating to this slide may be found in the "Safe-Harbor" statement at the beginning of this presentation.

Date post:	04-Oct-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

SGI Multi-Paradigm Architecture€¦ · – Abacus Computation Blade – Enhanced Performance,...

Documents