+ All Categories
Home > Documents > The IBM Blue Gene/L System Architecture

The IBM Blue Gene/L System Architecture

Date post: 06-Jan-2016
Category:
Upload: almira
View: 49 times
Download: 1 times
Share this document with a friend
Description:
The IBM Blue Gene/L System Architecture. Presented by Sabri KANTAR. What is Blue Gene/L?. Blue Gene is an IBM Research project dedicated to exploring the frontiers in supercomputing. In November 2004, the IBM Blue Gene computer became the fastest supercomputer in the world. - PowerPoint PPT Presentation
26
The IBM Blue The IBM Blue Gene/L System Gene/L System Architecture Architecture Presented by Sabri KANTAR Presented by Sabri KANTAR
Transcript
Page 1: The IBM Blue Gene/L System Architecture

The IBM Blue The IBM Blue Gene/L System Gene/L System ArchitectureArchitecture

Presented by Sabri KANTARPresented by Sabri KANTAR

Page 2: The IBM Blue Gene/L System Architecture

What is Blue Gene/L?What is Blue Gene/L? Blue Gene is an IBM Research project dedicated Blue Gene is an IBM Research project dedicated

to exploring the frontiers in supercomputing.to exploring the frontiers in supercomputing. In November 2004, the IBM Blue Gene computer In November 2004, the IBM Blue Gene computer

became the fastest supercomputer in the world.became the fastest supercomputer in the world. This project is designed to scale to 65,536 dual-This project is designed to scale to 65,536 dual-

processor nodes, with a peak performance of 360 processor nodes, with a peak performance of 360 TeraFLOPS.TeraFLOPS.

Example usage:Example usage: hydrodynamics hydrodynamics quantum chemistry quantum chemistry molecular dynamicsmolecular dynamics climate modeling climate modeling financial modelingfinancial modeling

Page 3: The IBM Blue Gene/L System Architecture

A High-Level View of the BG/L Architecture

Within node: Low latency, high bandwidth memory system. Strong floating point performance: 4 floating point

operations/cycle. Across nodes:

Low latency, high bandwidth networks. Many nodes:

Low power/node. Low cost/node. RAS (reliability, availability and serviceability).

Familiar SW API: C, C++, Fortan, MPI, POSIX subset, …

Page 4: The IBM Blue Gene/L System Architecture

Main Design Principles for Blue Gene/L

Some science & engineering applications scale up to and beyond 10,000 parallel processes.

Improve computing capability, holding total system cost.

Reduce cost/FLOP. Reduce complexity and size.

~25KW/rack is max for air-cooling in standard room. Need to improve performance/power ratio. 700MHz PowerPC440 for ASIC has excellent FLOP/Watt.

Maximize Integration: On chip: ASIC with everything except main memory. Off chip: Maximize number of nodes in a rack..

Large systems require excellent reliability, availability, serviceability (RAS)

Page 5: The IBM Blue Gene/L System Architecture

Main Design Principles (cont’d)

Make cost/performance trade-offs considering the end-use: Applications <> Architecture <> Packaging Examples:

1 or 2 differential signals per torus link. I.e. 1.4 or 2.8Gb/s.

Maximum of 3 or 4 neighbors on collective network. I.e. Depth of network and thus global latency.

Maximize the overall system efficiency: Small team designed all of Blue Gene/L. Example: Chose ASIC die and chip pin-out to

ease circuit card routing.

Page 6: The IBM Blue Gene/L System Architecture

Reducing Cost and Complexity

Cables are bigger, costlier and less reliable than traces. So want to minimize the number of cables. So 3-dimensional torus is chosen as main BG/L network,

with each node connected to 6 neighbors. Maximize number of nodes connected via circuit card(s)

only. BG/L midplane has 8*8*8=512 nodes. (Number of cable connections) / (all connections) = (6 faces * 8 * 8 nodes) / (6 neighbors * 8 * 8 * 8

nodes)= 1 / 8

Page 7: The IBM Blue Gene/L System Architecture

Blue Gene/L Architecture

Up to 32*32*64=65536 nodes (3D torus).

Max 360 teraFLOPS computation power. Each processor can perform 4 floating

point operations per cycle (in the form of two 64-bit floating point multiply-add’s per cycle)

5 networks connect nodes to themselves and to the world.

Page 8: The IBM Blue Gene/L System Architecture

Node ArchitectureNode Architecture IBM PowerPC embedded CMOS

processors, embedded DRAM, and system-on-a-chip technique is used.

11.1-mm square die size, allowing for a very high density of processing.

The ASIC uses IBM CMOS CU-11 0.13 micron technology.

700 Mhz processor speed close to memory speed.

Two processors per node. Second processor is intended primarily

for handling message passing operations

Page 9: The IBM Blue Gene/L System Architecture

The BG/L node ASIC includes:

The two processing cores are standard PowerPC 440 core each with a PowerPC 440 FP2 core an enhanced “Double” 64-bit Floating-Point Unit

The two cores are not L1 cache coherent. Each core has a small 2 KB L2 cache 4 MB L3 cache made from embedded DRAM An integrated external DDR memory

controller A gigabit Ethernet adapter A JTAG interface

Page 10: The IBM Blue Gene/L System Architecture

BlueGene/L node diagram.

Page 11: The IBM Blue Gene/L System Architecture

Link ASICLink ASIC In addition to the compute ASIC, there is a “link”

ASIC. When crossing

a midplane boundary BG/L’s torus global combining tree global interrupt signals pass through the BG/L link ASIC.

It redrives signals over the cables between BG/L midplanes.

The link ASIC can redirect signals between its different ports. enables BG/L to be partitioned into multiple, logically

separate systems in which there is no traffic interference between systems.

Page 12: The IBM Blue Gene/L System Architecture

The PowerPC 440 FP2 core It It consists of a primary side and a secondary

side Each side has

its own 64-bit by 32 element register file a double-precision computational datapath and a double-precision storage access datapath

The primary side is capable of executing standard PowerPC floating-point instructions

An enhanced set of instructions include those that are executed solely on the secondary side, and those that are simultaneously executed on both sides.

Enhanced set includes SIMD operations

Page 13: The IBM Blue Gene/L System Architecture

The The FP2 core (cont’d) This enhanced set This enhanced set goes beyond the

capabilities of traditional SIMD architectures.

A single instruction can initiate a different but related operation on different data.

Single Instruction Multiple Operation Multiple Data (SIMOMD).

Either of the sides can access data from the other side’s register file.

This saves a lot of swapping when working purely on complex arithmetic operations.

Page 14: The IBM Blue Gene/L System Architecture

Memory SystemMemory System It is designed for high bandwidth, low

latency memory and cache accesses. An L2 hit returns in 6 to 10 processor

cycles An L3 hit in about 25 cycles An L3 miss in about 75 cycles System has a 16 byte interface to nine

256Mb SDRAM-DDR devices. Operating at a speed of one half or one

third of the processor.

Page 15: The IBM Blue Gene/L System Architecture

3D Torus Network3D Torus Network It is used for general-purpose, point-to-point

message passing and multicast operations to a selected “class” of nodes.

The topology is a three-dimensional torus constructed with point-to-point, serial links between routers embedded within the BlueGene/L ASICs.

Each ASIC has six nearest-neighbor connections Virtual cut-through routing with multipacket

buffering on collision Minimal, Adaptive, Deadlock Free

Page 16: The IBM Blue Gene/L System Architecture

Torus Network (cont’d)Torus Network (cont’d) Class Routing Capability (Deadlock-

free Hardware Multicast) Packets can be deposited along

route to specified destination. Allows for efficient one to many in

some instances Active messages allows for fast

transposes as required in FFTs. Independent on-chip network

interfaces enable concurrent access.

Page 17: The IBM Blue Gene/L System Architecture

Other NetworksOther Networks

A global combining/broadcast tree for collective operations

A Gigabit Ethernet network for connection to other systems, such as hosts and file systems.

A global barrier and interrupt network

And another Gigabit Ethernet to JTAG network for machine control

Page 18: The IBM Blue Gene/L System Architecture

Collective NetworkCollective Network It has tree structureIt has tree structure One-to-all broadcast functionalityOne-to-all broadcast functionality Reduction operations functionalityReduction operations functionality 2.8 Gb/s of bandwidth per link; Latency of tree 2.8 Gb/s of bandwidth per link; Latency of tree

traversal 2.5 µstraversal 2.5 µs ~23TB/s total binary tree bandwidth (64k ~23TB/s total binary tree bandwidth (64k

machine)machine) Interconnects all compute and I/O nodes (1024)Interconnects all compute and I/O nodes (1024)

Page 19: The IBM Blue Gene/L System Architecture

Gb Ethernet Disk/Host I/O Gb Ethernet Disk/Host I/O NetworkNetwork

IO nodes are leaves on collective network.IO nodes are leaves on collective network. Compute and IO nodes use same ASIC, but:Compute and IO nodes use same ASIC, but:

IO node has Ethernet not torus. ProvedesIO node has Ethernet not torus. Provedes IO IO seperationseperation on application. on application.

Compute node has torus, not Ethernet: No need Compute node has torus, not Ethernet: No need for 65536 cables.for 65536 cables.

Configurable ratio of IO to compute = Configurable ratio of IO to compute = 1:8,16,32,64,128.1:8,16,32,64,128.

Application runs on compute nodes, not IO Application runs on compute nodes, not IO nodes.nodes.

Page 20: The IBM Blue Gene/L System Architecture

Fast Barrier/Interrupt Fast Barrier/Interrupt NetworkNetwork

Four Independent Barrier or Interrupt ChannelsFour Independent Barrier or Interrupt Channels Independently Configurable as "or" or "and"Independently Configurable as "or" or "and"

Asynchronous PropagationAsynchronous Propagation Halt operation quickly (current estimate is 1.3usec worst Halt operation quickly (current estimate is 1.3usec worst

case round trip)case round trip) 3/4 of this delay is time-of-flight.3/4 of this delay is time-of-flight.

Sticky bit operationSticky bit operation Allows global barriers with a single channel.Allows global barriers with a single channel.

User Space AccessibleUser Space Accessible System selectableSystem selectable

It is pIt is partitionartitioneded along same boundaries as Tree, and along same boundaries as Tree, and TorusTorus Each user partition contains it's own set of barrier/ Each user partition contains it's own set of barrier/

interrupt signals interrupt signals

Page 21: The IBM Blue Gene/L System Architecture

Control NetworkControl Network JTAG interface to 100Mb EthernetJTAG interface to 100Mb Ethernet

direct access to all nodes.direct access to all nodes. boot, system debug availability.boot, system debug availability. runtime noninvasive RAS support.runtime noninvasive RAS support. non-invasive access to performance countersnon-invasive access to performance counters ddirect access to shared SRAM in every nodeirect access to shared SRAM in every node

Control, configuration and monitoring:Control, configuration and monitoring: Make all active devices accessible through Make all active devices accessible through

JTAG, I2C, or other “simple”JTAG, I2C, or other “simple” bus. (Only clock bus. (Only clock buffers & DRAM are not accessible)buffers & DRAM are not accessible)

Page 22: The IBM Blue Gene/L System Architecture

PackagingPackaging 2 nodes per compute card. 16 compute cards per node board. 16 node boards per 512-node midplane. Two midplanes in a 1024-node rack. For compiling, diagnostics, and analysis,

a host computer is required. An I/O node handles communication

between a compute node and other systems, including the host and file servers.

Page 23: The IBM Blue Gene/L System Architecture

BlueGene/L packaging.

Page 24: The IBM Blue Gene/L System Architecture

Science ApplicationScience Application Study of protein folding and dynamics. Aim is to obtain a microscopic view of the

thermodynamics and kinetics of the folding process

Simulating longer and longer time-scales is the key challenge

Focus is on improving the speed of execution for a fixed size system by utilizing additional CPUs.

Understanding the logical limits to concurrency within the application is very important.

Page 25: The IBM Blue Gene/L System Architecture

ConclusionConclusion The Blue Gene/L supercomputer is The Blue Gene/L supercomputer is

designed to improve cost/performance for designed to improve cost/performance for a relatively broad class of applications a relatively broad class of applications with good scaling behavior.with good scaling behavior.

This is achieved by using parallesim.This is achieved by using parallesim. System on Chip technology.System on Chip technology. The functionality of a node was contained The functionality of a node was contained

within a single ASIC chip.within a single ASIC chip. BG/L has significantly lower cost in terms BG/L has significantly lower cost in terms

of power, space, and service, while doing of power, space, and service, while doing no worse than the other competitors.no worse than the other competitors.

Page 26: The IBM Blue Gene/L System Architecture

The EndThe End

Questions ???Questions ???


Recommended