+ All Categories
Home > Documents > The Tera Computer System

The Tera Computer System

Date post: 24-Feb-2016
Category:
Upload: cheche
View: 46 times
Download: 0 times
Share this document with a friend
Description:
The Tera Computer System. Robert Alverson , David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, Burton Smith Tera Computer Company Seattle, Washington USA ICS '90 Proceedings of the 4th international conference on Supercomputing - PowerPoint PPT Presentation
Popular Tags:
26
The Tera Computer System Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, Burton Smith Tera Computer Company Seattle, Washington USA ICS '90 Proceedings of the 4th international conference on Supercomputing Ran Manevich - Computer Architecture and Parallel Systems seminar (236604) – Spring
Transcript
Page 1: The  Tera  Computer System

The Tera Computer System

Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, Burton Smith

Tera Computer Company Seattle, Washington USA

ICS '90 Proceedings of the 4th international conference on Supercomputing

Ran Manevich - Computer Architecture and Parallel Systems seminar (236604) – Spring 2012

Page 2: The  Tera  Computer System

Tera Computers Company 1972 - Seymour Cray

founds Cray Research ,Inc.

1976 – Cray 1 – 250 MFlops, 1MB Memory

1987 – James Rottstolk and Burton Smith found Tera Computers Company

2000 – Tera Acquires Cray’s Research assets and becomes Cray, Inc. Seymour Cray standing next to the core

unit of the Cray 1 computer, circa 1974

Page 3: The  Tera  Computer System

Tera Computers Company (Cray)

Page 4: The  Tera  Computer System

Tera Computers Company (Cray)

Jaguar - World #3 – 224162 cores, 1759 TFlops, 6950 KW. Ridge National Lab. US.

Page 5: The  Tera  Computer System

Tera Computer System A shared memory MIMD supercomputer

introduced at ~1990. Resources:

256 Processors 512 memory units 256 I/O cache units 256 I/O Processors

Page 6: The  Tera  Computer System

Interconnection Network Pipelined packet switched nodes(routers).

A packet consists of source and destination addresses, opcode and 64 bits of data (164 bits total*) .

Each link can transport a packet in both directions on a single clock cycle (i.e. single flit packets).

* George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Page 7: The  Tera  Computer System

Interconnection Network 3D 16x16x16 Tourus:

Page 8: The  Tera  Computer System

Interconnection Network 1280 of the 4096 routers are attached to

recourses ( 256 processors + 512 memory units + 256 I/O caches + 256 I/O processors)

X links and Y links are missing on alternate Z layers in order to speed-up router performance.

This reduces router crossbar degree from 6 to 4 and from 7 to 5 in routers without/with a recourse respectively.

Recourses are distributed homogeneously across the layers – average communication distance reduction.

Page 9: The  Tera  Computer System

Interconnection Network Odd Z layers:

Page 10: The  Tera  Computer System

Interconnection Network Even Z layers:

Page 11: The  Tera  Computer System

Data Memory 512 data memory units of 128 MB each.

Total: 64 GB Memory is byte addressable and organized in

64 bit words. Four additional access bit states per word:

2 trap bits 1 invisible indirect addressing bit 1 full/empty bit for synchronization

Additional code bits for single error correction and double error detection separately for data and access state

Page 12: The  Tera  Computer System

Data Memory Virtual addresses randomization to avoid

hotspots. Randomization for each processor can be

limited to a sub-set of the 512 segment to exploit physical locality.

Page 13: The  Tera  Computer System

Data Memory - Synchronization

4 Types of load/store access control for hardware based synchronization:

Page 14: The  Tera  Computer System

I/O Caches “Disk speeds have not kept pace with

advances in processor and memory performance in recent years.” Tera system needs up to 70 GB/s of sustained bandwidth between data memory and secondary storage (e.g. magnetic disks).

This bandwidth is supplied by directly addressable 256 I/O caches, 1GB each (total 256 GB). I/O cache units are functionally identical to data memory but slower.

Each processor fetches instructions to a neighboring I/O cache unit.

Page 15: The  Tera  Computer System

Processors 256 Processors. Each processor can execute up to 128

instruction streams (i.e. threads) simultaneously.

Every clock tick , one among the streams that are in “ready” state is allowed to issue an instruction.

Page 16: The  Tera  Computer System

Processors If there are enough streams, execution latency

(70 ticks on average) can be hidden by parallelism.

Band

widt

h Lim

itatio

ns

# Threads

PerformanceMax performance

execution Memory access

16

Zvika Guz et. al.

Page 17: The  Tera  Computer System

Processors – stream state

Stream state is defined by the following registers: 1 64-bit Stream Status Word (SSW) – for

program counter and additional mode flags. 32 64-bit General Registers (R0-R31) 8 64-bit Target Registers (T0-T7) – for trap

handler and branch targets. To enable a rapid context switch (on every tick), there are 128 sets of context registers. Each processor has 128 SSW’s, 4096 general registers and 1024 target registers. With target registers, branch target addressed are prefeached in parallel to branch decision calculation.

Page 18: The  Tera  Computer System

Instructions

To enable multiple operations issue per tick, “Mildly horizontal” VILW (Very Long Instruction Word) instructions are use. These instructions typically specify three operations:1. Memory reference operation (e.g.

UNS_LOADB).2. Arithmetic operation (e.g. FLOAT_ADD_MUL).3. Control (e.g. JUMP) or second arithmetic

operation.

Page 19: The  Tera  Computer System

Explicit-Dependence Lookahead

Each instruction contains a 3 bit lookahead field that specifies how many instructions from this stream will issue before encountering an instruction that depends on the current one.

New instruction is issued only when the instructions with lookahead values referring to it have completed. If instructions are independent (lookahead value is 7), 9 streams are enough to hide instruction latency of 72 ticks.

INS. LAR0 = R0 + 1

1R1 = R1 + 1

4R0 = R0 + 1 2R3 = R3 + 1 4R4 = R4 + 1 4R0 = R0 + 1 4R1 = R1 + 1 4R2 = R2 + 1 4…

Page 20: The  Tera  Computer System

Protection Domains (Processes)

Each processor supports as many as 16 active protection domains (processes/address spaces). A protection domain defines program memory, data memory and the mapping between physical and virtual addresses.

Each instruction stream (thread) is assigned to a protection domain. The exact domain is not known to the user program.

A protection domain can be seen as a virtual processor and can be moved from one physical processor to another.

Page 21: The  Tera  Computer System

Protection Domains (Processes)

Retry limit - Defines in each protection domain how many times a memory reference can fail (in testing full/empty) before it will trap (exception).

Page 22: The  Tera  Computer System

Privilege Levels

Privilege levels are defined independently for each stream.

4 levels of privilege: user, supervisor, kernel and IPL.

IPL is the highest and is the only that operates in absolute addressing mode.

Page 23: The  Tera  Computer System

Arithmetic Operations supported directly by hardware:

addition, subtraction, multiplication, conversion(?) and comparison. Types that are directly supported: 64-bit 2’s-complement and unsigned integers. 64 bit floating point numbers. 64 bit complex numbers.

Types that are indirectly supported: 8, 16 and 32 2’s-complement and unsigned

integers. Arbitrary length integers. 32 bit floating point numbers. 128 bit “double percision” numbers.

Page 24: The  Tera  Computer System

Software * Operating System - Custom fully symmetric,

distributed parallel version of UNIX.

* George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Programming Model - Thread-based programming model that permits

a mixture of implicit and explicit parallelism. The virtual machine has an unbounded number

of processors with uniform access to all memory locations.

Tera’s compilers perform automatic parallelization of Fortran, C and C++ (loop unrolling, operations on vectos, etc.)

Page 25: The  Tera  Computer System

Performance*

Nominal clock frequency: 333 MHz

Peak performance: 1Gflop per processor, 256 Gflops total.

* George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Data bandwidth per node: 2.67 GB/s

Processors power dissipation: 6KW per processor, 1.536MW total.

167 Kflops/Watt

Page 26: The  Tera  Computer System

Thank You!!!


Recommended