The Tera Computer System

The Tera Computer System

Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, Burton Smith

Tera Computer Company Seattle, Washington USA

ICS '90 Proceedings of the 4th international conference on Supercomputing

Ran Manevich - Computer Architecture and Parallel Systems seminar (236604) – Spring 2012

Tera Computers Company 1972 - Seymour Cray

founds Cray Research ,Inc.

1976 – Cray 1 – 250 MFlops, 1MB Memory

1987 – James Rottstolk and Burton Smith found Tera Computers Company

2000 – Tera Acquires Cray’s Research assets and becomes Cray, Inc. Seymour Cray standing next to the core

unit of the Cray 1 computer, circa 1974

Tera Computers Company (Cray)

Tera Computers Company (Cray)

Jaguar - World #3 – 224162 cores, 1759 TFlops, 6950 KW. Ridge National Lab. US.

Tera Computer System A shared memory MIMD supercomputer

introduced at ~1990. Resources:

256 Processors 512 memory units 256 I/O cache units 256 I/O Processors

Interconnection Network Pipelined packet switched nodes(routers).

A packet consists of source and destination addresses, opcode and 64 bits of data (164 bits total*) .

Each link can transport a packet in both directions on a single clock cycle (i.e. single flit packets).

* George Davison, Constantine Pavlakos, Claudio Silva. Final Report for the Tera Computer TTI CRADA. Sandia National Labs Report SAND97-0134, January 1997.

Interconnection Network 3D 16x16x16 Tourus:

Interconnection Network 1280 of the 4096 routers are attached to

recourses ( 256 processors + 512 memory units + 256 I/O caches + 256 I/O processors)

X links and Y links are missing on alternate Z layers in order to speed-up router performance.

This reduces router crossbar degree from 6 to 4 and from 7 to 5 in routers without/with a recourse respectively.

Recourses are distributed homogeneously across the layers – average communication distance reduction.

Interconnection Network Odd Z layers:

Interconnection Network Even Z layers:

Data Memory 512 data memory units of 128 MB each.

Total: 64 GB Memory is byte addressable and organized in

64 bit words. Four additional access bit states per word:

2 trap bits 1 invisible indirect addressing bit 1 full/empty bit for synchronization

Additional code bits for single error correction and double error detection separately for data and access state

Data Memory Virtual addresses randomization to avoid

hotspots. Randomization for each processor can be

limited to a sub-set of the 512 segment to exploit physical locality.

Data Memory - Synchronization

4 Types of load/store access control for hardware based synchronization:

I/O Caches “Disk speeds have not kept pace with

advances in processor and memory performance in recent years.” Tera system needs up to 70 GB/s of sustained bandwidth between data memory and secondary storage (e.g. magnetic disks).

This bandwidth is supplied by directly addressable 256 I/O caches, 1GB each (total 256 GB). I/O cache units are functionally identical to data memory but slower.

Each processor fetches instructions to a neighboring I/O cache unit.

Processors 256 Processors. Each processor can execute up to 128

instruction streams (i.e. threads) simultaneously.

Every clock tick , one among the streams that are in “ready” state is allowed to issue an instruction.

Processors If there are enough streams, execution latency

(70 ticks on average) can be hidden by parallelism.

Band

widt

h Lim

itatio

ns

# Threads

PerformanceMax performance

execution Memory access

16

Zvika Guz et. al.

Processors – stream state

Stream state is defined by the following registers: 1 64-bit Stream Status Word (SSW) – for

program counter and additional mode flags. 32 64-bit General Registers (R0-R31) 8 64-bit Target Registers (T0-T7) – for trap

handler and branch targets. To enable a rapid context switch (on every tick), there are 128 sets of context registers. Each processor has 128 SSW’s, 4096 general registers and 1024 target registers. With target registers, branch target addressed are prefeached in parallel to branch decision calculation.

Instructions

To enable multiple operations issue per tick, “Mildly horizontal” VILW (Very Long Instruction Word) instructions are use. These instructions typically specify three operations:1. Memory reference operation (e.g.

UNS_LOADB).2. Arithmetic operation (e.g. FLOAT_ADD_MUL).3. Control (e.g. JUMP) or second arithmetic

operation.

Explicit-Dependence Lookahead

Each instruction contains a 3 bit lookahead field that specifies how many instructions from this stream will issue before encountering an instruction that depends on the current one.

New instruction is issued only when the instructions with lookahead values referring to it have completed. If instructions are independent (lookahead value is 7), 9 streams are enough to hide instruction latency of 72 ticks.

INS. LAR0 = R0 + 1

1R1 = R1 + 1

4R0 = R0 + 1 2R3 = R3 + 1 4R4 = R4 + 1 4R0 = R0 + 1 4R1 = R1 + 1 4R2 = R2 + 1 4…

Protection Domains (Processes)

Each processor supports as many as 16 active protection domains (processes/address spaces). A protection domain defines program memory, data memory and the mapping between physical and virtual addresses.

Each instruction stream (thread) is assigned to a protection domain. The exact domain is not known to the user program.

A protection domain can be seen as a virtual processor and can be moved from one physical processor to another.

Protection Domains (Processes)

Retry limit - Defines in each protection domain how many times a memory reference can fail (in testing full/empty) before it will trap (exception).

Privilege Levels

Privilege levels are defined independently for each stream.

4 levels of privilege: user, supervisor, kernel and IPL.

IPL is the highest and is the only that operates in absolute addressing mode.

Arithmetic Operations supported directly by hardware:

addition, subtraction, multiplication, conversion(?) and comparison. Types that are directly supported: 64-bit 2’s-complement and unsigned integers. 64 bit floating point numbers. 64 bit complex numbers.

Types that are indirectly supported: 8, 16 and 32 2’s-complement and unsigned

integers. Arbitrary length integers. 32 bit floating point numbers. 128 bit “double percision” numbers.

Software * Operating System - Custom fully symmetric,

distributed parallel version of UNIX.


Programming Model - Thread-based programming model that permits

a mixture of implicit and explicit parallelism. The virtual machine has an unbounded number

of processors with uniform access to all memory locations.

Tera’s compilers perform automatic parallelization of Fortran, C and C++ (loop unrolling, operations on vectos, etc.)

Performance*

Nominal clock frequency: 333 MHz

Peak performance: 1Gflop per processor, 256 Gflops total.


Data bandwidth per node: 2.67 GB/s

Processors power dissipation: 6KW per processor, 1.536MW total.

167 Kflops/Watt

Thank You!!!

Date post:	24-Feb-2016
Category:	Documents
Upload:	cheche
View:	46 times
Download:	0 times

The Tera Computer System

Documents