+ All Categories
Home > Documents > Tarantula A Vector Extension to the Alpha Architecture

Tarantula A Vector Extension to the Alpha Architecture

Date post: 19-Mar-2016
Category:
Upload: agnes
View: 36 times
Download: 3 times
Share this document with a friend
Description:
Tarantula A Vector Extension to the Alpha Architecture. Roger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, Andr é Seznec Universitat Polit ècnica Catalunya, Barcelona, Spain - PowerPoint PPT Presentation
21
Tarantula A Vector Extension to the Alpha Architecture Roger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, André Seznec Universitat Politècnica Catalunya, Barcelona, Spain Compaq Computer Corporation, Shrewsbury, MA
Transcript
Page 1: Tarantula A Vector Extension to the Alpha Architecture

TarantulaA Vector Extension to the Alpha ArchitectureRoger Espasa, Federico Ardanaz, Joel Emerz, Stephen Felixz, Julio Gago, Roger Gramunt,Isaac Hernandez, Toni Juan, Geoff Lowneyz, Matthew Mattinaz, André Seznec

Universitat Politècnica Catalunya, Barcelona, SpainCompaq Computer Corporation, Shrewsbury, MA

Page 2: Tarantula A Vector Extension to the Alpha Architecture

State of the World• CMOS Technology progresses

– More transistors, more functional units, more control overhead

• VLIW and Wide Superscalar – More individually controlled units– Amount of real estate for control logic grows non-

linearly• Vector ISA

– Localization of parallelism, aggregation of control– Regular structures, simple control

Page 3: Tarantula A Vector Extension to the Alpha Architecture

Tarantula• EV8 core + tightly integrated Vector Unit

– Out of Order execution, Register Renaming– Integrated in VM and cache coherence

system– SMT support

• Targeted at scientific computing applications

• Requires compiler support and recompilation

Page 4: Tarantula A Vector Extension to the Alpha Architecture

Vector ISA• New Architectural State

– 32 vector registers (v0-v31)• v31 wired to 0. Used for prefetch

– Vector length (vl), Vector stride (vs), Vector Mask (vm)

• 45 New Instructions– 5 Groups

• Vector-Vector, Vector-Scalar, Strided Memory Access, Random Memory Access, Vector Control

Page 5: Tarantula A Vector Extension to the Alpha Architecture

Vector Mask• Allows conditional

execution without EV8 scalar registers

• VM can be renamed

A(i).ne.0.and.B(i).gt.2

vloadq A(i) --> v0vloadq B(i) --> v1vcmpne v0, #0 --> v6vcmpgt v1, #2 --> v7vand v6, v7 --> v8setvm v8 --> vm

Page 6: Tarantula A Vector Extension to the Alpha Architecture

Tarantula Block Diagram

Page 7: Tarantula A Vector Extension to the Alpha Architecture

Vector Execution Unit• 16 independent lanes

– No communication, except for gather/scatter• Each lane has

– 2 functional units– Slice of Register File and Mask

• Allows high bandwidth

– Address generator and private TLB• 32 functional unit appear as only 2 issue ports

– Simple scheduling

Page 8: Tarantula A Vector Extension to the Alpha Architecture

Vector Unit – Core Interface• Vector Unit physically separate from core

– Little modification to core• Large bus prevented by routing space

– Core to VBox• 3 Instruction Bus• 2 Data Buses for Scalars from EV8 register file• 3 Instruction Kill Signal Bus for misspeculation

– VBox to Core• 3 Instruction Completion Bus

Page 9: Tarantula A Vector Extension to the Alpha Architecture

Power Consumption

Page 10: Tarantula A Vector Extension to the Alpha Architecture

Vector Memory System• Bound to EV8 VM and Cache Coherence

architecture• High Load/Store Bandwidth required

– Goal one 64bit datum per flop– Memory Bus to slow– L1 Cache to small for vector data– Direct Connection to L2 Cache

• Non-Unit Stride central problem– 20% of all accesses– Don’t match cache lines

Page 11: Tarantula A Vector Extension to the Alpha Architecture

Non-Unit Strides• EV8 4MByte L2 Cache in 128 banks

– 8 ways, 16 banks per way– Read 8 ways, select correct one

• Non-unit stride accesses– Read 16 independent cache lines– Select one qword per line

• Requires– Conflict free addresses– Conflict free writes to 16 lanes

• One qword per lane per cycle

Page 12: Tarantula A Vector Extension to the Alpha Architecture

Conflict Free Addresses• Possible for any 128 consecutive elements

– For stride S= × 2s with s ≤ 4– Order stored in ROM table

• Elements accessed out of order– Even for length < 128 full eight cycles for

address generation• Slice

– Group of 16 conflict free addresses

Page 13: Tarantula A Vector Extension to the Alpha Architecture

PUMP• Stride 1 accesses

– 80% of all accesses– 128 Qwords in 16 (aligned) or 17

(misaligned) cache lines• Full cache lines read into PUMP latches

– Two qwords per cycle sent to VBox• Similar for writes• Allows double bandwidth

Page 14: Tarantula A Vector Extension to the Alpha Architecture

Gathers and Scatters• Arbitrary Address for every vector element

– Reordering algorithm doesn’t work• Conflict Resolution Box (CR)

– Find biggest subset of non-conflicting addresses, pack into slice

– Add new addresses to remaining ones and repeat• Worst case 128 slices generated• Same algorithm used for self-conflicting strides

– stride S= × 2s with s > 4

Page 15: Tarantula A Vector Extension to the Alpha Architecture

Vector Misses• To handle L2 misses consider slices as

atomic• On miss, slice moved to Miss Address File

(MAF)– Wait for missing data– Go to retry queue

• Too many retries cause Panic Mode– MAF nacks all other L2 requests, that might

prevent progress

Page 16: Tarantula A Vector Extension to the Alpha Architecture

Scalar-Vector Coherency• VBox by-passes L1 cache

– Presence bit P indicates L2 cache line loaded by VCore

– If P Set, VBox invalidates L1• Scalar Write followed by Vector Read is not

covered– Barrier command required – DrainM Purges write buffer and cause replay

trap

Page 17: Tarantula A Vector Extension to the Alpha Architecture

Evaluation• No Compiler support available

– Hand coded assembler cores• Scientific Benchmarks• ASIM Simulator

– Cycle Accurate EV8 simulator• Tarantula compared to

– EV8– EV8 + Trantula’s memory system– Tarantula4 1:4 ratio to RAMBUS frequency

Page 18: Tarantula A Vector Extension to the Alpha Architecture

Operations per Cycle

Page 19: Tarantula A Vector Extension to the Alpha Architecture

Speed Up over EV8

Page 20: Tarantula A Vector Extension to the Alpha Architecture

Conclusions• Vector Processor most efficient solution for many

applications• Vector Unit can be added to standard

microprocessor core• Big Bandwidth requirement can only be satisfied

by L2 cache• Potentially big performance gains

– 2 to 20 over EV8• Performance depends on good code

– Tiling + aggressive prefetching• Very good power/performance ratio

Page 21: Tarantula A Vector Extension to the Alpha Architecture

Questions• Can only scientific applications exploit

vector processors?– Radix sort worked– Powerful memory access instructions– Masks allow logic execution

• Does anyone no more about PRAM algorithms?

• EV8/VBox coherency seems quirky. Does anyone see a better solution?


Recommended