+ All Categories
Home > Documents > Pipe Lining 8

Pipe Lining 8

Date post: 07-Apr-2018
Category:
Upload: -
View: 223 times
Download: 0 times
Share this document with a friend

of 31

Transcript
  • 8/6/2019 Pipe Lining 8

    1/31

    1

    Pipelining for Multi-

    Core Architectures

  • 8/6/2019 Pipe Lining 8

    2/31

    2

    Multi-Core Technology

    2004 2005 2007

    Single Core Dual Core Multi-Core

    + Cache

    +

    Cache

    Cache

    Core

    4 or more cores

    + Cache

    Core

    Cache2 or more cores

    + Cache

    2Xmore

    cores

  • 8/6/2019 Pipe Lining 8

    3/31

    3

    Why multi-core ?

    Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

    heat problems

    Clock problems

    Efficiency (Stall) problems

    Doubling issue rates above todays 3-6 instructions per clock, say to 6 to12 instructions, is extremely difficult

    issue 3 or 4 data memory accesses per cycle,

    rename and access more than 20 registers per cycle, and

    fetch 12 to 24 instructions per cycle.

    Many new applications are multithreaded

    General trend in computer architecture (shift towards more parallelism)

  • 8/6/2019 Pipe Lining 8

    4/31

    4

    Instruction-level parallelism

    Parallelism at the machine-instruction level

    The processor can re-order, pipeline

    instructions, split them into

    microinstructions, do aggressive branch

    prediction, etc.

    Instruction-level parallelism enabled rapid

    increases in processor speeds over the last15 years

  • 8/6/2019 Pipe Lining 8

    5/31

    5

    Thread-level parallelism (TLP)

    This is parallelism on a more coarser scale

    Server can serve each client in a separate

    thread (Web server, database server)

    A computer game can do AI, graphics, andsound in three separate threads

    Single-core superscalar processors cannot

    fully exploit TLP Multi-core architectures are the next step in

    processor evolution: explicitly exploiting TLP

  • 8/6/2019 Pipe Lining 8

    6/31

    6

    What applications benefit

    from multi-core?

    Database servers

    Web servers (Web commerce)

    Multimedia applications

    Scientific applications,

    CAD/CAM

    In general, applications with

    Thread-level parallelism(as opposed to instruction-

    level parallelism)

    Each can

    run on its

    own core

  • 8/6/2019 Pipe Lining 8

    7/31

    7

    More examples

    Editing a photo while recording a TV show

    through a digital video recorder

    Downloading software while running an

    anti-virus program

    Anything that can be threaded today will

    map efficiently to multi-core

    BUT: some applications difficult to

    parallelize

  • 8/6/2019 Pipe Lining 8

    8/31

  • 8/6/2019 Pipe Lining 8

    9/31

    9

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cach

    eand

    Control

    Bus

    Thread 1: floating point

    Without SMT, only a single thread

    can run at any given time

  • 8/6/2019 Pipe Lining 8

    10/31

    10

    Without SMT, only a single thread

    can run at any given time

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cach

    eand

    Control

    Bus

    Thread 2:

    integer operation

  • 8/6/2019 Pipe Lining 8

    11/31

    11

    SMT processor: both threads can

    run concurrently

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cach

    eand

    Control

    Bus

    Thread 1: floating pointThread 2:

    integer operation

  • 8/6/2019 Pipe Lining 8

    12/31

    12

    But: Cant simultaneously use the

    same functional unit

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cach

    eand

    Control

    Bus

    Thread 1 Thread 2

    This scenario is

    impossible with SMT

    on a single core

    (assuming a single

    integer unit)IMPOSSIBLE

  • 8/6/2019 Pipe Lining 8

    13/31

  • 8/6/2019 Pipe Lining 8

    14/31

    14

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL2Cach

    eandC

    ontrol

    Bus

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL2Cach

    eand

    Co

    ntrol

    Bus

    Thread 3 Thread 4

    Multi-core:

    threads can run on separate cores

  • 8/6/2019 Pipe Lining 8

    15/31

    15

    Combining Multi-core and SMT

    Cores can be SMT-enabled (or not) The different combinations:

    Single-core, non-SMT: standard uniprocessor

    Single-core, with SMT Multi-core, non-SMT

    Multi-core, with SMT: our fish machines

    The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

    Intel calls them hyper-threads

  • 8/6/2019 Pipe Lining 8

    16/31

    16

    SMT Dual-core: all four threads can

    run concurrently

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL

    2Cach

    eandControl

    Bus

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL2Cach

    eand

    Co

    ntrol

    Bus

    Thread 1 Thread 3 Thread 2 Thread 4T

  • 8/6/2019 Pipe Lining 8

    17/31

    17

    Multi-Core and caches coherence

    memory

    L2 cache

    L1 cache L1 cacheCO

    RE

    1

    C

    O

    RE

    0

    L2 cache

    memory

    L2 cache

    L1 cache L1 cacheCO

    RE

    1

    C

    O

    RE

    0

    L2 cache

    Both L1 and L2 are private

    Examples: AMD Opteron,

    AMD Athlon, Intel Pentium D

    L3 cache L3 cache

    A design with L3 caches

    Example: Intel Itanium 2

  • 8/6/2019 Pipe Lining 8

    18/31

    18

    The cache coherence problem

    Since we have private caches:How to keep the data consistent across caches?

    Each core should perceive the memory as a

    monolithic array, shared by all the cores

  • 8/6/2019 Pipe Lining 8

    19/31

    19

    The cache coherence problem

    Suppose variable x initially contains 15213

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=15213

    multi-core chip

  • 8/6/2019 Pipe Lining 8

    20/31

    20

    The cache coherence problem

    Core 1 reads x

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=15213

    multi-core chip

  • 8/6/2019 Pipe Lining 8

    21/31

  • 8/6/2019 Pipe Lining 8

    22/31

    22

    The cache coherence problem

    Core 1 writes to x, setting it to 21660

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

    assuming

    write-throughcaches

  • 8/6/2019 Pipe Lining 8

    23/31

    23

    The cache coherence problem

    Core 2 attempts to read x gets a stale copy

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

  • 8/6/2019 Pipe Lining 8

    24/31

    24

    The Memory Wall

    Problem

  • 8/6/2019 Pipe Lining 8

    25/31

    25

    Memory Wall

    Proc60%/yr.(2X/1.5yr)

    DRAM9%/yr.

    (2X/10yrs)1

    10

    100

    1000

    1980

    1981

    1983

    1984

    1985

    1986

    1987

    1988

    1989

    1990

    1991

    1992

    1993

    1994

    1995

    1996

    1997

    1998

    1999

    2000

    DRAM

    CPU

    1982

    Processor-Memory

    Performance Gap:(grows 50% / year)

    Per

    fo

    rm

    an

    ce Moores Law

  • 8/6/2019 Pipe Lining 8

    26/31

    26

    Latency in a Single PC

    1997 1999 2001 2003 2006 2009

    X-Axis

    0.1

    1

    10

    100

    1000

    Tim

    e(ns)

    0

    100

    200

    300

    400

    500

    Mem

    orytoCPU

    Ratio

    CPU Clock Period (ns)

    Memory System Access Time

    Ratio

    Memory Access Time

    CPU Time

    Ratio

    THEWALL

  • 8/6/2019 Pipe Lining 8

    27/31

    h l d

  • 8/6/2019 Pipe Lining 8

    28/31

    28

    Technology TrendsTechnology Trends Capacity Speed (latency)Logic: 2x in 3 years 2x in 3 years

    DRAM: 4x in 3 years 2x in 10 years

    Disk: 4x in 3 years 2x in 10 years

    DRAM Generations

    Year Size Cycle Time

    1980 64 Kb 250 ns

    1983 256 Kb 220 ns

    1986 1 Mb 190 ns

    1989 4 Mb 165 ns

    1992 16 Mb 120 ns

    1996 64 Mb 110 ns

    1998 128 Mb 100 ns2000 256 Mb 90 ns

    2002 512 Mb 80 ns

    2006 1024 Mb 60ns

    16000:1 4:1

    (Capacity) (Latency)

  • 8/6/2019 Pipe Lining 8

    29/31

    29

    Processor-DRAM Performance Gap Impact:Processor-DRAM Performance Gap Impact:ExampleExample

    To illustrate the performance impact, assume a single-issue pipelined CPU with CPI

    = 1 using non-ideal memory.

    The minimum cost of a full memory access in terms of number of wasted CPU

    cycles:

    CPU CPU Memory Minimum CPU cycles orYear speed cycle Access instructions wastedMHZ ns ns

    1986: 8 125 190 190/125 - 1 = 0.5

    1989: 33 30 165 165/30 -1 = 4.5

    1992: 60 16.6 120 120/16.6 -1 = 6.2

    1996: 200 5 110 110/5 -1 = 21

    1998: 300 3.33 100 100/3.33 -1 = 29

    2000: 1000 1 90 90/1 - 1 = 89

    2003: 2000 .5 80 80/.5 - 1 = 159

    M i M

  • 8/6/2019 Pipe Lining 8

    30/31

    30

    Main MemoryMain Memory Main memory generally uses Dynamic RAM (DRAM),

    which uses a single transistor to store a bit, but requires aperiodic data refresh (~every 8 msec).

    Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)

    Size: DRAM/SRAM - 4-8,

    Cost & Cycle time: SRAM/DRAM - 8-16 Main memory performance:

    Memory latency:

    Access time: The time it takes between a memory access request and

    the time the requested information is available to cache/CPU. Cycle time: The minimum time between requests to memory

    (greater than access time in DRAM to allow address lines to be stable)

    Memory bandwidth: The maximum sustained data transferrate between main memory and cache/CPU.

  • 8/6/2019 Pipe Lining 8

    31/31

    31

    Architects Use Transistors to Tolerate Slow

    Memory

    Cache Small, Fast Memory

    Holds information (expected)

    to be used soon

    Mostly Successful

    Apply Recursively Level-one cache(s)

    Level-two cache

    Most of microprocessor

    die area is cache!


Recommended