+ All Categories
Home > Documents > Duality Cache - Microsoft...Transforming caches into massively parallel vector ALUs 3 18-core Xeon...

Duality Cache - Microsoft...Transforming caches into massively parallel vector ALUs 3 18-core Xeon...

Date post: 03-Feb-2021
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
61
Duality Cache for Data Parallel Acceleration Daichi Fujiki Scott Malke Reetuparna Das University of Michigan Mbits Research Group
Transcript
  • Duality Cache

    for Data Parallel Acceleration

    Daichi Fujiki

    Scott Malke

    Reetuparna Das

    University of MichiganMbits Research Group

  • Duality = Storage + Compute

    Why compute in-cache ?

  • Transforming caches into massively parallel vector ALUs3

    18-core Xeon processor45 MB LLC

    Way

    1

    Way

    20

    Way

    19

    2.5MB LLC slice

    CBOXTMU

    32kB data bank

    8kB array

    8kB SRAM array

    D EN

    QC

    A&B

    A^B

    SCout

    Cin

    Vref

    C_EN

    ~A & ~B

    SA SA

    BL BLB

    DR

    S = A^B^C

    Bitline ALU

    18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

    WLBit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

    Bit-Slice 3Bit-Slice 2Bit-Slice 1Bit-Slice 0

    Row decoders

    0

    255

    255

    = A + B

    BL/BLB

    Logic

    Arr

    ay

    AA

    rray

    B

    0

    1

    1

    0

    0

    0

    1

    1

    1

    0

    0

    1A +

    B

    Way

    2

  • Way

    2

    Transforming caches into massively parallel vector ALUs

    4

    18-core Xeon processor45 MB LLC

    Way

    1

    Way

    20

    Wa

    y 1

    9

    2.5MB LLC slice

    CBOXTMU

    32kB data bank

    8kB array

    8kB SRAM array

    WL

    Row decoders

    0

    255

    255

    = A + B

    BL/BLB

    Logic

    D EN

    QC

    A&B

    A^B

    SCout

    Cin

    Vref

    C_EN

    ~A & ~B

    SA SA

    BL BLB

    DR

    S = A^B^C

    Bitline ALU

    Array AArray B

    A + B

    Passive Last Level Cache transformed into ∼ 1 million bit-serial active ALUs

    ✓ Multiply ✓ Divide ✓ Add

    Bit-serial operation @2.5 GHz

    18 LLC slices 360 ways 5760 arrays 1,474,560 ALUs

    Fixed Point (Configurable Precision)

    45 MB LLC →

  • Neural Cache [ISCA ‘18]5

    CPU

    SRAM

    Digital Bit-Serial Operations @ 2.5GHz

    Supported Operations

    ✓ Integer Operations

    Programming Model

    ✓ Manually Mapped CNN KernelsCPU

    NeuralCache

    1 million bit-line ALUs

    Cache Mode Accelerator Mode

    35 MB LLC

    Neural Network

  • Duality Cache6

    CPU

    SRAM

    Digital Bit-Serial Operations @ 2.5GHz

    Supported Operations

    ✓ Integer Operations

    ✓ Floating Point Operations

    ✓ Transcendental Functions (sin, cos, log etc.)

    Programming Model

    ✓ SIMT & VLIW execution

    ✓ CUDA / OpenACC programs

    CPU

    DualityCache

    1 million bit-line ALUs

    Cache Mode Accelerator Mode

    35 MB LLC

    Data Parallel Applications

  • 1. Reduced data movement

    2. Cost

    3. On chip memory capacity

    7

    Duality Cache Benefits

    DDR4

    CPU

    SRAM

    Baseline

    ✓No memcpy. Efficient serial/parallel interleaving.

    + 471 mm2 +250W (GPU) vs. ✓+30 mm2 +6W = ~3.5% (DC)

    ✓ ~10x thread capacity. Flexible cache allocation.

    GDDR5DDR4

    CPU

    SRAM

    GPU

    PC

    Ie

    A GPU system

    DDR4

    CPU

    DualityCache

    Duality Cache

  • Outline

    Background

    Integer / Logical Operations

    Floating Point Operations

    Transcendental Functions

    Execution Model

    Programming Model

    Compiler

    Methodology / Results

    8

  • 9

    BLB0 BL0 BLBn BLn

    SA

    Ro

    w D

    eco

    der

    SADifferential

    Sense Amplifiers

    Bitlines

    Wordlines

    Ro

    w D

    eco

    der

    -O

    Changes

    SA SA

    Vref

    SA SA

    Vref Single-endedSense Amplifiers

    Additional row decoder

    Reconfigurablesense amplifiers

    Logical Operations In-SRAM

  • SA SA

    Vref

    Logical Operations In-SRAM10

    BLB0 BL0 BLBn BLn

    Ro

    w D

    eco

    der

    Ro

    w D

    eco

    der

    SA SA

    Vref Single-endedSense Amplifiers

    A AND B

    A

    B

    BA

    0 1

    0 11 0

    1 0

    A AND B

    10

  • SA SA

    Vref

    BLB0 BL0 BLBn BLn

    Ro

    w D

    eco

    der

    Ro

    w D

    eco

    der

    SA SA

    Vref Single-endedSense Amplifiers

    A

    B

    BA

    0 1

    0 11 0

    1 0

    A AND B

    10

    A NOR B

    1 0

    11

    Logical Operations In-SRAM

  • Bit-Serial Integer Operations

    A + B

    Row decoders

    0

    255

    255BL/BLB

    Sum

    Carry

    Arr

    ay A

    Arr

    ay B

    A +

    B

    Wo

    rd 3

    Wo

    rd 2

    Wo

    rd 1

    Wo

    rd 0

    }

    }

    }

    S S S S

    Transposed data

    0 0 0 0𝑠𝑖 = 𝑎𝑖 ⊕bi⊕ 𝑐𝑖−1𝑐𝑖 = 𝑎𝑖 × bi + 𝑎𝑖 ⊕bi × 𝑐𝑖−1

    In-memory operations

    D EN

    QC

    A&B

    A^B

    SCout

    Cin

    Vref

    C_EN

    ~A & ~B

    SA

    SA

    BL BLB

    DR

    S = A^B^C

  • Bit-Serial Integer Operations13

    A + B

    Row decoders

    0

    255

    255BL/BLB

    Sum

    Carry

    Arr

    ay A

    Arr

    ay B

    A +

    B

    Wo

    rd 3

    Wo

    rd 2

    Wo

    rd 1

    Wo

    rd 0

    }

    }

    }

    S S S S

    Transposed data

    0 0 0 0

    A × B

    Row decoders

    0

    255

    255BL/BLB

    Sum

    Carry

    Wo

    rd 3

    Wo

    rd 2

    Wo

    rd 1

    Wo

    rd 0

    S S S S

    Transposed data

    0 0 0 0𝑠𝑖 = 𝑎𝑖 ⊕bi⊕ 𝑐𝑖−1𝑐𝑖 = 𝑎𝑖 × bi + 𝑎𝑖 ⊕bi × 𝑐𝑖−1

    In-memory operations Tag 0 0 0 0

    Arr

    ay A

    Arr

    ay B

    A ×

    B

    }

    }

    }

    𝒑𝒑(𝑖) = 𝑎𝑖 × 𝒃 × 2𝑖

    𝒑(1) = 𝒑𝒑(1)𝒑(2) = 𝒑(1) + 𝒑𝒑(2)…𝒑(𝑛) = 𝒑(𝑛−1) + 𝒑𝒑(𝑛)

    𝑎𝑖 ≔ 𝑖th bit of bit vector 𝒂

    𝒂(𝑖) ≔ bit vector 𝒂 after 𝑖th iteration

    partial product

    predication(tag)

  • Bit-Serial Floating Point Operations14

    A + B

    + × 102 9. 25000

    + × 101 2.50000

    SIMDlane 1

    - × 104 1. 02500

    + × 102 2.50000

    SIMDlane 2

    + × 102 9. 25000

    + × 102 0.25000

    - × 104 1. 02500

    + × 104 0.02500

    1-bit shift

    2-bit shift

    ①Mantissa Denormalization

    |ediff| = 1

    |ediff| = 2

    sgn exp mnt mnt

    × 102 9. 25000

    × 102 0.25000

    × 104 998.975

    × 104 0.02500

    ②Convert mnt into 2’s complement format

    Up to mntbit * #lanes cycles for shift ops!

    Conversion cost (signed 2’s complement)!

  • Bit-Serial Floating Point Operations15

    A + B

    + × 102 9. 25000

    + × 101 2.50000

    SIMDlane 1

    - × 104 1. 02500

    + × 102 2.50000

    SIMDlane 2

    ediff = 1

    ediff = 2

    sgn exp mnt

    SIMDlane 1

    SIMDlane 2

    exp

    9. 25000

    2.50000

    998.975

    2.50000

    msb mnt

    × 102

    × 101

    × 104

    × 102

    Preemptive conversion to 2’s complement format

    exp

    9. 25000

    2.50000

    998.975

    2.50000

    msb mnt

    × 102

    × 101

    × 104

    × 102

    Unique ediff enumeration using CAM-search

    Uniq ediffs = {1, 2}

    + × 102 9. 25000

    + × 102 0.25000

    - × 104 1. 02500

    + × 104 0.02500

    1-bit shift

    2-bit shift

    ①Mantissa Denormalization

    mnt

    × 102 9. 25000

    × 102 0.25000

    × 104 998.975

    × 104 0.02500

    ②Convert mnt into 2’s complement format

  • Bit-Serial Floating Point Operations16

    A + B

    + × 102 9. 25000 + × 101 2.50000SIMDlane 1

    - × 104 1. 02500 + × 102 2.50000SIMDlane 2

    |ediff| = 1

    |ediff| = 2

    sgn exp mnt

    SIMDlane 3

    - × 103 1. 02500 + × 101 2.50000 |ediff| = 2

    SIMDlane 4

    - × 103 1. 02500 + × 101 2.50000 |ediff| = 2

    SIMDlane 256

    - × 103 1. 00500 + × 102 2.50100 |ediff| = 1

    ...

    Up to mntbit * #lanes cycles for shift ops!

    8 cycle ediff read + 23 cycle mnt shift

    8 cycle ediff read + 23 cycle mnt shift

    8 cycle ediff read + 23 cycle mnt shift

    8 cycle ediff read + 23 cycle mnt shift...

    8 cycle ediff read + 23 cycle mnt shift

    |ediff| = 1

    |ediff| = 2

    23 cycle mnt shift

    23 cycle mnt shift

    + CAM-search cycleUnique ediff enumeration using CAM-search✓

    Uniq ediffs = {1, 2}

    7,936 cycles (total)

  • Bit-Serial Floating Point Operations17

    A + B

    SIMDlane 1

    SIMDlane 2

    exp

    × 102 9. 25000

    × 102 0.25000

    × 104 998.975

    × 104 0.02500

    mnt

    × 102 10. 0000

    × 104 999.000

    + =

    + =

    ③Perform addition

    + × 103 1. 0000

    - × 104 1.00000

    ⑤Normalization④Convert back to sign expression

    + × 102 10. 0000

    - × 104 1.00000

    sgn exp mnt

    ②Convert mnt into complement format

  • Bit-Serial Floating Point Operations18

    A + B

    SIMDlane 1

    SIMDlane 2

    exp

    × 102 9. 25000

    × 102 0.25000

    × 104 998.975

    × 104 0.02500

    mnt

    × 102 10. 0000

    × 104 999.000

    + =

    + =

    ③Perform addition

    ④Convert back to sign expression

    + × 102 10. 0000

    - × 104 1.00000

    + × 103 1. 0000

    - × 104 1.00000

    ⑤Normalization

    sgn exp mnt

    Compiler-directed reconversion (2’s compl→signed)

    ②Convert mnt into complement format

  • Bit-Serial Floating Point Operations19

    A + B

    SIMDlane 1

    SIMDlane 2

    exp

    × 102 9. 25000

    × 102 0.25000

    × 104 998.975

    × 104 0.02500

    mnt

    × 102 10. 0000

    × 104 999.000

    + =

    + =

    ③Perform addition

    × 103 1. 0000

    × 104 999.000

    ④Partial normalization

    exp

    9. 25000

    2.50000

    998.975

    2.50000

    mnt

    × 102

    × 101

    × 104

    × 102

    SIMDlane 1

    SIMDlane 2

    × 102 10. 0000

    × 104 999.000

    + =

    + =

    ③Perform shift+add𝑠𝑖 = 𝑎𝑖 + 𝑏𝑖+ediff

    ②Convert mnt into complement format

  • A + B

    Row decoders

    0

    255

    255BL/BLB

    Sum

    Carry

    Arr

    ay B

    A +

    B

    Wo

    rd 3

    Wo

    rd 2

    Wo

    rd 1

    Wo

    rd 0

    }

    }

    S S S S

    Transposed data

    0 0 0 0

    Arr

    ay A

    }

    ediff= exp_a-exp_b

    2. Swap operandsCalculate ediffIf ediff[i] < 0 Then swap(A[i], B[i])

    1

    msb

    exp

    mnt

    swapA[1], B[1]

    3. Enumerate unique ediffUsing search

    uniq_ediff= {1}

    001

    5. Normalize expIf bit_overflowThen exp_c=exp_a+1

    mnt_c>>=1

    4. ARSHADD Foreach uniq_ediffA[i]+(B[i]>>ediff)

    +

    B[i]>>1

    Arr

    ay A

    Arr

    ay B

    sgn

    exp

    mnt

    1. Convert into 2’s complement

    S =

    A +

    B

  • Latency Optimizations (Integer, FP)21

    0000101101

    0000111101

    0000001001

    0000000101

    SIMD Lanes

    Leading Zero Search

    Search leading zeros.

    Skip inner loop iterations

    & reduce inner loop cycles.

    001010 000101×

    e.g. Multiplication O(n2) →O(na×nb)

    na nb

    0000101101

    0000111101

    0000001001

    0000000101

    SIMD Lanes

    0 0 0 0Tag

    Zero Tag Search

    Check tag bits are all zero.

    Skip the inner loop iteration.

    Useful for (const) integer multiplications.

  • Optimizations with CAM Search

    SA SA

    Vref

    BLB0 BL0 BLBn BLn

    SA SA

    Vref

    0 1

    1 1

    0 0

    0 0

    0 1

    0 000

    1

    1

    0

    0

    1

    0

    Search String

    0 1Tag

    Cycle 1Reduction AND

    0 1

    2 cycle CAM

  • Optimizations with CAM Search

    SA SA

    Vref

    BLB0 BL0 BLBn BLn

    SA SA

    Vref

    0 1

    1 1

    0 0

    0 0

    0 1

    0 0

    1

    1

    0

    0

    1

    0

    Search String

    0 1Tag

    Cycle 2

    Reduction NOR

    1 1

    Ediff enumeration

    1. Perform leading zero search to limit the search space for CAM search.

    (e.g. ∀ediff < 8 ⇒ 4-bit CAM search)

    2. Perform CAM search over ediff vector for values 0 ≤ 𝑖 ≤ 2𝑛.

    3. If any hit (wired OR of tag), perform ARSHADD

    2 cycle CAM

  • Supported In-Cache Operations24

    Operation Type Algorithm Latency Optimizations

    add, sub uint, int [2], Bit-serial O(n) -

    mul uint, int [2], Bit-serial O(n2) LZS (multiplicand), ZTS

    div, rem uint, int [2], Bit-serial O(n2) LZS (dividend, divisor), ZTS,ZRS

    and, or, xor uint [1] O(n)

    shl, shr uint, int Bit-serial O(n2) LZS, CAM Search

    add, sub float Bit-serial O(n2) LZS, CAM Search

    mul float Bit-serial O(n2) ZTS

    div float Bit-serial O(n2) ZTS, ZRS

    sin, cos fixed point CORDIC O(nk) Parallel CORDIC

    exp fixed point CORDIC O(nk) Parallel CORDIC

    log fixed point CORDIC O(nk) Parallel CORDIC

    sqrt fixed point CORDIC O(nk) Parallel CORDIC

    rsqrt float Fast Inv Sqrt O(n2)

    Algorithm[1] Compute Caches (HPCA’17)[2] Neural Cache (ISCA’18)

    Latencyn: Data Bit Length (bit)k: CORDIC iteration count

    OptimizationsLZS: Leading Zero SearchZTS: Zero Tag SearchZRS: Zero Residue Search

  • Transcendental functions using CORDIC25

    Benefits

    • No multiplications.

    • Can be combined with Bit-serial algorithms.

    • Constant static 𝛼𝑖 series for all possible inputs. No LUT required.

    Highly parallelizable for large SIMD processor.

    • Operation level parallelism.

    Express an angle 𝜃 using add/sub of a series. (0 ≤ 𝜃 < 𝜋/2)

    𝜃 =±𝛼𝑖

    Vector rotation requires multiplications with tan(𝛼𝑖).𝑥𝑖𝑦𝑖

    =cos 𝛼𝑖 −sin 𝛼𝑖sin 𝛼𝑖 cos 𝛼𝑖

    𝑥𝑖−1𝑦𝑖−1

    = cos(𝛼𝑖)1 − tan 𝛼𝑖

    tan 𝛼𝑖 1

    𝑥𝑖−1𝑦𝑖−1

    CORDIC performs pseudo rotation so that tan 𝛼𝑖 = ±2−𝑖

    𝑥𝑖𝑦𝑖

    = 𝐾𝑥𝑖−1 ∓ 𝑦𝑖−1 × 2

    −𝑖

    𝑦𝑖−1 ± 𝑥𝑖−1 × 2−1 , 𝜃 = ±arctan2

    −𝑖

  • Outline

    Background

    Integer / Logical Operations

    Floating Point Operations

    Transcendental Functions

    Execution Model

    Programming Model

    Compiler

    Methodology / Results

    26

  • Execution Model27

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Wa

    y 1

    Way

    20

    Way

    19

    CBOXTMU

    Wa

    y 2

    2.5MB LLC slice

  • Execution Model28

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Cache Subarrays

    …32-

    bit

    8x

    32-

    bit re

    gis

    ters

    256 threads

    32-b

    it…

    32-b

    it…

  • Execution Model29

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Cache Subarrays

    …32-

    bit

    8x

    32-

    bit re

    gis

    ters

    256 threads

    32-b

    it…

    32-b

    it…

    1 thread

    32 x 32-bit registers

  • Execution Model30

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Cache Subarrays

    8x 3

    2-

    bit re

    gis

    ters

    256 threads / Bank

    1 thread

    32 x 32-bit registers

    4-wide VLIW-style instruction issue

    add sub load nop

    add sub load nop

    Bank2

    1024 thread 4-wide VLIW SIMD processor

    Bank3Bank4

  • Execution Model31

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    1 thread

    32 x 32-bit registers

    4-wide VLIW-style instruction issue

    Tag

    add//!1sub//!2…

    Tag compare

    Window 1

    FSMcmdsel

    Window 2

    FSMcmdsel

    Window 3

    FSMcmdsel

    Window 4

    FSMcmdsel

    Decoder

    256 threads / Bank

    Bank2Bank3Bank4

    add sub load nop

    add sub load nop

  • Execution Model 32

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    1 thread

    32 x 32-bit registers

    4-wide VLIW-style instruction issue

    add sub load nop

    add sub load nop

    Bank2Bank3Bank4

    Tag

    add//!1sub//!2…

    Tag compare

    Window 1

    FSMcmdsel

    Window 2

    FSMcmdsel

    Window 3

    FSMcmdsel

    Window 4

    FSMcmdsel

    Decoder

    PC+1

    0

    2047

    jmp_en

    256 threads / Bank

  • Execution Model 33

    DualityCache

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    add sub load nop

    Tag

    add//!1sub//!2…

    Tag compare

    Window 1

    FSMcmdsel

    Window 2

    FSMcmdsel

    Window 3

    FSMcmdsel

    Window 4

    FSMcmdsel

    Decoder

    PC+1

    0

    2047

    add sub load nop

    C-B

    ox

    TMU *MSHR

    LLC / Mem

    Ba

    nk

    * Transposing Memory Unit

    jmp_en

  • Programming Model34

    ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

    Independent CTAs (Thread Blocks)

    ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

    Warp

    // kernel__global__ void vecadd (const float* A, const float* B, float* C, int n_el) {

    int tid = blockDim.x * blockIdx.x + threadIdx.x;if (tid < n_el)

    C[tid] = A[tid] + B[tid];}

    // kernel invocationvecadd (A, B, C, n_el);

    Block

    Grid

  • Programming Model35

    Data Movement Management

    Parallel Region

    Loop Mapping Optimization

    Simple and seamless directive-based parallel programming paradigm for heterogeneous architectures.

    Target Architectures: Native x86, NVIDIA GPU, OpenCLCompiler Support: PGI, gcc9.1, Riken Omni, OpenUH

  • Programming Model36

    ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

    Independent CTAs (Thread Blocks)

    ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

    Inter-thread communication

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    DualityCache

    XBar

  • Resource Comparison37

    CPU

    DualityCache

    GPU (NVIDIA GP100)

    SMs

  • Resource Comparison38

    DualityCache

    SMs

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Re

    gis

    ter

    WarpScheduler

    Re

    gis

    ter

    WarpScheduler

    Shared Mem / L1

    Xeon E5 35MB LLC NVIDIA GP100 SMs

    SM

    Bank× 4 = Register

    Cores(32 lanes)

    Cores(32 lanes)

    Capacity

    × 4≥

    PU Resources

    32 x 32-bit reg. / thread

    1,024 threads32 PUs

    256 PUs

  • Resource Comparison39

    DualityCache

    SMs

    Bank

    Bank

    Bank

    Bank

    Way

    Tag

    C-Box

    Re

    gis

    ter

    WarpScheduler

    Re

    gis

    ter

    WarpScheduler

    Shared Mem / L1

    Xeon E5 35MB LLC NVIDIA GP100 SMs

    SM

    Bank× 4 = Register

    Cores(32 lanes)

    Cores(32 lanes)

    Capacity

    256 PUs × 4

    ≥1,024 threads

    32 PUs

    PU Resources

    32 x 32-bit reg. / thread

    System Resources(2 socket CPU + GPU)

    150x more PUs

    10x more Threadsx 560 x 28

  • Compiler40

    nvccNVIDIA CUDA

    Compiler

    ompccRIKEN Omni

    OpenACC compiler frontend

    CUDA binary

    ELF

    PTX

    SASS

    CUDA Runtime

    Duality Cache Compiler

    KernelAnalysis

    InstructionScheduler

    PTXOptimizer

    ResourceAllocator

    DC binary

    ELF

    DC-PTX

    DC Runtime

    Built on GPU Ocelot Framework

  • Compiler Optimization(1) Register Pressure Aware Instruction Scheduling

    41

    Duality Cache Compiler

    KernelAnalysis

    InstructionScheduler

    PTXOptimizer

    ResourceAllocator

    Built on GPU Ocelot Framework

    Instruction Scheduling First

    Maximize parallelism utilizing abundant shared register resources in VLIW architecture

    Result in frequent register spill

    Resource Allocation First

    Optimized (small) resource usage and reuse distance

    Introduces many false dependencies (less parallelism)

    !

    !

    InstructionScheduler

    ResourceAllocator

    Interactively perform resource allocation and instruction scheduling.

    Bottom-Up Greedy (BUG) based

    Linear Scan Register Allocation based

  • BB-DFG

    PU1 PU2 PU1 PU2

    Bad :(

    Network Delay

    BB1

    BB3

    BB2

    BB4

    CFG

    PU1 PU2

    Reg.spill

    Alive in @ PU2

    Alive out @ PU2

    Alive out

    Alive in Alive in

    Alive outAlive out

    Alive in

    Good :)

    Large Execution Time

    Compiler Optimization(1) Register Pressure Aware Instruction Scheduling

  • Bottom-Up Greedy

    Collect candidate assignments

    Make final assignments

    Minimize data transfer latency by

    taking both operand & successor location into consideration

    [Ellis 1986]

    22

    2

    1

    1

    1

    1

    Time

    PU1 PU2Alive in

    Alive out

    Compiler Optimization(1) Register Pressure Aware Instruction Scheduling

    Register pressure

  • Compiler Optimization(2) PTX Optimizations

    44

    Duality Cache Compiler

    KernelAnalysis

    InstructionScheduler

    PTXOptimizer

    ResourceAllocator

    Built on GPU Ocelot Framework

    I. AST Balancing

    a + b + c + d+

    ++

    a bc

    d +

    +c d

    +a b

    Folded associative Exprs Regenerate balanced AST-subtreetargeting max ILP = 4

  • Compiler Optimization(2) PTX Optimizations

    45

    Duality Cache Compiler

    KernelAnalysis

    InstructionScheduler

    PTXOptimizer

    ResourceAllocator

    Built on GPU Ocelot Framework

    I. AST Balancing

    II. Thread Independent Variable Isolation

    a + b + c + d+

    ++

    a bc

    d +

    +c d

    +a b

    Regenerate balanced AST-subtreetargeting max ILP = 4

    __device__ kernel{

    int bid = blockIdx.x;for (int i = 0; i < ITER_CNT; ++i){

    // Do something}

    }

    bid and i are thread independent variable.⇓

    Loop unroll / const fold.

    Do not store them in thread private registers to reduce reg pressure.

    Folded associative Exprs

  • Outline

    Background

    Integer / Logical Operations

    Floating Point Operations

    Transcendental Functions

    Execution Model

    Programming Model

    Compiler

    Methodology / Results

    46

  • Evaluation Methodology47

    ♦ Benchmarks−Rodinia

    backprop

    bfs

    b+tree

    dwt2d

    hotspot

    hotspot3D

    hybridsort

    nw

    streamcluster

    gaussian

    heartwall

    leukocyte

    lud

    nn

    −PathScaleOpenACCbenchmark

    divergence

    gradient

    lapsgrb

    laplacian

    tricubic

    tricubic2

    uxx1

    vecadd

    wave13pt

    gameoflife

    gaussblur

    matvec

    whispering

    CPU2-sockets

    GPU1-card, PCIe v3

    Duality Cache

    Processor

    Intel Xeon E5-2597 v3, 2.6GHz,

    28 cores, 56 threads

    NVIDIA GP100, 1.6GHz, 3840 cuda

    cores

    35MB Duality Cache, 2.6 GHz

    On-chip Memory 78.96 MB 9.14 MB 78.96 MB

    Off-chip Memory 64 GB DDR412GB GDDR5 +

    64GB DDR4 (host)64 GB DDR4

    Total System Area 912 mm2 1383 mm2 942 mm2

    TDP 290 W 640 W 296 W

    Profiler / Simulator(Performance)

    perf NVPROFGPU Ocelot

    + Ramulator

    Profiler / Simulator(Energy)

    Intel RAPL Interface

    NVIDIA System Management

    Interface

    Trace based simulation

  • Area / Power Overhead

    Area (mm2) Power (W) Area Overhead

    CPU 456 145 -Compute Cache

    Peripheral 3.15 2.96 0.69%

    TMU 5.32 0.06 1.17%

    Controller / FSM 6.16 0.33 1.35%

    MSHR 0.86 0.05 0.19%

    Local Crossbar 0.28 0.01 0.06%

    Total 471.77 148.4 3.50%

    48

    Duality Cache has only 3.5% area overhead

  • System Speedup49

    CPU, DC: DDR4, GPU: GDDR5 + memcpy (DDR4 ↔PCIe v3 ↔GDDR5)

    Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation

    Reduced data transfer (memcpy) cost via external bus (PCIe).

    Small reuse factor / large working set make memcpy costly for GPU.

    Rodinia OpenACC

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    Rodinia OpenACC

    Spee

    du

    p

    CPUGPUDuality Cache

    20x

    3.6x

    2.4x

    4.0x

  • 0

    200

    400

    600

    800

    1000

    bac

    kpro

    p

    bfs

    b+

    tre

    ed

    wt2

    dga

    uss

    ian

    he

    artw

    all

    ho

    tsp

    ot

    ho

    tsp

    ot3

    Dh

    ybri

    dso

    rtla

    vaM

    Dle

    uko

    cyte lud

    nn

    nw

    pat

    hfi

    nd

    er

    stream

    clu…

    div

    erge

    nce

    gam

    eo

    flif

    ega

    uss

    blu

    r

    gra

    die

    nt

    lap

    gsrb

    lap

    laci

    an

    mat

    vec

    tric

    ub

    ic

    tric

    ub

    ic2

    uxx

    1ve

    cad

    dw

    ave

    13p

    tw

    his

    pe

    rin

    g

    Ave

    rage

    Ker

    nel

    Siz

    e (#

    CTA

    s)

    2 sockets (560 CTAs)

    1 socket (280 CTAs)

    Massively Parallel Execution50

    PU Utilization

    Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation

    Some kernels have high level of parallelism (e.g. backprop, b+tree, nn, gaussian, gausblur, etc.)

    280 CTA/ socket (DC) vs. 60 CTA / card (GPU) (reg. capacity based)

    Rodinia OpenACC

    2000

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    Rodinia OpenACC

    Spee

    du

    p

    CPUGPUDuality Cache

    20x

    3.6x

    2.4x

    4.0x

  • Kernel Speedup51

    DC: GDDR5, GPU: GDDR5 + No memcpy

    Key performance factors• Reduced Memcpy time• Massively parallel execution• Compute / memory access overlap• Flexible cache allocation

    Rodinia OpenACC

    Some kernels with high level of parallelism (e.g. backprop, b+tree, nn, gaussian, gaussblur, etc.) significantly reduce execution time.

    Memory bounded applications (hotspot3d, streamcluster) will benefit from GDDR5

    0.0

    1.0

    2.0

    3.0

    4.0

    5.0

    Rodinia OpenACC

    Spee

    du

    p

    CPUGPUDuality Cache

    20x

    3.6x

    2.4x

    4.0x

  • Operation Latency52

    0

    500

    1000

    1500

    2000

    base opt base opt base opt base opt base opt

    mul div fpadd fpmul fpdiv

    Late

    ncy

    (cy

    cles

    )

    Integer multiplication is often used to calculate address or some variables based on induction variables as an operand→ They contain many leading zeros which we can skip by our optimization. (13x faster)

    Floating point addition has small dynamic range.→ The number of unique ediff found usually has its peak at 1 in the distribution. (6.1x faster)

  • Conclusion53

    CPU

    DualityCache

    Contributions

    In-cache computing framework for general purpose programming

    • Used SIMT programming model for the programming frontend

    • Developed a compiler for in-cache computing

    • Enhanced computation primitives (all digital)

    Results

    1450x over CPU52x over GPU

    Overall Efficiency

    20x over CPU5.8x over GPU

    Energy

    72x over CPU4x over GPU

    Performance

    Duality Cache needs 15.7 mm2

    == 3.5 % over CPUGPU = 471 mm2

    Area

  • 54

  • Backup Slides

    55

  • DC + CPU Hybrid Execution56

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    Duality Cache Duality Cache+ CPU

    lud

    No

    rmal

    ized

    Exe

    cuti

    on

    Tim

    e DC: Compute

    DC: Memory

    CPU

    BFS

    LUD

    Kernel Launch Pattern(time vs. #CTAs launched)

    Tim

    eTi

    me Execute

    small kernels on

    CPU threads

    Enough parallelism to keep all PUs busy

    67% better

  • System Energy57

    Because of the reduced execution time, we achieve 5.34× energy efficiency compared to GPU.One core is active during execution to serve instructions. This makes CPU and DRAM access dominant in energy consumption.

  • Average Power58

    0

    20

    40

    60

    80

    100

    120

    140

    ba

    ckp

    rop

    bfs

    b+

    tre

    e

    dw

    t2d

    ga

    ussia

    n

    he

    art

    wa

    ll

    ho

    tsp

    ot

    ho

    tsp

    ot3

    D

    hybri

    dso

    rt

    lava

    MD

    leuko

    cyte

    lud

    nn

    nw

    pa

    thfin

    de

    r

    str

    eam

    clu

    ..

    div

    erg

    en

    ce

    ga

    meo

    flife

    ga

    ussb

    lur

    gra

    die

    nt

    lapg

    srb

    lapla

    cia

    n

    ma

    tve

    c

    tric

    ub

    ic

    tric

    ub

    ic2

    uxx1

    ve

    ca

    dd

    wa

    ve

    13

    pt

    Me

    an

    Av

    era

    ge

    Po

    we

    r (W

    )

    Mean: 88.8 WMax: 120.1 W

  • 64

    Way

    1

    Way

    20

    Way

    2

    Way

    19

    CBOXTMU

    Transpose

    Ro

    w D

    eco

    der

    A0[MSB]

    A1[MSB]

    A2[MSB]

    A0[LSB]

    A1[LSB]

    A2[LSB]

    ... ...

    ... ...

    ...

    ...

    ...

    ...

    ...

    ...

    Col Decoder

    SA SA SA

    DR

    DR

    DR

    SADR

    SA SA SA

    DR

    DR

    DR

    ...

    ...

    ...

    ...

    ...

    ...

    ... ...

    ... ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    ...

    Control

    SA

    SA

    SA

    SA

    SA

    SA

    SA

    SA

    SA

    SA

    DR

    DR

    DR

    DR

    DR

    DR

    DR

    DR

    DR

    DR

    DR

    B0[MSB]

    B1[MSB]

    B2[MSB]

    B0[LSB]

    B1[LSB]

    B2[LSB]

    Regular read/write

    Transp

    ose

    re

    ad/w

    rite

    8-T transpose bit-cell

  • A2 A1 A0B2 B1 B0C2 C1 C0

    TMU A0A1A2

    C0C1C2

    B0B1B2

    Transpose

    65

  • Core 18Core 18

    Core 1

    Duality Cache Operation Modes

    CPU only mode Duality Cache only mode

    Hybrid mode - Slice partitioned Hybrid mode - Way partitioned

    Core 1

    Slice 1

    Core 18

    Slice 18 Slice 1 Slice 18

    Core 1

    Slice 1

    Core 17

    Slice 6 Slice 7 Slice 18

    Core 1

    Slice 1

    Core 17

    Slice 1818 Ways 2 Ways

    Way mask implemented by CAT

    Cores running non-DC threads

    Cores running DC-related threads

    LLC used for non-DC threads

    LLC used for NC-related threads


Recommended