+ All Categories
Home > Documents > Anoop Thomas

Anoop Thomas

Date post: 06-Apr-2018
Category:
Upload: scribduserfornow
View: 221 times
Download: 0 times
Share this document with a friend

of 42

Transcript
  • 8/3/2019 Anoop Thomas

    1/42

    HIGH PERFORMANCE AND LOWHIGH PERFORMANCE AND LOWPOWER HYBRID CACHEPOWER HYBRID CACHE

    ARCHITECTURES FOR CMPsARCHITECTURES FOR CMPs

    Presented by,

    Seminar Guide : Sunith C K ANOOP THOMASREG NO:98911037VLSI & EMBEDDED SYSTEMS

    MTech VLSI & ES 1

  • 8/3/2019 Anoop Thomas

    2/42

    IntroductionIntroduction

    A multi-core processor is a singlecomponent with two or more independentrocessors cores .

    Chip Multiprocessors or CMP:- When thecores are integrated onto a single

    integrated circuit die.

    MTech VLSI & ES 2

  • 8/3/2019 Anoop Thomas

    3/42

    Need For Cache MemoryNeed For Cache Memory

    Cache memories minimizes the

    performance gap between high-speedprocessors and slow off-chip memory.

    Cache subsystems, particularly on-chip,with multiple layers of large caches iscommon in CMPs.

    MTech VLSI & ES 3

  • 8/3/2019 Anoop Thomas

    4/42

    The processor-memory performance gap. [web]

    MTech VLSI & ES 4

  • 8/3/2019 Anoop Thomas

    5/42

    Current SchemesCurrent Schemes Performance can be improved through Non-uniform

    Cache Architecture (NUCA)

    Large cache is divided into multiple banks withdifferent access latencies determined by their physicallocations to the source of the re uest.

    Static NUCA:- A cache line is statistically mapped intobanks, with low order bits of the index determiningthe bank.

    Dynamic NUCA:- Any given line can be mapped toseveral banks based on the mapping policy.

    MTech VLSI & ES 5

  • 8/3/2019 Anoop Thomas

    6/42

    NUCA fails in case of large cacheNUCA fails in case of large cache

    It only utilizes varied access latency of cachebanks, due to their physical location, to improveperformance.

    Cache banks are of same size, process andcircuit technology.

    Overall cache size available is fixed for samememory technology.

    MTech VLSI & ES 6

  • 8/3/2019 Anoop Thomas

    7/42

    Comparison of different memory technologies.[4]

    MTech VLSI & ES 7

    >No single memory technology by itself is efficient.

  • 8/3/2019 Anoop Thomas

    8/42

    Memory Hierarchy generally used. [web]

    MTech VLSI & ES 8

  • 8/3/2019 Anoop Thomas

    9/42

    Hybrid Cache MemoryHybrid Cache MemoryArchitecturesArchitectures

    Cache designed using differing memory

    technologies performs better than singletechnology.

    y r ac e rc ec ures a ows eve s ncache hierarchy to be constructed from differentmemory technologies.

    MTech VLSI & ES 9

  • 8/3/2019 Anoop Thomas

    10/42

    Inter cache HCA (LHCA): The levels in a cachehierarchy can be made of disparate memory

    technologies.

    Region Base Intra Cache HCA (RHCA): A singlelevel of cache can be partitioned into multiple

    regions, each of different memory technology.

    STT-RAM with SRAM can be used to form ahybrid cache architecture for chip

    multiprocessors with low power consumption andhigh performance.

    MTech VLSI & ES 10

  • 8/3/2019 Anoop Thomas

    11/42

    MTech VLSI & ES 11

    Overview of LHCA and RHCA[4]

  • 8/3/2019 Anoop Thomas

    12/42

    STT RAM based HCASTT RAM based HCA

    STTRAM is non-volatile.

    Read speed is comparable to that ofSRAM (As per the design).

    Higher density than SRAM.

    Disadvantage is that write latency is long

    and high dynamic power consumption.

    MTech VLSI & ES 12

  • 8/3/2019 Anoop Thomas

    13/42

    Background of STTBackground of STT--RAMRAM

    Information carrier inside MRAM isMagnetic Tunnel Junction (MTJ).

    MTech VLSI & ES 13

    A conceptual view of MTJ. [4]

  • 8/3/2019 Anoop Thomas

    14/42

    MTech VLSI & ES 14

    An illustration of an MRAM cell. [4]

  • 8/3/2019 Anoop Thomas

    15/42

    MTJ is the storage element. NMOS is used for access controller.

    They are connected in series.

    Write Operation >Positive voltage difference between bit lineand source line for writing 0

    writing 1

    Read operation NMOS is enabled and (Vbl-Vsl) voltage is

    applied between BL and SL, is usually negativeand small.

    MTech VLSI & ES 15

  • 8/3/2019 Anoop Thomas

    16/42

    STT RAM alone is not used asSTT RAM alone is not used ascache memorycache memory

    Large number of writes on the last levelcache (LLC) occurs for most of the CMPa lications.

    Due to long write latency and very highdynamic power consumption, using STT

    RAM is not advised.

    MTech VLSI & ES 16

  • 8/3/2019 Anoop Thomas

    17/42

    Hybrid Cache Architecture usingHybrid Cache Architecture usingSTTSTT--RAMRAM

    STT-RAM and SRAM can be together used

    to form an HCA.

    STT-RAM has low leakage power and high.

    With smart cache management policies,low power consumption and high

    performance can be obtainedsimultaneously.

    MTech VLSI & ES 17

  • 8/3/2019 Anoop Thomas

    18/42

    Basic ArchitectureBasic Architecture

    Each Core is configured with private L1instruction and data cache.

    s e cons s ng o mu p e cac ebanks connected through interconnectionnetwork.

    MTech VLSI & ES 18

  • 8/3/2019 Anoop Thomas

    19/42

    Each bank is either an STT-RAM or an SRAMbank.

    SRAM banks are shared by all cores

    STT-RAM banks are logically divided into, .

    Shared SRAM banks are organized intoDNUCA.

    MTech VLSI & ES 19

  • 8/3/2019 Anoop Thomas

    20/42

    Hybrid Cache Architecture. [2]

    MTech VLSI & ES 20

  • 8/3/2019 Anoop Thomas

    21/42

    24 STT-RAM groups are logically divided into8 groups.

    Each of these groups consists of 3 STT-RAMbanks.

    Each core is privately allocated one logicalSTT-RAM group and most of the cores

    .

    SRAM is included to make write operationsmore efficient.

    SRAM banks are shared by all on-chip cores.

    MTech VLSI & ES 21

  • 8/3/2019 Anoop Thomas

    22/42

    MTech VLSI & ES 22

    Each hybrid LLC is implemented with 4 sub-banks.

    Each STT-RAM sub-bank is configured with a

    sub-bank write buffer to speed up long latencywrite operations.

    Cache bank structure of hybrid cache. [2]

  • 8/3/2019 Anoop Thomas

    23/42

    Micro Architectural MechanismMicro Architectural Mechanism

    Private STT-RAM was used to reduce the

    high power-consuming remote blockaccesses.

    or t e core runn ng memory- ntens veworkloads, private STT-RAM group maynot accommodate the large working set.

    MTech VLSI & ES 23

  • 8/3/2019 Anoop Thomas

    24/42

    Neighborhood Group CachingNeighborhood Group Caching

    Neighboring cores share their private STT-

    RAM groups with each other based on theHCA.

    E : Core 1 can share its STT-RAM banks

    with its one-hop neighbor core 0, 2 and 5. Neighborhood sharing can obtain more

    balanced capacity and access latency

    between private and shared schemes.

    MTech VLSI & ES 24

  • 8/3/2019 Anoop Thomas

    25/42

    NGC is scalable for future CMPs by carefullydefining the neighborhood.

    The energy aware read and write policieshelps HCA to optimize the powerconsum tion without sacrificin erformance.

    Flow graph for the whole micro architecturalmechanism is shown in the next slide.

    MTech VLSI & ES 25

  • 8/3/2019 Anoop Thomas

    26/42

    Flow graph of proposed micro-architecture mechanisms. [2]

    MTech VLSI & ES 26

  • 8/3/2019 Anoop Thomas

    27/42

    EnergyEnergy--Aware WriteAware Write

    Write miss occurs then target block will be

    loaded from low level memory and putinto SRAM bank.

    Write hits to SRAM is directly served bycorresponding SRAM bank.

    Write hits to STT-RAM banks are servedby Block Swapping Mechanism.

    MTech VLSI & ES 27

  • 8/3/2019 Anoop Thomas

    28/42

    EnergyEnergy--Aware ReadAware Read

    Read miss occurs then the target block is

    fetched from low level memory and putinto local STT-RAM group.

    -by local group or from the neighboringgroups.

    Read hits on SRAM bank, Active BlockMigration is used to serve the request.

    MTech VLSI & ES 28

  • 8/3/2019 Anoop Thomas

    29/42

    Block SwappingBlock Swapping Cache lines with intensive write operations are

    migrated from STT-RAM to SRAM.

    Migration causes an original line in SRAM to bereplaced.

    If re laced SRAM line is valid two lines in SRAM and

    STT-RAM are swapped.

    Future accesses to this line will hit STT-RAM whichreduces long latency accesses to low level memory.

    Invalid line is directly written back to memory.

    MTech VLSI & ES 29

  • 8/3/2019 Anoop Thomas

    30/42

    Swapping is activated upon a block in STT-RAMaccessed by two consecutive writes or accumulativelyaccessed by three writes.

    Each cache line extended with 2 bit swapping counterand 1 bit cross access counter to control dataswapping between STT-RAM and SRAM.

    MTech VLSI & ES 30

    a e rans ons o oc swapp ng

  • 8/3/2019 Anoop Thomas

    31/42

    Once a block is loaded into STT-RAM bothcounters will be set to zero.

    Block swap occurs when cross access counteris 0 and swapping counter is 10 or whencross access counter is 1 and swappingcounter is 11.

    When a read occurs when swapping counteris 01, cross access counter will be set to 1 toindicate that this block is accessed by readand write operations.

    MTech VLSI & ES 31

  • 8/3/2019 Anoop Thomas

    32/42

    Active Block MigrationActive Block Migration Upon read hits on SRAM migration of cache line

    from SRAM to STT-RAM occurs.

    Blocks in SRAM are divided into two types

    >Blocks swapped from STT-RAM banks.

    Cross access counter is used to differentiatethese block.

    >low level memory is set to 0>swapped from STT-RAM is set to 1

    MTech VLSI & ES 32

  • 8/3/2019 Anoop Thomas

    33/42

    State transitions of Active Block Migration [4]

    : oc etc e rom ow eve memory

    will be migrated into STT-RAM when a read request hits onthis block.

    LAZY ACTIVE MIGRATION: The swapped blocks from STT-

    RAM are migrated back into STT-RAM when it isaccumulatively read by twice more than the write on thisblock.

    MTech VLSI & ES 33

  • 8/3/2019 Anoop Thomas

    34/42

    Results and AnalysisResults and Analysis

    MTech VLSI & ES 34

    Main simulation parameters considered. [4]

  • 8/3/2019 Anoop Thomas

    35/42

    POWER ANALYSIS

    The main power component in STT-RAM isdynamic power and the leakage power ofperipheral circuits.

    Using STT-RAM as well as the low-power

    MTech VLSI & ES 35

    ,

    scheme consumes less power thanconventional SNUCA and DNUCA.

  • 8/3/2019 Anoop Thomas

    36/42

    MTech VLSI & ES 36

    Power Comparison normalized by SNUCA.[2]

  • 8/3/2019 Anoop Thomas

    37/42

    PERFORMANCE ANALYSIS

    The performance of the hybrid scheme is betterthan conventional SNUCA and DNUCA.

    Block replication causes large numbers of low-latency local hits in private STT-RAM groups andhence IPC is improved .

    Due the large density of STT-RAM and thecapacity efficiency of NGC scheme, the hybridscheme reduces massive long latency on-chipremote accesses and off-chip accesses during the

    execution.

    MTech VLSI & ES 37

  • 8/3/2019 Anoop Thomas

    38/42

    MTech VLSI & ES 38

    Average IPC comparison normalized by SNUCA. [2]

  • 8/3/2019 Anoop Thomas

    39/42

    ConclusionConclusion

    HCA greatly reduces the power and increasesthe performance when compared to using theconventional SRAM on-chip cache technology.

    By the combination of various memory

    cache system with better performance.

    With the help of proposed micro-architecturalmechanisms, the hybrid scheme is adaptive

    to variations of workloads.

    MTech VLSI & ES 39

  • 8/3/2019 Anoop Thomas

    40/42

    ReferencesReferences [1] Fran Vahid, Tony D. Givargis Embedded System Design: A Unified

    Hardware/Software Introduction

    [2] Jianhua Li, Xue, C.J, Yinlong Xu, STT-RAM based energy-efficiencyhybrid cache for CMPs In VLSI and System-on-Chip, 2011 IEEE/IFIP 19thInternational Conference, pages 31-36, 2011

    [3] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane,

    . , . , . , . , . , . ,novel nonvolatile memory with spin torque transfer magnetization switching:spin-ram. In IEEE Electronic Device Conference, pages 459462, 2005.

    [4] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ramakrishnan Rajamony,Yuan Xie, Hybrid Cache Architecture with disparate memory technologies.Paper published in- SCA 2009 Proceedings of the 36th annual internationalsymposium on Computer architecture

    Online Available: isca09.cs.columbia.edu/pres/04.pdf

    MTech VLSI & ES 40

  • 8/3/2019 Anoop Thomas

    41/42

    ReferecesRefereces ccntdntd..

    [5] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, A novel architecture of the 3d stacked

    mram l2 cache for cmps. In The 16th IEEE International Symposium on High-Performance Computer Architecture, pages 239249, 2009.

    [6] Video lecture on Digital Computer Organization - Lec-18 Cache MemoryArchitecture

    On ine Avai a e: ttp: npte .iitm.ac.in vi eo.p p?su jectI =117105078

    MTech VLSI & ES 41

  • 8/3/2019 Anoop Thomas

    42/42

    THANK YOUTHANK YOU

    MTech VLSI & ES 42


Recommended