+ All Categories
Home > Documents > Embedded Memories based on SOC (VLSI Seminar)

Embedded Memories based on SOC (VLSI Seminar)

Date post: 03-Apr-2018
Category:
Upload: anjali-naik
View: 222 times
Download: 0 times
Share this document with a friend

of 31

Transcript
  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    1/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 1

    Introduction

    Embedded systems functionality aspects

    Processing

    processors

    transformation of data

    Storage

    memory

    retention of data

    Communication

    buses

    transfer of data

    Memory :Basic concept

    Stores large number of bits

    m x n: m words ofn bits each

    k = Log2(m) address input signals

    orm = 2^k words

    e.g., 4,096 x 8 memory:

    32,768 bits

    12 address input signals

    8 input/output data signals

    Memory access

    r/w: selects read or write

    enable: read or write only when asserted

    multiport: multiple accesses to different locations simultaneously

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    2/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 2

    1.Write Ability and Storage performance

    Traditional ROM/RAM distinctions ROM

    read only, bits stored without power

    RAM

    read and write, lose stored bits without power

    Traditional distinctions blurred

    Advanced ROMs can be written to

    e.g., EEPROM

    Advanced RAMs can hold bits without power

    e.g., NVRAM

    Write ability

    Manner and speed a memory can be written

    Storage permanence

    ability of memory to hold stored bits after they are written

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    3/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 3

    write ability

    Ranges of write ability

    High end

    processor writes to memory simply and quickly

    e.g., RAM

    Middle range

    processor writes to memory, but slower

    e.g., FLASH, EEPROM

    Lower range

    special equipment, programmer, must be used to write to memory

    e.g., EPROM, OTP ROM

    Low end bits stored only during fabrication

    e.g., Mask-programmed ROM

    In-system programmable memory

    Can be written to by a processor in the embedded system using the memory

    Memories in high end and middle range of write ability

    Storage performance

    Range of storage permanence

    High end

    essentially never loses bits

    e.g., mask-programmed ROM

    Middle range

    holds bits days, months, or years after memorys power source turned

    off

    e.g., NVRAM

    Lower range holds bits as long as power supplied to memory

    e.g., SRAM

    Low end

    begins to lose bits almost immediately after written

    e.g., DRAM

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    4/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 4

    2.ROM:Read only Memory

    Nonvolatile memory

    Holds bits after power is no longer supplied

    High end and middle range of storage permanence

    Nonvolatile memory

    Can be read from but not written to, by a processor in an embedded system

    Traditionally written to, programmed, before inserting to embedded system

    Uses

    Store software program for general-purpose processor

    program instructions can be one or more ROM words Store constant data needed by system

    Implement combinational circuit

    Example :8*4 ROM

    Horizontal lines = words

    Vertical lines = data

    Lines connected only at circles

    Decoder sets word 2s line to 1 if address input is 010

    Data lines Q3 and Q1 are set to 1 because there is a programmed connection with

    word 2s line

    Word 2 is not connected with data lines Q2 and Q0

    Output is 1010

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    5/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 5

    Mask-programmed ROM

    Connections programmed at fabrication

    set of masks

    Lowest write ability

    only once

    Highest storage permanence

    bits never change unless damaged

    Typically used for final design of high-volume systems

    spread out NRE cost for a low unit cost

    OTP ROM:One-time programmable ROM

    Connections programmed after manufacture by user

    user provides file of desired contents of ROM

    file input to machine called ROM programmer

    each programmable connection is a fuse

    ROM programmer blows fuses where connections should not exist

    Very low write ability

    typically written only once and requires ROM programmer device

    Very high storage permanence

    bits dont change unless reconnected to programmer and more fuses blown

    Commonly used in final products

    cheaper, harder to inadvertently modify

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    6/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 6

    EPROM: Erasable programmable ROM

    Programmable component is a MOS transistor

    Transistor has floating gate surrounded by an insulator (a) Negative charges form a channel between source and drain storing a logic

    1

    (b) Large positive voltage at gate causes negative charges to move out of

    channel and get trapped in floating gate storing a logic 0

    (c) (Erase) Shining UV rays on surface of floating-gate causes negative

    charges to return to channel from floating gate restoring the logic 1

    (d) An EPROM package showing quartz window through which UV light can

    pass

    Better write ability

    can be erased and reprogrammed thousands of times

    Reduced storage permanence

    program lasts about 10 years but is susceptible to radiation and electric noise

    Typically used during design development

    Connections programmed after manufacture by user

    user provides file of desired contents of ROM

    file input to machine called ROM programmer

    each programmable connection is a fuse

    ROM programmer blows fuses where connections should not exist

    Very low write ability

    typically written only once and requires ROM programmer device

    Very high storage permanence

    bits dont change unless reconnected to programmer and more fuses blown

    Commonly used in final products

    cheaper, harder to inadvertently modify

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    7/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 7

    EEPROM: Electrically Erasable Programmable ROM

    Programmed and erased electronically

    typically by using higher than normal voltage

    can program and erase individual words

    Better write ability

    can be in-system programmable with built-in circuit to provide higher than

    normal voltage

    built-in memory controller commonly used to hide details from

    memory user

    writes very slow due to erasing and programming

    busy pin indicates to processor EEPROM still writing

    can be erased and programmed tens of thousands of times

    Similar storage permanence to EPROM (about 10 years)

    Far more convenient than EPROMs, but more expensive

    Flash Memory

    Extension of EEPROM

    Same floating gate principle

    Same write ability and storage permanence

    Fast erase Large blocks of memory erased at once, rather than one word at a time

    Blocks typically several thousand bytes large

    Writes to single words may be slower

    Entire block must be read, word updated, then entire block written back

    Used with embedded systems storing large data items in nonvolatile memory

    e.g., digital cameras, TV set-top boxes, cell phones

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    8/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 8

    3.RAM

    Typically volatile memory

    bits are not held without power supply

    Read and written to easily by embedded system during execution

    Internal structure more complex than ROM

    a word consists of several memory cells, each storing 1 bit

    each input and output data line connects to each cell in its column

    rd/wr connected to every cell

    when row is enabled by decoder, each cell has logic that stores input data bit

    when rd/wr indicates write or outputs stored bit when rd/wr indicates read

    Basic Types of RAM

    SRAM: Static RAM

    Memory cell uses flip-flop to store bit

    Requires 6 transistors

    Holds data as long as power supplied

    DRAM: Dynamic RAM

    Memory cell uses MOS transistor and capacitor to store bit

    More compact than SRAM

    Refresh required due to capacitor leak words cells refreshed when read

    Typical refresh rate 15.625 microsec.

    Slower to access than SRAM

    RAM VARIATION

    PSRAM: Pseudo-static RAM

    DRAM with built-in memory refresh controller Popular low-cost high-density alternative to SRAM

    NVRAM: Nonvolatile RAM

    Holds data after external power removed

    Battery-backed RAM

    SRAM with own permanently connected battery

    writes as fast as reads

    no limit on number of writes unlike nonvolatile ROM-based memory

    SRAM with EEPROM or flash

    stores complete RAM contents on EEPROM or flash before power

    turned off

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    9/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 9

    4.Scratchpad Memory

    Embedded processor-based system

    >Processor core

    >Embedded memory

    > Instruction and Data Cache

    >Embedded SRAM

    >Embedded DRAM

    Scratch Pad Memory

    >Design problems

    1. How much on-chip memory?

    2. Partitioning of on-chip memory in cache and scratchpad?

    3. Which variables/arrays in the scratchpad?

    Goals

    > Improve performance

    > Save power

    Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    10/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 10

    Abstract

    Efficient utilizationof on-chip memory space is extremely important in modern embedded

    system applications basedon microprocessor cores. In additionto a data cache that

    interfaceswith slower off-chip memory, a fast on-chip SRAM,called Scratch-Pad memory, is

    often used in several applications. this present a technique for efficiently exploiting onchip

    Scratch-Pad memory by partitioning the applications scalar and array variables into off-

    chipDRAM and on-chip Scratch-Pad SRAM, with the goal of minimizing the total execution

    time of embedded applications.

    > Introduction

    Complex embedded system applications typically use heterogeneous chips consisting of

    microprocessor cores, along with on-chip memory and co-processors. Flexibility and short

    design time considerations drive the use of CPU cores as instantiable modules in system

    designs [5]. The integration of processor cores and memory in the same chip effects a

    reduction in the chip count, leading to costeffective solutions. Examples of commercial

    microprocessor cores commonly used in system design are LSI Logics CW33000 series

    [3]and the ARM series from AdvancedRISC Machines [10].

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    11/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 11

    Typical examples of optional modules integrated with the processor on the same chip are:

    Instruction Cache, Data Cache, and on-chip SRAM. The instruction and data caches are

    fastlocal memory serving as an interface between the processor and the off-chip memory.

    The on-chip SRAM, termed Scratch-Pad memory, is a small, high-speed data memory that is

    mapped into an address space disjoint from the off-chip memory, but connected to the same

    address and data buses.

    Both the cache and Scratch-Pad SRAM have a single processor cycle ac-cess latency,

    whereas an access to the off-chip memory (usually DRAM) takes several (typically 1020)

    processor cycles.

    The main difference between the Scratch-Pad SRAM and data cache is that the SRAM

    guarantees a single-cycle access time, whereas an access to cache is subject to compulsory,

    capacity, and conflict misses.

    When an embedded application is compiled, the accessed data can now be stored either in the

    Scratch-Pad memory or in off-chip memory. In the second case, it is accessed by the

    processor through the data cache. We present a technique for minimizing the total execution

    time of an embedded application by a careful partitioning of scalar and array variables used

    in the application into off-chip DRAM (accessed through data cache) and Scratch-Pad

    SRAM.

    Optimization techniques for improving the data cache performance of programs have been

    reported [4, 7, 9]. The analysis in [9] is limited to scalars, and hence, not generally applicable.Iteration space blockingfor improving data locality is studied in [4]. This technique is also

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    12/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 12

    limited to the type of code that yields naturally to blocking. In [7],a data layout strategy for

    avoiding conflict misses is presented. However, array access patterns in some applications

    are too complex to be statically analyzable using this method. The availability of an on-chip

    SRAM with guaranteed fast access time creates an opportunity for overcomingsome of the

    cache conflict problems (Section 2). Theproblem of partitioning data into SRAM and cachewith the objective of maximizing performance, which we address inthis paper, has, to our

    knowledge, not been attempted before.

    > Problem Description

    Figure 1(a) shows the architectural block diagram of an application employing a typical

    embedded core processor (e.g., the LSI Logic CW33000 RISC Microprocessor core[3]),

    where the parts enclosed in the dotted rectangle are implemented in one chip, and whichinterfaces with an off-chip memory, usually realized with DRAM. The address and data

    buses from the CPU core connect to theDataCache, Scratch-Pad memory, and theExternal

    Memory Interface (EMI) blocks. On a memory access request from the CPU, the data cache

    indicates a cache hit to the EMI block through the C HIT signal. Similarly, if the SRAM

    interface circuitry in the Scratch-Pad memory determines that the referenced memory address

    maps into the on-chip SRAM, it assumes control of the data bus and indicates this status to

    the EMI through signal S HIT. If both the cache and SRAM report misses, the EMI transfers

    a block of data of the appropriate size (equal to the cache line size) between

    the cache and the DRAM.

    The data address space mapping is shown in Figure 1(b),for a memory of size data words.

    Memory addresses0 . . .1 map into the Scratch-Pad memory, and have a single processor

    cycle access time. Thus, in Figure 1(a), S HIT would be asserted whenever the processor

    attemptsto access any address in the range 0 . ..1. Memory addresses . . . 1 map into the off-

    chip DRAM, and are accessed by the CPU through the data cache. A cache hit for an address

    in the range. . . 1 results in a single-cycledelay, whereas a cache miss, which leads to a

    block transfer between off-chip and cache memory, results in a delay of10-20 processor

    cycles.

    Suppose the above code is executed on a processor configured with a data cache of size 1

    KByte. The performance is degraded by the conflict misses in the cache between elements of

    the two arraysHistandBrightnessLevel. Data layout techniques, such as [7] are not effective

    in eliminating the above type of conflicts, because the accesses to Histare data-dependent.

    Note that this problem occurs in both direct-mapped as well as set-associative caches.

    However, the conflict problem can be solved elegantly if we include a Scratch-Pad SRAM in

    the architecture. Since theHistarray is relatively small, we can store it in the SRAM, so that

    it does not conflict withBrightnessLevelin the data cache. This storage assignment improves

    the performance of theHistogram Evaluation code significantly.

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    13/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 13

    We present a strategy for partitioning scalar and array variables in an application code into

    Scratch-Pad memory and off-chip DRAM accessed through data cache, to maximize the

    performance by selectively mapping to the SRAM those variables that are estimated to cause

    the maximum number of conflicts in the data cache.

    > The Partitioning Strategy

    The overall approach in partitioning program variables into Scratch-Pad memory and DRAM

    is to minimize the cross-interference between different variables in the data cache. We first

    outline the different features of the code

    affecting the partitioning.

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    14/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 14

    5.CACHE

    Want inexpensive, fast memory

    Main memory

    Large, inexpensive, slow memory stores entire program and data

    Cache

    Small, expensive, fast memory stores copy of likely accessed parts of larger

    memory

    Can be multiple levels of cache

    >Introduction to Memory Hierarchy

    Usually designed with SRAM

    faster but more expensive than DRAM

    Usually on same chip as processor

    space limited, so much smaller than off-chip main memory

    faster access ( 1 cycle vs. several cycles for main memory)

    Cache operation:

    Request for main memory access (read or write)

    First, check cache for copy

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    15/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 15

    cache hit

    copy is in cache, quick access

    cache miss

    copy not in cache, read address and possibly its neighbors into

    cache Several cache design choices

    cache mapping, replacement policies, and write techniques

    >Different Mapping Techniques

    Direct Mapping

    Main memory address divided into 2 fields Index

    cache address

    number of bits determined by cache size

    Tag

    compared with tag stored in cache at address indicated by index

    if tags match, check valid bit

    Valid bit

    indicates whether data in slot has been loaded from memory

    Offset used to find particular word in cache line

    Fully associative mapping

    Complete main memory address stored in each cache address

    All addresses stored in cache simultaneously compared with desired address

    Valid bit and offset same as direct mapping

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    16/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 16

    Set associative Mapping

    Compromise between direct mapping and fully associative mapping

    Index same as in direct mapping

    But, each cache address contains content and tags of 2 or more memory address

    locations

    Tags of that set simultaneously compared as in fully associative mapping

    Cache with set size N called N-way set-associative

    2-way, 4-way, 8-way are common

    Technique for choosing which block to replace

    when fully associative cache is full

    when set-associative caches line is full

    Direct mapped cache has no choice

    Random

    replace block chosen at random

    LRU: least-recently used

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    17/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 17

    replace block not accessed for longest time

    FIFO: first-in-first-out

    push block onto queue when accessed

    choose block to replace by popping queue

    >Cache Replacement Policy

    When written, data cache must update main memory

    Write-through

    write to main memory whenever cache is written to

    easiest to implement

    processor must wait for slower main memory write

    potential for unnecessary writes

    Write-back

    main memory only written when dirty block replaced

    extra dirty bit for each block set when cache block written to

    reduces number of slow main memory writes

    >Cache Impact on system Performance

    Most important parameters in terms of performance:

    Total size of cache

    total number of data bytes cache can hold

    tag, valid and other house keeping bits not included in total

    Degree of associativity

    Data block size

    Larger caches achieve lower miss rates but higher access cost

    e.g.,

    2 Kbyte cache: miss rate = 15%, hit cost = 2 cycles, miss cost = 20

    cycles

    avg. cost of memory access = (0.85 * 2) + (0.15 * 20) = 4.7

    cycles 4 Kbyte cache: miss rate = 6.5%, hit cost = 3 cycles, miss cost will not

    change

    avg. cost of memory access = (0.935 * 3) + (0.065 * 20) =

    4.105 cycles (improvement)

    8 Kbyte cache: miss rate = 5.565%, hit cost = 4 cycles, miss cost will

    not change

    avg. cost of memory access = (0.94435 * 4) + (0.05565 * 20) =

    4.8904 cycles (worse)

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    18/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 18

    6. Advanced RAM DRAMs commonly used as main memory in processor based embedded systems

    high capacity, low cost

    Many variations of DRAMs proposed

    need to keep pace with processor speeds

    FPM DRAM: fast page mode DRAM

    EDO DRAM: extended data out DRAM

    SDRAM/ESDRAM: synchronous and enhanced synchronous DRAM

    RDRAM: rambus DRAM

    6.1 Basic DRAM

    Address bus multiplexed between row and column components

    Row and column addresses are latched in, sequentially, by strobing ras (row address strobe)and cas (column address strobe) signals, respectively

    Refresh circuitry can be external or internal to DRAM device

    strobes consecutive memory address periodically causing memory content to be

    refreshed

    Refresh circuitry disabled during read or write operation

    Fast Page Mode DRAM (FPM DRAM)

    Each row of memory bit array is viewed as a page

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    19/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 19

    Page contains multiple words

    Individual words addressed by column address

    Timing diagram:

    row (page) address sent

    3 words read consecutively by sending column address for each

    Extra cycle eliminated on each read/write of words from same

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    20/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 20

    Extended data out DRAM (EDO DRAM)

    Improvement of FPM DRAM

    Extra latch before output buffer

    allows strobing ofcasbefore data read operation completed Reduces read/write latency by additional cycle

    (Synchronous and Enhanced Synchronous (ES) DRAM

    SDRAM latches data on active edge of clock

    Eliminates time to detect ras/cas and rd/wrsignals

    A counter is initialized to column address then incremented on active edge of clock to

    access consecutive memory locations

    ESDRAM improves SDRAM

    added buffers enable overlapping of column addressing

    faster clocking and lower read/write latency possible

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    21/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 21

    Rambus DRAM (RDRAM)

    More of a bus interface architecture than DRAM architecture

    Data is latched on both rising and falling edge of clock

    Broken into 4 banks each with own row decoder

    can have 4 pages open at a time

    Capable of very high throughput

    6.2 DRAM Integration Problem SRAM easily integrated on same chip as processor

    DRAM more difficult

    Different chip making process between DRAM and conventional logic

    Goal of conventional logic (IC) designers:

    - minimize parasitic capacitance to reduce signal propagation delays and power

    consumption

    Goal of DRAM designers:

    - create capacitor cells to retain stored information

    Integration processes beginning to appear

    6.3 Memory Management Unit (MMU)

    Duties of MMU

    Handles DRAM refresh, bus interface and arbitration

    Takes care of memory sharing among multiple processors

    Translates logic memory addresses from processor to physical memory addresses

    of DRAM Modern CPUs often come with MMU built-in

    Single-purpose processors can be used

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    22/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 22

    7.Cache Coherence Protocols

    The presence of caches in current-generation distributed shared-memory multiprocessors

    improves performance by reducing the processors memory access time and by

    decreasing the bandwidth requirements of both the local memory module and the global

    interconnect. Unfortunately, the local caching of data introduces the cache coherence

    problem. Early distributed shared-memory machines left it to the programmer to deal with

    the cache coherence problem, and consequently these machines were considered difficult

    to program [5][38][54]. Todays multiprocessors solve the cache coherence problem in

    hardware by implementing a cache coherence protocol. This chapter outlines the cache

    coherence problem and describes how cache coherence protocols solve it.In addition, this chapter discusses several different varieties of cache coherence protocols

    including their advantages and disadvantages, their organization, their common protocol

    transitions, and some examples of machines that implement each protocol. Ultimately

    a designer has to choose a protocol to implement, and this should be done carefully. Protocol

    choice can lead to differences in cache miss latencies and differences in the number of

    messages sent through the interconnection network, both of which can lead to differences

    in overall application performance. Moreover, some protocols have high-level properties

    like automatic data distribution or distributed queueing that can help application performance.

    Before discussing specific protocols, however, let us examine the cache coherence

    problem in distributed shared-memory machines in detail.

    7.1 The Cache Coherence Problem

    Figure 2.1 depicts an example of the cache coherence problem. Memory initially contains

    the value 0 for location x, and processors 0 and 1 both read location x into theircaches. If

    processor 0 writes location x in its cache with the value 1, then processor 1scache nowcontains the stale value 0 for location x. Subsequent reads of location x by processor1

    willcontinue to return the stale, cached value of 0. This is likely not what the programmer

    expected when she wrote the program. The expected behavior is for a read byany processor to

    return the most up-to-date copy of the datum. This is exactly what acache coherence protocol

    does: it ensures that requests for a certain datum always returnthe most recent value.

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    23/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 23

    The coherence protocol achieves this goal by taking action whenever a location is written.

    More precisely, since the granularity of a cache coherence protocol is a cache line, the

    protocol takes action whenever any cache line is written. Protocols can take two kinds ofactions when a cache lineL is writtenthey may eitherinvalidate all copies ofL from the

    other caches in the machine, or they may update those lines with the new value being written.

    Continuing the earlier example, in an invalidation-based protocol when processor 0

    writesx = 1, the line containingx is invalidated from processor 1s cache. The next time

    processor 1 reads locationx it suffers a cache miss, and goes to memory to retrieve the latest

    copy of the cache line. In systems with write-through caches, memory can supply the

    data because it was updated when processor 0 wrotex. In the more common case of systems

    with writeback caches, the cache coherence protocol has to ensure that processor 1

    asks processor 0 for the latest copy of the cache line. Processor 0 then supplies the line

    from its cache and processor 1 places that line into its cache, completing its cache miss. In

    update-based protocols when processor 0 writesx = 1, it sends the new copy of the datum

    directly to processor 1 and updates the line in processor 1s cache with the new value. In

    either case, subsequent reads by processor 1 now see the correct value of 1 for location

    x, and the system is said to be cache coherent.

    Most modern cache-coherent multiprocessors use the invalidation technique rather than

    the update technique since it is easier to implement in hardware. As cache line sizes continue

    to increase, the invalidation-based protocols remain popular because of the increased

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    24/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 24

    number of updates required when writing a cache line sequentially with an update-based

    coherence protocol. There are times however, when using an update-based protocol is

    superior. These include accessing heavily contended lines and some types of synchronization

    variables. Typically designers choose an invalidation-based protocol and add some

    special features to handle heavily contended synchronization variables. All the protocolspresented in this paper are invalidation-based cache coherence protocols, and a later section

    is devoted to the discussion of synchronization primitives.

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    25/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 25

    8. Directory-Based Coherence

    The previous section describes the cache

    coherence problem and introduces the cachecoherence protocol as the agent that solves the coherence problem. But the question

    remains, how do cache coherence protocols work?

    There are two main classes of cache coherence protocols,snoopy protocols and directory-

    based protocols. Snoopy protocols require the use of a broadcast medium in the

    machine and hence apply only to small-scale bus-based multiprocessors. In these broadcast

    systems each cache snoops on the bus and watches for transactions which affect it.

    Any time a cache sees a write on the bus it invalidates that line out of its cache if it is

    present. Any time a cache sees a read request on the bus it checks to see if it has the mostrecent copy of the data, and if so, responds to the bus request. These snoopy bus-based

    systems are easy to build, but unfortunately as the number of processors on the bus

    increases, the single shared bus becomes a bandwidth bottleneck and the snoopy protocols

    reliance on a broadcast mechanism becomes a severe scalability limitation.

    To address these problems, architects have adopted the distributed shared memory

    (DSM) architecture. In a DSM multiprocessor each node contains the processor and its

    caches, a portion of the machines physically distributed main memory, and a node controller

    which manages communication within and between nodes (see Figure 2.2). Rather

    than being connected by a single shared bus, the nodes are connected by a scalable

    interconnection

    network. The DSM architecture allows multiprocessors to scale to thousands

    Chapter 2: Cache Coherence Protocols 13

    of nodes, but the lack of a broadcast medium creates a problem for the cache coherence

    protocol. Snoopy protocols are no longer appropriate, so instead designers must use a

    directory-based cache coherence protocol.

    The first description of directory-based protocols appears in Censier and Feautriers

    1978 paper [9]. The directory is simply an auxiliary data structure that tracks the caching

    state of each cache line in the system. For each cache line in the system, the directoryneeds to track which caches, if any, have read-only copies of the line, or which cache has

    the latest copy of the line if the line is held exclusively. A directory-based cache-coherent

    machine works by consulting the directory on each cache miss and taking the appropriate

    action based on the type of request and the current state of the directory.

    Figure 2.3 shows a directory-based DSM machine. Just as main memory is physically

    distributed throughout the machine to improve aggregate memory bandwidth, so the directory

    is distributed to eliminate the bottleneck that would be caused by a single monolithic

    directory. If each nodes main memory is divided into cache-line-sized blocks, then the

    directory can be thought of as extra bits of state for each block of main memory. Any time

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    26/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 26

    a processor wants to read cache line L, it must send a request to the node that has the

    directory

    for lineL. This node is called the home node forL. The home node receives the

    request, consults the directory, and takes the appropriate action. On a cache read miss, for

    example, if the directory shows that the line is currently uncached or is cached read-only

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    27/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 27

    9. MESI Cache Coherence .

    Abstract

    Nowadays, the computational systems (multi and uniprocessors) need to avoid the cache

    coherence problem. There are some techniques to solve this problem. The MESI cache

    coherence protocol is one of them. This paper presents a simulator of the MESI protocol

    which is used for teaching the cache memory coherence on the computer systems with

    hierarchical memory system and for explaining the process of the cache memory location in

    multilevel cache memory systems. The paper shows a description of the course in which the

    simulator is used, a short explanation about the MESI protocol and how the simulatorworks. Then, some experimental results in a real teaching environment are described.

    Keywords: Cache memory, Coherence protocol, MESI, Simulator, Teaching tool.

    9.1 Introduction

    In multiprocessor systems, the memory should provide a set of

    locations that hold values, and when a location is read itshould return the latest written value

    to that location. This property must be established to communicate data between

    threads or processes running on one processor. One reading returns the latest written value to

    the location regardless ofwhich process wrote it. This question is known as the cache

    coherence problem. This kind of problems arises even in uniprocessors when I/O operations

    occur. Most I/O transfers are performed by direct memory access (DMA) devices

    that move data between the memory and the peripheral component without involving the

    processor [5]. When the DMAdevice writes to a location in main memory, unless special

    action is taken, the processor may continue to see the old

    value if that location was previously present in its cache [1]. The techniques and support

    which are used to solve the multiprocessor cache coherence problem also solve the I/Ocoherence problem. Essentially all microprocessors today provide support for multiprocessor

    cache coherence. The MESI cache coherence protocol is a technique to maintain the

    coherence of the cache memory content in hierarchical memory systems [2], [7]. It is based

    on four possible states of the

    cache blocks: Modified, Exclusive, Shared and Invalid. Each accessed block lies in one of

    these stages and the transitions among them define the MESI protocol. Nowadays, most

    processors (Intel, AMD) use this protocol or its versions. Knowing how these processors

    maintain the cache coherence is very important for the students. This paper

    presents a simulator of the MESI cache coherence protocol [1], [6]. The MESI simulator is a

    software tool which has been implemented in the JAVA language. It has been developed

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    28/31

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    29/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 29

    10.MESI Protocol

    The MESI protocol makes it possible to maintain the coherence in

    cached systems. It is based on the four states that ablock in the cache memory can have. These four states are the abbreviations for MESI:

    modified, exclusive, shared and

    invalid. States are explained below:

    Invalid: It is a non-valid state. The data you are looking for are not in the cache, or the local

    copy of these data is not

    correct because another processor has updated the corresponding memory position.

    Shared: Shared without having been modified. Another processor can have the data into the

    cache memory and bothcopies are in their current version.

    Exclusive: Exclusive without having been modified. That is, this cache is the only one that

    has the correct value of

    the block. Data blocks are according to the existing ones in the main memory.

    Modified: Actually, it is an exclusive-modified state. It means that the cache has the only

    copy that is correct in the

    whole system. The data which are in the main memory are wrong.

    The state of each cache memory block can change depending on the actions taken by the

    CPU [3]. Figure 1 presents

    these transitions clearly.

    Although the Figure 1 is very clear, here is a brief explanation: at the beginning, when the

    cache is empty and a blockof memory is written into the cache by the processor, this block

    has the exclusive state because there are no copies ofthat block in the cache. Then, if this

    block is written, it changes to a modified state, because the block is only in one

    cache but it has been modified and the block that is in the main memory is different to it.

    On the other hand, if a block is in the exclusive state, when the CPU tries to read it and it

    does not find the block, ithas to find it in the main memory and loads it into its cachememory. Then, that block is in two different caches so itsstate is shared. Then, if a CPU

    wants to write into a block that is in the modified state and it is not in its cache, this block

    has to be cleared from the cache where it was and it has to be loaded into the main memory

    because it was the mostcurrent copy of that block in the system. In that case, the CPU writes

    the block and it is loaded in its cache memorywith the exclusive state, because it is the most

    current version now. If the CPU wants to read a block and it does not find

    the block in its cache, this is because there is a more recent copy, so the system has to clear

    the block from the cachewhere it was and to load it in the main memory. From there, the

    block is read and the new state is shared because there

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    30/31

    Embedded Memories

    Dept of ECE,RVCE Bangalore. Page 30

    are two current copies in the system. Another option is that a CPU writes into a shared block,

    in this case the block

    changes its state into exclusive.

    Figure 1: Transitions from CPU bus

    It should be taken into account that the state of a cache memory block can change because of

    the actions of anotherCPU, an Input/Output interruption or a DMA. These transitions areshown in Figure 2. Hence, the processor is going touse the valid data in its operations. We do

    not have to worry if a processor has changed data from the main memory andhas the most

    current value of these data in its cache. With the MESI protocol, the processor obtains the

    most currentvalue every time it is required.

  • 7/28/2019 Embedded Memories based on SOC (VLSI Seminar)

    31/31

    Embedded Memories

    11.References

    [1] Culler, D.E., Singh, J.P., and Gupta, A. Parallel Computer Architecture. A hardware/software approach.

    Morgan

    Kaufmann Publishers, Inc., 1999.

    [2] Hamacher, C., Vranesic, Z., and Zaky, S. Computer Organization. McGraw-Hill, 2003.[3] Handy, J. The Cache Memory Book. Academic Press, 1998.

    [4] McGettrick, A., Thies, M.D., Soldan, D.L., and Srimani, P.K., Computer Engineering Curriculum in the

    New

    Millennium. IEEE Transactions on Education, vol. 46, no. 4, November 2003.

    [5] Patterson, D.A., and Hennessy, J.L. Computer Organization and Design: The Hardware/Software Interface .

    Morgan

    Kaufman Publishers, Inc., 2004.

    [6] Stalling, W. Computer Organization and Architecture. Prentice-Hall, 2006.

    [7] Tanembaum, A.S. Structured Computer Organization. Prentice-Hall, 2006.CLEI ELECTRONIC JOURNAL, VOLUME 12, NUMBER 1, PAPER 5, APRIL 2009


Recommended