Analysis of Memory Access Cycles for Core Power Gating

Analysis of Memory Access Cycles for Core PowerGating

Pushkar Nandkar, Avani Deshpande, Namrata Date

Department of Electrical and Computer Engineering,University of Minnesota, Twin Cities

{nandk001,deshp058,datex008}@umn.edu

Abstract—For a multi core processor, information aboutmemory accesses by each core is important in understanding therelative state of the cores with respect to each other. A detailedcycle to cycle analysis shows the activities of the cores in eachcycle. When a particular core accesses memory, as the memoryaccess is long, leakage power can be reduced if the supply to thiscore is switched off. However, this decision not only depends onthe memory access latency of the particular core but also on therelative states of the other cores in the system. If a core is powergated, it is important to consider the non-zero wakeup latencyof the core. Also, the voltage noise due to the gating of this coremay affect the cores in proximity to this core. Thus, our analysisof a system with multi-cores provides information about cyclesduring which the core accesses the memory and the number ofcores that stall during such access in each cycle. This informationcan then be used to take a decision about power gating of anyparticular core.

Keywords : Memory stall, memory access latency, powergating, multicore, multiprogrammed, profiling

I. INTRODUCTION

Modern handheld devices demand more battery life withincreased functionality while keeping the thermal limits undercheck. As a consequence the available power should be usedefficiently. Leakage Power is a major portion of the powerconsumed by a chip. A large amount of this power is dissipatedwhen a core is waiting for resources to complete its execution.

Fig. 1. Leakage Power during Memory Stalls

Absence of power gating will result in significant leakagepower dissipition even if the circuit is clock gated as seen inFig 1 [8]. Fig. 2 [8] shows the realistic power profile whenpower gating is implemented. The leakage power graduallyreduces to a significantly low level during the sleep state.

Fig. 2. Power Gated Memory Stalls

During memory access, a core stalls due to the associatedlatency. If the core is connected to the power grid(ON) duringthis long latency memory access, leakage power is dissipated.This leakage power dissipation in the core can be reduced bypower-gating the core during the period when it is stalling.If the power supply to this core is switched off during thisperiod, it may result in significant reduction in the dissipatedleakage power.

If a core stalls for a short duration, it cannot be power gateddue to the associated wake up latency. In order to exploit thebenefits of power gating for shorter stalls, multiple wake upmodes can be employed. Each mode is based on the tradeoffbetween power saving and the latency associated with wakeup. For shorter stalls a faster wake up can be achieved at theexpense of lesser power savings.

Memory access latencies are variable depending on thelocation of the required block in the memory hierarchy. Ifthe memory access latency is sufficiently high, a core can bepower gated. The power saving due to gating of the core canbe calculated based on the number of stall cycles. Core wake-up latency is the time that a particular core takes to wake upfrom power down. The value of the core wake-up latency iscritical to the decision of whether to power gate a core or not.If the core wakeup latency is less than the memory accesslatency, the core can be power gated and a significant amountof leakage power can be saved. The value of the core wake-up latency depends on various parameters. In the context ofa multi-core processor, this value for a particular core alsodepends on the states of other cores present on the chip. Ifthe value of core wake up latency, if calculated accurately, isobtained to be lower than the memory access time, the resultingpower savings are considerable even after the overheads related

to power gating such as the state restoration logic, leakage fromswitches, etc.

The project is divided in four parts.

1) Simulate a Multicore system with a multiprogrammedenvironment in Gem5 [5]

2) Analysis of the cache trace for the multicore system3) Design Logic for determining the core idle time.4) Parse the trace file and plot the stall and active cycles

for each core.

II. DESIGN CONSIDERATIONS FOR POWER GATEDCIRCUITS

In order to formally introduce power gating in a chip inwhich multiple blocks can be power gated and powered up atdifferent times, multiple power domains need to be defined foreach of the blocks. This makes the following factors importantto consider.

A. Retention of stored data

The current circuit state needs to be preserved when powergating a certain block since power gating will lead to loss ofdata and affect the correctness. Retention registers are used topreserve the current state. Latches or flip flops can be used toretain the data in such a way that these registers are not partof the power gated domain. However, the overheads associatedwith these techniques include large area, complexity in thegeneration of control signals, power consumption, etc.

B. Output Isolation

For the signals crossing power domains, care must be takenat the crossings where a signal from an ’OFF’ power domaindrives inputs in the ’ON’ power domain. These signals needto be isolated using isolation cells belonging to ’Always ON’power domain before they reach another domain.

C. Design of current switch

Good design of a sleep transistor is extremely necessaryfor the success of power gating. Factors such as delay dueto the sleep transistor, IR drop, etc need to be considered. Afine grain or a coarse grain design can be implemented. In finegrain, a sleep transistor is used for every standard cell whereasin coarse grain, the sleep transistors are connected togetherbetween Vdd and GND rails. In coarse grain implementationthe current is shared and hence the IR drop is less. This alsorequires less area as compared to fine grain.This makes thecoarse grain implementation the preferred one.

D. Wake up process

The overhead associated with the wake up of a power gatedblock is critical to the decision of power gating. During wakeup large current flows through the current switches. Due to theinductance associated with the power rails, a noise voltage isgenerated. This leads to fluctuations in the supply voltage andground bounce.

Careful design of power gating network is necessary inorder to avoid such effects. The design of power gatingnetwork should also consider the effects of process variations.

E. Physical Design

The area overhead associated with power gating is due tocircuitry for retention as well as isolation of output. It dependson the amount of data that needs to be retained as well as thenumber of outputs. In case of combinational circuits the areaoverhead is more for circuitry for output isolation. The routingof all the control signals also adds to the area.

F. P/G noise

When a block is being power gated or woken up, noisevoltage appears in the power delivery network. Neighboringblocks of the power gated block need to be protected from thisnoise. The problem of voltage noise can be addressed either byimplementing adaptive run time strategies or techniques basedon estimation of the noise.

A core which is to be power gated is considered to be anattacker and the active cores which are affected by the noiseare the victims. Not all cores on the chip are affected by thisinduced noise. Only cores within a certain area are affected.Hence safety measures need to be implemented only for thecores within the attackers attacking range

Fig. 3. Floorplan Examples

Fig 3 describes how floorplan information can help to un-derstand the effects of state change of a core on its surroundingcore. In this particular case, in Fig. 3.a- Core 1 is the attackerand cores 2,3,5,6,9 are in unsafe zone. In Fig 3.b - Core 2 isthe attacker and cores 1,3,4,5,6,7,8,9,10 are in unsafe zone. InFig 3.c - Core 6 is the attacker all the other 15 cores are inunsafe zone.

In order to consider the above aspect of effect of floorplanon the power gating decision for voltage noise considerations,we consider the states of all the cores of the chip at any instant.When floorplan information is provided, based on the relativeposition of the cores, a valid decision logic can be implementedfor the power gating.

III. RELATED WORK

Power-gating in order to control leakage power has been anarea of research for quite some time now. Hu, Buyuktosunogluet al. have discussed the power-gating of execution units whenthe units are underutilized [1]. Idle time prediction has been afocus of work for many.

Many techniques have focused on power gating of SRAMsand other techniques like cache decay in which the power tocache line is turned off when the cache is not accessed for aspecific interval of time [2]. Work by Ye-Jyun Lin et al. [4]focusses on power gating of an IP core instead of CPU corewhile it is waiting for memory access.

The work by Jeong et al. [3] discusses the approach ofmemory access power gating for cores by implementing core-wakeup in stages. Our project is based on the work of Jeong,Kwangok, et al. which involves power gating of active coresduring the long memory accesses.

IV. INFRASTRUCTURE

GEM5[5] was used for the simulation of system with 4cores.

A. Simulator: Gem5

Gem5 is an object-oriented simulation platform i.e. a groupof objects interact with each other through function calls duringsimulation. It is implemented using Python and C++ and has2 simulation modes: Full system mode (FS mode) and Syscallemulation mode (SE mode). FS mode simulates a systemwith operating system and devices, while in SE mode onlythe programs are simulated and the system components areprovided by the simulator. Various ISAs such as Alpha, ARM,MIPS, Power, SPARC are supported.

Over other simulators available, like Simplescalar, the sup-port for multiple cores was seen to be the best for Gem5. Gem5also offers flexibility to select among various Cache Coher-ence Protocols, multilevel memory hierarchy, Blocking/Non-blocking Cache implementation. As a result Gem5 is a naturalchoice as a simulator for this project.

1) Memory System in Gem5: Memory Objects are con-nected through ports and exhibit a master/slave hierarchy.The CPU side port is the slave port and the memory sideport is the master port. A memory access (read as well aswrite) is considered to be complete on receiving a responsefrom the accessed memory object. A split memory accessis implemented i.e. memory access takes place in two steps:first, the effective address calculation and then accessing theaddress. Cache hit time is configurable. On a cache miss, therequest is forwarded to the MSHR (Miss Status and HandlingRegister). The evicted dirty cache lines are sent to the WriteBuffer. A cache is blocked when the MSHR or Write Bufferis full. In a blocked state a cache rejects all the requestsfrom CPU. However it does not reject response messages frommaster ports and snoop requests. Block size and associativityis configurable. Various flags associated with cache lines areValid, Read, Write and Dirty. Depending on the status ofthe flags a cache line can be read or written. The memoryaccesses that require an access to the lower memory are keptin the MSHR whose size is configurable. Cached memory isalways write back, write allocate. Coherent Bus object providessupport for the snoopy protocol.

The message flow in case of a cache miss is shown in fig4 and fig 5 [6]:

In case of a read miss the ReadReq message is sent to thelower levels. If the data is present in the private cache of otherCPU, it sends the data and the corresponding request to mainmemory is aborted. The status flags are updated as required.

In case of a write miss, a ReadExReq is broadcasted. Thisinvalidates the copy of data in other local caches. When a writemiss occurs, lower level needs to be accessed for the data.The copies of the block in other private caches need to beinvalidated and the newly written copy is marked as exclusive.

Fig. 4. Read Miss

Fig. 5. Write Miss

B. Benchmark: SPLASH2

SPLASH2 is a benchmark consisting of set of parallelapplications. Currently SPLASH2 suit consists of 8 completeapplications and 4 kernels. We have used the Syscall emulation(SE) mode for simulation.

C. Architecture used: ALPHA

D. Configuration

Table I shows the configuration used for running theRADIX SORT application from the SPLASH2 benchmarksuite. A separate Instruction cache and Data cache is presentat the first level (core level). L2 is a shared cache. We haveimplemented a Blocking Cache in order to achieve maximumpotential power-gating intervals.

TABLE I. CONFIGURATION USED FOR SIMULATION

Core ISA ALPHACPU-type InOrder

Clock 2GHzIcache 32K, 2 wayDcache 64K, 2 way

Blocking Cache

Memory L2 256K, 8 wayHierarchy

V. WORK DONE

Initially, to understand the working of gem5 we simulated asingle core system with the configuration shown in Table I. Ithas a private L1 cache and a shared L2 cache. It is an in orderCPU. Separate L1 data and instruction cache are implementedand L2 is unified. When CPU requires data/ instruction, arequest is sent to L1 cache. If the data is present in L1 cachethe request from the CPU is serviced. However, if the data isnot available in L1 cache, it is a L1 miss. A request for thedata is now sent to L2. If the data is present in L2, the blockis moved to L1 cache. If a L2 miss occurs, access to the mainmemory is required to service the request from the CPU. The

latency associated with main memory access is very high. Inour simulation it is 53ns. We have implemented a blockingcache. For this we set the value of MSHR equal to 1. Bydefault gem5 implements a non blocking cache i.e. the readsto the cache are not blocked when a miss is being serviced. Thenumber of outstanding misses is configurable. We studied thememory trace file and understood the taxonomy related to thetrace file. The statistics obtained from the simulation are as inTable II. The SPLASH2 benchmark was run on the simulator.Running RADIX from SPLASH2, the following statistics areobtained.

TABLE II. STATISTICS OBTAINED AFTER RUNNING RADIX ON GEM5ON A SINGLE CORE

Parameter Value

Total Simulation ticks 35416446000

Simulation frequency 1THz

Simulation seconds 0.035416

Simulation instruction rate 16061 instructions/sec

Total instructions simulated 61106777

Overall L2 hits 108280

Overall L2 miss 127466

L1 instruction cache miss 2564

L1 data cache misses 233273

Overall L1 instruction cache hits 6701177

Overall L1 data cache hits 9568540

Table II gives us the details of the number of stalled cyclesduring each miss for the first 100 instruction of the trace.Parsing the trace file obtained from the simulator, we obtainthe number of stall cycles during a miss. From this file weextracted the data for first 1000 cpu cycles and the same hasbeen plotted in Fig 6

. Fig 7 shows the stall cycles for the same cpu after thecompulsory miss phase have passed in the execution.

Fig. 6. CPU stalled cycles for first 1000 cpu cycles when RADIX ofSPLASH2 benchmark is run on gem5 on a single core

The value of 1 for a particular cycle indicates that the coreis stalled for a memory access during that period. This periodcan be utilized for power-gating the core.

We then implemented a multicore system, with each corehaving a private L1 cache. Each core runs a program individ-ually and there is no communication amongst the cores. L2

Fig. 7. CPU stalled cycles after the compulsory misses phase for a singlecore

cache is shared. The size of L2 cache is decided to ensureinclusion in all the cases.

A. Different Simulation Experiments

1) 4 cores : Each Running the same program:Configuration- L1 cache size (32k Instruction cache, 64kBData cache), L2 cache size (1MB) to implement inclusion

Fig. 8. Four CPUs: Same Workload : Memory access stall cycles

Fig 8 shows the activity of each core against the cyclenumber. As earlier, a value of 1 indicates the core is accessingmemory during that period. We can observe the variablememory access latency. Whenever an L2 miss is observed along latency is seen compared to L1 miss and L2 hit.

2) 4 cores : Each running a different program: A multiprogrammed environment with 4 different splash kernelsnamely RADIX, FFT, CHOLESKY and LU is simulated.Configuration- L1 cache size (32k Instruction cache, 64kBData cache), L2 cache size (1MB) to implement inclusion.Number of instructions simulated- 120000 (to limit the simu-lation time) A long trace indicating various memory operationsis generated which can then be analyzed focusing on requiredregion of interest. For the convenient display, a representativenumber of cycles is selected.

Fig. 9. Four CPUs: Different Workloads : Memory access stall cycles

Alongwith the information about state of each core, a plotis also generated from the system perspective to understandthe number of cores stalling for memory access at any cycle.This profile is as seen in Fig 10

Fig. 10. Four multiprogrammed CPUs : Number of CPUs Stalled

The profile indicates the number of cores stalling at anycycle.

3) 16 cores : 4x4 program distribution over cores:Configuration- 16 cores. L1 cache size (32k Instruction cache,64kB Data cache), L2 cache size (2MB) One of each fourkernels is simulated on 4 cores for 120000 instructions.

An enlarged version is available in Appendix A for bettervisualization. During majority of cycles only 1 core is activeand the others are stalling. This happens during L2 accessessince L2 is implemented as blocking cache. It blocks all theread requests until the current request is serviced. As a resulta processor stalls for a long time. During these stalls theprocessors can be put in the sleep mode thus reducing theleakage power. As the number of cores increases, due to theblocking L2, the duration of stall cycle increases. These figsclearly show when every processor becomes active and whenit is idle.

Fig. 11. 16 multiprogrammed CPUs: Memory access stall cycles

Fig. 12. 16 multiprogrammed CPUs : Number of CPUs Stalled

4) 64 cores : 16x4 program distribution over cores:Configuration- 64 cores. L1 cache size (32k Instruction cache,64kB Data cache), L2 cache size (8MB) One of each fourkernels is simulated on 16 cores for 120000 instructions.

Fig. 13. 64 multiprogrammed CPUs: Memory access stall cycles

Fig. 14. 64 multiprogrammed CPUs : Number of CPUs OFF

B. Details about Profiling Implementation

To understand the energy gains that can be obtained by gat-ing the core, one needs to understand how many cores accessthe memory at a particular time. To extract this information thememory access stall cycles of a core need to be profiled and thepattern obtained may further be useful in designing a predictorto accurately predict the memory access of a particular core.This will further aid the decision making process regarding thepower gating of the core.

The information about the memory stall cycles can beobtained from the trace related to the cache which can beobtained by configuring GEM5 to dump out the trace andsetting the cache related flags to understand the operationsrelated to memory. This obtained trace can be further analyzedto find out when a L1 miss is encountered and when the CPUhandles the response for that miss. A few scripts are writtenin bash (shell) to extract this data and plot it in a graphicalformat.

The scripts segregate the data from the trace obtained foreach core. One can specify the clock cycles (start and end)from which the data should be extracted from the trace file.This data is then parsed for each core and related misses andcorresponding L1 cache handles are obtained. The figures arethen plotted based on the available data. A rising edge denotesa miss and the falling edge denotes the handle by the L1 cache.GnuPlot[7] is used to plot the graphs representing these missesand handles for each CPU separately. The data is used tomonitor the number of cpus that are active or idle during aparticular clock cycle.

The opportunities available for power gating also dependon the benchmark used. Application involving frequent mem-ory accesses may provide better opportunities for power gatingsince a core is power gated when it stalls during long memoryaccesses.

VI. CONCLUSION AND FUTURE WORK

The current results indicate the stall cycles on a per corebasis as well as the state of the cores in the system as awhole. The results with the simulation of the 64 core systemindicate a large number of cycles when majority of the cores

are stalling. These cycles give a rough estimate of the potentialof power gating in the system. This information can also aidthe floorplan where we can place the cores with overlappingactivity cycles away from each other in order to distributethe load well over the power delivery network. A predictorcan now be designed to take a run-time decision for powergating of each core. This predictor can use the floorplaninformation along with the tendency of each core to accessmemory to decide which core can be power gated during itsstall for memory access. This analysis can also be extended tomulticore systems in which multiple cores run multiple threadsof a single program and communicate in order to performtasks. The considerations in this case however will involvemany aspects of the system like the memory system (cachecoherence protocol), any other optimizations that have beenused to hide the memory latency in the system (write buffer,larger MSHR) , the interconnect model used etc.

REFERENCES

[1] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson andP. Bose, Microarchitectural Techniques for Power Gating of ExecutionUnits, Proceedings of the 2004 international symposium on Low powerelectronics and design, (pp. 32-37). ACM.

[2] Kaxiras, Stefanos, et al. ”Cache-line decay: A mechanism to reduce cacheleakage power.” Power-Aware Computer Systems (2001), 82-96.

[3] Jeong, Kwangok, et al. ”MAPG: Memory access power gating.” Design,Automation and Test in Europe Conference and Exhibition (DATE), 2012, IEEE, 2012.

[4] Lin, Ye-Jyun, et al. ”Memory access aware power gating for MPSoCs.”Design Automation Conference (ASP-DAC), 2012 17th Asia and SouthPacific., IEEE, 2012

[5] Binkert, Nathan, et al. ”The gem5 simulator.” ACM SIGARCH ComputerArchitecture News, 39.2 (2011): 1-7.

[6] http://www.gem5.org/docs/html/gem5MemorySystem.html[7] http://gnuplot.sourceforge.net/[8] http://nanocad.ee.ucla.edu/pub/Main/SnippetTutorial/PG.pdf

APPENDIX A

Figure A.1: 4 cores: Memory stall cycles same workload

Figure A.2: 4 cores: Number of CPUs stalled

Figure A.3: 16 cores: Memory stall cycles


Figure A.5: 64 cores: Memory stall cycles


Figure A.7 Configuration Diagram as obtained from Gem5 for a single core

Figure A.8 Configuration for 4cores

Date post:	16-Apr-2015
Category:	Documents
Upload:	pushkarsn
View:	32 times
Download:	2 times

Analysis of Memory Access Cycles for Core Power Gating

Documents