+ All Categories
Home > Documents > EECS151/251AHomework8 Problem1:PowerandLeakage · 2020. 4. 21. · EECS151/251AHomework8 Due...

EECS151/251AHomework8 Problem1:PowerandLeakage · 2020. 4. 21. · EECS151/251AHomework8 Due...

Date post: 27-Jan-2021
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
10
EECS 151/251A Homework 8 Due Monday, April 6 th , 2020 Problem 1: Power and Leakage Consider a 3-input “2-1 AOI” gate shown below with V DD = 1V, C L = 5 fF, C D =0.2 fF/nm. Assume R ON,n = 10 kΩ · nm, R ON,p = 20 kΩ · nm, R OF F,n = 100 MΩ · nm, R OF F,p = 500 MΩ · nm for the given device length. a) Size the gate, using as a reference a symmetrically sized inverter with W n = 10 nm. Your sized gate should have the same input capacitance as the reference inverter for all inputs. Solution: To keep the pull-up and pull-down delays the same, we size W p,a =4W n,a and W p,b/c = 2W n,b/c . To make the inputs have the same input capacitance as the reference inverter we size the transistors as follows: a n =3/5W n = 6 nm a p = 12/5W n = 24 nm b n = c n = W n = 10 nm Version: 1 - 2020-04-21 11:29:45-07:00
Transcript
  • EECS 151/251A Homework 8

    Due Monday, April 6th, 2020

    Problem 1: Power and Leakage

    Consider a 3-input “2-1 AOI” gate shown below with VDD = 1 V, CL = 5 fF, CD = 0.2 fF/nm.Assume RON,n = 10 kΩ · nm, RON,p = 20 kΩ · nm, ROF F,n = 100 MΩ · nm, ROF F,p = 500 MΩ · nmfor the given device length.

    a) Size the gate, using as a reference a symmetrically sized inverter with Wn = 10 nm. Yoursized gate should have the same input capacitance as the reference inverter for all inputs.

    Solution:To keep the pull-up and pull-down delays the same, we size Wp,a = 4Wn,a and Wp,b/c =2Wn,b/c. To make the inputs have the same input capacitance as the reference inverterwe size the transistors as follows:

    • an = 3/5Wn = 6 nm• ap = 12/5Wn = 24 nm• bn = cn = Wn = 10 nm

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 2

    • bp = cp = 2Wn = 20 nm

    b) Assume that the probability of an input being high is 0.5 (i.e., on any given clock cycle, eachinput is equally likely to be a 0 or a 1.) and that all inputs are independent. What is theprobability that the output is high, P (Out = 1)? What is the probability that the output islow, P (Out = 0)? What is the gate activity factor (i.e. the probability that the output willtransition from low to high, P0→1)?

    Solution:The truth table for the 2-1 AOI gate is as follows:

    a b c out0 0 0 10 0 1 10 1 0 10 1 1 01 0 0 01 0 1 01 1 0 01 1 1 0

    Since the probability of an input being 1 or 0 is equally likely and all the inputs areindependent, then the probability of the output being 1 is the sum of the probabilities ofinput combinations that yield an output of 1. Therefore

    P (Out = 1) = 38 = 0.375

    The same process is applied to finding the probability of the output being 0.

    P (Out = 0 = 58 = 0.625

    The probabiility that the output will transition from low to high can be found usingconditional probability.

    P0→1 = P (Outt=n = 1|Outt−1 = 0) = P (Out = 0) · P (Out = 1) =1564 = 0.234375

    c) What is the dynamic power dissipation of the gate, if the clock frequency is 3GHz? You mayignore the parasitic drain capacitance in the internal nodes of the PMOS stack, but not atthe output.

    Solution:From lecture, we know that dynamic power dissipation of a gate can be expressed as

    Pdyn =12αCtotV

    2ddfclk

    The total capacitance (from our sizing) is the drain capacitance from an, ap, and bn along

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 3

    with the load capacitance.

    Ctot =35CD +

    125 CD + CD + CL = 4CD + CL

    The activity factor is the probability of the output transitioning, which you found in part(b).

    α = P0→1The other values are

    Vdd = 1 V, fclk = 3 GHz

    Pdyn =12(0.234)(4 · 0.2 · 10 + 5) · 10

    −15(1)2(3 · 109)

    Pdym = 4.57 µW

    d) 251A only - 151 Optional. For the following three cases, calculate the leakage current.An approximate expression is perfectly fine as long as you explain and justify your assump-tions/simplifications.

    (a) All inputs are zero.

    Solution:All PMOS on so ignore their resistance, parallel combo of a_nmos with b_nmos +c_nmos.

    Req ≈ Ron,n,a||(Ron,n,b +Ron,n,c)

    I = VddRon,n,a||(Ron,n,b +Ron,n,c)

    (b) All inputs are 1.

    Solution:All NMOS on so ignore their resistance, series combo of a_pmos with parallelb_pmos + c_pmos.

    Req ≈ Ron,p,a + (Ron,p,b||Ron,p,b)

    I = VddRon,p,a + (Ron,p,b||Ron,p,b)

    (c) A = B = 1, C = 0.

    Solution:c_pmos on in series with a_pmos off in series with a_nmos on.

    Req ≈ Ron,p,c +Roff,p,a +Ron,n,a)

    I = VddReq ≈ Ron,p,c +Roff,p,a +Ron,n,a)

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 4

    Problem 2: Energy Efficiency Improvements

    Consider the design of a vector add unit. As shown below the unit has two input input registerbanks and an output register bank. One of the input register banks holds the first vector [A3, A2,A1, A0] and the second holds [B3, B2, B1, B0]. A controller (not shown) passes the elements of theinput vectors through the adder (one per clock cycle) and the result is stored in the output registerbank [C3, C2, C1, C0]. As you can see, a 4-1 multiplexor is used by the controller for choosing theproper A and B register, and clock enable signals are given to select the proper C register. Thecircuit elements have the following delays: τadd = 16 ns, τmux = 2 ns, and τsetup = τclk−Q = 1 ns.

    On average, at the nominal Vdd the energy for one data item passed through the adder block is 1Joule, and 0.2 Joules for the multiplexor. The registers each consume 0.1 Joule on average for eachnew data word stored.

    Your application for this circuit requires a complete vector of 4 elements be computed every 80ns.You can ignore the time and energy required to load new values into the A and B registers.

    For this problem assume that the adder operation cannot be pipelined.

    Devise a scheme that would improve the switching energy efficiency while meeting the applicationrequirements. Compare the switching energy per result of the original circuit and your new one.

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 5

    Assume that a 1/n reduction in clock frequency can accommodate a 1/n reduction in Vdd.

    Solution:It takes 4 clock cycles to complete the vector addition. Each clock cycle requires going through2 MUXes, an adder, and storing the result in the correct register. The total energy expenditureis

    Etot,old = 4(2 · 0.2 + 1 + 0.1) = 6 J

    Since we cannot pipeline the adder operation, the other tradeoff we can make with energyefficiency while meeting application requirements is more hardware cost with parallelism. Wecan have 4 adders running in parallel and remove the need for MUXes. The new total energyexpenditure is then

    Etot,new = 4 · 1 + 4 · 0.1 = 4.4 J

    We can also run the clock slower since the vector addition can be completed in one clock cycle.The clock can be slowed by a factor of 8018 = 4.44 where the new critical path is 18ns. Thisallows a factor of 4.44 reduction in Vdd, resulting in a further 4.442 = 19.8 times reduction inenergy.

    Problem 3: Race to Halt

    An effective scheme for improving energy efficiency when static power consumption is a significantcomponent of total power consumption is a technique call “race to halt”. The basic idea is to runthe hardware at maximum speed to quickly compute the necessary set of computations, then turnoff the power, thus preventing leakage.

    Suppose you have a CPU that take 4 seconds to run your application with an average powerconsumption of 8 Watts, where 50% of the power is dynamic and 50% is static. Assume that noother program also running on the CPU. You are willing to run your application slower if thatcould preserve energy.

    You would like to determine the most effective way to run your application to preserve the batterylife. You have the ability to control the supply voltage (Vdd), the clock frequency (f), and if neededcan put the CPU into a sleep mode where static power is essentially zero. The CPU’s Vdd can beincreased or decreased by at most 25%.

    Explore “race to halt” versus running longer at a lower Vdd. Which approach will be better atconserving your battery charge? For this problem, assume that when varying frequency f and suppyvoltage Vdd, that the static power usage remains constant. This is more or less true. Show yourwork and justify your answer.

    Assume that an n% increase/decrease in clock frequency can accommodate an n% increase/decreasein Vdd.

    Solution:Since the nominal average power consumption Ptot,nom = 8 W, then the nominal dynamic

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 6

    power Pdyn,nom = 12CV 2f = 4 W and Pstatic,nom = 4 W.

    Race to Halt: Increase Vdd and fclk by the maximum 25%.

    Pdyn,race =12C(1.25Vdd)

    2(1.25fclk)

    The CPU now takes 3 seconds instead of 4 to run the application, so the total energy is

    Etot,race = 1.95 ·12CV

    2f · 3 s + 4 W · 3 s

    Etot,race = 1.95 · 4 W · 3 s + 4 W · 3 s = 35.4 J

    Lower Vdd: Decrease Vdd and fclk by the maximum 25%.

    Pdyn,race =12C(0.75Vdd)

    2(0.75fclk)

    The CPU now takes 5 seconds instead of 4 to run the application, so the total energy is

    Etot,race = 0.42 ·12CV

    2f · 5 s + 4 W · 5 s

    Etot,race = 0.42 · 4 W · 5 s + 4 W · 5 s = 28.4 J

    Interestly enough, the lower Vdd scheme is more energy efficient.

    Problem 4: Memory

    a) Suppose you want to design a 32-bit wide memory block with a capacity of 2K 32-bit wordsof storage (remember 1K = 1024). We would like to have the core of the block square (equalnumber of rows and columns). How many total address bits are needed for this memory?How many address bits are used by the row-decoder? How many address bits are used by thecolumn-decoder?

    Solution:

    2x1024x32 = 65536 -> core is 256 x 256, 11-bit address, col decoder requires 3 (23̂ =256/32 = 8), row decoder requires 8 (11-3) The total memory size is

    2× 1024× 32 = 65536 bits

    Since we want a squuare block, we take the square root of the memory size√

    65536 = 256

    This means we have to design a 256 × 256 memory core. The total number of addressbits required is

    L = log2(2 · 1024) = 11 bits

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 7

    Each row contains 256/32 = 8 32-bit words. So the column-decoder needs

    K = log2(8) = 3 bits

    The row-decoder then requires L−K = 11− 3 = 8 bits.

    b) Now you want to design the row decoder using the predecoder technique presented in lecture.You can use only gates with no more than 4-inputs. Map out the scheme and describe thedesign of each of the decoder.

    Solution:Predecode groups of 2 bits for the 8 bits used by the row decoder. Then combine eachgroup of 2 results into a 4-input AND gate to decode the 8 bit address.

    Problem 5: DRAM [4 pts]

    1-transistor DRAM designs usually include a “row buffer”—a register on the periphery that is usedto register an entire row.

    a) Explain how this register could be used and why it’s a good idea.

    Solution:It reduces power and increasing memory system speed. RAM accesses exhibit spaciallocality to a high degree: it’s likely that access to one word in a DRAM row is likelyfollowed by another access to the same row. Buffering the row saves having to read thememory cells again, returning a value to the system faster and using less power. For

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 8

    writing: a row is opened (copied into the row buffer) and constituent bytes/words areupdated before the entire buffer is written back.

    b) Explain how the inclusion of this buffer changes the detailed steps needed for a memory readand memory write operation.

    Solution:For read:

    • compulsory miss: slow access - open row, read data and move to row buffer, thenmove data to out

    • row buffer hit: fast access - only move data from buffer to out• row buffer conflict: low access - write back existing row, open and read the new

    row, and update row buffer

    For write:

    • compulsory miss: slow access - open row, update row buffer, then move data tobuffer where edits will take place

    • row buffer hit: fast access - write and make edits on row buffer• row buffer conflict: low access - write back existing row, open and read the new

    row, and update row buffer

    Problem 6: Memory Implementation

    a) Consider the design of a (very) small asynchrous-read register file block of 4 words by 4-bits each, and with two read ports and one write port. You want to implement the registermemory cells as positive edge-triggered flip-flops. Draw the circuit diagram for your designusing the flip-flop cells, multiplexers, and logic gates.

    Solution:

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 9

    b) 251A only - 151 Optional. Now consider the redesign of the register file from part a)using latches instead of flip-flops. For this design, as above, the write operation occurs onthe positive edge of the clock, but now the output data on a read become available after thefalling edge of the clock.

    Solution:

    Problem 7: Memory Blocks [10pts]

    You are given a simple dual port (SDP) memory block that is 128x8. Show how you would usemultiple instances to design a memory that has 2 independent read ports and is 256x8.

    Solution:Use 2 128x8 to make 256x8 (increase depth), then stack 2 (4 total) to get 2 read ports

    Version: 1 - 2020-04-21 11:29:45-07:00

  • EECS 151/251A Homework 8 10

    Monday,April6,2020 12:08PM

    Version: 1 - 2020-04-21 11:29:45-07:00

    Power and LeakageEnergy Efficiency ImprovementsRace to HaltMemoryDRAM [4 pts]Memory ImplementationMemory Blocks [10pts]


Recommended