+ All Categories
Home > Documents > 10T Dual-voltage Low Power SRAM Project Report

10T Dual-voltage Low Power SRAM Project Report

Date post: 23-Feb-2017
Category:
Upload: jie-song
View: 28 times
Download: 1 times
Share this document with a friend
8
10T Dual-voltage Low Power SRAM Ruobai Feng, Zhuonan Li, Zhesheng Lou, Yimai Peng, Jie Song ABSTRACT With technology scaling, low power operation becomes one of the crucial topics in VLSI. In memory design, it is vital to reduce the power consumption of the memory with little trade-off in performance and area penalty. In this paper, we propose a low power 10T SRAM bit cell to be implemented in a RISC processor, compare it with the 6T SRAM cell in various perspectives, and present the results after HSPICE simulation. Concepts Hardware Static Memory Hardware Clock Generation and Timing Keywords SRAM; 10T Bit Cell SRAM; Voltage Scaling; Read Static Noise Margin 1. INTRODUCTION As the demand for memory continuously grows, SRAM becomes increasingly important in modern VLSI Design. However, since SRAM occupies a large fraction of area, and consumes a significant amount of dynamic and leakage power, the aggressive technology scaling makes cooling and power issues even worse. As a result, the power consumption becomes the major concern in SRAM design. To reduce the total power consumption, there are several approaches: voltage scaling, multi-voltage supply, logic optimization, pipelining, and parallelism, etc. Because the required performance varies between components in a SRAM in most cases, multiple voltage supply appears to be a good solution to optimize the balance between performance and overall power consumption. Bit cell is the core storage structure of SRAM and will greatly influence the performance of SRAM. 6T SRAM cell is conventionally used as the memory cell. Because of the compact design and the voltage division between access and driver transistor, 6T SRAM cells has relatively small hold and read noise margin, substantial problems will occur especially when power supply voltage is low. To deal with such problems, we propose a non-conventional 10T SRAM cell that achieves higher stability in read and write in low voltage environment, and at the same time, has a lower overall power consumption. In this paper, we will compare 6T and 10T SRAM cells in terms of delay and power consumption during read and write operations and noise margins in section 2. Section 3 explains the layout techniques we adopted when implementing the bit cell in an actual SRAM circuit, and section 4 details peripheral circuit design of the SRAM. The simulation results will be presented and discussed in section 5, and section 6 will describe the problems we encountered in design as well as possible future improvements. during read operation decreases because of the voltage division between the access and driver transistor. In order to find a SRAM cell whose performance is proper in read and write operation in low voltage and stability is higher, 8T SRAM cell and 10T SRAM cell are proposed to make a comparison with conventional 6T SRAM cell. 2. COMPARISON BETWEEN 6T AND 10T SRAM CELLS After comparing and combining references from references [2]~[7] we propose a 10T SRAM cell that achieves higher stability in read and write in low voltage environment, and consumes lower power. The 10T SRAM bit cell we proposed is controlled by three control signals, write, read, and footer. In read operation, both bit lines should be pre-charged high, and read and footer signals are turned high. The inverter pair is grounded as in 6T SRAM cell, and data stored in the cell will turn on one of the pass transistors, allowing voltage to drop in the corresponding bit line. In write operation, write signal will be turned on and footer signal turned off to float the nodes to be written. After the write driver alters the state in the cell, footer is then turned on to finish to pull-down transition. Because low power is our ultimate goal in SRAM design, and we decide to decrease the power supply voltage as long as the performance is acceptable, the robustness in low power supply is an important criterion in our design process. Figures. 2 and 3. compare the read performance of 6T and 10T SRAM cells when VDD is lowered to 570 mV. When the 6T cell fails to give the correct output, 10T cell is still working properly. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EECS 427, Fall, 2015, Ann Arbor, Michigan, U.S. Figure 1. 10T SRAM cell proposed.
Transcript
Page 1: 10T Dual-voltage Low Power SRAM Project Report

10T Dual-voltage Low Power SRAM Ruobai Feng, Zhuonan Li, Zhesheng Lou, Yimai Peng, Jie Song

ABSTRACT With technology scaling, low power operation becomes one of the crucial topics in VLSI. In memory design, it is vital to reduce the power consumption of the memory with little trade-off in performance and area penalty. In this paper, we propose a low power 10T SRAM bit cell to be implemented in a RISC processor, compare it with the 6T SRAM cell in various perspectives, and present the results after HSPICE simulation.

Concepts • Hardware � Static Memory • Hardware � Clock Generation and Timing

Keywords SRAM; 10T Bit Cell SRAM; Voltage Scaling; Read Static Noise Margin 1. INTRODUCTION As the demand for memory continuously grows, SRAM becomes increasingly important in modern VLSI Design. However, since SRAM occupies a large fraction of area, and consumes a significant amount of dynamic and leakage power, the aggressive technology scaling makes cooling and power issues even worse. As a result, the power consumption becomes the major concern in SRAM design. To reduce the total power consumption, there are several approaches: voltage scaling, multi-voltage supply, logic optimization, pipelining, and parallelism, etc. Because the required performance varies between components in a SRAM in most cases, multiple voltage supply appears to be a good solution to optimize the balance between performance and overall power consumption. Bit cell is the core storage structure of SRAM and will greatly influence the performance of SRAM. 6T SRAM cell is conventionally used as the memory cell. Because of the compact design and the voltage division between access and driver transistor, 6T SRAM cells has relatively small hold and read noise margin, substantial problems will occur especially when power supply voltage is low. To deal with such problems, we propose a non-conventional 10T SRAM cell that achieves higher stability in read and write in low voltage environment, and at the same time, has a lower overall power consumption. In this paper, we will compare 6T and 10T SRAM cells in terms of delay and power consumption during read and write operations and noise margins in section 2. Section 3 explains the layout techniques we adopted when implementing the bit cell in an actual SRAM circuit, and section 4 details peripheral circuit design of the SRAM. The simulation results will be presented and discussed in section 5, and section 6 will describe the problems we

encountered in design as well as possible future improvements. during read operation decreases because of the voltage division between the access and driver transistor. In order to find a SRAM cell whose performance is proper in read and write operation in low voltage and stability is higher, 8T SRAM cell and 10T SRAM cell are proposed to make a comparison with conventional 6T SRAM cell.

2. COMPARISON BETWEEN 6T AND 10T SRAM CELLS After comparing and combining references from references [2]~[7] we propose a 10T SRAM cell that achieves higher stability in read and write in low voltage environment, and consumes lower power.

The 10T SRAM bit cell we proposed is controlled by three control signals, write, read, and footer. In read operation, both bit lines should be pre-charged high, and read and footer signals are turned high. The inverter pair is grounded as in 6T SRAM cell, and data stored in the cell will turn on one of the pass transistors, allowing voltage to drop in the corresponding bit line. In write operation, write signal will be turned on and footer signal turned off to float the nodes to be written. After the write driver alters the state in the cell, footer is then turned on to finish to pull-down transition.

Because low power is our ultimate goal in SRAM design, and we decide to decrease the power supply voltage as long as the performance is acceptable, the robustness in low power supply is an important criterion in our design process. Figures. 2 and 3. compare the read performance of 6T and 10T SRAM cells when VDD is lowered to 570 mV. When the 6T cell fails to give the correct output, 10T cell is still working properly.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. EECS 427, Fall, 2015, Ann Arbor, Michigan, U.S.

Figure 1. 10T SRAM cell proposed.

Page 2: 10T Dual-voltage Low Power SRAM Project Report

In 6T SRAM, the data stored in cell not only controls the pull down transistor, but is in the path of bit line discharging. The contention between pull down transistor and access transistor makes the data susceptible to noise. In contrast, the data stored in 10T SRAM cell is connected to the gate of lower pass transistors and hence is isolated from the discharging path. The Monte-Carlo simulation results for the read noise margin of both SRAM cells are presented in Figure 4.a and figure 4.b. When the read noise margin for 6T cell reaches 0 when sigma = 4, 10T cell has fair noise margin even when sigma is less than 6.

Our 10T SRAM cell design greatly decreases the contention during both read and write operations, and thus lowers the overall dynamic power consumption. The footer contributes the most to reduce write time and write power. Before each write operation, the footer transistor is turned off, the inverter pair is disconnected from ground, and as a result, floating nodes inside can be easily flipped by the write driver. Table.1. shown below provides the quantitative comparisons between 6T and 10T SRAM cells in terms of delay and power.

Table 1. Comparison between 6T and 10T SRAM cells in delay and power consumption.

6T SRAM Cell 10T SRAM Cell

Write Time (ps) 165 113

Read Time (ps) 332 360

Write Power (uW) 5.73 2.675

Read Power (uW) 26.88 25.63

Static Power (nW) 0.46 0.898

3. LAYOUT FLOOR PLAN 3.1 Layout of bit cell 10T cell layout is done with thin cell style with body contact shared by every 32 rows to reduce area cost. The cell is actually not so thin as we need extra M3 for control signals. the cell footer is shared among every two cells adjacent vertically, and the VSS of read footer is shared among the two vertically adjacent cells. Initially we also think about sharing the read transistor, but it is not a good idea as two bit lines can be shorted by the connected drain of read. This footer must be shared inside the cell cause great trouble to the layout, and we have to assign it an additional 0.4 height to avoid using already-scarce M3 resource. The total area of the 10T cell is 2.4um x 4um, while the 6T cell we layout is 1.2um x 2.4um. The area penalty is quite large for the 10T cell.

Figure 2. 6T SRAM cell fails at VDD = 570 mV. .

Figure 3. 10T SRAM cell works at VDD = 570 mV. .

Figure 4.a. (Left) 6T SRAM cell read noise margin. 4.b. (Right) 10T SRAM cell read noise margin.

Page 3: 10T Dual-voltage Low Power SRAM Project Report

3.2 Layout of single word column Our SRAM is constructed with 64 rows and 4 words in a row, as a result, 256 words can be stored in the SRAM controlled by 8-bit address code. Figure. 6 shows the structure of a single word column. A control block is assigned to each word column, which combines address information, read and write signals, and clock generated by the clock generator, locally selects the correct row, and gives the clocked read, write, or footer signal. Column controllers that pass on instructions and process signals are located below the bit cells. For the sake of power efficiency, all the blocks except for the level converter are operated under virtual VDD, which is substantially lower than the normal VDD. The output signal after read operation is then converted back to normal VDD by the level converter.

Figure. 7. given below shows the overall structure of the entire SRAM circuit. Four word columns are used, and more peripheral circuits such as multiplexers, decoders, clock generators, drivers and buffers are added to enhance the functionality. The actual layout is shown in Figure. 8.

4. PERIPHERAL CIRCUITS 4.1 Clock Generator Timing needs to be precisely controlled in SRAM for correct functionality. As a result, a clock generator that generates clock signals with different skews based on the global CLK is necessary. The reference timing is depicted in Figure.9, inverter chains are used to adjust the skew time. The precharge clock is generated by inverting the CLK signal and add 300 ps delay, which ensures the proper function of the previous evaluation cycle. When precharge finishes and bit cells are ready to read or write, the read and write instructions will be sent to bit cells 550 ps after the negative edge of the global CLK. The connection between bit lines and sense amplifier will be turned off 400 ps after receiving the read signal, by then the potential difference between bit lines is expected to reach 120 mV, and ready to be amplified by sense amplifier in the following circuit.

Figure 9. Instruction timing for SRAM.

.

Figure 5. 10T SRAM bit cell layout. .

Figure 6. 10T SRAM single word column structure.

Figure 7. 4K SRAM structure.

Figure 8. 4K SRAM layout.

Page 4: 10T Dual-voltage Low Power SRAM Project Report

4.2 Pre-charge Circuit Both bit lines need to be pre-charged to virtual VDD before read and write. The pre-charge circuit we used is shown in Figure. 10. M1 and M2 are the driver transistors that pre-charge bit lines, and transistor M3 equalizes the voltages to ensure both bit lines are at the same potential before read.

4.3 Sense Amplifier The sense amplifier design will greatly impact the read performance of a SRAM. A properly designed sense amplifier can reduce the voltage swing in bit lines, improve the read speed, lower the power consumption, and avoid potential disturbance to state stored in cell. However, with technology scaling, SRAM circuit becomes denser, and more bit cells are added to each bit line. Parasitic capacitances are increased, and thus slows down the voltage sensing process.

In our SRAM, we used the conventional regenerative inverter based sense amplifier that is shown below in Figure. 11.

The read time is decided by the SRAM cell drivability and sense amplifier swing. The required swing of sense amplifier is simulated with Monte Carlo. 10k simulations result that the sense amplifier can work properly with as small as 50mV swing for a working voltage of 0.7V with robustness of 3!, which is shown in Figure. 12. However, we implement the read time to satisfy a SA swing of 100mV to allow for possible noises.

To achieve the best read delay and compensate for the large load capacitance, we ran parametric sweep to optimize the size for each transistor. For example, Figure. 13. shows a sweep for sizes of pull down NMOS in the inverter pair, and a wider NMOS will result in the faster propagation delay. After taking the area penalty into consideration, we chose 1.5um as the final width for the NMOS.

4.4 Decoding Circuits Decoding is required to select proper cells for read or write operation. Duel to the great number of cells of modern SRAMs, the load of decoders boosts. It becomes important to develop a fast decoder with good drivability. Meanwhile it is tricky to layout the decoding circuits matching with the cell height as the cell bank becomes denser and denser. The 10T cell our group used for SRAM has different working conditions for read and write process, so it suffers from the half select problem and the word lines could not be shared. To solve the problem, our group have cells of the same word placed next to each other and there is a control unit in front of each word in charge of the three control lines. The word controls receive signals from the decoding circuits, which is implemented as pre-decoding and separated column and row decoding. Input addresses are first processed by the pre-decoder. The lowest two bits of address is decoded to four separate enable lines to four column decoders, where the column enable signals meet CEN and

Figure 10. Pre-charge circuit. .

Figure 11. Sense amplifier circuit. .

Figure 12. Monte Carlo results for SA working at 0.7V with 50mV swing.

.

Figure 13. Sense amplifier sizing sweep. .

Page 5: 10T Dual-voltage Low Power SRAM Project Report

WEN and the resulting read/write enable was not sent to the cell controls until clock signal arrives. The higher 6 bits of address is sent to row decoder with their complements. The row decoder then completes a two step decoding with a NAND-INV-NAND-INV chain. This multi level decoding technique works properly to relieve the load pressure from the high branching and long wires, as well as reducing the size of each part to reduce area/power. Adding post PEX capacitance loads to the decoding path, we see a 260ps delay from CLK to READ word line, which is the critical one. 5. RESULTS ANALYSIS The timing series is shown in Figure 14.a and Figure 14.b. And the data of post PEX simulation of SRAM are shown in the Table.2. In Figure. 14., the CLK2 signal is the clock signal of the chip with 550ps skew. The reason generating the CLK2 is in the baseline design, the cycle time is 3.8ns. And the decoder and Register file takes 1.7ns. In order to send the control signal of SRAM at the negative edge of the clock, decoding time in SRAM should be smaller than 0.2ps. It’s not safe to use the negative edge of CLK. Therefore, adding 550ps skew can make sure the operation function is correct.

In read cycle, at the negative edge of CLK2, column decoder starts to generate the read control signal for the SRAM cell, it takes 0.18ns for the decoding. This time can be reduced with a better layout design. When the read signal comes, it starts to read SRAM. The read time is the same as the time for generating the differential voltage in bit lines. In this customization, the differential voltage is 216mV in bit lines at 25°C and the input differential voltages of Sense Amplifier is 154mV. The corresponding read time is 0.3ns. In the testing data, it shows that the differential voltage for Sense Amplifier to regenerate the signal should be close to 100mV with noise interruption. In that case the read time can be reduced to accelerate the reading speed and decrease the power consumption. That is a way to improve the design. After regenerating the data, the output Sense is sent to Level converter to convert the swing from 0.75V to 1.2V and it takes 500ps to produce output Q. In the level converter, the inverter at the output port is designed very big in order to drive

high capacitance of the output. However, since the load capacitance is lower than the expected value, that is to say the using an inverter chain can be more efficient and approximately 200ps in the converting time will be saved.

In write cycle, at the negative edge of CLK2, column decoder starts to generate the write control signal for the SRAM cell. Since the write working mode is the time from negative edge of CLK2 to the rise of CLK, which is enough for the data to be written into cell, we concentrate more on another write time as shown in the figure. The data write time is 250 ps. Data hold time is to make sure the data written into the SRAM cell is stable and the data hold time is 120 ps.

Table 2. Post-PEX Simulation results.

Pin Symbol -55°C 25°C 125°C � � Min Min Min Cycle time (ns) Tclk 3.8 3.8 3.8 Clock high (ns) Tclk,high 1.9 1.9 1.9 Clock low (ns) Tclk,low 1.9 1.9 1.9 Read signal Decoding (ps) Tclk-r 218 250 260

read time (ps) Tread 388 437 485 SA regenerating (ps) Tsense 60 67 70 Level converting (ps) Tout 453 504 542 write time (ps) Twrite 206 250 220 Data setup (ps) Ts 0 0 0

Data hold (ps) Th 110 120 125

6. REFERENCES [1] Neil Weste and David Harris, CMOS VLSI Design: A Circuits

and Systems Perspective, Addison Wesley, Fourth Edition, 2011.

Figure 14.b. Write Timing. .

Figure 14.a. Read Timing. .

Page 6: 10T Dual-voltage Low Power SRAM Project Report

[2] Vamsi Kiran, P.N.; Saxena, N., Design and Analysis of Different Types SRAM Cell, Electronics and Communication Systems (ICECS), 2015 2nd International Conference on, vol., pp.1060-1065, 2015.

[3] Athe, P.; Dasgupta, S., A comparative study of 6T, 8T and 9T decanano SRAM cell, Industrial Electronics & Applications, 2009. ISIEA 2009. IEEE Symposium on, vol.2, pp.889-894, 2009.

[4] Zamani, M.; Hassanzadeh, S.; Hajsadeghi, K.; Saeidi, R., A 32kb 90nm 9T-cell sub-threshold SRAM with improved read and write SNM, Design & Technology of Integrated Systems in Nanoscale Era (DTIS), 2013 8th International Conference on, vol., pp.104-107, 2013.

[5] Ramani, A.R.; Ken Choi., A Novel 9T SRAM Design in Sub-Threshold Region, Electro/Information Technology (EIT), 2011 IEEE International Conference on, vol., pp.1-6, 2011.

[6] Jinmo Kwon; Ik Joon Chang; Insoo Lee; Heemin Park; Jongsun Park, Heterogeneous SRAM Cell Sizing for Low-Power H.264 Applications, Circuits and Systems I: Regular Papers, IEEE Transactions on, vol.59, pp.2275-2284, 2012.

[7] Madiwalar, B.; Kariyappa, B.S., Single Bit-line 7T SRAM cell for low Power and High SNM, Automation, Computing, Communication, Control and Compressed Sensing (iMac4s), 2013 International Multi-Conference on, vol., pp.223-228, 2013.

Page 7: 10T Dual-voltage Low Power SRAM Project Report

APPENDIX 1. PROCESSOR INTEGRATION 1.1 Processor Floor Plan The processor is fully integrated in APR. The floor plan is designed to minimize the communication wire by placing related blocks adjacently. And we sized the decoder and PC matching the height of RF-ALU and shifter to facilitate routing. The decoder is a crucial part of the processor, which has much logic and account for an important part of the total delay. To decrease the decoding delay and decoder size, we allowed M4 and M5 to be used, trading off with final integration routing. Fortunately, the final routing is not much a problem as we have considered the floor plan carefully. When implementing the power rings, we considered the power consumption of different blocks. For example, SRAM has only a small part of the total cells active during active cycle, and the power ring need not to be very wide. On the other hand, the decoder and datapath have much larger spatial activity factor, which requires more power and wider rings. But in the final floor plan, the power ring is sized for a more compact floor plan.

Figure. 1. Floor Plan of the Processor

Figure. 2. Processor Image

1.2 Processor testability

Page 8: 10T Dual-voltage Low Power SRAM Project Report

We used 22 of the 40 generated pads for testability, among which are 5 inputs, 16 bits of D to be written back to RF, and scan signals for PC. We did not implement any scan in decoder, since the logic is already complex. Instead the D signal combined with PC scan-in is enough to test the chip functionality. 1.3 Timing consideration and clock signal PC, RF, ROM and RAM are timed according to clock signals. The PC output is an immediate output after the control signal from decoder, and the PC register is written at the rising edge. RF has data inside the master latch when clock is high, and write the data into slave latch when clk is low. The write address signal for RF is refreshed every cycle on the falling edge of clock, when the decoding is already completed and before the RF write timing. If we have write address at the rising edge, there may be violations when the address arrives too late, and D is already written to the previous address. Our customized RF is design to work at the negative edge of CLK, after decoder and RF. Clock skew happens. In order to reduce the skew, clock signal inputs of different blocks are designed to be close to each other. However, we did not realize the large load of clock signal, and we did not make good use of the clock pad drivers. The clock wire we used is too narrow, causing large slew. 1.4 Processor performance Before the final integration, we improved the performance of our ALU by pursuing the limit from layout and schematic, and we reduced the delay of our ALU from 2.08ns to 1.75ns without changing the architecture, which is still based on a carry select adder. The delay of other blocks is also listed here. The RF has a setup time of 460ps for the data to be written into master latch, and the worst case read time is 700ps, corresponding to the CLK-Q delay where data are first written into slave latch and sent to output immediately. The delay of shifter is 1.13ns.


Recommended