CSE 477. VLSI Systems DesignCMPEN 411
SRAM, DRAM
[Adapted from Rabaey’s Digital Integrated Circuits, Second Edition,
©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
Sp09 CMPEN 411 L23 S.*
Heads-up
IBM Kerry Bernstein’s talk Thursday 4 PM, IST 333
To prepare for his talk, go to ANGEL system, find the file “New
dimensions in performance”, under “interesting reading
materials”
To make up last cancelled lecture:
Kerry Bernstein’s talk – “Microarchitecture’s Race for Performance
and Power”, PSU talk, 11/2004, Slides and Videos are online in
ANGEL system “Interesting Reading Materials”
DAC Young Student Scholarship
Review: Basic Building Blocks
Multiplexers, decoders
Interconnect
Sp09 CMPEN 411 L23 S.*
2D 4x4 SRAM Memory Bank
A0
BLi+1
To decrease the bit line delay for reads – use low swing bit lines,
i.e., bit lines don’t swing rail-to-rail, but precharge to Vdd (as
1) and discharge to Vdd – 10%Vdd (as 0). (So for 2.5V Vdd, 0 is
2.25V.) Requires sense amplifiers to restore to full swing.
Write circuitry – receives full swing from sense amps – or writes
full swing to the bit lines
Sp09 CMPEN 411 L23 S.*
6-Transistor SRAM Storage Cell
For lecture
Note that it is identical to the register cell from static
sequential circuit - cross-coupled inverters. The major job of the
pullups is to replenish loss due to leakage
Sizing of the transistors is critical
Does it consume standby power (yes – but of two forms. see if the
students can identify each). Cell leakage and bit line
leakage.
Talk about the crow-bar effect of the cross-coupled inverters
Sp09 CMPEN 411 L23 S.*
SRAM Cell Analysis (Read)
Read-disturb (read-upset): must limit the voltage rise on !Q to
prevent read-upsets from occurring while simultaneously maintaining
acceptable circuit speed and area
M1 must be stronger than M5 when storing a 1 (as shown)
M3 must be stronger than M6 when storing a 0
0
First precharge both bit lines – BL and !BL – to 1 - Note that bit
line capacitance values for Cbit can be in the pF range for large
memories (bit line capacitance is from wiring and from diffusion
caps of M5’s of all the cells connected to the bit line and the
load presented by the sense amp)
Then discharge !BL through M5 and M1 (since the cell is holding 1)
– key is that you must ensure that the !Q point doesn’t rise too
high before Cbit is discharged or the memory cell will change
state- read-disturb.
BL being 1 will help to keep the cell from toggling. Or can
precharge the bit lines to Vdd/2 so !Q could never reach the
switching point. This has performance benefits as well.
Sp09 CMPEN 411 L23 S.*
Read Voltage Ratios
Keep cell size minimal while maintaining read stability
Make M1 minimum size and increase the L of M5 (to make it
weaker)
increases load on WL
Make M5 minimum size and increase the W of M1 (to make it
stronger)
Similar constraints on (W3/L3)/(W6/L6) when storing a 0
1.2
To avoid read-disturb, the voltage on node !Q must remain below the
trip point of the inverter pair for all process, noise, and
operating conditions. CRs of no less than 1.2 (most microprocessor
fabs use a minimum cell ratio of 1.25 to 2)
Simulations should be done at high Vdd and low VTn (and considering
process variations and misalignments)
Sp09 CMPEN 411 L23 S.*
SRAM Cell Analysis (Write)
Cbit
Cbit
The !Q side of the cell cannot be pulled high enough to ensure
writing of 0 (because M1 is on and sized to protect against read
upset). So, the new value of the cell has to be written through
M6.
M6 must be able to overpower M4 when storing a 1 and writing a
0
M5 must be able to overpower M2 when storing a 0 and writing a
1
0
State shown is that before write takes effect (1 is stored, trying
to write a 0)
Note that the !Q side of the cell cannot be pulled high enough to
ensure the writing of a 1 (because of the Q state holding M1 on for
a good path to ground). So writing has to occur through the Q side
of the RAM cell. In order to write the cell, the pass gate M6 must
be more conductive than the M4 to allow node Q to be pulled to a
value low enough for the inverter pair to begin amplifying the new
data.
The maximum ratio of the pullup size to that of the pass gate
required to guarantee that the cell is writable – M6 in linear, M4
in saturation
Sp09 CMPEN 411 L23 S.*
Write Voltage Ratios
Make M4 and M6 minimum size
1.8
In order to write the cell, node Q must be pulled to a value low
enough to trip the inverter combination – so pulling Q below VTn
(0.4V) is required.
The lower the PR, the lower the value of VQ (has to be below
1.8).
The limiting case for the write operation occurs at high Vdd when
the pfet is strong (mup high, VTp low) and the nfet is weak (mun
low, VTn high)
Typically, the widths of the pullup devices are sized at or near
process minimum. Longer than minimum channel lengths may also be
employed to further reduce the pullup ratio. This is necessary
since the read sizing of the SRAM cell dictates that the pass gate
sizing should be minimized to prevent read disturbs.
Sp09 CMPEN 411 L23 S.*
Cell Sizing and Performance
Minimum sized pull down fets (M1 and M3)
Requires longer than minimum channel length, L, pass transistors
(M5 and M6) to ensure proper CR
But up-sizing L of the pass transistors increases capacitive load
on the word lines and limits the current discharged on the bit
lines both of which can adversely affect the speed of the read
cycle
Minimum width and length pass transistors
Boost the width of the pull downs (M1 and M3)
Reduces the loading on the word lines and increases the storage
capacitance in the cell – both are good! – but cell size may be
slightly larger
Performance is determined by the read operation
To accelerate the read time, SRAMs use sense amplifiers (so that
the bit line doesn’t have to make a full swing)
Read requires (dis)charging of the large bit-line capacitance
through the stack of two small transistors in the selected cell.
Write is dominated by the propagation delay of the cross-coupled
inverter pair.
SPEED – tp = CLVswing/Iav
Read – critical speed op since have large bit line capacitance to
discharge through small transistors
Write – speed determined by the propagation delay of the
cross-coupled inverters
tpLH > tpHL due to precharge
Sp09 CMPEN 411 L23 S.*
6-T SRAM Layout
Simple and reliable, but big
signal routing and connections to two bit lines, a word line, and
both supply rails
Area is dominated by the wiring and contacts
Other alternatives to the 6-T cell include the resistive load 4-T
cell and the TFT cell neither of which are available in a standard
CMOS logic process
VDD
GND
Q
Q
WL
BL
BL
M1
M3
M4
M2
M5
M6
Area is dominated by wiring and contacts – 11.5 contacts
Other considerations – size of cell (1092 lambda**2 = 582 micron
**2 by 0.7 micron design rules)
and power – standby/cell of 10**-15A
Sp09 CMPEN 411 L23 S.*
Multiple Read/Write Port Storage Cell
To avoid read upset, the widths of M1 and M3 will have to be sized
up by a factor equal to the number of simultaneously open read
ports
!BL1
BL1
WL1
M1
M2
M3
M4
M5
M6
Q
!Q
WL2
BL2
!BL2
M7
M8
Allows multiple simultaneous reads of the same cell, so the cell
design must be stable for a case of multiple reads.
For the case of reads, with more than one pass gate open, the
voltage rise in the cell will be larger and thus the size of the
pulldown will have to be increased to maintain an acceptably low
level to keep from incurring a read upset (by a factor equal to the
number of simultaneous open read ports).
Sp09 CMPEN 411 L23 S.*
Resistance-load SRAM Cell
M
3
R
L
R
L
V
DD
WL
Q
Q
M
1
M
2
M
4
BL
BL
How to make R? Undope poly Tera Om/squre poly with silicide 4-5
Om/squre
Sp09 CMPEN 411 L23 S.*
Remove R
M
3
WL
M
1
M
2
M
4
BL
BL
How to make R? Undope poly Tera Om/squre poly with silicide 4-5
Om/squre
Sp09 CMPEN 411 L23 S.*
Remove R
Further remove one transistor
How to make R? Undope poly Tera Om/squre poly with silicide 4-5
Om/squre
Sp09 CMPEN 411 L23 S.*
3-Transistor DRAM Cell
M1
M2
M3
X
BL1
BL2
WWL
RWL
Cs
Write: Cs is charged (or discharged) by asserting WWL and BL1
Value stored at node X when writing a 1 is VWWL - VTn
Read: Cs is “sensed” by asserting RWL and observing BL2
Read is non-destructive and inverting (ratioless)
X
VDD-VT
BL1
VDD
WWL
write
RWL
read
BL2
VDD-VT
V
Core of first popular MOS memories (e.g., first 1Kbit memory from
Intel). Cs is data storage (internal capacitance of wiring, M2
gate, and M1 diffusion capacitances)
No constraints on device sizes (ratioless)
Note threshold drop at point X which decreases the drive (gate
voltage) of M2 and slows down read time – some designs “bootstrap”
the word line voltage (raise the VWWL to a value higher than Vdd to
get around threshold drop).
Write – uses WWL and BL1. Data is retained as charge on CS once WWL
is lowered.
Read – uses RWL and BL2. Assume BL2 precharged to Vdd (or Vdd-Vt).
If cell is holding 1, then BL2 goes low – so reads are
inverting.
Refresh – read stored data, put its inverse on BL1 and assert WWL
(need to do this every 1 to 4 msec)
Sp09 CMPEN 411 L23 S.*
3-Transistor DRAM Cell
M1
M2
M3
X
BL1
BL2
WWL
RWL
Cs
Refresh: read stored data, put its inverse on BL1 and assert WWL
(need to do this every 1 to 4 msec)
Note Vt drop at x: how to fix it?
X
VDD-VT
BL1
VDD
WWL
write
RWL
read
BL2
VDD-VT
V
Core of first popular MOS memories (e.g., first 1Kbit memory from
Intel). Cs is data storage (internal capacitance of wiring, M2
gate, and M1 diffusion capacitances)
No constraints on device sizes (ratioless)
Note threshold drop at point X which decreases the drive (gate
voltage) of M2 and slows down read time – some designs “bootstrap”
the word line voltage (raise the VWWL to a value higher than Vdd to
get around threshold drop).
Write – uses WWL and BL1. Data is retained as charge on CS once WWL
is lowered.
Read – uses RWL and BL2. Assume BL2 precharged to Vdd (or Vdd-Vt).
If cell is holding 1, then BL2 goes low – so reads are
inverting.
Refresh – read stored data, put its inverse on BL1 and assert WWL
(need to do this every 1 to 4 msec)
Sp09 CMPEN 411 L23 S.*
3-T DRAM Layout
Fewer contacts & wires
Total cell area is 576 2 (compared to 1,092 2 for the 6-T SRAM
cell)
No special processing steps are needed (so compatible with logic
CMOS process)
Can use bootstrapping (raise VWWL to a value higher than VDD) to
eliminate threshold drop when storing a “1”
BL2
BL1
GND
RWL
WWL
M3
M2
M1
Note many fewer contacts (only 7) and wires than in 6-T SRAM
cell
576 lambda**2 as compared to 1092 lambda**2 for SRAM cell
Sp09 CMPEN 411 L23 S.*
1-Transistor DRAM Cell
M1
X
BL
WL
X
VDD-VT
WL
write
1
BL
VDD
Cs
read
1
VDD/2
sensing
CBL
Write: Cs is charged (or discharged) by asserting WL and BL
Read: Charge redistribution occurs between CBL and Cs
Read is destructive, so must refresh after read
Voltage swing is small
Most pervasive DRAM cell in commercial designs.
Write – set BL and activate WL. Once again could bootstrap WL so
that voltage drop at X doesn’t occur (to bring it up to Vdd) –
common practice
Read - BL precharged to Vpre – typically Vdd/2 – then assert WL and
sense state of BL that takes effect due to charge sharing between
CBL and Cs. Note that Read is destructive (steal charge from Cs) so
must follow with a refresh cycle. Note that Cs << CBL (1 to 2
orders of magnitude) so read voltage swings are typically very
small (around 250mV for 0.8 micron technology?). Charge transfer
ratios are between 1% and 10%
delta V = VBL – Vpre = (Vbit – Vpre) (Cs/(Cs + CBL))
REQUIRES a sense amp for each bit line for correct operation
(wereas before was used to improve performance (via reduced bit
line swings on reads))
Sp09 CMPEN 411 L23 S.*
Sense Amp Operation
1-T DRAM Cell Observations
Cell is single ended (complicates the design of the sense
amp)
Cell requires a sense amp for each bit line due to charge
redistribution based read
BL’s precharged to VDD/2 (not VDD as with SRAM design)
all previous designs used SAs for speed, not functionality
Cell read is destructive; refresh must follow to restore data
Cell requires an extra capacitor (CS) that must be explicitly
included in the design
May not compatible with logic CMOS process
A threshold voltage is lost when writing a 1 (can be circumvented
by bootstrapping the word lines to a higher value than VDD)
Sp09 CMPEN 411 L23 S.*
1-T DRAM (3-D capacitor)
Peripheral Memory Circuitry
Sense amplifiers
Area – pitch matching
Address decoders have a substantial impact on the speed and power
consumption of the memory
When designing decoders, important to keep the complete memory
floorplan in perspective so that geometry matching between the
decoder cell dimensions and the core cell is done – pitch matching.
Otherwise, will have long lines affecting speed and power
consumption.
Sp09 CMPEN 411 L23 S.*
2D 4x4 __RAM Memory
BLi+1
To decrease the bit line delay for reads – use low swing bit lines,
i.e., bit lines don’t swing rail-to-rail, but precharge to Vdd (as
1) and discharge to Vdd – 10%Vdd (as 0). (So for 2.5V Vdd, 0 is
2.25V.) Requires sense amplifiers to restore to full swing.
Write circuitry – receives full swing from sense amps – or writes
full swing to the bit lines
Sp09 CMPEN 411 L23 S.*
2D 4x4 ___RAM Memory
Note refresh logic
Note a single bit line (no BLbar) – thus sense amps are required
for operation (and are more difficult to design)
Note that now sense amps are in front of the column decoder
(because one is needed for every bit line, not just for the word
being accessed) unlike in the SRAM case.
Sp09 CMPEN 411 L23 S.*
Row Decoders
Collection of 2M complex logic gates organized in a regular, dense
fashion
(N)AND decoder for 8 address bits
WL(0) = !A7 & !A6 & !A5 & !A4 & !A3 & !A2 &
!A1 & !A0
…
NOR decoder for 8 address bits
WL(0) = !(A7 | A6 | A5 | A4 | A3 | A2 | A1 | A0)
…
Goals: Pitch matched, fast, low power
note that addresses are represented as unsigned numbers (all bits
are used unlike in the book)
Sp09 CMPEN 411 L23 S.*
Dynamic Decoders
Precharge devices
V
DD
V
DD
V
DD
V
DD
Nor is faster (only one transistor to ground), but larger (see ROM,
each transistor has to connect to GND), more power (3 WL switch vs.
1 WL in NAND)
Sp09 CMPEN 411 L23 S.*
Pass Transistor Based Column Decoder
BL3
BL2
BL1
BL0
data_out
A1
A0
S3
S2
S1
S0
Read: connect BLs to the Sense Amps (SA) Writes: drive one of the
BLs low to write a 0 into the cell
Fast since there is only one transistor in the signal path.
However, there is a large transistor count ( (K+1)2K + 2 x
2K)
For K = 2 3 x 22 (decoder) + 2 x 22 (PTs) = 12 + 8 = 20
!BL3
!BL2
!BL1
!BL0
!data_out
Essentially a 2**k input multiplexer
Can run the NOR decoder while the row decoder and core are working
– so only have 1 extra transistor in the signal path. Make sure the
select lines (S) go full swing (to VDD) so that full swing appears
on the BLs during write
Transistor count – 2*(k+1)2**k + 2**k devices - so for k=10 (1024
to 1) it would require 2*12,288 transistors (11*1024 + 1024)
Note that this is for 1-bit data lines. For multiple bit data
words, the cost of the predecoder can be amortized and the cost of
the pt design is less (in terms of transistor count). Note the
large load on the decoder outputs for multiple bit data words (2 *
# bits in the data word).
Sp09 CMPEN 411 L23 S.*
Tree Based Column Decoder
Number of transistors = (2 x 2 x (2K -1))
for K = 2 2 x 2 x (22 – 1) = 4 x 3 = 12
Delay increases quadratically with the number of sections (K) (so
prohibitive for large decoders)
can fix with buffers, progressive sizing, combination of tree and
pass transistor approaches
!BL3
!BL2
!BL1
!BL0
!data_out
Note no predecoder needed as with previous design – the reason for
the transistor count reduction
Number of transistors comes down to 2* 2*(2**k – 1) – so for k=10
(1024 to1) requires only 2*2,046 transistors
But not true (i.e., transistor count savings) for more than one bit
of data!
Sp09 CMPEN 411 L23 S.*
Bit Line Precharge Logic
!PC
!BL
BL
First step of a Read cycle is to precharge (PC) the bit lines to
VDD
every differential signal in the memory must be equalized to the
same voltage level before Read
Turn off PC and enable the WL
the grounded PMOS load limits the bit line swing (speeding up the
next precharge cycle)
equalization transistor - speeds up equalization of the two bit
lines by allowing the capacitance and pull-up device of the
nondischarged bit line to assist in precharging the discharged
line
Static pullup scheme – advantage is that it does not require a
heavily loaded precharge clock signal to be routed across the
array; disadvantage is that it is always on, so is fighting against
the bit line discharge for the bit lines that are moving low
(consumes power)
Clocked scheme – allows the designer to use much larger precharge
devices so that bit line equalization happens more rapidly (note
equalization transistor to help even more); disadvantage is the
power consumption of the heavily loaded precharge clock
signal.
What purpose do the two pfets with their gates tied to ground
serve?
Sp09 CMPEN 411 L23 S.*
Sense Amplifiers
Amplification – resolves data with small bit line swings (in some
DRAMs required for proper functionality)
Delay reduction – compensates for the limited drive capability of
the memory cell to accelerate BL transition
Power reduction – eliminates a large part of the power dissipation
due to charging and discharging bit lines
Signal restoration – for DRAMs, need to drive the bit lines full
swing after sensing (read) to do data refresh
SA
input
output
large
small
tp = ( C * V ) / Iav
Differential Sense Amplifier
Directly applicable to
Differential Sensing SRAM
Reliability and Yield
Memories operate under low signal-to-noise conditions
word line to bit line coupling can vary substantially over the
memory array
folded bit line architecture (routing BL and !BL next to each other
ensures a closer match between parasitics and bit line
capacitances)
interwire bit line to bit line coupling
transposed (or twisted) bit line architecture (turn the noise into
a common-mode signal for the SA)
leakage (in DRAMs) requiring refresh operation
suffer from low yield due to high density and structural
defects
increase yield by using error correction (e.g., parity bits) and
redundancy
and are susceptible to soft errors due to alpha particles and
cosmic rays
we have only shown/considered folded bit line architectures
here.
Sp09 CMPEN 411 L23 S.*
Redundancy in the Memory Structure
Row address
Column address
Redundant row
Redundant columns
Fuse bank
Replace bad row or column with “spare” – done by setting the fuse
bank
Helps to correct faults that affect a large section of the memory;
not good for scattered point errors or local errors (use error
correction (ECC) logic like parity bits for that)
Sp09 CMPEN 411 L23 S.*
Page 4
Page 5
Column Redundancy
Error-Correcting Codes
2K>= m+k+1. m # data bit, k # check bit
For 64 data bits, needs 7 check bits
e.g. If B3 flips
Sp09 CMPEN 411 L23 S.*
Performance and area overhead for ECC
A circuit failure occurs only when the voltage disturbance causes
the logic state of the circuit to change such that it cannot
automatically recover. This can happen before the disturbed node
completely charges or discharges. Once the node voltage reaches the
switching points of any associated logic gates, this false
transition will start to propagate along these signal paths.
Furthermore, since many circuits have feedback loops, positive
feedback can even accelerate the faulty transitions. Given the
physical mechanism of a soft error event, the following measures
can be taken in circuit design to reduce the particle induced
failure rates:
increase the storage node charge
Add devices to compensate for charge loss
Minimize the charge collecting efficiency at the storage
nodes
Sp09 CMPEN 411 L23 S.*
Redundancy and Error Correction
Soft Errors
alpha particles (from the packaging materials)
neutrons from cosmic rays
As feature size decreases, the charge stored at each node decreases
(due to a lower node capacitance and lower VDD) and thus Qcritical
(the charge necessary to cause a bit flip) decreases leading to an
increase in the soft error rate (SER)
From Semico Research Corp.
18
9
FIT= Failure In Time, one FIT is a single failure in 1 billion
(1e9) hours. Hence, a system that experiences 1 failure in 13,158
hours has a failure rate of 1E9/13,158 = 76,000 FITs.
Avionics system in civilian aviation: altitude of 30,000 feet on a
route crossing the north pole both cause increase in neutron flux.
If avionics board uses four 1M 130nm SRAM-based FPGAs, it would be
subject to 0.074 upsets per day = 324 hours between upsets or
3million FITs. Assume one such system on-board each commercial
aircraft, 4,000 civilian flights per day, 3 hours average flight
time. Nearly 37 aircraft will experience a neutron-induced
SRAM-based FPGA configuration failure during the duration of their
flight.
Sp09 CMPEN 411 L23 S.*
Scary Fact
Avionics system in civilian aviation: altitude of 30,000 feet on a
route crossing the north pole both cause increase in neutron flux.
If avionics board uses four 1M 130nm SRAM-based FPGAs, it would be
subject to 0.074 upsets per day = 324 hours between upsets or
3million FITs. Assume one such system on-board each commercial
aircraft, 4,000 civilian flights per day, 3 hours average flight
time. Nearly 37 aircraft will experience a neutron-induced
SRAM-based FPGA configuration failure during the duration of their
flight.
Sp09 CMPEN 411 L23 S.*
Modeling of a particle strike
See 2004 paper. Before I talk about the solution, first I want to
spend a few minutes on the analysis of soft error.. First question
you may have is how to model soft errors… We can model a transient
fualt as a double exponential injection current sourse in SPICE,
where the increasing lope depends on the collection time constant
for a junction, which is dependent on the doping concentration and
process… I0, the peak current, depends on the process and the
charge intensity. The area, I*t is the Q…
Show Kerry’s video here…
Sp09 CMPEN 411 L23 S.*
A SPICE simulation for SRAM
A particle strike
On-chip Memory: ITRS roadmap
State of Art
State of Art