Date post: | 13-Apr-2017 |
Category: |
Engineering |
Upload: | indramani-yadav |
View: | 128 times |
Download: | 1 times |
1
53
INTELLIGENT RAM (IRAM)
REPORT
ON
Intelligent RAM
As Partial Fulfilment for the Degree of
BACHELOR OF COMPUTER APPLICATION
Submitted to
Laxmi Institute of Commerce and Computer Applications (BCA)
Sarigam
Affiliated to
Veer Narmad South Gujarat University, Surat
Prepared By
Yadav Indramani Surendra
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
2
53
INTELLIGENT RAM (IRAM)
Laxmi Institute of Commerce and Computer Applications (BCA),
Sarigam
SEMINAR REPORT
As Partial Requirement for the Degree of
Bachelor of Computer Applications (BCA)
Academic Year 2015-2016
Submitted By:
Yadav Indramani Surendra
Guided By:
Internal Guide: Ms Sindhu Pandya
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
3
53
INTELLIGENT RAM (IRAM)
PREFACE
It is an exciting moment for me to present this seminar report. The proper care
was taken while preparing the report so that it is easy to read & understand.
During the preparation of this seminar report, the Information technology
concepts were implemented. This seminar is part of Third Year study, the final
step towards the completion of BCA Course.
This documentation defines the system function in an understandable
manner. Seminar report consists of different sections like specification of
technology, study on it, Different functions, its features etc. that will help user
to understand the particular technology in brief.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
4
53
INTELLIGENT RAM (IRAM)
ACKNOWLEDGEMENT
The task of completing a seminar work needs co-operation guidance of prominent in the subject line for amateurs like us to execute personality.
I am grateful to Dr Keyur Nayak (Principal) and Ms. Sindhu Pandya (Incharge Principal) and Internal Guide on encouragement to undertake this journey of knowledge gathering and guiding me to improve my presentation skill.
I extend my heartfelt regards and respect to our teachers of BCA department for their continued encouragement and support from the very beginning.
I extend my sincere thanks to all the non-teaching staff for providing the necessary facilities and help. Without the support of any of them this seminar report work would not have been a reality.
Sincerely,
Yadav Indramani Surendra
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
5
53
INTELLIGENT RAM (IRAM)
INDEX
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
SR.NO
.
DESCRIPATION PAGE
NO.
1. ABSTRACT 4
2. INDRODUCTION 5
3. INSPIRATIONS OF
IRAM
6
1. PROCESSOR -
MEMORY
(DRAM) GAP OR
LATENCY
6
4. IRAM - ARCHITECTURE
12
5. IRAM – BENCHMARKING 20
6. ADVANTAGES OF IRAM 31
7. DISADVANTAGES OF IRAM 34
8. CONCLUSION 35
9. REFERENCES 37
6
53
INTELLIGENT RAM (IRAM)
INTELLIGENT RAM (IRAM)
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
7
53
INTELLIGENT RAM (IRAM)
Abstract
As the name suggests „Intelligent RAM‟ is the integration of
Intelligence and RAM. Intelligence stands for Microprocessor, and
RAM, the random access memory which is the volatile memory that is
an important part of computing systems from desktop computers to
supercomputers.
Intelligent RAM, or IRAM, merges processing and memory into a
single chip to lower memory latency, increase memory bandwidth, and
improve energy efficiency as well as to allow more flexible selection of
memory size and organization. In addition, IRAM promises savings in
power and board area.
The seminar is genuine effort to introduce this new idea which covers
the inspirations, advantages, architecture of IRAM and the
technologies which makes possible the revolutionary idea Intelligent
RAM.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
8
53
INTELLIGENT RAM (IRAM)
Introduction
The division of the semiconductor industry into microprocessor and
memory camps provides many advantages. First and foremost, a
fabrication line can be tailored to the needs of the device. Microprocessor
fabrication lines offer fast transistors to make fast logic and many metal
layers to accelerate communication and simplify power distribution,
while DRAM fabrications offer many polysilicon layers to achieve both
small DRAM cells and low leakage current to reduce the DRAM refresh
rate.
Separate chips also mean separate packages, allowing microprocessors to
use expensive packages that dissipate high power (5 to 50 watts) and
provide hundreds of pins to make wide connections to external memory,
while allowing DRAMs to use inexpensive packages which dissipate low
power (1 watt) and use only a few dozen pins.
Separate packages in turn mean computer designers can scale the number
of memory chips independent of the number of processors: most desktop
systems have 1 processor and 4 to 32 DRAM chips, but most server
systems have 2 to 16 processors and 32 to 256 DRAMs. Memory systems
have standardized on the Single In-line Memory Module (SIMM) or Dual
In-line Memory Module (DIMM), which allows the end user to scale the
amount of memory in a system.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
9
53
INTELLIGENT RAM (IRAM)
Quantitative evidence of the success of the industry is its size: in 1998
DRAMs were a $37B industry and microprocessors were a $20B
industry. In addition to financial success, the technologies of these
industries have improved at unparalleled rates.
DRAM capacity has quadrupled on average every 3 years since 1986,
while microprocessor speed has done the same since 1996. The split into
two camps has its disadvantages as well.
Two trends call into question the current practice of microprocessors and
DRAMs being fabricated as different chips on different fabrication lines:
1) the gap between processor and DRAM speed is growing at 50% per
year; and 2) the size and organization of memory on a single DRAM chip
is becoming awkward to use in a system, yet size is growing at 60% per
year.
This disadvantages made the people to think of developing a new
technology and as a result the integrated technology „Intelligent RAM‟
came into exist. Intelligent RAM, or IRAM, merges processing and
memory into a single chip to lower memory latency, increase memory
bandwidth, and improve energy efficiency as well as to allow more
flexible selection of memory size and organization. In addition, IRAM
promises savings in power and board area. The IRAM technology was
proposed by David Patterson, University of California, Berkeley, U S A
as an alternative for current memory-processor combination which has
many pit falls. It was implemented by a group of post graduate students
lead by Patterson.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
10
53
INTELLIGENT RAM (IRAM)
Inspirations of IRAM
Processor - Memory (DRAM) Gap or Latency It is the delay time
between the performance of Microprocessor and RAM. Since
processor and RAM are fabricated separately in two fabrication lines
the clock speed of them will vary in two different rates creating a
performance gap between them. Figure 2.1 shows that while
microprocessor performance has been improving at a rate of 60% per
year, the access time to DRAM has been improving at less than 10%
per year. Hence computer designers are faced with an increasing
“Processor-Memory Performance Gap”, which is now the primary
obstacle to improved computer system performance.
System architects have attempted to bridge the processor-memory
performance gap by introducing deeper and deeper cache memory hierarchies;
unfortunately, this makes the memory latency even longer in the worst case.
The main memory latency in the system is a factor of four larger than
the raw DRAM access time; this difference is due to the time to drive the
address off the microprocessor, the time to multiplex the addresses to the
DRAM, the time to turn around the bidirectional data bus, the overhead of the
memory controller, the latency of the SIMM connectors, and the time to drive
the DRAM pins first with the address and then with the return data.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
11
53
INTELLIGENT RAM (IRAM)
Figure 2.1
Processor - Memory Performance Gap Penalty To overcome the
performance gap we have to provide a cache memory which needs to
invest more money. Also because of the increase in the number of
transistors the power consumption also is increased. But the cache is
inefficient as the performance gap is very high and is growing at a
rate of 50% per year. The extraordinary delays in the memory
hierarchy occur despite tremendous resources being spent trying the
bridge the processor-memory performance gap. We call the percent of
die area and transistors dedicated to caches and other memory
latency-hiding hardware the “Memory Gap Penalty”.
Table 2.1 quantifies the penalty; it has grown to 60% of the area and almost
90% of the transistors in several microprocessors. Also due to the rapid growth
there may arise a situation where the degree of integration has to be increased
and fabrication size of transistors need to be decreased which will further
increase the cost of production which implies the penalty will increase with new
generations of microprocessors.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
12
53
INTELLIGENT RAM (IRAM)
Year Processor On-Chip
Cache Size
Memory
Gap
Penalty:
% Die
Area
Memory
Gap
Penalty:
Transistors
Die Area
(mm2)
Total
Transistors
1994 Digital
Alpha
21164
I: 8 KB, D:
8 KB, L2:
96 KB
37.4% 77.4% 298 9.3 M
1996 Digital
Strong-
Arm SA-
110
I: 16 KB,
D: 16 KB
60.8% 94.5% 50 2.1 M
1993 Intel
Pentium
I: 8 KB, D:
8 KB
31.9% 32% 300 3.1 M
1995 Intel
Pentium
Pro
I: 8 KB, D:
8 KB, L2:
512 KB
P: 18.5%
+L2: 100%
(Total:
52.2%)
P: 11.2%
+L2: 100%
(Total:
76.5%)
P: 226
+L2:186
P: 3.5 M
+L2: 25.0M
2000 Intel
Pentium4
I: 8 KB,
D:12 KB,
L2: 512 KB
P: 22.5%
+L2: 100%
(Total:
64.2%)
P: 18.2%
+L2: 100%
(Total:
87.5%)
P: 242
+L2:282
P: 5.5 M
+L2: 31.0M
2001 AMD
Athlon
I: 32 KB,
D: 32 KB,
L2: 512 KB
P: 20.2%
+L2: 100%
(Total:
62.0%)
P: 20.2%
+L2: 100%
(Total:
88.5%)
P: 268
+L2:312
P: 6.5 M
+L2: 33.0M
Table 2.1
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
13
53
INTELLIGENT RAM (IRAM)
Memory Revenue The memory revenue is decreasing rapidly nowadays. Even
though the need for DRAM chips is increasing the DRAM manufacturers are not
getting the benefit because of the high cost of production. This makes the RAM
manufactures to think of another alternative which can reduce the cost of
production to maintain the revenue as well as their business.
Figure 2.2 shows that the DRAM revenue has been falling continuously from
first quarter of the year 1999 after reaching a maximum of 16 Billion U S
Dollars. Again in the first quarter of the year 2000 it showed a slight rise by
reaching 7 Billion after which seems to be sliding down continuously for three
consecutive years which has not ceased till now.
Figure 2.2
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
14
53
INTELLIGENT RAM (IRAM)
I/O Bus Performance Lag The parallel I/O bus is not efficient
because it lags behind the processor and memory in band width. If we scale the
bus by increasing clock speed and bus width for increasing performance the
packaging cost is increased. Scaling also results in increased number of pins. So
the performance lag of parallel I/O bus points to the requirement of a much more
efficient technological implementation which rectifies the band width scarcity and
increase in both cost of production and in number of pins while scaling it for
increasing the performance and efficiency.
PCI Bits Pin Number 16 ~20
32 ~50
64 ~90
Table 2.2
Table 2.2 shows the rapid increase in pin number which in turn increases
the cost of production when the PCI bits are scaled for increasing the
performance of the I/O system.
Database Demand for Processing Power and Memory The database or
software is demanding for more and more processing power and memory.
But both of them are inadequate when compared to the actual demand.
Also the demand is increasing rapidly because more and more high end
applications like multimedia applications which need high processing
power and RAM. Also their requirements are increasing by the release of
their continuing versions which are capable of squeezing the final drop of
performance from the computing system.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
15
53
INTELLIGENT RAM (IRAM)
Figure 2.3
Figure 2.3 shows the database demand for processing power and memory
is increasing by a multiple of 2 or becoming twice in 9 months according
to Greg’s law. But the microprocessor speed and DRAM speed becomes
twice in 18 months and 120 months respectively. The microprocessor
speed is increasing according to the Moore‟s Law. So
both microprocessor and memory is less when compared to the actual
requirement demanded by the database and other software applications
making a Database - Processor performance gap and Database - Memory
performance gap respectively. Also the performance gap is growing
continuously and rapidly since the new software applications are hungrier
in matters of processing power and memory. So the database demand for
more processing power and memory made the computer experts to think
of a technology which reduces the performance gap between the database
demand and processor as well as the memory.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
16
53
INTELLIGENT RAM (IRAM)
Fewer DRAMs/System over Time While the Processor-Memory
Performance Gap has widened to the point where it is dominating
performance for many applications, the cumulative effect of two decades
of 60% per year improvement in DRAM capacity has resulted in huge
individual DRAM chips. This has put the DRAM industry in something
of a bind. That is the DRAM width or the memory per DRAM is growing
at a rate of 60% per year. But the minimum memory required per system
is growing only at a rate of 25% - 30% per year.
Figure 2.4
Figure 2.4 shows that over time the number of DRAM chips required for a
reasonably configured PC has been shrinking. The required minimum memory
size, reflecting application and operating system memory usage, has been growing
at only about half to three-quarters the rate of DRAM chip capacity. For example,
consider a word processor that requires 8MB; if its memory needs had increased at
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
17
53
INTELLIGENT RAM (IRAM)
the rate of DRAM chip capacity growth, that word processor would have had to fit
in 80KB in 1986 and 800 bytes in 1976. The result of the prolonged rapid
improvement in DRAM capacity is fewer DRAM chips needed per PC, to the
point where soon many PC customers may require only a single DRAM chip. Also
unused memory bits increases effective cost.
So customers may no longer automatically switch to the larger capacity DRAM
as soon as the next generation matches the same cost per bit in the same
organization because 1) the minimum memory increment may be much larger
than needed, 2) the larger capacity DRAM will need to be in a wider
configuration that is more expensive per bit than the narrow version of the
smaller DRAM, or 3) the wider capacity does not match the width needed for
error checking and hence results in even higher costs.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
18
53
INTELLIGENT RAM (IRAM)
IRAM – Architecture
Key Technologies
The Key Technologies behind the IRAM technology are,
1) Vector Processing 2) Embedded DRAM and 3) Serial I/O Vector
Processing
High speed microprocessors rely on instruction level parallelism (ILP) in
programs, which means the hardware has the potential short instruction
sequences to execute in parallel. As mentioned above, these high speed
microprocessors rely on getting hits in the cache to supply instructions and
operands at a sufficient rate to keep the processors busy.
An alternative model to exploiting ILP that does not rely on caches is vector
processing. It is a well established architecture and compiler model that was
popularized by supercomputers, and it is considerably older than superscalar.
Vector processors have high-level operations that work on linear arrays of
numbers.
Advantages of vector computers and the vectorized programs on them
include:
1. Each result is independent of previous results, which enables deep pipelines
and high clock rates in them.
2. A single vector instruction does a great deal of work, which means fewer
instruction fetches in general and fewer branch instructions and so fewer
mispredicted branches.
3. Vector instructions often access memory a block at a time, which allows
memory latency to be amortized over, say, 64 elements.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
19
53
INTELLIGENT RAM (IRAM)
4. Vector instructions often access memory with regular (constant stride)
patterns, which allows multiple memory banks to simultaneously supply operands.
Figure 4.1
Figure 4.1 shows the „vector processing model‟ in which the difference
between scalar and vector instructions is schematically represented. In scalar
processing the instructions are carried out sequentially while in vector processing a
number of instructions are carried out in parallel which depends on the vector
length of the processor. So parallel processing is much faster than scalar processing.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
20
53
INTELLIGENT RAM (IRAM)
IRAM - Vector Architecture
Since vector architecture deals with vector processing it represents only
the processor architecture of IRAM. It helps to study about instruction level
parallelism or parallel processing of IRAM. The parallel processing is carried out
by virtual processing of the IRAM processor. Figure 4.2 shows the „vector
architectural model‟ of IRAM. The parallel processing is carried out by the virtual
processors VP0, VP1, VP$vlr-1. It consists of the 32 general purpose registers which
are vr0, vr1, vr31. The general purpose registers are for execution of general
instructions.
Figure 4.2
The 32 flag registers vf0, vf1, vf31 are for executing floating point instructions. It
also consists of 32 control registers vcr0, vcr1, vcr31 for the control of instruction
execution carried out by the processor
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
21
53
INTELLIGENT RAM (IRAM)
Advantages of Vector Processing
Advantages of vector processing are,
1. It has high performance on demand for multimedia processing. That is it
enhances the multimedia processing by means of parallel processing which
makes it ideal processing method since multimedia processing plays a key role
in today’s computing.
2. It has low power for issue of control logic. The control logic voltage is 1.2 V
when compared to the 3V to 5V of other processing methods.
3. It doesn’t have much complexity in design. Due to the simple design the
implementation becomes very easy and cost effective.
4. It has a well understood programming model. The compiler instruction
language is very simple and easy to understand. Also the instruction language is
very efficient even though it is simple when compared to the other assembly
level programming languages.
Embedded DRAM
The embedded DRAM technology used in IRAM is by means of embedded
technology. It is the technology by which a chip is embedded into a device for the
control and well execution of the operations of that particular device. Usually chip
embedding is done in devices handled by common people where they don’t need to
interact with the chip directly, but by means of embedded chip he executes and
controls the device.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
22
53
INTELLIGENT RAM (IRAM)
Figure 4.3
Figure 4.3 shows how embedded technology is used in the manufacturing of
IRAM. During the fabrication the memory chip is embedded into the
microprocessor to produce IRAM. Thus IRAM becomes a single chip into
which both memory and processor are integrated for high quality performance
due to their coexistence.
Advantages of Embedded DRAM
The Advantages of Embedded DRAM are,
1. It offers high bandwidth for vector processing. Due to the high memory
bandwidth possessed by the DRAM chip it can enhance the performance of
vector processor which needs high memory usage and bandwidth because of the
abundant parallelism in vector processing.
2. It has a low latency which makes the memory accesses much faster and
efficient.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
23
53
INTELLIGENT RAM (IRAM)
3. The energy or power required for memory accesses is very low. Also the
memory access frequency is less. So the power consumption of the DRAM chip
is less compared to other memory chips which consume more power.
4. The memory flexibility of IRAM is due to the embedded technology used in
its manufacturing process. The designers can specify exact length and width of
the DRAM since it is not restricted by powers of 2.So embedded DRAM offers
system memory size benefits.
Serial I/O
Due to the poor performance of parallel I/O both in the case of band width and
scaling processes the I/O system of IRAM is using a much more efficient and cost
effective technology the „Serial I/O system‟. It enhances the performance of IRAM
without hindering the memory and processor performances by offering a smooth
and faster path for data transfer.
Figure 4.4
Figure 4.4 shows the schematic representation of the interaction between IRAM
and the I/O devices. The interaction is through the serial I/O lines implemented
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
24
53
INTELLIGENT RAM (IRAM)
in it as shown in figure. Due to the high band width offered by the serial I/O
lines the data transfer takes place much faster and efficiently in IRAM which
enhances its performance.
Advantages of Serial I/O
The advantages of serial I/O are,
1. Serial I/O offers very high band width which enhances both processing and
memory intensive operations. The typical band width of serial I/O is of the
order of Gigabits/Sec.
2. The pin count of serial I/O is very less compared to parallel I/O. It requires
only 1-2 pins per unidirectional link while the parallel I/O requires 5-10 pins per
unidirectional link.
3. Serial I/O band width can be incrementally scaled for increasing the band
width and efficiency. Also the scaling will not cause any increase in pin number
which implies that the scaling is very cost effective in the case of serial I/O
while scaling in parallel I/O increases both the number of pins and cost of
production.
4. The power consumption of serial I/O system is very less when compared to
that of parallel I/O system. I/O enhances the performance of IRAM along with
Reduction in power consumption. The reduced power consumption of serial I/O
system makes it suitable for low power consuming devices.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
25
53
INTELLIGENT RAM (IRAM)
IRAM - Floor Plan
Each and every technological implementation needs a basic plan on which the
device or circuit is built or implemented. Similarly IRAM too have a basic plan
or floor plan for its implementation which includes the design specification and
structure for implementation. Figure 4.5 shows the floor plan of IRAM. It
consists of 1024 1Megabit memory modules split into two memory zones each
having 512 1Megabit modules. The memory capacity is 64 Mbytes per memory
zone. The 8 vector lanes are for multiple instances of parallel processing of the
vector processor. The CPU and IO are the central processing unit and input-
output system of the IRAM chip. The crossbar switch acts as a link between the
memory, central processing unit and the IO system.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
26
53
INTELLIGENT RAM (IRAM)
Figure 4.5
Floor Plan Specification
The IRAM Floor plan specification is as follows, 0.13 Micron - The
manufacturing size of the transistors.1GHZ -The processing power of the IRAM
chip. 1GBit DRAM -The memory subsystem capacity in terms of memory
modules.128MB - Total memory capacity. 16GFLOPS (64b) -The number of
floating point operations per second.64GOPS (16b) -The number of operations per
second.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
27
53
INTELLIGENT RAM (IRAM)
IRAM - Complete Architecture
IRAM is implemented from the floor plan, according to the floor plan
specifications. Figure 4.6 represents the complete architecture of IRAM
implementation. The 1024 1Mbit modules are embedded in IRAM as shown in
figure. The 16Kilobyte L1 cache is split into two, 8K Instruction cache and 8K Data
cache. The Instruction cache is for processor instruction operations and Data cache
is for the various memory „load and store‟ operations.
Figure 4.6
The 2-way superscalar processor is for scalar processing. It is called 2-way
because the issue of control logic is different for FPU (Floating Point Unit)
operations and LSC (Load/Store/Coprocessor) operations.
The vector registers are meant for vector instruction executions which
register the vector instructions issued by the vector processor. The vector
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
28
53
INTELLIGENT RAM (IRAM)
instructions are queued from 2-way superscalar processor to the vector
instruction queue unit. The arithmetic and logic unit is for the execution of
various arithmetic and logic instructions issued by both super scalar
processor and vector processor. The Load/Store unit is meant for the various
memory load and store operations. The serial I/O lines are for the interaction
of IRAM with the various input and output devices.
The memory crossbar switch acts as a link between the memories, processor
and input-output devices for their mutual interactions during their operations.
The integrated architecture makes the memory and processor to coexist and
perform as a single unit. Since there is high level of interaction between the
memory, processor and the I/O devices the performance of IRAM is very
high compared to the separate chip processor -memory unit. This unified
architecture from the same fabrication line is in fact the secret behind the
excellent performance of IRAM in both processing and memory intensive
operations.
.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
29
53
INTELLIGENT RAM (IRAM)
IRAM -Benchmarking
Benchmarking is a process or group of processes by means of which one can
take a right decision upon the performance and efficiency of a product or
technology in comparison with the other product or technology which took part
in the event. Benchmarking helps us to find the product or technology which is
appropriate for our requirements. So it is a solid proof for the performance and
efficiency of a product or technology in comparison with others.
Benchmarking Environment
Benchmarking environment is the environment where the
whole processes are carried out for arriving at a right decision.
It may include the various products based on the same
technology or different technology depending upon the type
and requirement of benchmarking. In this case we are
benchmarking the technology IRAM with other processor-
memory combinations for proving the real potential of IRAM.
So the benchmarking environment will consist of the various
processors from different manufacturers along with various memory
modules from different sources to perform as a single unit while the
IRAM is itself a single unit manufactured from a single fabrication
process. Table 5.1 shows the various processors selected for the
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
30
53
INTELLIGENT RAM (IRAM)
benchmarking and the different memory modules used for the
benchmarking process.
μP IRAM SPARC R10K P III P 4 EV 6
Make Berkeley Sun Origin Intel Intel Alpha
Clock 1GHz 833MHz 900MHz 950MHz 1.5GHz 966MHz
L1 8+8KB 16+16KB 32+32KB 32KB 12+8KB 64+64KB
L2 NA 2MB 1MB 256KB 256KB 2MB
Memory 128 MB 256MB 1GB 256MB 1GB 512MB
Table 5.1
The various processors used for the benchmarking are SPARC from Sun, R10K
from Origin, P III and P 4 from Intel, EV6 from Alpha and IRAM from
Berkeley. The clock frequencies of the various processors, the L1 and L2 caches
and the memory modules used with them are all mentioned in the table 5.1
given above. Also it can be noted that the processor with minimum L1 and L2
cache (L2 cache is not possessed by IRAM) and minimum memory capacity is
IRAM. All other processors are having more L1 and L2 caches and memory
capacity associated with them for their operations.
.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
31
53
INTELLIGENT RAM (IRAM)
Benchmarking Processes
The various benchmarking processes carried out are,
1. Transitive Closure: The first benchmark problem is to compute the
transitive closure of a directed graph in a dense representation. The code taken
from the DIS reference implementation used non-unit stride, but was easily
changed to unit stride. This benchmark performs only 2 arithmetic operations
(an addition and a subtraction) at each step, while it executes 2 loads and 1
store.
2. Giga Updates per Second (GUPS): This benchmark is a synthetic
problem, which measures giga-updates-per-second. It repeatedly reads and
updates distinct, pseudo-random memory locations. The inner loop contains 1
arithmetic operation, 2 loads, and 1 store, but unlike transitive, the memory
accesses are random. It contains abundant data-parallelism because the
addresses are pre-computed and free of duplicates.
3. Sparse Matrix-Vector Multiplication (SPMV): This problem also
requires random memory access patterns and a low number of arithmetic
operations. It is common in scientific applications, and appears in both the DIS
and NPB suites in the form of a Conjugate Gradient (CG) solver. We have a CG
implementation for IRAM, which is dominated by SPMV, but here we focus on
the kernel to isolate the memory system issues. The matrices contain a pseudo-
random pattern of non-zeros using a construction algorithm from the DIS
specification, parameterized by the matrix dimension, n, and the number of non
zeroes, m.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
32
53
INTELLIGENT RAM (IRAM)
4. Histogram: Computing a histogram of a set of integers can be used for
sorting and in some image processing problems. Two important considerations
govern the algorithmic choice: the number of buckets, b, and the likelihood of
duplicates. For image processing, the number of buckets is large and collisions
are common because there are typically many occurrences of certain colors
(e.g.: white) in an image. Histogram is nearly identical to GUPS in its memory
behaviour, but differs due to the possibility of collisions, which limit parallelism
and are particularly challenging in a data-parallel model.
Mesh Adaptation: The final benchmark is a two dimensional unstructured
mesh adaptation algorithm based on triangular elements. This benchmark is
more complex than the others, and there is no single inner loop to characterize.
The memory accesses include both random and unit stride, and the key problem
is the complex control structure, since there are several different cases when
inserting a new point into an existing mesh. Starting with a coarse-grained task
parallel program, we performed significant code reorganization and data pre-
processing to allow vectorization. The various benchmarking processes are
selected specially for testing both the processing and memory handling of the
various processor-memory combinations and IRAM. These benchmarking
processes are both processing and memory intensive which squeezes the final
drop of performance from the different processor-memory systems.
Transitive Closure
The best-case scenario for both caches and vectors is a unit stride memory
access pattern, as found in the transitive closure benchmark. In this case, the
main advantage for IRAM is the size of its on-chip memory, since DRAM is
denser than SRAM. IRAM has 12 MB of on-chip memory compared to 10s of
KB for the L1 caches on the cache-based machines. IRAM is admittedly a large
chip, but this is partly due to being an academic research project with a very
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
33
53
INTELLIGENT RAM (IRAM)
small design team - the 2-3 orders of magnitude advantage in on-chip memory
size is primarily due to the memory technology. Figure 5.1 shows the
performance of the transitive closure benchmark.
Results confirm the expected advantage for IRAM on a problem
with abundant parallelism and a low arithmetic/memory operation ratio.
Performance is relatively insensitive to graph size, although IRAM performs
better on larger problems due to the longer average vector length. The Pentium
4 has a similar effect, which may be due to improved branch prediction because
of the sparse graph structure in the test problem.
Figure 5.1
Giga Updates per Second (GUPS)
A more challenging memory access pattern is one with either non-unit strides or
indexed loads and stores (scatter/gather operations). The first challenge for any
machine is generating the addresses, since each address needs to be checked for
validity and for collisions. IRAM can generate only 4 addresses per cycle,
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
34
53
INTELLIGENT RAM (IRAM)
independent of the data width. For 64-bit data, this is sufficient to load or store a
value on every cycle, but if the data width is halved to 32-bits, the 4 64-bit lanes
perform arithmetic operations at the rate of 4 32-bit lanes, and the arithmetic
unit can more easily be starved for data. In addition, details of the memory bank
structure can become apparent, as multiple accesses to the same DRAM bank
require additional latency to charge the DRAM. The frequency of these bank-
conflicts depends on the memory access pattern and the number of banks in the
memory system.
The GUPS benchmark results, shown in Figure 5.2, highlights the
address generation issue. Although performance improves slightly when moving
from 64 to 32 bits, after that performance is constant due to the limits for 4 address
generators. Overall, though, IRAM does very well on this benchmark, nearly
doubling the performance of its nearest competitor, the Pentium 4, for 32 and 64 bit
data. In fairness, GUPS was the one of the benchmarks in which the benchmarking
conductors tidied up the compiler-generated assembly instructions for the inner
loops, which produced a 20-60% speedup.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
35
53
INTELLIGENT RAM (IRAM)
Figure 5.2
In addition to the MOP rate, it is interesting to observe the memory bandwidth
consumed in this problem. GUPS achieves 1.77, 2.36, 3.54, and 4.87 GB/s
memory bandwidth on IRAM at 8, 16, 32, and 64-bit data widths, respectively.
This is relatively close to the peak memory bandwidth of 6.4 GB/s.
Sparse Matrix-Vector Multiplication (SPMV)
For the SPMV benchmark, the matrix dimension was set to
10,000 and the number of non zeroes to 177,782, i.e., there were about 18 non
zeroes per row. The computation is done in single precision floating-point. The
pseudorandom pattern of non zeroes is particularly challenging, and many matrices
taken from real applications have some structure that would have better locality,
which would especially benefit cache-base machines.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
36
53
INTELLIGENT RAM (IRAM)
Four different algorithms were considered for SPMV, reflecting
the best practice for both cache-based and vector machines. The performance
results are shown in Figure 5.3. Compressed Row Storage (CRS) is the most
common sparse matrix format, which stores an array of column indices and non-
zero values for each row; SPMV is then performed as a series of sparse dot
products. The performance on IRAM is better than some cache-based machines, but
it suffers from lack of parallelism.
The dot product is performed by recursive halving, so vectors start with an
average of 18 elements and drop from there. Both the P4 and EV6 exceed IRAM
performance for this reason. CRS-banded uses the same format and algorithm as
CRS, but reflects a different nonzero structure that would likely result from
bandwidth reduction orderings, such as reverse Cuthill-McKee (RCM). This has
little effect on IRAM, but improves the cache hit rate on some of the other
machines.
Figure 5.3
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
37
53
INTELLIGENT RAM (IRAM)
The Ell pack (or It pack) format forces all rows to have the same length by
padding them with zeros. It still has indexed memory operations, but increases
available data parallelism through vectorization across rows. The raw Ellpack
performance is excellent, and this format should be used on IRAM and PIII for
matrices with the longest row length close to the average. If we instead measure
the effective performance (off), who discounts operations performed on padded
zeros, the efficiency can be arbitrarily poor. Indeed, the randomly generated
DIS matrix has an enormous increase in the matrix size and number of
operations, making it impractical.
The Segmented-sum algorithm was first
proposed for the Cray PVP. The data structure is an augmented form of the CRS
format and the computational structure is similar to Ellpack, although there is
additional control complexity. The underlying Ellpack algorithm was modified
that converts roughly 2/3 of the memory accesses from a large stride to unit
stride. The remaining 1/3 are still indexed references. This was important on
IRAM, because we are using 32-bit data and have only 4 address generators as
discussed above.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
38
53
INTELLIGENT RAM (IRAM)
Histogram
This benchmark builds a histogram for the pixels in a 500x500 image from the
DIS Specification. The number of buckets depends on the number of bits in each
pixel, so we use the base 2 logarithm (i.e., the pixel depth) as the parameter in our
study. Performance results for pixel depths of 7, 11, and 15 are shown in Figure 5.4.
The first five sets are for IRAM, all but the second (Retry 0%) use this image data
set. The first set (Retry) uses the compiler default vectorization algorithm, which
vectorized while ignoring duplicates, and corrects the duplicates in a serial phase at
the end.
This works well if there are few duplicates, but performs poorly for our case.
The second set (Retry 0%) shows the performance when the same algorithm is used
on data containing no duplicates. The third set (Privy) makes several private copies
of the buckets with the copies merged at the end. It performs poorly due to the large
number of buckets and gets worse as this number increases with the pixel depth.
Figure 5.4
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
39
53
INTELLIGENT RAM (IRAM)
The fourth and fifth algorithms use a more sophisticated sort-diff-find-diff
algorithm that performs in register sorting. Bitonic sort was used because the
communication requirements are regular and it proved to be a good match for
IRAM's “butterfly” permutation instructions, designed primarily for reductions and
FFTs. The compiler automatically generates in-register permutation code for
reductions, but the sorting algorithm used here was hand-coded. The two sort
algorithms differ on the allowed data width: one works when the width is less than
16 bits and the other when it is up to 32 bits.
The narrower width takes advantage of the higher arithmetic
performance for narrow data on IRAM. Results show that on IRAM, the sort-based
and privatized optimization methods consistently give the best performance over
the range of bit depths. It also demonstrates the improvements that can be obtained
when the algorithm is tailored to shorter bit depths.
Overall, IRAM does not do as well as on the other benchmarks, because
the presence of duplicates hurts vectorization, but can actually help improve cache
hits on cache-based machines. We therefore see excellent timings for the histogram
computation on these machines without any special optimizations. A memory
system advantage starts to be apparent for 15-bit pixels, where the histograms do
not fit in cache, and at this point IRAM's performance is comparable to the faster
microprocessors.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
40
53
INTELLIGENT RAM (IRAM)
Mesh Adaptation
This benchmark performs a single level of refinement starting with
a mesh of 4802 triangular elements, 2500 vertices, and 7301 edges. In this
application, we use a different algorithm organization for the different machines:
The original code was designed for conventional processors and is used for those
machines, while the vector algorithm uses more memory bandwidth but contains
smaller loop bodies, which helps the compiler perform vectorization.
The vectorized code also pre-sorts the mesh points to avoid branches in
the inner loop, as in Histogram. Although the branches negatively affect superscalar
performance, presorting is too expensive on those machines. Mesh adaptation also
requires indexed memory operations, so address generation again limits IRAM.
Figure 5.5
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
41
53
INTELLIGENT RAM (IRAM)
Figure 5.5 shows the performance of processors in Mesh Adaptation. It
indicates that that IRAM has performed well in mesh adaptation when compared
to other processors.
The only competitor was Intel Pentium 4. So IRAM has emerged a clear
winner in the processing intensive benchmark, Mesh Adaptation.
Summary of Benchmark Characteristics
An underlying goal in the benchmark was to identify the limiting factor in these
memory-intensive benchmarks.
The graph in Figure 5.6 shows the memory bandwidth used on IRAM
and the MOPS rate achieved on each of the benchmarks using the best algorithm
on the most challenging input. GUPS uses the 64-bit version of the problem, SPMV
uses the segmented sum algorithm, and Histogram uses the 16-bit sort. While all
of these problems have low operation counts per memory operation, the memory
and operation rates are quite different in practice. Of these benchmarks, GUPS is
the most memory-intensive, where as Mesh Adaptation is the least.
Histogram, SPMV and Transitive Closure have roughly the same balance
between computation and memory, although their absolute performance varies
dramatically due to differences in parallelism. In particular, although GUPS and
Histogram are nearly identical in the characteristics, the difference in parallelism
results in a very different absolute performance as well as relative bandwidth to
operation rate. Figure 5.7 shows the summary of performance for each of the
benchmarks across machines.
The y-axis is a log scale, and IRAM is significantly faster than the other
machines on all applications except SPMV and Histogram. An even more dramatic
picture is seen from measuring the MOPS/Watt ratio, as shown in Figure 5.8. Most
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
42
53
INTELLIGENT RAM (IRAM)
of the cache-based machines use a small amount of parallelism, but spend a great
deal of power on a high clock rate. Indeed a graph of Flops per machine cycle is
very similar. Only the Pentium III, designed for portable machines, has a
comparable power consumption of 4 Watts compared to IRAM‟s 2 Watts. The
Pentium III cannot compete on performance, however, due to lack of parallelism.
Figure 5.6
Figure 5.7
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
43
53
INTELLIGENT RAM (IRAM)
Figure 5.8
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
44
53
INTELLIGENT RAM (IRAM)
Advantages of IRAM
Low Latency
To reduce latency, the wire length should be kept as short as possible. This
suggests the fewer bits per block the better. In addition, the DRAM cells furthest
away from the processor will be slower than the closest ones. Rather than
restricting the access timing to accommodate the worst case, the processor could
be designed to be aware when it is accessing “slow” or “fast” memory. Some
additional reduction in latency can be obtained simply by not multiplexing the
address as there is no reason to do so on an IRAM. Also, being on the same chip
with the DRAM, the processor avoids driving the off chip wires, potentially turning
around the data bus, and accessing an external memory controller. In summary,
the access latency of an IRAM processor does not need to be limited by the same
constraints as a standard DRAM part.
Much lower latency may be obtained by intelligent floor planning, utilizing
faster circuit topologies, and redesigning the address/data bussing schemes. The
potential memory latency for random addresses of less than 30 ns is possible for a
latency-oriented DRAM design on the same chip as the processor; this is as fast as
second level caches. Recall that the memory latency on the Alpha Server 8400 is
253 ns.
IRAM offers performance opportunities for applications with unpredictable
memory accesses and very large memory “footprints”, such as data bases, which
may take advantage of the potential 5X to 10X decrease in IRAM latency. The
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
45
53
INTELLIGENT RAM (IRAM)
lower latency of IRAM is due to the fact that it doesn’t have any parallel DRAM‟s,
memory controller, longer bus to turn around and also it has lower number of
pins. The latency of an IRAM chip is 10-15 ns for a 64-128 MB memory capacity
which is very low compared to other memory chips.
High Bandwidth
A DRAM naturally has extraordinary internal bandwidth, essentially fetching
the square root of its capacity each DRAM clock cycle; an on-chip processor
can tap that bandwidth. The potential bandwidth of the gigabit DRAM is even
greater than indicated by its logical organization. Since it is important to keep
the storage cell small, the normal solution is to limit the length of the bit lines,
typically with 256 to 512 bits per sense amp. This quadruples the number of
sense amplifiers.
To save die area, each block has a small number of I/O lines, which reduces the
internal bandwidth by a factor of about 5 to 10 but still meets the external
demand. One IRAM goal is to capture a larger fraction of the potential on-chip
bandwidth. For example, two prototypes 1 gigabit DRAMs were presented at
ISSCC in1996. As mentioned above, to cope with the long wires inherent in 600
mm2 dies of the gigabit DRAMs, vendors are using more metal layers: 3 for
Mitsubishi and 4 for Samsung.
The total number of memory modules on chip is 512 2-Mbit modules and 1024
1-Mbit modules, respectively. Thus a gigabit IRAM might have 1024 memory
modules each 1K bits wide. Not only would there be tremendous bandwidth at
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
46
53
INTELLIGENT RAM (IRAM)
the sense amps of each block, the extra metal layers enable more cross-chip
bandwidth. Assuming a 1Kbit metal bus needs just 1mm, a 600 mm2 IRAM
might have 16 busses running at 50 to 100 MHz
Thus the internal IRAM bandwidth should be as high as 200-300 GBytes/sec.
For comparison, the sustained memory bandwidth of the Alpha Server 8400
which includes a 75 MHz, 256-bit memory bus is 1.2 Gbytes/sec. Cross bar
switch in IRAM architecture delivers only 1/3 to 2/3 of the theoretical band
width, so actual band width will be 100-200 GB/Sec. Applications with
predictable memory accesses, such as matrix manipulations, may take
advantage of the potential 50X to 100X increase in IRAM bandwidth.
High Energy Efficiency
Integrating a microprocessor and DRAM memory on the same die offers the
potential for improving energy consumption of the memory system. DRAM is
muchdenser than SRAM, which is traditionally used for on-chip memory.
Therefore, an IRAM will have many fewer external memory accesses, which
consume a great deal of energy to drive high capacitance off-chip buses. Even on-
chip accesses will be more energy efficient, since DRAM consumes less energy
than SRAM. Finally, an IRAM has the potential for higher performance than a
conventional approach. Since higher performance for some fixed energy
consumption can be translated into equal performance at a lower amount of
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
47
53
INTELLIGENT RAM (IRAM)
energy, the performance advantages of IRAM can be translated into lower energy
consumption.
Besides reducing the frequency of memory accesses IRAM also reduces the
energy per instruction which is given by the equation,
Energy per memory access = AEL1 + MRL1 x AEL2 + MRL2 x AEoff-chip where AE = access energy and MR = miss rate
The main contributing term in the above equation is the access energy of
off-chip which vanishes along with the second term from the equation for energy
per instruction for IRAM since there is no L2 cache in IRAM. So the energy per
memory access will be only the access energy of L1 cache which is also less when
compared to the L1 cache access energy of other microprocessor chips. So the
energy consumption of IRAM chips is very low which is very good for low power
consuming devices and portable devices.
Memory Flexibility
Another advantage of IRAM over conventional designs is the ability to adjust
both the size and width of the on-chip DRAM. Rather than being limited by powers
of 2 in length or width, as is conventional DRAM, IRAM designers can specify
exactly the number of words and their width. This flexibility can improve the cost
of IRAM solutions versus memories made from conventional DRAMs.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
48
53
INTELLIGENT RAM (IRAM)
Low Cost of Production
Fabrication of RAM and Processor is done in a single fabrication line. So cost is
reduced due to unified fabrication. Tax is reduced since no need for individual
taxes for RAM and Processor. Since both cost of production and tax is less the
market price of IRAM will be less than the combined price for RAM and Processor.
Thus it is suitable for budget conscious customers. In fact it should be an obvious
choice of performance conscious customers because it delivers high quality
performance at bare minimum price. Small Board Area IRAM integrates several
chips into „One Chip‟. So board area is very much reduced due to integration. So it
may be attractive in applications where board area is precious such as cellular
phones or portable computers. Since the size and portability of devices is
decreasing and increasing rapidly IRAM offers a well defined path for achieving the
future goals.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
49
53
INTELLIGENT RAM (IRAM)
Disadvantages of IRAM
No product or technology is hundred percent perfect. Sometimes it may have
defects or drawbacks. Similarly IRAM is also not a perfect technology. It also has
disadvantages. The disadvantages of IRAM chip are,
1. Completely New Architecture: IRAM is a new technology which is
entirely different from current technological implementations since it
integrates the processor and memory into a single chip. For the
acceptance of this new technology we have to discard our current
products and technologies. That is we have to revamp the complete
system from the scratch itself. So it may affect the wide acceptance of
this technology even though the performance is excellent. But once it is
accepted widely it won’t be a problem as it becomes the current
technology making other technologies obsolete.
2. Non Upgradeability of Memory: Since the DRAM chips are embedded in the
IRAM chip we will not be able to upgrade the memory further. This may limit
the popularity of IRAM chips since the demand for more memory capacity in
increasing rapidly. Researchers are going on for finding a solution for this
problem. Sometimes the next generation chips may have the provision for
upgrading the memory capacity.
3. High Cost of Testing: The testing cost of IRAM chip is high when compared
to the other memory testing processes. This is because the cost of testing
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
50
53
INTELLIGENT RAM (IRAM)
during manufacturing is significant for DRAMs. Adding a processor would
significantly increase the test time on conventional DRAM testers. But once it
establishes the testing cost will not be a problem in a long run since the
revenue or profit will account for the high cost of testing. Sometimes a new
cost effective method of testing the DRAMs may emerge in course of time.
4. Overheating: The high level of integration decreases the chip area
considerably. So even though the heat produced is less when compared to that
of current processors, it may overheat the chip due to the small area. But we
can rectify the effect of heat by means of an efficient heat sink or cooling
system which exhausts the heat produced to the external environment.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
51
53
INTELLIGENT RAM (IRAM)
Conclusion
Merging a microprocessor and DRAM on the same chip presents opportunities
in performance, energy efficiency, and cost: a factor of 5 to 10 reduction in
latency, a factor of 50 to 100 increases in bandwidth, a factor of 2 to 4
advantage in energy efficiency, and an unquantified cost savings by removing
superfluous memory and by reducing board area. The surprise is that these
claims are not based on some exotic, unproven technology; they based instead
on tapping the potential of a technology in use for the last 10 years.
The popularity of IRAM is only limited by the amount of memory on-chip,
which should expand by about 60% per year. A best case scenario would be for
IRAM to expand its beachhead in graphics, which requires about 10 Mbits, to
the game, embedded, and personal digital assistant markets, which require
about 32 Mbits of storage.
Such high volume applications could in turn justify creation of a process that is
more friendly to IRAM, with DRAM cells that are a little bigger than in a DRAM
fabrication but much more amenable to logic and SRAM. As IRAM grows to 128
to 256 Mbits of storage, an IRAM might be adopted by the network computer
or portable PC markets. Such a success could in turn entice either
microprocessor manufacturers to include substantial DRAM on chip, or DRAM
manufacturers to include processors on chip.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
52
53
INTELLIGENT RAM (IRAM)
Hence IRAM presents an opportunity to change the nature of the
semiconductor industry. From the current division into logic and memory
camps, a more homogeneous industry might emerge with historical
microprocessor manufacturers shipping substantial amounts of DRAM - just as
they ship substantial amounts of SRAM today - or historical DRAM
manufacturers shipping substantial numbers of microprocessors.
Both scenarios might even occur, with one set of manufacturers oriented
towards high performance and the other towards low cost. Also IRAM with its
potential can create a new generation of computers with increased portability,
reduced size and power consumption without compromising on performance
and efficiency.
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)
53
53
INTELLIGENT RAM (IRAM)
References
IRAM - Chips that remember and compute, IEEE International Solid
State Circuits Conference
A Case for Intelligent RAM, IEEE Micro
IRAM - the Industrial Setting, Applications, and Architectures,
Computer Science Division, University of California, Berkeley
Vector IRAM - ISA and Micro-architecture, Computer Science Division,
University of California, Berkeley
Vector IRAM - A Media-oriented Vector Processor with Embedded
DRAM, Computer Science Division, University of California, Berkeley
A Media-enhanced vector architecture for embedded memory
systems, Computer Science Division, University of California, Berkeley
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines,
Computer Science Division, University of California, Berkeley
IRAM - Overcoming the I/O Bus Bottleneck, Denver, CO, USA
The energy efficiency of IRAM architectures, 24th Annual International
Symposium on Computer Architecture
LAXMI INSTITUTE OF COMMERCE & COMPUTER APPLICATION, SARIGAM (BCA)