Date post: | 02-Dec-2014 |
Category: |
Documents |
Upload: | bineesh-babu |
View: | 651 times |
Download: | 5 times |
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
COMPUTER ORGANIZATION
R 402 2+1+0
Module 1
Introduction: Organization and Architecture – Review of basic operational
concepts – CPU- single bus and two bus organization, Execution of a complete
instruction – interconnection structures – layered view of a computer system.
Module 2
CPU - Arithmetic: Signed addition and subtraction – serial and parallel adder –
BCD adder – Carry look ahead adder, Multiplication – Array multiplier – Booth‘s
Algorithm, Division – Restoring and non-restoring division, floating point
arithmetic - ALU Design.
Module 3
Control Unit Organization: Processor Logic Design – Processor Organization –
Control Logic Design – Control Organization – Hardwared control –
Microprogram control – PLA control – Microprogram sequencer, Horizontal and
vertical micro instructions – Nano instructions.
Module 4
Memory: Memory hierarchy – RAM and ROM – Memory system considerations
– Associative memory, Virtual memory – Cache memory – Memory interleaving.
Module 5
Input – Output: Printers, Plotters, Displays, Keyboard, Mouse, OMR and OCR,
Device interface – I/O processor – Standard I/O interfaces – RS 232 C, IEEE
488.2 (GPIB).
References
1. Computer Organization - Hamacher, Vranesic and Zaky, Mc Graw Hill
2. Digital Logic and Computer Design - Morris Mano, PHI
3. Computer Organization and Architecture -William Stallings, Pearson Education
Asia.
4. Computer Organization and Design - Pal Chaudhuri, PHI
5. Computer Organization and Architecture -M Morris Mano, PHI
6. Computer Architecture and Organization - John P Hayes, Mc Graw Hill
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
1.1 Introduction to Computer organization and architecture In describing computer system, a distinction is often made between computer architecture
and computer organization.
Computer architecture refers to those attributes of a system visible to a programmer, or
put another way, those attributes that have a direct impact on the logical execution of a
program.
Computer organization refers to the operational units and their interconnection that
realize the architecture specification.
Examples of architecture attributes include the instruction set, the number of bit to
represent various data types (e.g.., numbers, and characters), I/O mechanisms, and
technique for addressing memory.
Examples of organization attributes include those hardware details transparent to the
programmer, such as control signals, interfaces between the computer and peripherals,
and the memory technology used.
As an example, it is an architectural design issue whether a computer will have a multiply
instruction. It is an organizational issue whether that instruction will be implemented by a
special multiply unit or by a mechanism that makes repeated use of the add unit of the
system. The organization decision may be bases on the anticipated frequency of use of
the multiply instruction, the relative speed of the two approaches, and the cost and
physical size of a special multiply unit.
Historically, and still today, the distinction between architecture and organization has
been an important one. Many computer manufacturers offer a family of computer model,
all with the same architecture but with differences in organization. Consequently, the
different models in the family have different price and performance characteristics.
Furthermore, an architecture may survive many years, but its organization changes with
changing technology.
Basic Structure of a Computer
Figure 1 shows the general structure of the IAS computer. It consists of:
A main memory, which stores both data and instructions.
An arithmetic-logical unit (ALU) capable of operating on binary data.
A control unit, which interprets the instructions in memory and causes them to be
executed.
Input and output (I/O) equipment operated by the control unit.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig.1 Basic structure of a computer.
1.2 Review of basic operational concepts Now we focus on the processing unit, which executes machine instructions and
coordinates the activities of other units. This unit is often called the instruction Set
Processor (ISP), or simply the processor. We examine its internal structure and how it
performs the tasks of fetching, decoding, and executing instructions of a program. The
processing unit used to be called the central processing unit (CPU). The term ―central‖ is
less appropriate today because many modern computer systems include several
processing units.
The organization of processors has evolved over the years, driven by developments in
technology and the need to provide high performance. A common strategy in the
development of high-performance processors is to make various functional units operate
in parallel as much as possible. High-performance processors have a pipelined
organization where the execution of one instruction is started before the execution of the
preceding instruction is completed. In another approach, known as superscalar operation,
several instructions are fetched and executed at the same time. Pipelining and superscalar
architectures are discussed later.
A typical computing task consists of a series of steps specified by a sequence of machine
instructions that constitute a program. An instruction is executed by carrying out a
sequence of more rudimentary operations. These operations and the means by which they
are controlled are the main topic of this chapter.
1.3 CPU- single bus organization To execute a program, the processor fetches one instruction at a time and performs the
operations specified. Instructions are fetched from successive memory locations until a
branch or a jump instruction is encountered. The processor keeps track of the address of
the memory location containing the next instruction to be fetched using the program
counter, PC. After fetching an instruction, the contents of the PC are updated to point to
the next instruction in the sequence. A branch instruction may load a different value into
the PC.
Another key register in the processor is the instruction register, IR. Suppose that
each instruction comprises 4 bytes, and that it is stored in one memory word. To execute
an instruction, the processor has to perform the following three steps:
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
I. Fetch the contents of the memory location pointed to by the PC. The contents of this
location are interpreted as an instruction to be executed. Hence, they are loaded into the
IR. Symbolically, this can be written as
IR [[PC]]
2. Assuming that the memory is byte addressable, increment the contents of the PC by 4,
that is,
PC [PC] +4
3. Carry out the actions specified by the instruction in the IR.
In cases where an instruction occupies more than one word, steps I and 2 must be
repeated as many times as necessary to fetch the complete instruction. These two steps
are usually referred to as the fetch phase; step 3 constitutes the execution phase.
To study these operations in detail, we first need to examine the internal organization
of the processor. They can be organized and interconnected in a variety of ways. We will
start with a very simple organization. Later in this chapter and in Chapter 8 we will
present more
complex structures that provide high performance. Figure 1.1 shows an organization in
which the arithmetic and logic unit (ALU) and all the registers are interconnected via a
single common bus. This bus is internal to the processor and should not be confused with
the external bus that connects the processor to the memory and 110 devices.
The data and address lines of the external memory bus are shown in Figure 1.1 connected
to the internal processor bus via the memory data register, MDR, and the memory address
register, MAR, respectively. Register MDR has two inputs and two outputs. Data may be
loaded into MDR either from the memory bus or from the internal processor bus. The
data stored in MDR may be placed on either bus. The input of MAR is connected to the
internal bus, and its output is connected to the external bus. The control lines of the
memory bus are connected to the instruction decoder and control logic block. This unit is
responsible for issuing the signals that control the operation of all the units inside the
processor and for interacting with the memory bus.
The number and use of the processor registers R0 through R(n - 1) vary considerably
from one processor to another. Registers may be provided for general-purpose use by the
programmer. Some may be dedicated as special-purpose registers, such as index registers
or stack pointers. Three registers, Y, Z, and TEMP in Figure 1.1, have not been
mentioned before. These registers are transparent to the programmer, that is, the
programmer need not be concerned with them because they are never referenced
explicitly by any instruction. They are used by the processor for temporary storage during
execution of some instructions. These registers are never used for storing data generated
by one instruction for later use by another instruction.
The multiplexer MUX selects either the output of register Y or a constant value 4 to be
provided as input A of the ALU. The constant 4 is used to increment the contents of the
program counter. We will refer to the two possible values of the MUX control input
Select as Select4 and SelectY for selecting the constant 4 or register Y, respectively.
As instruction execution progresses, data are transferred from one register to another,
often passing through the ALU to perform some arithmetic or logic operation. The
instruction decoder and control logic unit is responsible for implementing the actions
specified by the instruction loaded in the JR register. The decoder generates the control
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
signals needed to select the registers involved and direct the transfer of data. The
registers, the ALU, and the interconnecting bus are collectively referred to as the
datapath.
With few exceptions, an instruction can be executed by performing one or more of
the following operations in some specified sequence:
• Transfer a word of data from one processor register to another or to the ALU
• Perform an arithmetic or a logic operation and store the result in a processor register
• Fetch the contents of a given memory location and load them into a processor register
• Store a word of data from a processor register into a given memory location
We now consider in detail how each of these operations is implemented, using the
simple processor model in Figure 1.1.
Fig.1.1 Single bus organization of the data path inside a processor
1. 3 .1 REGISTER TRANSFERS
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Instruction execution involves a sequence of steps in which data are transferred from one register
to another. For each register, two control signals are used to place the contents of that register on
the bus or to load the data on the bus into the register. This is represented symbolically in Figure
1.2. The input and output of register Ri are connected to the bus via switches controlled by the
signals Riin and Riout, respectively. When Riin is set to 1, the data on the bus are loaded into Ri.
Similarly, when Riout is
set to 1, the contents of register Ri are placed on the bus. While Riout is equal to 0, the
bus can be used for transferring data from other registers.
Suppose that we wish to transfer the contents of register Rl to register R4. This can be
accomplished as follows:
Enable the output of register Ri by setting Riout to 1. This places the contents of R 1 on the
processor bus.
Enable the input of register R4 by setting R4in to 1. This loads data from the processor
bus into register R4.
All operations and data transfers within the processor take place within time periods
defined by the processor clock. The control signals that govern a particular transfer are
asserted at the start of the clock cycle. In our example, R1out and R4in are set to 1.
The registers consist of edge-triggered flip-flops. Hence, at the next active edge of the
clock, the flip-flops that constitute R4 will load the data present at their inputs. At the
same time, the control signals R1out and R4in will return to 0. We will use this simple
model of the timing of data transfers for the rest of this chapter. However, we should
point out that other schemes are possible. For example, data transfers may use both the
rising and falling edges of the clock. Also, when edge-triggered flip-flops are not used,
two or more clock signals may be needed to guarantee proper transfer of data. This is
known as multiphase clocking.
An implementation for one bit of register Ri is shown in Figure 1.3 as an example. A
two-input multiplexer is used to select the data applied to the input of an edge-triggered
D flip-flop. When the control input Ri1 is equal to 1, the multiplexer selects the data
on the bus. This data will be loaded into the flip-flop at the rising edge of the clock.
When Ri is equal to 0, the multiplexer feeds back the value currently stored in the
flip-flop.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig.1.2 Input and output gating for the registers in fig 1.1
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig. 1.3 Input and output gating for one register bit
The Q output of the flip-flop is connected to the bus via a tri-state gate. When Riout
is equal to 0, the gate‘s output is in the high-impedance (electrically disconnected) state.
This corresponds to the open-circuit state of a switch. When Riout = 1, the gate drives
the bus to 0 or I, depending on the value of Q.
1.3.2 PERFORMING AN ARITHMETIC OR LOGIC OPERATION
The ALU is a combinational circuit that has no internal storage. It performs arithmetic
and logic operations on the two operands applied to its A and B inputs. In Figures 1.1
and 1.2, one of the operands is the output of the multiplexer MUX and the other operand is
obtained directly from the bus. The result produced by the ALU is stored temporarily in
register Z. Therefore, a sequence of operations to add the contents of register Ri to those
of register R2 and store the result in register R3 is
1. R1out, Yin
2. SelectY, Add, Z
3. Z0, R31
The signals whose names are given in any step are activated for the duration of the clock
cycle corresponding to that step. All other signals are inactive. Hence, in step 1, the
output of register Rl and the input of register Y are enabled, causing the contents of RI to
be transferred over the bus to Y. In step 2, the multiplexer‘s Select signal is set to SelectY
causing the multiplexer to gate the contents of register Y to input A of the ALU. At the
same time, the contents of register R2 are gated onto the bus and, hence, to input B. The
function performed by the ALU depends on the signals applied to its control lines. In this
case, the Add line is set to 1, causing the output of the ALU to be the sum of the two
numbers at inputs A and B. This sum is loaded into register Z because its input control
signal is activated. In step 3, the contents of register Z are transferred to the destination
register, R3. This last transfer cannot be carried out during step 2, because only one
register output can be connected to the bus during any clock cycle.
In this introductory discussion, we assume that there is a dedicated signal for each
function to be performed. For example, we assume that there are separate control signals
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
to specify individual ALU operations, such as Add, Subtract, XOR, and so on. In reality,
some degree of encoding is likely to be used. For example, if the ALU can perform eight
different operations, three control signals would suffice to specify the required operation.
1.3.3 FETCHING A WORD FROM MEMORY
To fetch a word of information from memory, the processor has to specify the address of
the memory location where this information is stored and request a Read operation. This
applies whether the information to be fetched represents an instruction in a program or an
operand specified by an instruction. The processor transfers the required address to the
MAR, whose output is connected to the address lines of the memory bus. At the same
time, the processor uses the control lines of the memory bus to indicate that a Read
operation is needed. When the requested data are received from the memory they are
stored in register MDR, from where they can be transferred to other registers in the
processor.
The connections for register MDR are illustrated in Figure 1.4. It has four control signals:
MDRin and MDRout control the connection to the internal bus, and MDRIUE and
MDRoutE control the connection to the external bus. The circuit in Figure 1.3 is easily
modified to provide the additional connections. A three-input multiplexer can be used,
with the memory bus data line connected to the third input. This input is selected when
MDRinE = 1. A second tri-state gate, controlled by MDROUtE can be used to connect the
output of the flip-flop to the memory bus.
During memory Read and Write operations, the timing of internal processor operations
must be coordinated with the response of the addressed device on the memory bus. The
processor completes one internal data transfer in one clock cycle. The speed of operation
of the addressed device, on the other hand, varies with the device. We saw in Chapter 5
that modern processors include a cache memory on the same chip as the processor.
Typically, a cache will respond to a memory read request in one clock cycle. However,
when a cache miss occurs, the request is forwarded to the main memory, which
introduces a delay of several clock cycles. A read or write request may also be intended
for a register in a memory-mapped I/O device.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig 1.4 Connection and control signals for register MDR
Such I/O registers are not cached, so their accesses always take a number of clock cycles.
To accommodate the variability in response time, the processor waits until it receive an
indication that the requested Read operation has been completed. We will assume that a
control signal called Memory-Function-Completed (MFC) is used for this purpose. The
addressed device sets this signal to 1 to indicate that the contents of that specified
location have been read and are available on the data lines of the memory the bus.
As an example of a read operation, consider the instruction Move (Rl),R2. The
actions needed to execute this instruction are:
These actions may be carried out as separate steps, but some can be combined into
a single step. Each action can be completed in one clock cycle, except action 3 which
requires one or more clock cycles, depending on the speed of the addressed device.
For simplicity, let us assume that the output of MAR is enabled all the time. Thus, the
contents of MAR are always available on the address lines of the memory bus. This ad or
is the case when the processor is the bus master. When a new address is loaded into
MAR, it will appear on the memory bus at the beginning of the next clock cycle, as
shown in Figure 1.5. A Read control signal is activated at the same time MAR is loaded.
This signal will cause the bus interface circuit to send a read command, MR, on the bus.
With this arrangement, we have combined actions I and 2 above into a single control
step. Actions 3 and 4 can also be combined by activating control signal MDRinE while
waiting for a response from the memory. Thus, the data received from the memory are
loaded into MDR at the end of the clock cycle in which the MFC signal is received. In
the next clock cycle, MDRout is activated to transfer the data to register R2. This means
that the memory read operation requires three steps, which can be described by the
signals being activated as follows:
where WMFC is the control signal that causes the processor‘s control circuitry to wait
for the arrival of the MFC signal. Figure 1.5 shows that MDRinE is set to 1 for exactly the
same period as the read command, MR. Hence, in subsequent discussion, we will not
specify the value of MDRinE explicitly, with the understanding that it is always equal to
MR.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig.1.5 Timing of a memory Read operation.
1.3.4 STORING A WORD IN MEMORY
Writing a word into a memory location follows a similar procedure. The desired address
is loaded into MAR. Then, the data to be written are loaded into MDR, and a Write
command is issued. Hence, executing the instruction Move R2,(Rl) requires the following
sequence:
As in the case of the read operation, the Write control signal causes the memory bus
interface hardware to issue a Write command on the memory bus. The processor remains
in step 3 until the memory operation is completed and an MFC response is received.
1.4 MULTIPLE-BUS ORGANIZATION We used the simple single-bus structure of Figure 1.1 to illustrate the basic ideas. The he
resulting control sequences in Figures 1.6 and 1.7 are quite long because only one data
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
item can be transferred over the bus in a clock cycle. To reduce the number of steps
needed, most commercial processors provide multiple internal paths that enable several
transfers to take place in parallel.
Fig.1.8 Three-bus organization of the data path.
Figure 1.8 depicts a three-bus structure used to connect the registers and the ALU of a
processor. All general-purpose registers are combined into a single block called the
register file. The register file in Figure 1.8 is said to have three ports. There are two
outputs, allowing the contents of two different registers to be accessed simultaneously
and have their contents placed on buses A and B. The third port allows the data on bus C
to be loaded into a third register during the same clock cycle. Buses A and B are used to
transfer the source operands to the A and B inputs of the ALU, where an arithmetic or
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
logic operation may be performed. The result is transferred to the destination over bus C.
If needed, the ALU may simply pass one of its two input operands unmodified to bus C.
We will call the ALU control signals for such an operation R=A or R=B. The three-bus
arrangement obviates the need for registers Y and Z in Figure 1.1.
A second feature in Figure 1.8 is the introduction of the Incrementer unit, which is used
to increment the PC by 4. Using the Incrementer eliminates the need to add 4 to the PC
using the main ALU, as was done in Figures 1.6 and 1.7. The source for the constant 4 at
the ALU input multiplexer is still useful. It can be used to increment other addresses,
such as the memory addresses in LoadMultiple and StoreMultiple instructions.
Fig. 1.9 Control sequence for the instruction Add R4,R5,R6 for the three-bus
organization in Fig1.8
Consider the three-operand instruction
Add R4,R5,R6
The control sequence for executing this instruction is given in Figure 1.9. In step 1, the
contents of the PC are passed through the ALU, using the R==B control signal, and
loaded into the MAR to start a memory read operation. At the same time the PC is
incremented by 4. Note that the value loaded into MAR is the original contents of the PC.
The incremented value is loaded into the PC at the end of the clock cycle and will not
affect the contents of MAR. In step 2, the processor waits for MFC and loads the data
received into MDR, then transfers them to IR in step 3. Finally, the execution phase of
the instruction requires only one control step to complete, step 4. By providing more
paths for data transfer a significant reduction in the number of clock cycles needed to
execute an instruction is achieved.
1.5 EXECUTION OF A COMPLETE INSTRUCTION Let us now put together the sequence of elementary operations required to execute one
instruction. Consider the instruction
Add (R3),Rl
which adds the contents of a memory location pointed to by R3 to register Ri. Executing
this instruction requires the following actions:
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
1. Fetch the instruction.
2. Fetch the first operand (the contents of the memory location pointed to by R3).
3. Perform the addition.
4. Load the result into RI.
Figure 1.6 gives the sequence of control steps required to perform these operations for the
single-bus architecture of Figure 1.1. Instruction execution proceeds as follows. In step 1,
the instruction fetch operation is initiated by loading the contents of the PC into the MAR
and sending a Read request to the memory. The Select signal is set to Select4, which
causes the multiplexer MUX to select the constant 4. This value is added to the operand
at input B, which is the contents of the PC, and the result is stored in register Z. The
updated value is moved from register Z back into the PC during step 2, while waiting for
the memory to respond. In step 3, the word fetched from the memory is loaded into the
IR.
Steps 1 through 3 constitute the instruction fetch phase, which is the same for all
instructions. The instruction decoding circuit interprets the contents of the JR at the
beginning of step 4. This enables the control circuitry to activate the control signals for
steps 4 through 7, which constitute the execution phase. The contents of register R3 are
transferred to the MAR in step 4, and a memory read operation is initiated.
Fig. 1.6. Control signals for the execution of the instruction Add (R3),R1.
Then the contents of Ri are transferred to register Y in step 5, to prepare for the addition
operation. When the Read operation is completed, the memory operand is available in
register MDR, and the addition operation is performed in step 6. The contents of MDR
are gated to the bus, and thus also to the B input of the ALU, and register Y is selected as
the second input to the ALU by choosing SelectY. The sum is stored in register Z, then
transferred to Ri in step 7. The End signal causes a new instruction fetch cycle to begin
by returning to step 1.
This discussion accounts for all control signals in Figure 1.6 except Y1, in step 2. There is
no need to copy the updated contents of PC into register Y when executing the Add
instruction. But, in Branch instructions the updated value of the PC is needed to compute
the Branch target address. To speed up the execution of Branch instructions, this value is
copied into register Y in step 2. Since step 2 is part of the fetch phase, the same action
will be performed for all instructions. This does not cause any harm because register Y is
not used for any other purpose at that time.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Branch Instruction A branch instruction replaces the contents of the PC with the branch target address. This
address is usually obtained by adding an offset X, which is given in the branch
instruction, to the updated value of the PC. Figure 1.7 gives a control sequence that
implements an unconditional branch instruction. Processing starts, as usual, with the fetch
phase. This phase ends when the instruction is loaded into the IR in step 3. The offset
value is extracted from the IR by the instruction decoding circuit, which will also perform
sign extension if required. Since the value of the updated PC is already available in
register Y, the offset X is gated onto the bus in step 4, and an addition operation is
performed. The result, which is the branch target address, is loaded into the PC in step 5.
The offset X used in a branch instruction is usually the difference between the branch
target address and the address immediately following the branch instruction.
Fig.1.7 Control sequence for an unconditional branch instruction.
For example, if the branch instruction is at location 2000 and if the branch target address
is 2050, the value of X must be 46. The reason for this can be readily appreciated from
the control sequence in Figure 1.7. The PC is incremented during the fetch phase, before
knowing the type of instruction being executed. Thus, when the branch address is
computed in step 4, the PC value used is the updated value, which points to the
instruction following the branch instruction in the memory.
Consider now a conditional branch. In this case, we need to check the status of the
condition codes before loading a new value into the PC. For example, for a Branch-on-
negative (Branch<0) instruction, step 4 in Figure 1.7 is replaced with
Thus, if N =0 the processor returns to step 1 immediately after step 4. If N = 1, step 5 is
performed to load a new value into the PC, thus performing the branch operation.
1.6 Interconnection structures
A computer consists of a set of components or modules (processor, memory, I/O) that
communicate with each other. A computer is a network of modules. There must be paths
for connecting these modules. The collection of paths connecting the various modules is
called the interconnection structure.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
• Memory
o Consists of N words of equal length
o Each word assigned a unique numerical address (0, 1, …, N-1)
o A word of data can be read or written
o Operation specified by control signals
o Location specified by address signals
• I/O Module
o Similar to memory from computers viewpoint
o Consists of M external device ports (0, 1, …, M-1)
o External data paths for input and output
o Sends interrupt signal to the processor
• Processor
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
o Reads in instructions and data
o Writes out data after processing
o Uses control signals to control overall operation of the system
o Receives interrupt signals
The preceding list defines the data to be exchanged. The interconnection structure must
support the following types of transfers:
• Memory to processor: processor reads an instruction or a unit of data from memory.
• Processor to memory: processor writes a unit of data to memory.
• I/O to processor: processor reads data from an I/O device via an I/O module.
• Processor to I/O: processor sends data to the I/O device via an I/O module.
• I/O to or from memory: an I/O module is allowed to exchange data directly with
memory, without going through the processor, using direct memory access (DMA).
Over the years, a number of interconnection structures have been tried. By far the most
common is the bus and various multiple-bus structures.
Bus Interconnection A bus is a communication pathway connecting two or more devices. Multiple devices can
be connected to the same bus at the same time. Typically, a bus consists of multiple
communication pathways, or lines. Each line is capable of transmitting signals
representing binary 1 or binary 0. A bus that connects major computer components
(processor, memory, I/O) is called a system bus.
Bus Structure Typically, a bus consists of 50 to hundreds of separate lines. On any bus the lines are
grouped into three main function groups: data, address, and control. There may also be
power distribution lines for attached modules.
• Data lines
o Path for moving data and instructions between modules.
o Collectively are called the data bus.
o Consists of: 8, 16, 32, 64, etc… bits – key factor in overall system performance
• Address lines
o Identifies the source or destination of the data on the data bus. ƒ CPU needs to
read an instruction or data from a given memory location.
o Bus width determines the maximum possible memory capacity for the system.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
8080 has 16 bit addresses giving access to 64K address
• Control lines
o Used to control the access to and the use of the data and address lines.
o Transmits command and timing information between modules.
Typical control lines include the following:
• Memory write: causes data on the bus to be written to the addressed memory location.
• Memory read: causes data from the addressed memory location to be placed on the bus.
• I/O write: causes data on the bus to be output to the addressed I/O port.
• I/O read: causes data from the addressed I/O port to be placed on the bus.
• Transfer ACK: indicates that data have been from or placed on the bus.
• Bus request: indicates that a module needs to gain control of the bus.
• Bus grant: indicates that a requesting module has been granted control of the bus.
• Interrupt request: indicates that an interrupt is pending.
• Interrupt ACK: indicates that the pending interrupt has been recognized.
• Clock: used to synchronize operations.
• Reset: initializes all modules.
What does a bus look like?
• Parallel lines on a circuit board.
• Ribbon cables.
• Strip connectors of a circuit board.
o PCI, AGP, PCI Express, SCSI, etc…
• Sets of wires.
1.7 Layered view of a computer system.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
PIPE LINING
Pipelining is used in modern computers to achieve high performance. We begin by
explaining the basics of pipelining and how it can lead to improved performance. Then
we examine machine instruction features that facilitate pipelined execution, and we show
that the choice of instructions and instruction sequencing can have a significant effect on
performance. Pipelined organization requires sophisticated compilation techniques, and
optimizing compilers have been developed for this purpose. Among other things, such
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
compilers rearrange the sequence of operations to maximize the benefits of pipelined
execution.
BASIC CONCEPTS
The speed of execution of programs is influenced by many factors. One way to improve
performance is to use faster circuit technology to build the processor and the main
memory. Another possibility is to arrange the hardware so that more than one operation
can be performed at the same time. In this way, the number of operations performed per
second is increased even though the elapsed time needed to perform any one operation is
not changed.
We have encountered concurrent activities several times before. In multiprogramming
DMA devices make I/O transfers and simultaneous computational activities possible
because they can perform I/O transfers independently once these transfers are initiated by
the processor.
Pipelining is a particularly effective way of organizing concurrent activity in a computer
system. The basic idea is very simple. It is frequently encountered in manufacturing
plants, where pipelining is commonly known as an assembly-line operation. Readers are
undoubtedly familiar with the assembly line used in car manufacturing. The first station
in an assembly line may prepare the chassis of a car, the next station adds the body, the
next one installs the engine, and so on. While one group of workers is installing the
engine on one car, another group is fitting a car body on the chassis of another car, and
yet another group is preparing a new chassis for a third car. It may take days to complete
work on a given car, but it is possible to have a new car rolling off the end of the
assembly line every few minutes.
Consider how the idea of pipelining can be used in a computer. The processor executes a
program by fetching and executing instructions, one after the other. Let F and E, refer to
the fetch and execute steps for instruction I. Execution of a program consists of a
sequence of fetch and execute steps, as shown in Fig1.10.
Now consider a computer that has two separate hardware units, one for fetching
instructions and another for executing them, as shown in Figure1.19. The instruction
fetched by the fetch unit is deposited in an intermediate storage buffer, B 1. This buffer is
needed to enable the execution unit to execute the instruction while the fetch unit is
fetching the next instruction. The results of execution are deposited in the destination
location specified by the instruction. For the purposes of this discussion, we assume that
both the source and the destination of the data operated on by the instructions are inside
the block labeled ―Execution unit.‖
The computer is controlled by a clock whose period is such that the fetch and execute
steps of any instruction can each be completed in one clock cycle. Operation of the
computer proceeds as in Figure 1.10c. In the first clock cycle, the fetch unit fetches an
instruction I1 (step F1) and stores it in buffer Bi at the end of the clock cycle. In the
second clock cycle, the instruction fetch unit proceeds with the fetch operation for
instruction 12 (step F2). Meanwhile, the execution unit performs the operation specified
by instruction I1, which is available to it in buffer Bi (step E1). By the end of the second
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
clock cycle, the execution of instruction 1 is completed and instruction 12 is available.
Instruction ‗2 is stored in B 1, replacing I, which is no longer needed. Step E2 is
performed by the execution unit during the third clock cycle, while instruction 13 is being
fetched by the fetch unit. In this manner, both the fetch and execute units are kept busy
all the time. If the pattern in Figure 1.10c can be sustained for a long time, the completion
rate of instruction execution will be twice that achievable by the sequential operation
depicted in Figure 1.10a.
In summary, the fetch and execute units in Figure 1.10b constitute a two-stage pipeline in
which each stage performs one step in processing an instruction. An inter- stage storage
buffer, B 1, is needed to hold the information being passed from one stage to the next.
New information is loaded into this buffer at the end of each clock cycle.
Fig. 1.10 Basic idea of instruction pipelining
The processing of an instruction need not be divided into only two steps. For example, a
pipelined processor may process each instruction in four steps, as follows:
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
F Fetch: read the instruction from the memory.
D Decode: decode the instruction and fetch the source operand(s).
E Execute: perform the operation specified by the instruction.
W Write: store the result in the destination location.
Fig 1.11 A 4-stage pipelining.
The sequence of events for this case is shown in Figure 1.11a. Four instructions are in
progress at any given time. This means that four distinct hardware units are needed, as
shown in Figure 1.11b. These units must be capable of performing their tasks
simultaneously and without interfering with one another. Information is passed from one
unit to the next through a storage buffer. As an instruction progresses through the
pipeline, all the information needed by the stages downstream must be passed along. For
example, during clock cycle 4, the information in the buffers is as follows:
• Buffer B 1 holds instruction I3, which was fetched in cycle 3 and is being decoded by
the instruction-decoding unit.
• Buffer B2 holds both the source operands for instruction I2 and the specification of the
operation to be performed. This is the information produced by the decoding hardware in
cycle3. The buffer also holds the information needed for the write step of instruction I2
(step W2). Even though it is not needed by stage E, this information must be passed on to
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
stage W in the following clock cycle to enable that stage to perform the required Write
operation.
• Buffer B3 holds the results produced by the execution unit and the destination
information for instruction Ii.
PIPELINE PERFORMANCE
The pipelined processor in Figure 1.11 completes the processing of one instruction in
each clock cycle, which means that the rate of instruction processing is four times that of
sequential operation. The potential increase in performance resulting from pipelining is
proportional to the number of pipeline stages. However, this increase would be achieved
only if pipelined operation as depicted in Figure 1.11a could be sustained without
interruption throughout program execution. Unfortunately, this is not the case.
For a variety of reasons, one of the pipeline stages may not be able to complete its
processing task for a given instruction in the time allotted. For example, stage E in the
four-stage pipeline of Figure 1.11b is responsible for arithmetic and logic operations, and
one clock cycle is assigned for this task. Although this may be sufficient for most
operations, some operations, such as divide, may require more time to complete. Figure
1.12 shows an example in which the operation specified in instruction I2 requires three
cycles to complete, from cycle 4 through cycle 6. Thus, in cycles 5 and 6, the Write stage
must be told to do nothing, because it has no data to work with. Meanwhile, the
information in buffer B2 must remain intact until the Execute stage has completed its
operation. This means that stage 2 and, in turn, stage I are blocked from accepting new
instructions because the information in B 1 cannot be overwritten. Thus, steps D4 and F5
must be postponed as shown.
Pipelined operation in Figure 1.12 is said to have been stalled for two clock cycles.
Normal pipelined operation resumes in cycle 7. Any condition that causes the pipeline to
stall is called a hazard. We have just seen an example of a data hazard. A data hazard is
any condition in which either the source or the destination operands of an instruction are
not available at the time expected in the pipeline. As a result some operation has to be
delayed, and the pipeline stalls.
Fig. 1.12 Effect of an execution operation taking more than one clock cycle.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
CPU-ARITHMETIC
Binary number system
The number system followed by computers
Base is two and any number is represented as an array containing 1‘s and 0‘s
representing coefficients of power of two.
Used in computer systems because of the ease of representing 1 and 0 as two
levels of voltage/power – high and low
To represent decimal system, 10 levels of voltage would be required!
Correspondingly complex hardware too
Binary arithmetic • Addition
• Subtraction
• Multiplication
• Division
Binary addition
• Four basic rules for elementary addition
• 0 + 0 = 0 ; 0 + 1 = 1; 1 + 0 = 1; 1 + 1 = 10;
• Carry-overs are performed in the same manner as in decimal addition • 11001 + 1001 =?
• How to add multiple (more than two) binary numbers?
Binary subtraction
• Four rules for elementary subtraction
• 0 – 0 = 0; 1 – 0 = 1; 1 – 1 = 0;
• 0 – 1 = 1, but with a borrow of 1 from the next column of minuend
• 1101 – 1100 = 1
• 1100 - 1001 =?
Signed binary numbers
• Like in decimal, we need to represent negative
numbers in binary system too
Decimal system uses ‗-‘ sign, but computers understand only 1‘s and 0‘s.
A solution is to add a digit (sign bit) to represent the sign – approach is called
Signed Magnitude Representation
0 marks positive and 1 marks negative
Problem! Have to specify the number of bits in the number to avoid
misinterpretation
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Complements
With signed magnitude, representing number is simple but arithmetic is complex
Nice to have a representation in which addition simple and other operations can
be done using addition
Multiplication is repeated addition; division is repeated subtraction
Using complements we can perform subtraction using addition
1’s complement
•By complementing (changing 1 to 0 and 0 to 1) each bit in the number.
• Most significant bit tells us the sign of the number
9 => 01001 2
-9 => 10110 2
Subtraction using 1’s complement
• To subtract, add 1‘s complement of the number.
• If there is an overflow bit, add the overflow bit to the remaining part
0111 – 0101 => 0111 + 1010
0111 + 0001+
1010 1
------- ---------
10001 0010
2’s complement
• Add 1 to the 1‘s complement form
Addition
1. Represent both operands in signed-2's complement format (If operand X>0, keep
its original binary form. If operand X<0, take 2's complement of X : 2n _
X )
2. Add operands, discard carry-out of the sign bit MSB (if any).
3. The result is automatically in signed-2's complement form.
Example: (n=6 bits):
6 -6 9 -9
000110 111010 001001 110111
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
ADDITION/SUBTRACTION
-The 1 in the 7th bit is automatically dropped.
The MSB of the result is 1, indicating it is a negative result represented in signed 2's
complement form. Its value can be found by 2's complement to be
‘
The 1 in the 7th bit is automatically dropped. The MSB of the result is 1, indicating it is a
negative result represented in signed 2's complement form. Its value can be found by 2's
complement to be
.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Subtraction
1. Represent both operands in signed-2's complement format.
2. Take 2's complement of the subtrahend B (which may be in complement form
already if it is negative).
3. Add it to the minuend A.
4. The result is automatically in signed-2's complement form.
Example: Given
, find .
Represent both operands in signed 2's complement:
,
Complement the subtrahand so that it becomes .
Add it to minuend:
, which is in signed 2's complement
Why does it work?
Consider the following three cases (where A>0, B>0):
This is a normal binary addition with a positive sum.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
(or )
The negative value –B is represented by 2's complement and
If A- B >0, the result is in binary form with automatically dropped.
If A - B< 0, the result is 2's complement representation of a negative value A- B.
Both negative values -A and -B are represented in 2's complement form as
and and
The first is automatically dropped and the second term is the 2's complement
representation of a negative value – (A+B).
We see that signed 2's complement representation can properly deal with both addition
and subtraction with negative operands as well as positive ones.
Example: (n=4 bits)
The MSB of the result is 1, indicating it is a negative result represented in signed
2's complement form. Its value can be found by 2's complent to be
, a wrong result!
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The MSB of the result is 1, indicating it is a negative result represented in signed
2's complement form. Its value can be found by 2's complement to be
The 1 in the 5th bit is dropped.
The 1 in the 5th bit is dropped. The result is 5, another wrong result!
The wrong results are caused by overflow problem. Given n=4 bits, the range of valid
values representable is to .
The overflow problem can be detected by checking whether the carry-in Cin to and carry-
out Cout from the MSB are the same. Consider the sign bit of the following six cases of
addition:
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
It is obvious that when Cin ≠ Cout, the result is incorrect due to overflow.
Hardware Implementation: An n-bit adder can be built by concatenating n full
adders:
This n-bit adder can also carry out subtraction. A-B as well as addition A+B.
A control signal is used to control a 2x1 MUX to select either for
addition when , or for subtraction when
. The subtraction is carried out by adding 2's complement of
operand to . (Recall that 2's complement can be obtained by bit-wise complement
and adding 1 to the LSB.)
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
ADDITION AND SUBTRACTION OF SIGNED NUMBERS
Figure shows the logic truth table for the sum and carry-out functions for adding equally
weighted bits x, and y in two numbers X and Y. The figure also shows logic expressions
for these functions, along with an example of addition of the 4-bit unsigned numbers 7
and 6. Note that each stage of the addition process must accommodate a carry-in bit.
Logic specification for a stage of binary addition
We use c, to represent the carry-in to the ith
stage, which is the same as the carry-out from
the (i — 1)st stage. The logic expression for s in above figure can be implemented with a
3-input XOR gate, used in following figure a as part of the logic required for a single
stage of binary addition.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Logic for addition of binary vectors
The carry-out function, c1 i, is implemented with a two-level AND-OR logic circuit. A
convenient symbol for the complete circuit for a single stage of addition, called a full
adder (FA), is also shown in the figure. A cascaded connection of n full adder blocks, as
shown in Figure b, can be used to add two n-bit numbers. Since the carries must
propagate, or ripple, through this cascade, the configuration is called an n-bit ripple-carry
adder
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The carry-in, Co, into the least-significant-bit (LSB) position provides a convenient
means of adding 1 to a number. For instance, forming the 2‘s-complement of a number
involves adding 1 to the 1‘s-complement of the number. The carry signals are also useful
for interconnecting k adders to form an adder capable of handling input numbers that are
kn bits long, as shown in Figure c.
The n-bit adder in Figure b can be used to add 2‘s-complement numbers X and Y, where
the Xn-1 — and Yn-1, i bits are the sign bits. In this case, the carry-out bit, cn is not part
of the answer. Overflow can only occur when the signs of the two operands are the same.
In this case, overflow obviously occurs if the sign of the result is different. Therefore, a
circuit to detect overflow can be added to the n-bit adder by implementing the logic
expression
It can also be shown that overflow occurs when the carry bits c, and Cn_l are different.
Therefore, a simpler alternative circuit for detecting overflow can be obtained by
implementing the expression cn XOR cn-1 with an XOR gate.
In order to perform the subtraction operation X — Y on 2‘s-complement numbers X and
Y, we form the 2‘s-complement of Y and add it to X. The logic circuit network shown in
the following figure can be used to perform either addition or subtraction based on the
value applied to the Add/Sub input control line. This line is set to 0 for addition, applying
the Y vector unchanged to one of the adder inputs along with a carry-in signal, c0, of 0.
When the Add/Sub control line is set to 1, the Y vector is 1‘s-complemented (that is, bit
complemented) by the XOR gates and c0 is set to 1 to complete the 2‘s-complementation
of Y. Remember that 2‘s-complementing a negative number is done in exactly the same
manner as for a positive number. An XOR gate can be added to the following figure to
detect the overflow condition cn XOR cn-1.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Binary addition subtraction logic network
Binary Coded Decimal (BCD)
Introduction:
Although binary data is the most efficient storage scheme; every bit pattern represents a
unique, valid value. However, some applications may not be desirable to work with
binary data.
For instance, the internal components of digital clocks keep track of the time in binary.
The binary value must be converted to decimal before it can be displayed.
Because a digital clock is preferable to store the value as a series of decimal digits, where
each digit is separately represented as its binary equivalent, the most common format
used to represent decimal data is called binary coded decimal, or BCD
BCD Numeric Format
-Every four bits represent one decimal digit.
Use decimal values from 0 to 9
4-bit values above 9 are not used in BCD.
The unused 4-bit values are:
BCD Decimal
1010 10
1011 11
1100 12
1101 13
1110 14
1111 15
Multi-digit decimal numbers are stored as multiple groups of 4 bits per digit.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
BCD is a signed notation
positive or negative.
For example, +27 as 0(sign) 0010 0111.
-27 as 1(sign) 0010 0111.
BCD does not store negative numbers in two‘s complement.
Values represented
b3b2b1b0 Sign &
magnitude
1‘s
complement
2‘s
complemt
0111 +7 +7 +7
0110 +6 +6 +6
0101 +5 +5 +5
0100 +4 +4 +4
0011 +3 +3 +3
0010 +2 +2 +2
0001 +1 +1 +1
0000 +0 +0 +0
1000 -0 -7 -8
1001 -1 -6 -7
1010 -2 -5 -6
1011 -3 -4 -5
1100 -4 -3 -4
1101 -5 -2 -3
1110 -6 -1 -2
1111 -7 -0 -1
Algorithms for Addition
0101 5
+ 1001 +9
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
1110 Incorrect BCD digit
+ 0110 Add 6
1 0100 Correct answer
1 4
BCD adder
-If the result, S3 S2 S1 S0, is not a valid BCD digit, the multiplexer causes 6 to be added
to the result.
Carry-Look-ahead Adder
There are several factors that contribute to the delay in the digital adders. One is the
propagation delay due to the internal structure of the gates, another factor is the loading
of the output buffers (due to fanout and net delays), and a third factor is the logic circuit
itself.
The propagation delay (or gate delay) of a gate is the time difference between the change
of the input and output signals.
Ripple-carry vs. Carry-look-ahead Adders
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
One type of circuit where the effect of gate delays is particularly clear, is an ADDER. In
a 4-bit adder ripple-carry adder the result of an addition of two bits depends on the carry
generated by the addition of the previous two bits. Thus, the Sum of the most significant
bit is only available after the carry signal has rippled through the adder from the least
significant stage to the most significant stage. This can be easily understood if one
considers the addition of the two 4-bit words: 1 1 1 12 + 0 0 0 12, as shown in Figure 3.
Figure 3: Addition of two 4-bit numbers illustrating the generation of the carry-out bit
In this case, the addition of (1+1 = 102) in the least significant stage causes a carry bit to
be generated. This carry bit will consequently generate another carry bit in the next stage,
and so on, until the final carry-out bit appears at the output. This requires the signal to
travel (ripple) through all the stages of the adder as illustrated in Figure 4 below. As a
result, the final Sum and Carry bits will be valid after a considerable delay. The carry-out
bit of the first stage will be valid after 4 gate delays (2 associated with the XOR gate and
1 each associated with the AND and OR gates).
From the schematic of Figure 4, one finds that the next carry-out (C2) will be valid after
an additional 2 gate delays (associated with the AND and OR gates) for a total of 6 gate
delays. In general the carry-out of a N-bit adder will be valid after 2N+2 gate delays. The
Sum bit will be valid an additional 2 gate delays after the carry-in signal. Thus the sum of
the most significant bit SN-1 will be valid after 2(N-1) + 2 +2 = 2N +2 gate delays. This
delay may be in addition to any delays associated with interconnections. It should be
mentioned that in case one implements the circuit in a FPGA, the delays may be different
from the above expression depending on how the logic has been placed in the look up
tables and how it has been divided among different CLBs.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Figure 4: Ripple-carry adder, illustrating the delay of the carry bit.
Features of Ripple-carry adder: - Multiple full adders with carry ins and carry outs
chained together
- Small Layout area
- Large delay time
The disadvantage of the ripple-carry adder is that it can get very slow when one needs to
add many bits. For instance, for a 32-bit adder, the delay would be about 66 ns if one
assumes a gate delay of 1 ns. That would imply that the maximum frequency one can
operate this adder would be only 15 MHz! For fast applications, a better design is
required. The carry-look-ahead adder solves this problem by calculating the carry signals
in advance, based on the input signals. It is based on the fact that a carry signal will be
generated in two cases: (1) when both bits Ai and Bi are 1, or (2) when one of the two bits
is 1 and the carry-in (carry of the previous stage) is 1. Thus, one can write,
COUT = Ci+1 = Ai.Bi + (Ai Bi).Ci. (1)
The " " stands for exclusive OR or XOR. One can write this expression also, as
Ci+1 = Gi + Pi.Ci (2)
in which Gi = Ai.Bi (3)
Pi = (Ai Bi) (4)
are called the Generate and Propagate term, respectively.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Lets assume that the delay through an AND gate is one gate delay and through an XOR
gate is two gate delays. Notice that the Propagate and Generate terms only depend on the
input bits and thus will be valid after two and one gate delay, respectively. If one uses the
above expression to calculate the carry signals, one does not need to wait for the carry to
ripple through all the previous stages to find its proper value. Let‘s apply this to a 4-bit
adder to make it clear.
C1 = G0 + P0.C0 (5)
C2 = G1 + P1.C1 = G1 + P1.G0 + P1.P0.C0 (6)
C3 = G2 + P2.G1 + P2.P1.G0 + P2.P1.P0.C0 (7)
C4 = G3 + P3.G2 + P3.P2.G1 + P3P2.P1.G0 + P3P2.P1.P0.C0 (8)
Notice that the carry-out bit, Ci+1, of the last stage will be available after four delays (two
gate delays to calculate the Propagate signal and two delays as a result of the AND and
OR gate). The Sum signal can be calculated as follows,
Si = Ai Bi Ci = Pi Ci. (9)
The Sum bit will thus be available after two additional gate delays (due to the XOR gate)
or a total of six gate delays after the input signals Ai and Bi have been applied. The
advantage is that these delays will be the same independent of the number of bits one
needs to add, in contrast to the ripple counter.
The carry-lookahead adder can be broken up in two modules: (1) the Partial Full Adder,
PFA, which generates Si, Pi and Gi as defined by equations 3, 4 and 9 above; and (2) the
Carry Look-ahead Logic, which generates the carry-out bits according to equations 5 to
8. The 4-bit adder can then be built by using 4 PFAs and the Carry Look-ahead logic
block as shown in Figure 5.
Figure 5: Block diagram of a 4-bit CLA.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The disadvantage of the carry-lookahead adder is that the carry logic is getting quite
complicated for more than 4 bits. For that reason, carry-look-ahead adders are usually
implemented as 4-bit modules and are used in a hierarchical structure to realize adders
that have multiples of 4 bits. Figure 6 shows the block diagram for a 16-bit CLA adder.
The circuit makes use of the same CLA Logic block as the one used in the 4-bit adder.
Notice that each 4-bit adder provides a group Propagate and Generate Signal, which is
used by the CLA Logic block. The group Propagate PG of a 4-bit adder will have the
following expressions,
PG = P3.P2.P1.P0 ; (10)
GG = G3 + P3G2 + P3.P2.G1. + P3.P2.P1.G0 (11)
The group Propagate PG and Generate GG will be available after 3 and 4 gate delays,
respectively (one or two additional delays than the Pi and Gi signals, respectively).
Figure 6: Block diagram of a 16-bit CLA Adder
MULTIPLICATION
Algorithms for Multiplication
1101 Multiplicand M
X 1011 Multiplier Q
1101
1101
0000
1101____
10001111 Product P
Array multiplier
Combination circuit
Product generated in one micro operation
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Requires large number of gates
Became feasible after integrated circuits developed
Needed for j multiplier and k multiplicand bits
o j x k AND gates
o j –1 k-bit adders to produce product of j + k bits
Multiply Signed-2’s Complement
Booth algorithm:
This algorithm serves two purposes:
Fast multiplication when there are consecutive 0‘s or 1‘s in the multiplier.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Can be used for signed multiplication.
QR multiplier
Qn least significant bit of QR
Qn+1 previous least significant bit of QR
BR multiplicand
AC= 0
SC number of bits in multiplier
Algorithm:
1. Do SC + 1 times
2. If QnQn+1 = 10
AC ← AC + BR + 1
3. If QnQn+1 = 01
AC ← AC + BR
4. Arithmetic shift right AC & QR
5. SC←SC –1
Explanation:
1. Depending on the current and previous bits, do one of the following:
00: a. Middle of a string of 0s, so no arithmetic operations.
01: b. End of a string of 1s, so add the multiplicand to the left
half of the product.
10: c. Beginning of a string of 1s, so subtract the multiplicand
from the left half of the product.
11: d. Middle of a string of 1s, so no arithmetic operation.
2. As in the previous algorithm, shift the Product register right (arith) 1 bit.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Hardware
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Example: -9 x -13 = 117
DIVISION
Algorithms for Division
Division can be implemented using either a restoring or a non-restoring algorithm. An
inner loop to perform multiple subtractions must be incorporated into the algorithm.
10
11 ) 1000
11_
10
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
A logic circuit arrangement implements the restoring-division technique
The restoring-division algorithm:
S1: DO n times
Shift A and Q left one binary position.
Subtract M from A, placing the answer back in A.
S2: If the sign of A is 1, set q0 to 0 and add M back to A (restore A); otherwise, set q0
to 1.
A restoring-division example
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Initially 0 0 0 0 0 1 0 0 0
0 0 0 1 1
Shift 0 0 0 0 1 0 0 0
Subtract 1 1 1 0 1
Set q0 1 1 1 1 0
Restore 1 1
0 0 0 0 1 0 0 0 0
Shift 0 0 0 1 0 0 0 0
Subtract 1 1 1 0 1
Set q0 1 1 1 1 1
Restore 1 1
0 0 0 1 0 0 0 0 0
Shift 0 0 1 0 0 0 0 0
Subtract 1 1 1 0 1
Set q0 0 0 0 1 0 0 0 0 1
Shift 0 0 0 1 0 0 0 1
Subtract 1 1 1 0 1
Set q0 1 1 1 1 1
Restore 1 1
0 0 0 1 0 0 0 1 0 Quotient
Remainder
First cycle
Second cycle
Third cycle
Fourth cycle
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The non-restoring division algorithm:
S1: Do n times
If the sign of A is 0, shift A and Q left one binary position and subtract M from A;
otherwise, shift A and Q left and add M to A.
S2: If the sign of A is 1, add M to A.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Assume the dividend and the divisor is 124 and 7 respectively. The Non –restoring
division scheme would proceed as follows:
(124 decimal= 01111100 binary).The M register contains the divisor 7 ( M= 00000111).
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
A few comparisons
Restoring division is most efficient for
- floating-point division
- for integer division when the divisor is not small
-
easy to implement.
Non Restoring Division
The main advantage is the compatibility with 2's complement notation for
dividend and divisor.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Note that a single precision floating point number is normalized only if it can be
expressed in binary in the form: 1.M x 2E where M is the 23-bit mantissa and the
exponent E is such that -126≤ E ≤127. A de-normalized number requires an exponent less
than -126, in which case the number would be represented using the special pattern 0 for
the characteristic to denote an exponent of -126 and the significand is expressed as a pure
fraction 0.M. Thus the value is 0.Mx2-126
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The difference in exponents determines which of the significands is shifted
The result is then normalized and the exponent is adjusted if necessary.
The rounding hardware then creates the final result.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
ALU Design
A One Bit ALU
° This 1-bit ALU will perform AND, OR, and ADD
A One-bit Full Adder
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
This is also called a (3, 2) adder
Half Adder: No CarryIn nor CarryOut
Truth Table:
Logic Equation for CarryOut
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
° CarryOut = (!A & B & CarryIn) | (A & !B & CarryIn) | (A & B & !CarryIn)
| (A & B & CarryIn)
° CarryOut = B & CarryIn | A & CarryIn | A & B
Logic Equation for Sum
° Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn)
| (A & B & CarryIn)
° Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn)
| (A & B & CarryIn)
° Sum = A XOR B XOR CarryIn
° Truth Table for XOR:
Logic Diagrams for CarryOut and Sum
CarryOut = B & CarryIn | A & CarryIn | A & B
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
° Sum = A XOR B XOR CarryIn
A 4-bit ALU
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
CONTROL UNIT ORGANIZATION
Instruction Execution
The CPU executes a sequence of instructions.
The execution of an instruction is organized as an instruction cycle: it is performed as a
succession of several steps;
• Each step is executed as a set of several microoperations.
• The task performed by any microoperation falls in one of the following categories:
- Transfer data from one register to another;
- Transfer data from a register to an external interface (system bus);
- Transfer data from an external interface to a register;
- Perform an arithmetic or logic operation, using registers for input and output.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Microoperations and Control Signals
In order to allow the execution of a microoperation, one or several control signals have to
be issued; they allow the corresponding data transfer and/or computation to be
performed.
Examples:
a) signals for transferring content of register R0 to R1:
R0out, R1in
b) signals for adding content of Y to that of R0 (result in Z):
R0out, Add, Zin
c) signals for reading a memory location; address in R3:
R3out, MARin, Read
• The CPU executes an instruction as a sequence of control steps. In each control step one
or several microoperations are executed.
• One clock pulse triggers the activities corresponding to one control step for each
clock pulse the control unit generates the control signals corresponding to the
microoperations to be executed in the respective control step.
Microoperations and Control Signals (cont’d)
Instruction:
ADD R1, R3 R1 ← R1 + R3
control steps and control signals:
instruction:
ADD R1, (R3) R1 ← R1 + [R3]
control steps and control signals:
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
instruction:
BR target unconditional branch (with relative addressing)
control steps and control signals:
• The first (three) control steps are identical for each instruction; they perform instruction
fetch and increment the PC. The following steps depend on the actual instruction (stored
in the IR).
• If a control step issues a read, the value will be available in the MBR after one
additional step.
• Several microoperations can be performed in the same control step if they don‘t conflict
(for example, only one of them is allowed to output on the bus)
Control Unit
The basic task of the control unit:
- For each instruction the control unit causes the CPU to go through a sequence of control
steps;
- in each control step the control unit issues a set of signals which cause the
corresponding microoperations to be executed.
• The control unit is driven by the processor clock.
The signals to be generated at a certain moment depend on:
- the actual step to be executed;
- the condition and status flags of the processor;
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
- the actual instruction executed;
- external signals received on the system bus
(e.g. interrupt signal)
Control Unit
• Techniques for implementation of the control unit:
1. Hardwired control
2. Microprogrammed control
HARDWIRED CONTROL
To execute instructions, the processor must have some means of generating the control
signals needed in the proper sequence. Computer designers use a wide variety of
techniques to solve this problem. The approaches used fall into one of two categories:
hardwired control and microprogrammed control. We discuss each of these techniques in
detail, starting with hardwired control in this section.
Consider the sequence of control signals given in Figure 1.6. Each step in this sequence is
completed in one clock period. A counter may be used to keep track of the control steps,
as shown in Figure 1.10. Each state, or count, of this counter corresponds to one control
step. The required control signals are determined by the following information:
• Contents of the control step counter
• Contents of the instruction register
• Contents of the condition code flags
• External input signals, such as MFC and interrupt requests
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig. 1.10 Control unit Organization
To gain insight into the structure of the control unit, we start with a simplified view of the
hardware involved. The decoder/encoder block in Figure 1.10 is a combinational circuit
that generates the required control outputs, depending on the state of all its inputs. By
separating the decoding and encoding functions, we obtain the more detailed block
diagram in Figure 1.11. The step decoder provides a separate signal line for each step, or
time slot, in the control sequence. Similarly, the output of the instruction decoder consists
of a separate line for each machine instruction. For any instruction loaded in the IR, one
of the output lines INS1 through INSm is set to 1, and all other lines are set to 0. The input
signals to the encoder block in Figure 1.11 are combined to generate the individual
control signals Ym, PC0, Add, End, and so on. An example of how the encoder generates
the Z1 control signal for the processor organization in Figure 1.1 is given in Figure 1.12.
This circuit implements the logic function
This signal is asserted during time slot T1 for all instructions, during T6 for an Add
instruction, during T4 for an unconditional branch instruction, and so on. The logic
function for Z is derived from the control sequences in Figures 1.6 and 1.7. As another
example, Figure 1.13 gives a circuit that generates the End control signal from the logic
function
The End signal starts a new instruction fetch cycle by resetting the control step counter
to its starting value. Figure 1.11 contains another control signal called RUN.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig.1.11 Separation of the decoding and encoding functions
Fig.1.12 Generation of the Zin control signal for the processor in Fig 1.1
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig. 1.13 Generation of the End control signal.
When set to 1, RUN causes the counter to be incremented by one at the end of every
clock cycle. When RUN is equal to 0, the counter stops counting. This is needed
whenever the WMFC signal is issued, to cause the processor to wait for the reply from
the memory.
The control hardware shown in Figure 1.10 or 1.11 can be viewed as a state machine that
changes from one state to another in every clock cycle, depending on the contents of the
instruction register, the condition codes, and the external inputs. The outputs of the state
machine are the control signals. The sequence of operations carried out by this machine is
determined by the wiring of the logic elements, hence the name ―hardwired.‖ A controller
that uses this approach can operate at high speed. However, it has little flexibility, and the
complexity of the instruction set it can implement is limited.
• In the case of hardwired control, the control unit is a combinatorial circuit; it gets a set
of inputs (from IR, flags, clock system bus) and transforms them into a set of control
signals.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Generation of signal Zin:
- first step of all instructions (fetch instruction)
- step 5 of ADD with register addressing
- step 5 of BR
- step 6 of ADD with register-indirect addressing
Zin =T1 +T5 ⋅ (ADDreg + BR) + T6 ⋅ ADDreg_ind +...
Generation of signal End:
- step 6 of ADD with register addressing
- step 7 of ADD with register-indirect addressing
- step 6 of BR
End = T6 ⋅ (ADDreg + BR) + T7 ⋅ ADDreg_ind + . . .
Advantages:
Hardwired control provides highest speed.
RISCs are implemented with hardwired control.
If the instruction set becomes very complex (CISCs) implementing hardwired
control is very difficult. In this case microprogrammed control units are used.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
In order to allow execution of register-to-register operations in a single clock
cycle, RISCs (and other modern processors) use three-bus CPU structures (see
following slide).
MICROPROGRAMMED CONTROL
Microprogram
- Program stored in memory that generates all the control signals required to execute
the instruction set correctly
- Consists of microinstructions
Microinstruction
- Contains a control word and a sequencing word
Control Word - All the control information required for one clock cycle.
o a sequence of Nsig bits, where Nsig is the total number of control signals;
each bit in a CW corresponds to one control signal.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
o Each control step during execution of an instruction defines a certain CW;
it represents a combination of 1s and 0s corresponding to the active and
non-active control signals
Sequencing Word - Information needed to decide the next microinstruction address
- Vocabulary to write a microprogram
Microprogrammed control - basic idea:
All microroutines corresponding to the machine instructions are stored in the
control store.
The control unit generates the sequence of control signals for a certain machine
instruction by reading from the control store the CWs of the microroutine
corresponding to the respective instruction.
The control unit is implemented just like another very simple CPU, inside the CPU,
executing microroutines stored in the control store.
Control Memory (Control Storage: CS)
- Storage in the microprogrammed control unit to store the microprogram
Writeable Control Memory (Writeable Control Storage: WCS)
- CS whose contents can be modified
-> Allows the microprogram can be changed
-> Instruction set can be changed or modified
Dynamic Microprogramming
- Computer system whose control unit is implemented with a micro program in WCS
- Microprogram can be changed by a systems programmer or a user
MICROPROGRAMMED CONTROL
In hardwired control, we saw how the control signals required inside the processor can be
generated using a control step counter and a decoder/encoder circuit. Now we discuss
an alternative scheme, called microprogrammed control, in which control signals are
generated by a program similar to machine language programs.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig.1.15 An example of micro instructions for the Fig. 1.6
First, we introduce some common terms. A control word (CW) is a word whose
individual bits represent the various control signals in Figure 1.11. Each of the control
steps in the control sequence of an instruction defines a unique combination of is and Os
in the CW. The CWs corresponding to the 7 steps of Figure 7.6 are shown in Figure 1.15.
We have assumed that SelectY is represented by Select = 0 and Select4 by Select 1. A
sequence of CWs corresponding to the control sequence of a machine instruction
constitutes the micro routine for that instruction, and the individual control words in this
microroutine are referred to as microinstructions.
The microroutines for all instructions in the instruction set of a computer are stored in a
special memory called the control store. The control unit can generate the control signals
for any instruction by sequentially reading the CWs of the corresponding microroutine
from the control store. This suggests organizing the control unit as shown in Figure 1.16.
To read the control words sequentially from the control store, a microprogram counter
(µPC) is used. Every time a new instruction is loaded into the IR, the output of the block
labeled ―starting address generator‖ is loaded into the µPC. The 1sPC is then
automatically incremented by the clock, causing successive microinstructions to be read
from the control store. Hence, the control signals are delivered to various parts of the
processor in the correct sequence.
One important function of the control unit cannot be implemented by the simple
organization in Figure 1.16. This is the situation that arises when the control unit is
required to check the status of the condition codes or external inputs to choose between
alternative courses of action. In the case of hardwired control, this situation is handled by
including an appropriate logic function, as in Equation 1.2, in the encoder circuitry. In
microprogrammed control, an alternative approach is to use conditional branch
microinstructions. In addition to the branch address, these microinstructions specify
which of the external inputs, condition codes, or, possibly, bits of the instruction register,
should be checked as a condition for branching to take place.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The instruction Branch<O may now be implemented by a microroutine such as that
shown in Figure 1.17. After loading this instruction into IR, a branch microinstruction
transfers control to the corresponding microroutine, which is assumed to start at location
25 in the control store. This address is the output of the starting address generator block
in Figure 1.16. The microinstruction at location 25 tests the N bit of the condition
codes. If this bit is equal to 0, a branch takes place to location 0 to fetch a new machine
instruction. Otherwise, the microinstruction at location 26 is executed to put the branch
target address into register Z, as in step 4 in Figure 1.7. The microinstruction in location
27 loads this address into the PC.
Fig.1.16 Basic organization of a microprogrammed control unit
Fig.1.17 Micro routine for the instruction BRANCH<0
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Fig. 1.18 Organization of the control unit to allow conditional branching in the
microprogram.
To support microprogram branching, the organization of the control unit should be
modified as shown in Figure 1.18. The starting address generator block of Figure 1.16
becomes the starting and branch address generator. This block loads a new address into
the µPC when a microinstruction instructs it to do so. To allow implementation of a
conditional branch, inputs to this block consist of the external inputs and condition codes
as well as the contents of the instruction register. In this control unit, the µPC is
incremented every time a new microinstruction is fetched from the microprogram
memory, except in the following situations:
1. When a new instruction is loaded into the IR, the µPC is loaded with the starting
address of the microroutine for that instruction.
2. When a Branch microinstruction is encountered and the branch condition is satisfied,
the µPC is loaded with the branch address.
3. When an End microinstruction is encountered, the µPC is loaded with the address of
the first CW in the microroutine for the instruction fetch cycle (this address is 0 in Figure
1.17).
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Control Store Organization
• The control store contains the microprogram (sometimes called firmware).
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Microroutine Executed for Conditional Branch
The microroutines contain, beside CWs, also branches which have to be
interpreted by the microprogrammed controller.
The sequencer is controlling the right execution sequence of microinstructions.
The sequencer is a small control unit of the control unit.
The greater ease and speed of designing a microprogrammed control unit versus the
design of a control unit based on a random logic implementation of the finite state
machine and next state function resulted in a significant reduction in design costs. It
was also much easier to correct errors in the microprogrammed system than in the
hardwired system.
With the lower cost and higher availability of fast RAM, some systems stored the
microcode in RAM producing what is sometimes called a writable control store
(WCS) machine. This allowed corrections of changes to the microcode even after the
machine had been delivered.
This is also made possible the loading of completely different instruction sets on the
same machine for different applications.
The main advantage of the hardwired system is their greater speed. This greater speed
coupled with the much higher cost of the hardwired systems tended to restrict their
use to high performance computers.
With the trend toward simpler instructions and control, and the advent of computer
aided design (CAD) tools, the design of hardwired control unit has become much
easier and less prone to errors. RISC machines, with their goal of executing one or
more instructions per cycle, are becoming much more prevalent. These developments
are tending to lead away from the use of microprogrammed control.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
PLA CONTROL
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
SEQUENCER (MICROPROGRAM SEQUENCER)
A Microprogram Control Unit that determines the Microinstruction Address to be
executed in the next clock cycle
- In-line Sequencing
- Branch
- Conditional Branch
- Subroutine
- Loop
- Instruction OP-code mapping
MICROINSTRUCTION SEQUENCING
Sequencing Capabilities Required in a Control Storage
- Incrementing of the control address register
- Unconditional and conditional branches
Instruction code
Mapping logic
Multiplexers
Control memory (ROM)
Subroutine Register
(SBR)
Branch logic
Status
bits
Microoperations
Control address register (CAR)
Incrementer
MUX select
select a status bit
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
- A mapping process from the bits of the machine instruction to an address for control
memory
- A facility for subroutine call and return
Conditional Branch
If Condition is true, then Branch (address from
the next address field of the current microinstruction)
else Fall Through
Conditions to Test: O(overflow), N(negative),
Z(zero), C(carry), etc.
Unconditional Branch
Fixing the value of one status bit at the input of the multiplexer to 1
Control address
Register
Control memory MUX
Load address
Increment
Status
(condition)
bits
Condition select
Next address
...
Micro-
operations
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
MICROINSTRUCTION FORMAT
Information in a Microinstruction
- Control Information
- Sequencing Information
- Constant
Information which is useful when feeding into the system
These information needs to be organized in some way for
- Efficient use of the microinstruction bits
- Fast decoding
Field Encoding
- Encoding the microinstruction bits
- Encoding slows down the execution speed due to the decoding delay
- Encoding also reduces the flexibility due to the decoding hardware
Horizontal Microinstructions
Each bit directly controls each micro-operation or each control point
Horizontal implies a long microinstruction word
Advantages: Can control a variety of components operating in parallel.
--> Advantage of efficient hardware utilization
Disadvantages: Control word bits are not fully utilized
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
CS becomes large --> Costly
Ingeneral ,the number of bits in a microinstruction rasnge from around a dozen to over a
hundred. The exact number depends on the complexity of the datapath and on the
number and types of instructions as well as the number of allowed instruction operands
and their addressing modes.
A horizontal microcode system uses minimal encoding to specify the control
information. For example, if there are 32 registers that might be used as an operand, then
a separate bit would signal whether the corresponding register is selected. Or if there
were 128 different operations that could be specified, then a separate bit would be used
for each. A disadvantage of this approach is that relatively few of the actions specified by
bits in the microinstruction can occur in parallel and only one register at a time can be
selected as a source or destination operand. This leads to the presence of many zeros in
the microinstruction, and creates a lot of wasted space in the memory.
Vertical Microinstructions
A microinstruction format that is not horizontal
Vertical implies a short microinstruction word
Encoded Microinstruction fields
--> Needs decoding circuits for one or two levels of decoding
In a vertical microcode system, the widths of the fields such as the register number and
ALU operations are reduced by encoding the information in a shorter form. For example,
any one of registers can be specified using 5-bit field or 7 bits could be used encode upto
128 different operations. The main disadvantage of this approach when compared to the
horizontal microcode system is the slower operation due to the need to decode fields.
One-level decoding
Field A 2 bits
2 x 4
Decode
r
3 x 8
Decoder
Field B
3 bits
1 of 4 1 of 8
Two-level decoding
Field A
2 bits
2 x 4
Decoder 6 x 64
Decoder
Field B
6 bits
Decoder and
selection logic
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Nanostorage and Nanoinstruction
Nanoinstructions are used to drive a lookup table of microinstructions in a machine
where a nanostore is used. This is appropriate where many of the microinstructions occur
several times through the micro program. In this case, the distinct microinstructions are
placed in a small control storedex. The nanostore then contains (in order) the index in the
microcontrol store of the appropriate microinstruction.
Usually, the microprogram consists of a large number of short microinstructions, while
the nanoprogram contains fewer words with longer nanoinstructions.
The decoder circuits in a vertical microprogram storage organization can be replaced by a
ROM
=> Two levels of control storage
First level - Control Storage
Second level - Nano Storage
Two-level microprogram
First level
-Vertical format Microprogram
Second level
-Horizontal format Nanoprogram
-Interprets the microinstruction fields, thus converts a vertical
microinstruction format into a horizontal nanoinstruction format.
Usually, the microprogram consists of a large number of short microinstructions, while
the nanoprogram contains fewer words with longer nanoinstructions.
Two-Level Microprogramming - Example
Microprogram: 2048 microinstructions of 200 bits each
With 1-Level Control Storage: 2048 x 200 = 409,600 bits
Assumption: 256 distinct microinstructions among 2048
With 2-Level Control Storage:
o Nano Storage: 256 x 200 bits to store 256 distinct nanoinstructions
o Control storage: 2048 x 8 bits
o To address 256 nano storage locations 8 bits are needed
Total 1-Level control storage: 409,600 bits
Total 2-Level control storage: 67,584 bits (256 x 200 + 2048 x 8)
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Control address register
11 bits
Control memory
2048 x 8
Microinstruction (8bits)
Nanomemory address
Nanomemory
256 x 200
Nanoinstructions ( 200 bits)
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
THE MEMORY SYSTEM
Programs and the data they operate on are held in the memory of the computer, In this
chapter, we discuss how this vital part of the computer operates. By now, the reader
appreciates that the execution speed of programs is highly dependent on the speed with
which instructions and data can be transferred between the processor and the memory. It
is also important to have a large memory to facilitate execution of programs that are large
and deal with huge amounts of data.
Ideally, the memory would be fast, large, and inexpensive. Unfortunately, it is impossible
to meet all three of these requirements simultaneously. Increased speed and size are
achieved at increased cost. To solve this problem, much work has gone into developing
clever structures that improve the apparent speed and size of the memory, yet keep the
cost reasonable.
First, we describe the most common components and organizations used to implement
the memory. Then we examine memory speed and discuss how the apparent speed of the
memory can be increased by means of caches. Next, we present the virtual memory
concept, which increases the apparent size of the memory. Finally, we discuss the
secondary storage devices, which provide much larger storage capability.
SOME BASIC CONCEPTS The maximum size of the memory that can be used in any computer is determined by I
the addressing scheme. For example, a 16-bit computer that generates 16-bit addresses is
capable of addressing up to 216 = 64K memory locations. Similarly, machines whose
instructions generate 32-bit addresses can utilize a memory that contains up to 232 =4G
(giga) memory locations, whereas machines with 40-bit addresses can access up to 240 =
1T (tera) locations. The number of locations represents the size of the address space of
the computer.
Most modern computers are byte addressable. Figure 2.7 shows the possible address
assignments for a byte-addressable 32-bit computer. The big-endian arrangement is used
in the 68000 processor. The little-endian arrangement is used in Intel processors. The
ARM architecture can be configured to use either arrangement. As far as the memory
structure is concerned, there is no substantial difference between the two schemes.
The memory is usually designed to store and retrieve data in word-length quantities. In
fact, the number of bits actually stored or retrieved in one memory access is the most
common definition of the word length of a computer. Consider, for example, a byte-
addressable computer whose instructions generate 32-bit addresses. When a 32-bit
address is sent from the processor to the memory unit, the high-order 30 bits determine
which word will be accessed. If a byte quantity is specified, the low-order 2 bits of the
address specify which byte location is involved. In a Read operation, other bytes may be
fetched from the memory, but they are ignored by the processor. If the byte operation is a
Write, however, the control circuitry of the memory must ensure that the contents of
other bytes of the same word are not changed.
Modern implementations of computer memory are rather complex and difficult to
understand on first encounter. To simplify our introduction to memory structures, we will
first present a traditional architecture. Then, in later sections, we will discuss the latest
approaches.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
From the system standpoint, we can view the memory unit as a black box. Data transfer
between the memory and the processor takes place through the use of two processor
registers, usually called MAR (memory address register) and MDR (memory data
register), as introduced in Section 1.2. If MAR is k bits long and MDR is n bits long, then
the memory unit may contain up to 2k addressable locations. During a memory cycle, n
bits of data are transferred between the memory and the processor. This transfer takes
place over the processor bus, which has k address lines and n data lines. The bus also
includes the control lines Read/Write (R/W) and Memory Function Completed (WC) for
coordinating data transfers. Other control lines may be added to indicate the number of
bytes to be transferred. The connection between the processor and the memory is shown
schematically in Figure 4.1.
The processor reads data from the memory by loading the address of the required
memory location into the MAR register and setting the R/W line to 1. The memory
responds by placing the data from the addressed location onto the data lines, and
confirms this action by asserting the MFC signal. Upon receipt of the MFC signal, the
processor loads the data on the data lines into the MDR register.
The processor writes data into a memory location by loading the address of this
location into MAR and loading the data into MDR. It indicates that a write operation is
involved by setting the R/W line to 0.
If read or write operations involve consecutive address locations in the main memory,
then a ―block transfer‖ operation can be performed in which the only address sent to the
memory is the one that identifies the first location.
Memory accesses may be synchronized using a clock, or they may be controlled using
special signals that control transfers on the bus, using the bus signaling schemes. Memory
read and write operations are controlled as input and output bus transfers, respectively.
A useful measure of the speed of memory units is the time that elapses between the
initiation of an operation and the completion of that operation, for example, the time
between the Read and the MFC signals. This is referred to as the memory access time,
Another important measure is the memory cycle time, which is the minimum time delay
required between the initiation of two successive memory operations, for example, the
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
time between two successive Read operations. The cycle time is usually slightly longer
than the access time, depending on the implementation details of the memory unit.
A memory unit is called random-access memory (RAM) if any location can be accessed
for a Read or Write operation in some fixed amount of time that is independent of the
location‘s address. This distinguishes such memory units from serial, or partly serial,
access storage devices such as magnetic disks and tapes. Access time on the latter devices
depends on the address or position of the data.
The basic technology for implementing the memory uses semiconductor integrated
circuits. The sections that follow present some basic facts about the internal structure and
operation of such memories. We then discuss some of the techniques used to increase the
effective speed and size of the memory.
The processor of a computer can usually process instructions and data faster than I they
can be fetched from a reasonably priced memory unit. The memory cycle time, I then, is
the bottleneck in the system. One way to reduce the memory access time is to use a cache
memory. This is a small, fast memory that is inserted between the larger, slower main
memory and the processor. It holds the currently active segments of a program and their
data.
Virtual memory is another important concept related to memory organization. So far, we
have assumed that the addresses generated by the processor directly specify physical
locations in the memory. This may not always be the case. For reasons that will become
apparent later in this chapter, data may be stored in physical memory locations that have
addresses different from those specified by the program. The memory control circuitry
translates the address specified by the program into an address that can be used to access
the physical memory. In such a case, an address generated by the processor is referred to
as a virtual or logical address. The virtual address space is mapped onto the physical
memory where data are actually stored. The mapping function is implemented by a
special memory control circuit, often called the memory management unit. This mapping
function can be changed during program execution according to system requirements.
Virtual memory is used to increase the apparent size of the physical memory. Data are
addressed in a virtual address space that can be as large as the addressing capability of the
processor. But at any given time, only the active portion of this space is mapped onto
locations in the physical memory. The remaining virtual addresses are mapped onto the
bulk storage devices used, which are usually magnetic disks. As the active portion of the
virtual address space changes during program execution, the memory management unit
changes the mapping function and transfers data between the disk and the memory. Thus,
during every memory cycle, an address-processing mechanism determines whether the
addressed information is in the physical memory unit. If it is, then the proper word is
accessed and execution proceeds. If it is not, a page of words containing the desired word
is transferred from the disk to the memory. This page displaces some page in the memory
that is currently inactive.
Because of the time required to move pages between the disk and the memory, there is a
speed degradation if pages are moved frequently. By judiciously choosing which page to
replace in the memory, however, there may be reasonably long periods when the
probability is high that the words accessed by the processor are in the physical memory
unit.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
This section has briefly introduced several organizational features of memory systems.
These features have been developed to help provide a computer system with as large and
as fast a memory as can be afforded in relation to the overall cost of the system. We do
not expect the reader to grasp all the ideas or their implications now; more detail is given
later. We introduce these terms together to establish that they are related; a study of their
interrelationships is as important as a detailed study of their individual features.
4.1 Memory Hierarchy We have already stated that an ideal memory would be fast, large, and inexpensive. It is
clear that a very fast memory can be implemented if SRAM chips are used. But these
chips are expensive because their basic cells have six transistors, which include packing a
very large number of cells onto a single chip. Thus, for cost reasons, it is impractical to
build a large memory using SRAM chips. The alternative is to use Dynamic RAM chips,
which have much simpler basic cells and thus are much less expensive. But such
memories are significantly slower.
Although dynamic memory units in the range of hundreds of megabytes can be
implemented at a reasonable cost, the affordable size is still small compared to the
demands of large programs with voluminous data. A solution is provided by using
secondary storage, mainly magnetic disks, to implement large memory spaces. Very large
disks are available at a reasonable price, and they are used extensively in computer
systems. However, they are much slower than the semiconductor memory units. So we
conclude the following: A huge amount of cost-effective storage can be provided by
magnetic disks. A large, yet affordable, main memory can be built with dynamic RAM
technology. This leaves SRAMs to be used in smaller units where speed is of the essence,
such as in cache memories.
All of these different types of memory units are employed effectively in a computer. The
entire computer memory can be viewed as the hierarchy depicted in Figure 5.13. The
fastest access is to data held in processor registers. Therefore, if we consider the registers
to be part of the memory hierarchy, then the processor registers are at the top in terms of
the speed of access. Of course, the registers provide only a minuscule portion of the
required memory.
At the next level of the hierarchy is a relatively small amount of memory that can be
implemented directly on the processor chip. This memory, called a processor cache,
holds copies of instructions and data stored in a much larger memory that is provided
externally. There are often two levels of caches. A primary cache is always located on the
processor chip. This cache is small because it competes for space on the processor chip,
which must implement many other functions. The primary cache is referred to as level 1
(LI) cache. A larger, secondary cache is placed between the primary cache and the rest of
the memory. It is referred to as level 2 (L2) cache. It is usually implemented using SRAM
chips.
Including a primary cache on the processor chip and using a larger, off-chip, secondary
cache is currently the most common way of designing computers. However, other
arrangements can be found in practice. It is possible not to have a cache on the processor
chip at all. Also, it is possible to have both L1 and L2 caches on the processor chip.
The next level in the hierarchy is called the main memory. This rather large memory
is implemented using dynamic memory components, typically in the form of SIMMs,
DIMMs, or RIMMs. The main memory is much larger but significantly slower than the
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
cache memory. In a typical computer, the access time for the main memory is about ten
times longer than the access time for the L1 cache.
Disk devices provide a huge amount of inexpensive storage. They are very slow
compared to the semiconductor devices used to implement the main memory.
During program execution, the speed of memory access is of utmost importance. The key
to managing the operation of the hierarchical memory system in Figure 5.13 is to bring
the instructions and data that will be used in the near future as close to the processor as
possible. This can be done by using the mechanisms presented in the sections that follow.
4.2 SEMICONDUCTOR RAM MEMORIES
Semiconductor memories are available in a wide range of speeds. Their cycle times range
from 100 ns to less than 10 ns. When first introduced in the late 1960s, they were much
more expensive than the magnetic-core memories they replaced. Because of rapid
advances in VLSI (Very Large Scale Integration) technology, the cost of semiconductor
memories has dropped dramatically. As a result, they are now used almost exclusively in
implementing memories. In this section, we discuss the main characteristics of
semiconductor memories. We start by introducing the way that a number of memory cells
are organized inside a chip.
4.2.1 INTERNAL ORGANIZATION OF MEMORY CHIPS
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Memory cells are usually organized in the form of an array, in which each cell is capable
of storing one bit of information. A possible organization is illustrated in Figure 5.2. Each
row of cells constitutes a memory word, and all cells of a row are connected to a common
line referred to as the word line, which is driven by the address decoder on the chip. The
cells in each column are connected to a Sense/Write circuit by two bit lines. The
Sense/Write circuits are connected to the data input/output lines of the chip. During a
Read operation, these circuits sense, or read, the information stored in the cells selected
by a word line and transmit this information to the output data lines. During a Write
operation, the Sense/Write circuits receive input information and store it in the cells of
the selected word.
Figure 5.2 is an example of a very small memory chip consisting of 16 words of 8 bits
each. This is referred to as a 16 x 8 organization. The data input and the data output of
each Sense/Write circuit are connected to a single bidirectional data line that can be
connected to the data bus of a computer. Two control lines, R/W and CS, are provided in
addition to address and data lines. The R/W (Read/Write) input specifies the required
operation, and the CS (Chip Select) input selects a given chip in a multichip memory
system.
The memory circuit in Figure 5.2 stores 128 bits and requires 14 external connections for
address, data, and control lines. Of course, it also needs two lines for power supply and
ground connections. Consider now a slightly larger memory circuit, one that has 1K
(1024) memory cells. This circuit can be organized as a 128 x 8 memory, requiring a total
of 19 external connections. Alternatively, the same number of cells can be organized into
a 1K x 1 format. In this case, a 10-bit address is needed, but there is only one data line,
resulting in 15 external connections. Figure 5.3 shows such an organization. The required
10-bit address is divided into two groups of 5 bits each to form the row and column
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
addresses for the cell array. A row address selects a row of 32 cells, all of which are
accessed in parallel. However, according to the column address, only one of these cells is
connected to the external data line by the output multiplexer and input demultiplexer.
Commercially available memory chips contain a much larger number of memory cells
than the examples shown in Figures 5.2 and 5.3. We use small examples to make the
figures easy to understand. Large chips have essentially the same organization as Figure
5.3 but use a larger memory cell array and have more external connections. For example,
a 4M-bit chip may have a 512K x 8 organization, in which case 19 address and 8 data
input/output pins are needed. Chips with a capacity of hundreds of megabits are now
available.
STATIC MEMORIES
Memories that consist of circuits capable of retaining their state as long as power is applied are
known as static memories. Figure 5.4 illustrates how a static RAM (SRAM) cell may be
implemented. Two inverters are cross-connected to form a latch. The latch is connected to two hit
lines by transistors T1 and T2. These transistors act as switches that can be opened or closed
under control of the word line. When the word line is at ground level, the transistors are turned
off and the latch retains its state. For example, let us asstfhie that the cell is in state 1 if the logic
value at point X is I and at point Y is 0. This state is maintained as long as the signal on the word
line is at ground level.
Read Operation
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
In order to read the state of the SRAM cell, the word line is activated to close switches T1 and
T2. If the cell is in state 1, the signal on bit line b is high and the signal on bit line b‘ is low. The
opposite is true if the cell is in state 0. Thus, b and b‘ are complements of each other. Sense/Write
circuits at the end of the bit lines monitor the state of b and b‘ and set the output accordingly.
Write Operation
The state of the cell is set by placing the appropriate value on bit line b and its
complement on b’, and then activating the word line. This forces the cell into the
corresponding state. The required signals on the bit lines are generated by the
Sense/Write circuit.
4.2.2 ASYNCHRONOUS DRAMS
Static RAMs are fast, but they come at a high cost because their cells require several
transistors. Less expensive RAMs can be implemented if simpler cells are used.
However, such cells do not retain their state indefinitely; hence, they are called dynamic
RAMs (DRAMs).
Information is stored in a dynamic memory cell in the form of a charge on a capacitor,
and this charge can be maintained for only tens of milliseconds. Since the cell is required
to store information for a much longer time, its contents must be periodically refreshed
by restoring the capacitor charge to its full value.
An example of a dynamic memory cell that consists of a capacitor, C, and a transistor, T,
is shown in Figure 5.6. In order to store information in this cell, transistor
T is turned on and an appropriate voltage is applied to the bit line. This causes a known
amount of charge to be stored in the capacitor. After the transistor is turned off, the
capacitor begins to discharge. This is caused by the capacitor‘s own leakage resistance
and by the fact that the transistor continues to conduct a tiny amount of current, measured
in picoamperes, after it is turned off. Hence, the information stored in the cell can be
retrieved correctly only if it is read before the charge on the capacitor drops below some
threshold value. During a Read operation, the transistor in a selected cell is turned on. A
sense amplifier connected to the bit line detects whether the charge stored on the
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
capacitor is above the threshold value. If so, it drives the bit line to a full voltage that
represents logic value 1. This voltage recharges the capacitor to the full charge that
corresponds to logic value 1. If the sense amplifier detects that the charge on the
capacitor is below the threshold value, it pulls the bit line to ground level, which ensures
that the capacitor will have no charge, representing logic value 0. Thus, reading the
contents of the cell automatically refreshes its contents. All cells in a selected row are
read at the same time, which refreshes the contents of the entire row.
A 16-megabit DRAM chip, configured as 2M x 8, is shown in Figure 5.7. The cells are
organized in the form of a 4K x 4K array. The 4096 cells in each row are divided into 512
groups of 8, so that a row can store 512 bytes of data. Therefore, 12 address bits are
needed to select a row. Another 9 bits are needed to specify a group of 8 bits in the
selected row. Thus, a 21-bit address is needed to access a byte in this memory. The high-
order 12 bits and the low-order 9 bits of the address constitute the row and column
addresses of a byte, respectively. To reduce the number of pins needed for external
connections, the row and column addresses are multiplexed on 12 pins. During a Read or
a Write operation, the row address is applied first. It is loaded into the row address latch
in response to a signal pulse on the Row Address Strobe (RAS) input of the chip. Then a
Read operation is initiated, in which all cells on the selected row are read and refreshed.
Shortly after the row address is loaded, the column address is applied to the address pins
and loaded into the column address latch under control of the Column Address Strobe
(CAS) signal. The information in this latch is decoded and the appropriate group of 8
Sense/Write circuits are selected. If the RJW control signal indicates a Read operation,
the output values of the selected circuits are transferred to the data lines, D7_0. For a
Write operation, the information on the D7_0 lines is transferred to the selected circuits.
This information is then used to overwrite the contents of the selected cells in the
corresponding 8 columns. We should note that in commercial DRAM chips, the RAS and
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
CAS control signals are active low so that they cause the latching of addresses when they
change from high to low. To indicate this fact, these signals are shown on diagrams as
RAS and CAS.
Applying a row address causes all cells on the corresponding row to be read and
refreshed during both Read and Write operations. To ensure that the contents of a DRAM
are maintained, each row of cells must be accessed periodically. A refresh circuit usually
performs this function automatically. Many dynamic memory chips incorporate a refresh
facility within the chips themselves. In this case, the dynamic nature of these memory
chips is almost invisible to the user.
In the DRAM described in this section, the timing of the memory device is controlled
asynchronously. A specialized memory controller circuit provides the necessary control
signals, RAS and CAS, that govern the timing. The processor must take into account the
delay in the response of the memory. Such memories are referred to as asynchronous
DRAMs.
Because of their high density and low cost, DRAMs are widely used in the memory units
of computers. Available chips range in size from 1M to 256M bits, and even larger chips
are being developed. To reduce the number of memory chips needed in a given computer,
a DRAM chip is organized to read or write a number of bits in parallel, as indicated in
Figure 5.7. To provide flexibility in designing memory systems, these chips are
manufactured in different organizations. For example, a 64-Mbit chip may be organized
as 16M x 4, 8M x 8, or 4M x 16.
4.2.3 Synchronous DRAMs More recent developments in memory technology have resulted in DRAMs whose
operation is directly synchronized with a clock signal. Such memories are known as
synchronous DRAMs (SDRAMs). Figure 5.8 indicates the structure of an SDRAM. The
cell array is the same as in asynchronous DRAMs. The address and data connections are
buffered by means of registers. We should particularly note that the output of each
refresh the contents of the cells. Data held in the latches that correspond to the selected in
which column(s) are transferred into the data output register, thus becoming available on
the feature is data output pins.
SDRAMs have several different modes of operation, which can be selected by
writing control information into a mode register. For example, burst operations of
different lengths can be specified. The burst operations use the block transfer capability
described above as the fast page mode feature. In SDRAMs, it is not necessary to
provide externally generated pulses on the CAS line to select successive columns. The
necessary control signals are provided internally using a column counter and the clock Is
whose signal. New data can be placed on the data lines in each clock cycle. All actions
are triggered by the rising edge of the clock.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
STRUCTURE OF LARGER MEMORIES
We have discussed the basic organization of memory circuits as they may be
implemented on a single chip. Next, we should examine how memory chips may be
connected to form a much larger memory.
Static Memory Systems
Consider a memory consisting of 2M (2,097,152) words of 32 bits each. Figure 5.10
shows how we can implement this memory using 512K x 8 static memory chips. Each
column in the figure consists of four chips, which implement one byte position. Four of
these sets provide the required 2M x 32 memory. Each chip has a control input called
Chip Select. When this input is set to 1, it enables the chip to accept data from or to place
data on its data lines. The data output for each chip is of the three-state type. Only the
selected chip places data on the data output line, while all other outputs are in the high-
impedance state. Twenty one address bits are needed to select a 32-bit word in this
memory. The high-order 2 bits of the address are decoded to determine which of the four
Chip Select control signals should be activated, and the remaining 19 address bits are
used to access specific byte locations inside each chip of the selected row. The RJW
inputs of all chips are tied together to provide a common Read/Write control (not shown
in the figure).
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Dynamic Memory Systems
The organization of large dynamic memory systems is essentially the same as the
memory shown in Figure 5.10. However, physical implementation is often done more
conveniently in the form of memory modules.
Modern computers use very large memories; even a small personal computer is likely to
have at least 32M bytes of memory. Typical workstations have at least I 28M bytes of
memory. A large memory leads to better performance because more of the programs and
data used in processing can be held in the memory, thus reducing the frequency of
accessing the information in secondary storage. However, if a large memory is built by
placing DRAM chips directly on the main system printed-circuit board that contains the
processor, often referred to as a motherboard, it will occupy an unacceptably large
amount of space on the board. Also, it is awkward to provide for future
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Modern computers use very large memories; even a small personal computer is likely to
have at least 32M bytes of memory. Typical workstations have at least I 28M bytes of
memory. A large memory leads to better performance because more of the programs and
data used in processing can be held in the memory, thus reducing the frequency of
accessing the information in secondary storage. However, if a large memory is built by
placing DRAM chips directly on the main system printed-circuit board that contains the
processor, often referred to as a motherboard, it will occupy an unacceptably large
amount of space on the board. Also, it is awkward to provide for future expansion of the
memory, because space must be allocated and wiring provided for the maximum
expected size. These packaging considerations have led to the development of larger
memory units known as SIMMs (Single In-line Memory Modules) and DIMMs (Dual In-
line Memory Modules). Such a module is an assembly of several memory chips on a
separate small board that plugs vertically into a single socket on the motherboard. SIMMs
and DIMMs of different sizes are designed to use the same size socket. For example, 4M
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
x 32, 16M x 32, and 32M x 32 bit DIMMs all use the same 100-pin socket. Similarly, 8M
x 64, 16M x 64, 32M x 64, and 64M x 72 DIMMs use a 168-pin socket. Such modules
occupy a smaller amount of space on a motherboard, and they allow easy expansion by
replacement if a larger module uses the same socket as the smaller one.
4.3 READ-ONLY MEMORIES (ROMs)
Both SRAM and DRAM chips are volatile, which means that they lose the stored
information if power is turned off. There are many applications that need memory
devices which retain the stored information if power is turned off. For example, in
a typical computer a hard disk drive is used to store a large amount of information,
including the operating system software. When a computer is turned on, the operating
system software has to be loaded from the disk into the memory. This requires execution
of a program that ―boots‖ the operating system. Since the boot program is quite large,
most of it is stored on the disk. The processor must execute some instructions that load
the boot program into the memory. If the entire memory consisted of only volatile
memory chips, the processor would have no means of accessing these instructions. A
practical solution is to provide a small amount of nonvolatile memory that holds the
instructions whose execution results in loading the boot program from the disk.
Nonvolatile memory is used extensively in embedded systems. Such systems typically do
not use disk storage devices. Their programs are stored in nonvolatile semiconductor
memory devices.
Different types of nonvolatile memory have been developed. Generally, the contents of
such memory can be read as if they were SRAM or DRAM memories. But, a special
writing process is needed to place the information into this memory. Since its normal
operation involves only reading of stored data, a memory of this type is called read-only
memory (ROM).
4.3.1 ROM
Figure 5.12 shows a possible configuration for a ROM cell. A logic value 0 is stored in
the cell if the transistor is connected to ground at point P; otherwise, a I is stored. The bit
line is connected through a resistor to the power supply. To read the state of the cell, the
word line is activated. Thus, the transistor switch is closed and the voltage on the bit line
drops to near zero if there is a connection between the transistor and ground. If there is no
connection to ground, the bit line remains at the high voltage, indicating a 1. A sense
circuit at the end of the bit line generates the proper output value. Data are written into a
ROM when it is manufactured.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
4.3.2 PROM
Some ROM designs allow the data to be loaded by the user, thus providing a
programmable ROM (PROM). Programmability is achieved by inserting a fuse at point P
in Figure 5.12. Before it is programmed, the memory contains all Os. The user can insert
Is at the required locations by burning out the fuses at these locations using high-current
pulses. Of course, this process is irreversible. PROMs provide flexibility and convenience
not available with ROMs. The latter are economically attractive for storing fixed
programs and data when high volumes of ROMs are produced. However, the cost of
preparing the masks needed for storing a particular information pattern in ROMs makes
them very expensive when only a small number are required. In this case, PROMs
provide a faster and considerably less expensive approach because they can be
programmed directly by the user.
4.3.3 EPROM
Another type of ROM chip allows the stored data to be erased and new data to be loaded.
Such an erasable, reprogrammable ROM is usually called an EPROM. It provides
considerable flexibility during the development phase of digital systems. Since EPROMs
are capable of retaining stored information for a long time, they can be used in place of
ROMs while software is being developed. In this way, memory changes and [he updates
can be easily made.
An EPROM cell has a structure similar to the ROM cell in Figure 5.12. In an EPROM
cell, however, the connection to ground is always made at point P and a [.If special
transistor is used, which has the ability to function either as a normal transistor g a or as a
disabled transistor that is always turned off. This transistor can be programmed are to
behave as a permanently open switch, by injecting charge into it that becomes trapped
inside. Thus, an EPROM cell can be used to construct a memory in the same way as the
previously discussed ROM cell.
The important advantage of EPROM chips is that their contents can be erased and
reprogrammed. Erasure requires dissipating the charges trapped in the transistors of
memory cells; this can be done by exposing the chip to ultraviolet light. For this reason,
EPROM chips are mounted in packages that have transparent windows.
4.3.4 EEPROM
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
A significant disadvantage of EPROMs is that a chip must be physically removed from
the circuit for reprogramming and that its entire contents are erased by the ultraviolet
light. It is possible to implement another version of erasable PROMs that can be both
programmed and erased electrically. Such chips, called EEPROMs, do not have to be
removed for erasure. Moreover, it is possible to erase the cell contents selectively. The
only disadvantage of EEPROMs is that different voltages are needed for erasing, writing,
and reading the stored data.
FLASH MEMORY
An approach similar to EEPROM technology has more recently given rise to flash
memory devices. A flash cell is based on a single transistor controlled by trapped charge,
just like an EEPROM cell. While similar in some respects, there are also substantial
differences between flash and EEPROM devices. In EEPROM it is possible to read and
write the contents of a single cell. In a flash device it is possible to read the contents
of a single cell, but it is only possible to write an entire block of cells. Prior to writing,
the previous contents of the block are erased. Flash devices have greater density, which
leads to higher capacity and a lower cost per bit. They require a single power supply
voltage, and consume less power in their operation.
The low power consumption of flash memory makes it attractive for use in portable
equipment that is battery driven. Typical applications include hand-held computers, cell
phones, digital cameras, and MP3 music players. In hand-held computers and cell
phones, flash memory holds the software needed to operate the equipment, thus obviating
the need for a disk drive. In digital cameras, flash memory is used to store picture image
data. In MP3 players, flash memory stores the data that represent sound.
Cell phones, digital cameras, and MP3 players are good examples of embedded systems.
Single flash chips do not provide sufficient storage capacity for the applications
mentioned above. Larger memory modules consisting of a number of chips are needed.
There are two popular choices for the implementation of such modules: flash cards and
flash drives.
Flash Cards
One way of constructing a larger module is to mount flash chips on a small card.
Such flash cards have a standard interface that makes them usable in a variety of
products. A card is simply plugged into a conveniently accessible slot. Flash cards
come in a variety of memory sizes. Typical sizes are 8, 32, and 64 Mbytes. A minute
of music can be stored in about 1 Mbyte of memory, using the MP3 encoding format.
Hence, a 64-MB flash card can store an hour of music.
Flash Drives
Larger flash memory modules have been developed to replace hard disk drives.
These flash drives are designed to fully emulate the hard disks, to the point that
can be fitted into standard disk drive bays. However, the storage capacity of flash drives
is significantly lower. Currently, the capacity of flash drives is less than one gigabyte. In
contrast, hard disks can store many gigabytes.
The fact that flash drives are solid state electronic devices that have no movable us parts
provides some important advantages. They have shorter seek and access times, which
results in faster response. They have lower power consumption, which makes them
attractive for battery driven applications, and they are also insensitive to vibration.
The disadvantages of flash drives hard disk drives are their smaller capacity and higher
cost per bit. Disks provide an extremely low cost per bit. Another disadvantage is that the
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
flash memory will deteriorate after it has been written a number of times. Fortunately,
this number is high, typically at least one million times.
4.4 MEMORY SYSTEM CONSIDERATIONS
The choice of a RAM chip for a given application depends on several factors. Foremost
among these factors are the cost, speed, power dissipation, and size of the chip.
Static RAMs are generally used only when very fast operation is the primary
requirement. Their cost and size are adversely affected by the complexity of the circuit
that realizes the basic cell. They are used mostly in cache memories. Dynamic RAMs are
the predominant choice for implementing computer main memories. The high densities
achievable in these chips make large memories economically feasible.
Memory Controller
To reduce the number of pins, the dynamic memory chips use multiplexed address inputs.
The address is divided into two parts. The high-order address bits, which select a row in
the cell array, are provided first and latched into the memory chip under control of the
RAS signal. Then, the low-order address bits, which select a column, are provided on the
same address pins and latched using the CAS signal.
A typical processor issues all bits of an address at the same time. The required
multiplexing of address bits is usually performed by a memory controller circuit, which is
interposed between the processor and the dynamic memory as shown in Figure 5.11.
The controller accepts a complete address and the R/W signal from the processor, under
control of a Request signal which indicates that a memory access operation is needed.
The controller then forwards the row and column portions of the address to the memory
and generates the RAS and CAS signals. Thus, the controller provides the RAS-CAS
timing, in addition to its address multiplexing function. It also sends the R/W and CS
signals to the memory. The CS signal is usually active low, hence it is shown as CS in
Figure 5.11. Data lines are connected directly between the processor and the memory.
Note that the clock signal is needed in SDRAM chips.
When used with DRAM chips, which do not have self-refreshing capability, the memory
controller has to provide all the information needed to control the refreshing process. It
contains a refresh counter that provides successive row addresses. Its function is to cause
the refreshing of all rows to be done within the period specified for a particular device.
Refresh Overhead
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
All dynamic memories have to be refreshed. In older DRAMs, a typical period for
refreshing all rows was 16 ms. In typical SDRAMs, a typical period is 64 ms.
Consider an SDRAM whose cells are arranged in 8K (=8 192) rows. Suppose that it takes
four clock cycles to access (read) each row. Then, it takes 8192 x 4 = 32,768 cycles to
refresh all rows. At a clock rate of 133 MHz, the time needed to refresh all rows is
32,768/(l33 x 106) = 246 x 106 seconds.Thus,therefreshingprocessoccupies0.246 ms in
each 64-ms time interval. Therefore, the refresh overhead is 0.246/64 0.0038, which is
less than 0.4 percent of the total time available for accessing the memory.
4.5 ASSOCIATIVE MEMORY
Associative memory (content-addressable memory, CAM) A memory that is capable
of determining whether a given datum – the search word – is contained in one of its
addresses or locations. This may be accomplished by a number of mechanisms. In some
cases parallel combinational logic is applied at each word in the memory and a test is
made simultaneously for coincidence with the search word. In other cases the search
word and all of the words in the memory are shifted serially in synchronism; a single bit
of the search word is then compared to the same bit of all of the memory words using as
many single-bit coincidence circuits as there are words in the memory. Amplifications of
the associative memory technique allow for masking the search word or requiring only a
―close‖ match as opposed to an exact match. Small parallel associative memories are
used in cache memory and virtual memory mapping applications.
Since parallel operations on many words are expensive (in hardware), a variety of
stratagems are used to approximate associative memory operation without actually
carrying out the full test described here. One of these uses hashing to generate a ―best
guess‖ for a conventional address followed by a test of the contents of that address.
Some associative memories have been built to be accessed conventionally (by words in
parallel) and as serial comparison associative memories; these have been called
orthogonal memories. See also associative addressing, associative processor.
Associative memory may refer to:
a type of memory closely associated with neural networks; such as Bidirectional
Associative Memory, Autoassociative memory and Hopfield net.
a type of computer memory; see Content-addressable memory.
an aspect of human memory; see Transderivational search.
4.6 VIRTUAL MEMORIES
In most modern computer systems, the physical main memory is not as large as the
address space spanned by an address issued by the processor. For example, a processor
that issues 32-bit addresses has an addressable space of 4G bytes. The size of the main
memory in a typical computer ranges from a few hundred megabytes to 1G bytes. When
a program does not completely fit into the main memory, the parts of it not currently
being executed are stored on secondary storage devices, such as magnetic disks. Of
course, all parts of a program that are eventually executed are first brought into the
main memory. When a new segment of a program is to be moved into a full memory, it
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
must replace another segment already in the memory. In modern computers, the operating
system moves programs and data automatically between the main memory and secondary
storage. Thus, the application programmer does not need to be aware of limitations
imposed by the available main memory.
Techniques that automatically move program and data blocks into the physical main
memory when they are required for execution are called virtual-memory techniques.
Programs, and hence the processor, reference an instruction and data space that is
independent of the available physical main memory space. The binary addresses that the
processor issues for either instructions or data are called virtual or logical addresses.
These addresses are translated into physical addresses by a combination of hardware and
software components. If a virtual address refers to a part of the program or data space that
is currently in the physical memory, then the contents of the appropriate location in the
main memory are accessed immediately. On the other hand, if the referenced address is
not in the main memory, its contents must be brought into a suitable location in the
memory before they can be used.
Figure 5.26 shows a typical organization that implements virtual memory. A special
hardware unit, called the Memory Management Unit (MMU), translates virtual addresses
into physical addresses. When the desired data (or instructions) are in the main memory,
these data are fetched as described in our presentation of the cache Y mechanism. If the
data are not in the main memory, the MMU causes the operating system to bring the data
into the memory from the disk. Transfer of data between the disk and the main memory is
performed using the DMA scheme.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
ADDRESS TRANSLATION
A simple method for translating virtual addresses into physical addresses is to assume
that all programs and data are composed of fixed-length units called pages, each of which
consists of a block of words that occupy contiguous locations in the main memory. Pages
commonly range from 2K to 16K bytes in length. They constitute the basic unit of
information that is moved between the main memory and the disk whenever the
translation mechanism determines that a move is required. Pages should not be too small,
because the access time of a magnetic disk is much longer (several milliseconds) than the
access time of the main memory. The reason for this is that it takes a considerable
amount of time to locate the data on the disk, but once located, the data can be transferred
at a rate of several megabytes per second. On the other hand, if pages are too large it is
possible that a substantial portion of a page may not be used, yet this unnecessary data
will occupy valuable space in the main memory.
This discussion clearly parallels the concepts introduced in Section 5.5 on cache memory.
The cache bridges the speed gap between the processor and the main memory and is
implemented in hardware. The virtual-memory mechanism bridges the size and speed
gaps between the main memory and secondary storage and is usually implemented in part
by software techniques. Conceptually, cache techniques and virtual- memory techniques
are very similar. They differ mainly in the details of their implementation.
A virtual-memory address translation method based on the concept of fixed-length pages
is shown schematically in Figure 5.27. Each virtual address generated by the processor,
whether it is for an instruction fetch or an operand fetch/store operation, is interpreted as
a virtual page number (high-order bits) followed by an offset (low-order bits) that
specifies the location of a particular byte (or word) within a page. Information about the
main memory location of each page is kept in a page table. This information includes the
main memory address where the page is stored and the current status of the page. An area
in the main memory that can hold one page is called a page frame. The starting address of
the page table is kept in a page table base register. By adding the virtual page number to
the contents of this register, the address of the corresponding entry in the page table is
obtained. The contents of this location give the starting address of the page if that page
currently resides in the main memory.
Each entry in the page table also includes some control bits that describe the status of the
page while it is in the main memory. One bit indicates the validity of the page, that is,
whether the page is actually loaded in the main memory. This bit allows the operating
system to invalidate the page without actually removing it. Another bit indicates whether
the page has been modified during its residency in the memory. As in cache memories,
this information is needed to determine whether the page should be written back to the
disk before it is removed from the main memory to make room for another page. Other
control bits indicate various restrictions that may be imposed on accessing the page. For
example, a program may be given full read and write permission, or it may be restricted
to read accesses only.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The page table information is used by the MMU for every read and write access, so
ideally, the page table should be situated within the MMU. Unfortunately, the page table
may be rather large, and since the MMU is normally implemented as part of the processor
chip (along with the primary cache), it is impossible to include a complete page table on
this chip. Therefore, the page table is kept in the main memory. However, a copy of a
small portion of the page table can be accommodated within the MMU.
This portion consists of the page table entries that correspond to the most recently
accessed pages. A small cache, usually called the Translation Look aside Buffer (TLB) is
incorporated into the MMU for this purpose. The operation of the TLB with respect to the
page table in the main memory is essentially the same as the operation we have discussed
in conjunction with the cache memory. In addition to the information that constitutes a
page table entry, the TLB must also include the virtual address of the entry. Figure 5.28
shows a possible organization of a TLB where the associative- mapping technique is
used. Set-associative mapped TLBs are also found in commercial products.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
An essential requirement is that the contents of the TLB be coherent with the contents of
page tables in the memory. When the operating system changes the contents of page
tables, it must simultaneously invalidate the corresponding entries in the TLB. One of the
control bits in the TLB is provided for this purpose. When an entry is invalidated, the
TLB will acquire the new information as part of the MMU‘s normal response to access
misses.
Address translation proceeds as follows. Given a virtual address, the MMU looks in the
TLB for the referenced page. If the page table entry for this page is found in the TLB, the
physical address is obtained immediately. If there is a miss in the TLB, then the required
entry is obtained from the page table in the main memory and the TLB is updated.
When a program generates an access request to a page that is not in the main memory, a
page fault is said to have occurred. The whole page must be brought from the disk into
the memory before access can proceed. When it detects a page fault, the MMU asks the
operating system to intervene by raising an exception (interrupt). Processing of the active
task is interrupted, and control is transferred to the operating system. The operating
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
system then copies the requested page from the disk into the main memory and returns
control to the interrupted task. Because a long delay occurs while the page transfer takes
place, the operating system may suspend execution of the task that caused the page fault
and begin execution of another task whose pages are in the main memory.
It is essential to ensure that the interrupted task can continue correctly when it resumes
execution. A page fault occurs when some instruction accesses a memory operand that is
not in the main memory, resulting in an interruption before the execution of this
instruction is completed. Hence, when the task resumes, either the execution of the
interrupted instruction must continue from the point of interruption, or the instruction
must be restarted. The design of a particular processor dictates which of these options
should be used.
If a new page is brought from the disk when the main memory is full, it must replace one
of the resident pages. The problem of choosing which page to remove is just as critical
here as it is in a cache, and the idea that programs spend most of their time in a few
localized areas also applies. Because main memories are considerably larger than cache
memories, it should be possible to keep relatively larger portions of a program in the
main memory. This will reduce the frequency of transfers to and from the disk. Concepts
similar to the LRU replacement algorithm can be applied to page replacement, and the
control bits in the page table entries can indicate usage. One simple scheme is based on a
control bit that is set to 1 whenever the corresponding page is referenced (accessed). The
operating system occasionally clears this bit in all page table entries, thus providing a
simple way of determining which pages have not been used recently.
A modified page has to be written back to the disk before it is removed from the main
memory. It is important to note that the write-through protocol, which is useful in the
framework of cache memories, is not suitable for virtual memory. The access time of the
disk is so long that it does not make sense to access it frequently to write small amounts
of data.
The address translation process in the MMU requires some time to perform, mostly
dependent on the time needed to look up entries in the TLB. Because of locality of
reference, it is likely that many successive translations involve addresses on the same
page. This is particularly evident in fetching instructions. Thus, we can reduce the
average translation time by including one or more special registers that retain the virtual
page number and the physical page frame of the most recently performed translations.
The information in these registers can be accessed more quickly than the TLB.
4.7 CACHE MEMORIES
The speed of the main memory is very low in comparison with the speed of modern
processors. For good performance, the processor cannot spend much of its time waiting
to access instructions and data in main memory. Hence, it is important to devise a scheme
that reduces the time needed to access the necessary information. Since the speed of the
main memory unit is limited by electronic and packaging constraints, the solution must
be sought in a different architectural arrangement. An efficient solution is to use a fast
cache memory which essentially makes the main memory appear to the processor to be
faster than it really is.
The effectiveness of the cache mechanism is based on a property of computer programs
called locality of reference. Analysis of programs shows that most of their execution time
is spent on routines in which many instructions are executed repeatedly. These
instructions may constitute a simple loop, nested loops, or a few procedures that
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
repeatedly call each other. The actual detailed pattern of instruction sequencing is not
important — the point is that many instructions in localized areas of the program are
executed repeatedly during some time period, and the remainder of the program is
accessed relatively infrequently. This is referred to as locality of reference. it manifests
itself in two ways: temporal and spatial. The first means that a recently executed
instruction is likely to be executed again very soon. The spatial aspect means that
instructions in close proximity to a recently executed instruction (with respect to the
instructions‘ addresses) are also likely to be executed soon.
If the active segments of a program can be placed in a fast cache memory, then the total
execution time can be reduced significantly. Conceptually, operation of a cache memory
is very simple. The memory control circuitry is designed to take advantage of the
property of locality of reference. The temporal aspect of the locality of reference suggests
that whenever an information item (instruction or data) is first needed, this item should be
brought into the cache where it will hopefully remain until it is needed again. The spatial
aspect suggests that instead of fetching just one item from the main memory to the cache,
it is useful to fetch several items that reside at adjacent addresses as well. We will use the
term block to refer to a set of contiguous address locations of some size. Another term
that is often used to refer to a cache block is cache line.
Consider the simple arrangement in Figure 5.14. When a Read request is received from
the processor, the contents of a block of memory words containing the location
e very slow specified are transferred into the cache one word at a time. Subsequently,
when the program references any of the locations in this block, the desired contents are
read directly from the cache. Usually, the cache memory can store a reasonable number
of blocks at any given time, but this number is small compared to the total number of
blocks in the main memory. The correspondence between the main memory blocks and
those in the cache is specified by a mapping function. When the cache is full and a
memory word (instruction or data) that is not in the cache is referenced, the cache control
hardware must decide which block should be removed‘ to create space for the new block
that contains the referenced word. The collection of rules for making this decision
constitutes the replacement algorithm.
The processor does not need to know explicitly about the existence of the cache. It simply
issues Read and Write requests using addresses that refer to locations in the memory. The
cache control circuitry determines whether the requested word currently exists in the
cache. If it does, the Read or Write operation is performed on the appropriate cache
location. In this case, a read or write hit is said to have occurred. In a Read operation, the
main memory is not involved. For a Write operation, the system can proceed in two
ways. In the first technique, called the write-through protocol, the cache location and the
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
main memory location are updated simultaneously. The second technique is to update
only the cache location and to mark it as updated with an associated flag bit, often called
the dirty or modified bit. The main memory location of the word is updated later, when
the block containing this marked word is to be removed from the cache to make room for
a new block. This technique is known as the write- back, or copy-back, protocol. The
write-through protocol is simpler, but it results in unnecessary Write operations in the
main memory when a given cache word is updated several times during its cache
residency. Note that the write-back protocol may also result in unnecessary Write
operations because when a cache block is written back to the memory all words of the
block are written back, even if only a single word has been changed while the block was
in the cache.
When the addressed word in a Read operation is not in the cache, a read miss occurs. The
block of words that contains the requested word is copied from the main memory into the
cache. After the entire block is loaded into the cache, the particular word requested is
forwarded to the processor. Alternatively, this word may be sent to the processor as soon
as it is read from the main memory. The latter approach, which is called load-through, or
early restart, reduces the processor‘s waiting period somewhat, but at the expense of
more complex circuitry.
During a Write operation, if the addressed word is not in the cache, a write miss occurs.
Then, if the write-through protocol is used, the information is written directly into the
main memory. In the case of the write-back protocol, the block containing the addressed
word is first brought into the cache, and then the desired word in the cache is overwritten
with the new information.
4.8 MAPPING FUNCTIONS
To discuss possible methods for specifying where memory blocks are placed in the cache,
we use a specific small example. Consider a cache consisting of 128 blocks of 16 words
each, for a total of 2048 (2K) words, and assume that the main memory is addressable by
a 16-bit address. The main memory has 64K words, which we will view as 4K blocks of
16 words each. For simplicity, we will assume that consecutive addresses refer to
consecutive words.
4.8.1 Direct Mapping
The simplest way to determine cache locations in which to store memory blocks
is the direct-mapping technique. In this technique, block j of the main memory maps onto
block j modulo 128 of the cache, as depicted in Figure 5.15. Thus, whenever one of the
main memory blocks 0, 128, 256, ... is loaded in the cache, it is stored in cache block 0.
Blocks 1, 129, 257, ... are stored in cache block 1, and so on. Since more than one
memory block is mapped onto a given cache block position, contention may arise for that
position even when the cache is not full. For example, instructions of a program may start
in block 1 and continue in block 129, possibly after a branch. As this program is
executed, both of these blocks must be transferred to the block-1 position in the cache.
Contention is resolved by allowing the new block to overwrite the currently resident
block. In this case, the replacement algorithm is trivial.
Placement of a block in the cache is determined from the memory address. The memory
address can be divided into three fields, as shown in Figure 5.15. The low-order 4 bits
select one of 16 words in a block. When a new block enters the cache, the 7-bit cache
block field determines the cache position in which this block must be stored. The high-
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
order 5 bits of the memory address of the block are stored in 5 tag bits associated with its
location in the cache. They identify which of the 32 blocks that are mapped into this
cache position are currently resident in the cache. As execution proceeds, the 7-bit cache
block field of each address generated by the processor points to a particular block
location in the cache. The high-order 5 bits of the address are compared with the tag bits
associated with that cache location. If they match, then the desired word is in that block
of the cache. If there is no match, then the block containing the required word must first
be read from the main memory and loaded into the cache. The direct-mapping technique
is easy to implement, but it is not very flexible.
4.8.2 Associative Mapping
Figure 5.16 shows a much more flexible mapping method, in which a main memory
block can be placed into any cache block position. In this case, 12 tag bits are required to
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
identify a memory block when it is resident in the cache. The tag bits of an address
received from the processor are compared to the tag bits of each block of the cache to see
if the desired block is present. This is called the associative-mapping technique. It gives
complete freedom in choosing the cache location in which to place the memory block.
Thus, the space in the cache can be used more efficiently. A new block that has to be
brought into the cache has to replace (eject) an existing block only if the cache is full. In
this case, we need an algorithm to select the block to be replaced. Many replacement
algorithms are possible. The cost of an associative cache is higher than the cost of a
direct-mapped cache because of the need to search all 128 tag patterns to determine
whether a given block is in the cache. A search of this kind is called an associative
search. For performance reasons, the tags must be searched in parallel.
4.8.3 Set-Associative Mapping
A combination of the direct- and associative-mapping techniques can be used. Blocks of
the cache are grouped into sets, and the mapping allows a block of the main memory to
reside in any block of a specific set. Hence, the contention problem of the direct method
is eased by having a few choices for block placement. At the same time, the hardware
cost is reduced by decreasing the size of the associative search. An example of this set-
associative-mapping technique is shown in Figure 5.17 for a cache with two blocks per
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
set. In this case, memory blocks 0, 64, 128 4032 map into cache set 0, and they can
occupy either of the two block positions within this set. Having 64 sets means that the 6-
bit set field of the address determines which set of the cache might contain the desired
block. The tag field of the address must then be associatively compared to the tags of the
two blocks of the set to check if the desired block is present. This two-way associative
search is simple to implement.
The number of blocks per set is a parameter that can be selected to suit the requirements
of a particular computer. For the main memory and cache sizes in Figure 5.17, four
blocks per set can be accommodated by a 5-bit set field, eight blocks per set by a 4-bit set
field, and so on. The extreme condition of 128 blocks per set requires no set bits and
corresponds to the fully associative technique, with 12 tag bits. The other extreme of one
block per set is the direct-mapping method. A cache that has k blocks per set is referred to
as a k-way set-associative cache.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
One more control bit, called the valid bit, must be provided for each block. This bit
indicates whether the block contains valid data. It should not be confused with the
modified, or dirty, bit mentioned earlier. The dirty bit, which indicates whether the block
has been modified during its cache residency, is needed only in systems that do not use
the write-through method. The valid bits are all set to 0 when power is initially applied to
the system or when the main memory is loaded with new programs and data from the
disk. Transfers from the disk to the main memory are carried out by a DMA mechanism.
Normally, they bypass the cache for both cost and performance reasons. The valid bit of a
particular cache block is set to 1 the first time this block is loaded from the main memory.
Whenever a main memory block is updated by a source that bypasses the cache, a check
is made to determine whether the block being loaded is currently in the cache. If it is, its
valid bit is cleared to 0. This ensures that stale data will not exist in the cache.
A similar difficulty arises when a DMA transfer is made from the main memory
to the disk, and the cache uses the write-back protocol. In this case, the data in the
memory might not reflect the changes that may have been made in the cached copy.
One solution to this problem is to flush the cache by forcing the dirty data to be written
back to the memory before the DMA transfer takes place. The operating system can do
this easily, and it does not affect performance greatly, because such disk transfers do not
occur often. This need to ensure that two different entities (the processor and DMA
subsystems in this case) use the same copies of data is referred to as a cache-coherence
problem.
4.9 REPLACEMENT ALGORITHMS
In a direct-mapped cache, the position of each block is predetermined; hence, no
replacement strategy exists. In associative and set-associative caches there exists some
flexibility. When a new block is to be brought into the cache and all the positions that it
may occupy are full, the cache controller must decide which of the old blocks to
overwrite. This is an important issue because the decision can be a strong determining
factor in system performance. In general, the objective is to keep blocks in the cache that
are likely to be referenced in the near future. However, it is not easy to determine which
blocks are about to be referenced. The property of locality of reference in programs gives
a clue to a reasonable strategy. Because programs usually stay in localized areas for
reasonable periods of time, there is a high probability that the blocks that have been
referenced recently will be referenced again soon. Therefore, when a block is to be
overwritten, it is sensible to overwrite the one that has gone the longest time without
being referenced. This block is called the least recently used (LRU) block, and the
technique is called the LRU replacement algorithm.
To use the LRU algorithm, the cache controller must track references to all blocks as
computation proceeds. Suppose it is required to track the LRU block of a four-block set
in a set-associative cache. A 2-bit counter can be used for each block. When a hit occurs,
the counter of the block that is referenced is set to 0. Counters with values originally
lower than the referenced one are incremented by one, and all others remain unchanged.
When a miss occurs and the set is not full, the counter associated with the new block
loaded from the main memory is set to 0, and the values of all other counters are
increased by one. When a miss occurs and the set is full, the block with the counter value
3 is removed, the new block is put in its place, and its counter is set to 0. The other three
block counters are incremented by one. It can be easily verified that the counter values of
occupied blocks are always distinct.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The LRU algorithm has been used extensively. Although it performs well for many
access patterns, it can lead to poor performance in some cases. For example, it produces
disappointing results when accesses are made to sequential elements of an array that is
hat slightly too large to fit into the cache (see Section 5.5.3 and Problem 5.12).
Performance I is of the LRU algorithm can be improved by introducing a small amount
of randomness in deciding which block to replace.
Several other replacement algorithms are also used in practice. An intuitively reasonable
rule would be to remove the ―oldest‖ block from a full set when a new block he must be
brought in. However, because this algorithm does not take into account the recent pattern
of access to blocks in the cache, it is generally not as effective as the LRU algorithm in
choosing the best blocks to remove. The simplest algorithm is to randomly choose the
block to be overwritten. Interestingly enough, this simple algorithm has been found to be
quite effective in practice.
4.10 MEMORY INTERLEAVING
If the main memory of a computer is structured as a collection of physically separate
modules, each with its own address buffer register (ABR) and data buffer register (DBR),
memory access operations may proceed in more than one module at the same d in time.
Thus, the aggregate rate of transmission of words to and from the main memory system
can be increased.
How individual addresses are distributed over the modules is critical in determining the
average number of modules that can be kept busy as computations proceed. Two methods
of address layout are indicated in Figure 5.25. In the first case, the memory address
generated by the processor is decoded as shown in Figure 5.25a. The high- order k bits
name one of n modules, and the low-order m bits name a particular word in that module,
When consecutive locations are accessed, as happens when a block of irate data is
transferred to a cache, only one module is involved. At the same time, however, devices
with direct memory access (DMA) ability may be accessing information in other memory
modules.
The second and more effective way to address the modules is shown in Figure 5.25b. It
is called memory interleaving. The low-order k bits of the memory address select a
module, and the high-order m bits name a location within that module. In this way,
consecutive addresses are located in successive modules. Thus, any component of the
system that generates requests for access to consecutive memory locations can keep
several modules busy at any one time. This results in both faster accesses to a block of
data and higher average utilization of the memory system as a whole. To implement the
interleaved structure, there must be 2‘ modules; otherwise, there will be gaps of
nonexistent locations in the memory address space.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
HIT RATE AND MIss PENALTY
An excellent indicator of the effectiveness of a particular implementation of the memory
hierarchy is the success rate in accessing information at various levels of the hierarchy.
Recall that a successful access to data in a cache is called a hit. The number of hits stated
as a fraction of all attempted accesses is called the hit rate, and the miss rate is the
number of misses stated as a fraction of attempted accesses.
Ideally, the entire memory hierarchy would appear to the processor as a single memory
unit that has the access time of a cache on the processor chip and the size of a magnetic
disk. How close we get to this ideal depends largely on the hit rate at different levels of
the hierarchy. High hit rates, well over 0.9, are essential for high-performance computers.
Performance is adversely affected by the actions that must be taken after a miss. The
extra time needed to bring the desired information into the cache is called the miss
penalty. This penalty is ultimately reflected in the time that the processor is stalled
because the required instructions or data are not available for execution. In general, the
miss penalty is the time needed to bring a block of data from a slower unit in the memory
hierarchy to a faster unit. The miss penalty is reduced if efficient mechanisms for
transferring data between the various units of the hierarchy are implemented. The
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
previous section shows how an interleaved memory can reduce the miss penalty
substantially.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Input –Output
5.1 Printers
In computing, a printer is a peripheral which produces a hard copy (permanent human-
readable text and/or graphics) of documents stored in electronic form, usually on physical
print media such as paper or transparencies. Many printers are primarily used as local
peripherals, and are attached by a printer cable or, in most newer printers, a USB cable to
a computer which serves as a document source. Some printers, commonly known as
network printers, have built-in network interfaces (typically wireless or Ethernet), and
can serve as a hardcopy device for any user on the network. Individual printers are often
designed to support both local and network connected users at the same time.
In addition, a few modern printers can directly interface to electronic media such as
memory sticks or memory cards, or to image capture devices such as digital cameras,
scanners; some printers are combined with a scanners and/or fax machines in a single
unit, and can function as photocopiers. Printers that include non-printing features are
sometimes called Multifunction Printers (MFP), Multi-Function Devices (MFD), or All-
In-One (AIO) printers. Most MFPs include printing, scanning, and copying among their
features. A Virtual printer is a piece of computer software whose user interface and API
resemble that of a printer driver, but which is not connected with a physical computer
printer.
Printers are designed for low-volume, short-turnaround print jobs; requiring virtually no
setup time to achieve a hard copy of a given document. However, printers are generally
slow devices (30 pages per minute is considered fast; and many inexpensive consumer
printers are far slower than that), and the cost per page is actually relatively high.
However this is offset by the on-demand convenience and project management costs
being more controllable compared to an out-sourced solution. The printing press naturally
remains the machine of choice for high-volume, professional publishing. However, as
printers have improved in quality and performance, many jobs which used to be done by
professional print shops are now done by users on local printers; see desktop publishing.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The world's first computer printer was a 19th century mechanically driven apparatus
invented by Charles Babbage for his Difference Engine.
Printing technology
Printers are routinely classified by the underlying print technology they employ;
numerous such technologies have been developed over the years. The choice of print
engine has a substantial effect on what jobs a printer is suitable for, as different
technologies are capable of different levels of image/text quality, print speed, low cost,
noise; in addition, some technologies are inappropriate for certain types of physical
media (such as carbon paper or transparencies).
Another aspect of printer technology that is often forgotten is resistance to alteration:
liquid ink such as from an inkjet head or fabric ribbon becomes absorbed by the paper
fibers, so documents printed with a liquid ink sublimation printer are more difficult to
alter than documents printed with toner or solid inks, which do not penetrate below the
paper surface.
Checks should either be printed with liquid ink or on special "check paper with toner
anchorage".[2]
For similar reasons carbon film ribbons for IBM Selectric typewriters bore
labels warning against using them to type negotiable instruments such as checks. The
machine-readable lower portion of a check, however, must be printed using MICR toner
or ink. Banks and other clearing houses employ automation equipment that relies on the
magnetic flux from these specially printed characters to function properly.
Types of Printers:
1. Dot Matrix Printer
2. Inkjet Printer
3. Laser Printer
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Dot matrix Printer Ink jet Printer Laser Printer
Modern print technology
The following printing technologies are routinely found in modern printers:
Toner-based printers
Laser printer
Toner-based printers work using the Xerographic principle that is used in most
photocopiers: by adhering toner to a light-sensitive print drum, then using static
electricity to transfer the toner to the printing medium to which it is fused with heat and
pressure.
The most common type of toner-based printer is the laser printer, which uses precision
lasers to cause toner adherence. Laser printers are known for high quality prints, good
print speed, and a low (Black and White) cost-per-copy. They are the most common
printer for many general-purpose office applications, but are much less common as
consumer printers due to their high initial cost - although this cost is dropping.
Laser printers are available in both color and monochrome varieties.
Another toner based printer is the LED printer which uses an array of LEDs instead of a
laser to cause toner adhesion to the print drum.
Recent research has also indicated that Laser printers emit potentially dangerous ultrafine
particles, possibly causing health problems associated with respiration [1] and cause
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
pollution equivalent to cigarettes.[3]
The degree of particle emissions varies with age,
model and design of each printer but is generally proportional to the amount of toner
required. Furthermore, a well ventilated workspace would allow such ultrafine particles
to disperse thus reducing the health side effects.
Liquid inkjet printers
Inkjet printer
Inkjet printers operate by propelling variably-sized droplets of liquid or molten material
(ink) onto almost any sized page. They are the most common type of computer printer for
the general consumer[citation needed]
due to their low cost, high quality of output, capability
of printing in vivid color, and ease of use.
Solid ink printers
Solid ink
Solid Ink printers, also known as phase-change printers, are a type of thermal transfer
printer. They use solid sticks of CMYK colored ink (similar in consistency to candle
wax), which are melted and fed into a piezo crystal operated print-head. The printhead
sprays the ink on a rotating, oil coated drum. The paper then passes over the print drum,
at which time the image is transferred, or transfixed, to the page.
Solid ink printers are most commonly used as color office printers, and are excellent at
printing on transparencies and other non-porous media. Solid ink printers can produce
excellent results. Acquisition and operating costs are similar to laser printers. Drawbacks
of the technology include high power consumption and long warm-up times from a cold
state.
Also, some users complain that the resulting prints are difficult to write on (the wax tends
to repel inks from pens), and are difficult to feed through Automatic Document Feeders,
but these traits have been significantly reduced in later models. In addition, this type of
printer is only available from one manufacturer, Xerox, manufactured as part of their
Xerox Phaser office printer line. Previously, solid ink printers were manufactured by
Tektronix, but Tek sold the printing business to Xerox in 2001.
Dye-sublimation printers
Dye-sublimation printer
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
A dye-sublimation printer (or dye-sub printer) is a printer which employs a printing
process that uses heat to transfer dye to a medium such as a plastic card, paper or canvas.
The process is usually to lay one color at a time using a ribbon that has color panels. Dye-
sub printers are intended primarily for high-quality color applications, including color
photography; and are less well-suited for text. While once the province of high-end print
shops, dye-sublimation printers are now increasingly used as dedicated consumer photo
printers.
Inkless printers
Thermal printers
Thermal printer
Thermal printers work by selectively heating regions of special heat-sensitive paper.
Monochrome thermal printers are used in cash registers, ATMs, gasoline dispensers and
some older inexpensive fax machines. Colors can be achieved with special papers and
different temperatures and heating rates for different colors. One example is the ZINK
technology.
UV printers
Xerox is working on an inkless printer which will use a special reusable paper coated
with a few micrometres of UV light sensitive chemicals. The printer will use a special
UV light bar which will be able to write and erase the paper. As of early 2007 this
technology is still in development and the text on the printed pages can only last between
16-24 hours before fading.
Printing speed
The speed of early printers was measured in units of characters per second. More
modern printers are measured in pages per minute. These measures are used primarily as
a marketing tool, and are not well standardised. Usually pages per minute refers to sparse
monochrome office documents, rather than dense pictures which usually print much more
slowly. PPM are most of the time referring to A4 paper in Europe and letter paper in the
US, resulting in a 5-10% difference.
5.2 Plotters
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
A plotter is a vector graphics printing device to print graphical plots that
connects to a computer. A graphics printer draws images with ink pens. It actually draws
point-to-point lines directly from vector graphics files. The plotter was the first computer
output device that could print graphics as well as accommodate full-size engineering and
architectural drawings. Using different colored pens, it was also able to print in color long
before inkjet printers became an alternative.There are different types of plotters.
Drum Plotters
Electrostatic plotters
Flat Bed Plotters
Inkjet Plotters
Pen Plotters
Drum Plotters
A type of pen plotter that wraps the paper around a drum with a pin feed
attachment. The drum turns to produce one direction of the plot, and the pens move to
provide the other. The plotter was the first output device to print graphics and large
engineering drawings. Using different colored pens, it could draw in color long before
color inkjet printers became viable.
Electrostatic Plotters
This plotter uses an electrostatic method of printing. Liquid toner models use a
positively charged toner that is attracted to paper which is negatively charged by passing
by a line of electrodes (tiny wires or nibs). Models print in black and white or color, and
some handle paper up to six feet wide. Newer electrostatic plotters are really large-format
laser printers and focus light onto a charged drum using lasers or LEDs.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Flatbed Plotters
This is a graphics plotter that contains a flat surface that the paper is placed on.
The size of this surface (bed) determines the maximum size of the drawing.
Inkjet Plotters
This is a printer that propels droplets of ink directly onto the medium. Today,
almost all inkjet printers produce color. Low-end inkjets use three ink colors (cyan,
magenta and yellow), but produce a composite black that is often muddy. Four-color
inkjets (CMYK) use black ink for pure black printing. Inkjet printers run the gamut from
less than a hundred to a couple hundred dollars for home use to tens of thousands of
dollars for commercial poster printers
Pen Plotters
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Pen plotters print by moving a pen across the surface of a piece of paper. This
means that plotters are restricted to line art, rather than raster graphics as with other
printers. Pen plotters can draw complex line art, including text, but do so very slowly
because of the mechanical movement of the pens. Pen Plotters are incapable of creating a
solid region of color; but can hatch an area by drawing a number of close, regular lines.
When computer memory was very expensive, and processor power was very slow, this
was often the fastest way to produce color high-resolution vector-based artwork, or very
large drawings efficiently.
5.3 Displays A display device is an output device for presentation of information for visual or
tactile reception, acquired, stored, or transmitted in various forms. When the input
information is supplied as an electrical signal, the display is called electronic display.A
display device is anything that will put images on a screen to see what input and actions a
user would ultimately need visual confirmation. The most common display is the default
monitor. By its term means that by default any monitor should work if installed on a CPU
prior to turning the power on. The screen has dials that can make the display seem blank
and sometimes adjustments must be made to the display itself.
CRT MONITOR [cathode-ray tube]
Features
3. High Voltage Device
4. Two connections present in a CRT.
i) To the AC power outlet
ii) To the System Unit (DB-15, DVI)
Disadvantages of CRT
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
They have a big back and take up space on desk.
The electromagnetic fields emitted by CRT monitors constitute a health hazard to
the functioning of living cells.
CRTs emit a small amount of X-ray band radiation which can result in a health
hazard.
Constant refreshing of CRT monitors can result in headache.
CRTs operate at very high voltage which can overheat system or result in an
implosion
Within a CRT a strong vacuum exists in it and can also result in a implosion
They are heavy to pick up and carry around
Advantages of CRT
The cathode rayed tube can easily increase the monitor‘s brightness by reflecting
the light.
They produce more colours
The Cathode Ray Tube monitors have lower price rate than the LCD display or
Plasma display.
The quality of the image displayed on a Cathode Ray Tube is superior to the
LCD and Plasma monitors.
The contrast features of the cathode ray tube monitor are considered highly
excellent.
How CRTs work & display?
A CRT monitor contains millions of tiny red, green, and blue phosphor dots that
glow when struck by an electron beam that travels across the screen to create a visible
image. In a CRT monitor tube, the cathode is a heated filament. The heated filament is
in a vacuum created inside a glass tube. The electrons are negative and the screen gives a
positive charge so the screen glows.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
LCD Monitor
Flat panel display
Features
1) HVD in Desktops and LVD in Laptops
2) Two connections present in a HVD LCD Monitor
i) To the AC power outlet
ii) To the System Unit
3) Highly energy efficient.
Flat panel displays encompass a growing number of technologies enabling
video displays that are lighter and much thinner than traditional television and video
displays that use cathode ray tubes, and are usually less than 4 inches (100 mm) thick.
They can be divided into two general categories: Volatile or Static. Flat panel displays
balance their smaller footprint and trendy modern look with high production costs and in
many cases inferior images compared with traditional CRTs. In many applications,
specifically modern portable devices such as laptops, cellular phones, and digital
cameras, whatever disadvantages exist are overcome by the portability requirements.
Volatile
Volatile displays require pixels be periodically refreshed to retain their state,
even for a static image. This refresh typically occurs many times a second. If this is not
done, the pixels will gradually lose their coherent state, and the image will "fade" from
the screen.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Examples of volatile flat panel displays
Plasma displays
Liquid crystal displays (LCDs)
Organic light-emitting diode displays (OLEDs)
Light-emitting diode display (LED)
Electroluminescent displays (ELDs)
Surface-conduction electron-emitter displays (SEDs)
Field emission displays (FEDs)
Nano-emissive display (NEDs)
Static
Static flat panel displays rely on materials whose color states are bistable. This
means that the image they hold requires no energy to maintain, but instead requires
energy to change. This results in a much more energy-efficient display, but with a
tendency towards slow refresh rates which are undesirable in an interactive display.
Examples of static flat panel displays
electrophoretic displays (e.g. E Ink's electrophoretic imaging film)
bichromal ball displays (e.g. Xerox's Gyricon)
Interferometric modulator displays (e.g. Qualcomm's iMod, a MEMS
display.)
Cholesteric displays (e.g. MagInk, Kent Displays)
Bistable nematic liquid crystal displays (e.g. ZBD)
5.4 Keyboard Keyboards are designed for the input of text and characters and also to control the
operation of a computer.
Types of Keyboards
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
1. Based on Layout
-QWERTY layout
-DVORAK layout
2. Ergonomic – Based on comfort
Standard Keyboards
The number of keys on a keyboard varies from the original standard of 101 keys to the
104-key windows keyboards.
Qwerty Layout
Dvorak Keyboard
Ergonomic Keyboard
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Differences between Dvorak and Qwerty
Typing a 62 word paragraph, Dvorak used between 35% to 20% less movement, and
saved almost 6 feet of finger movement, out of the 16 feet of finger movement needed to
type these short paragraphs with Qwerty. This is the 'minimum' of difference. In actual
practice, nearly all would show more savings, with a range of up to about 50%
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
LAYOUTOFA KEYBOARD
Layout of a keyboard can be divided into five sections:-
i) Typical Keys
These keys include letter keys [1,2,A,B etc], which are generally laid out in
same style that was common for typewriters.
ii)
Numeric
Keypad:-
Numeric keys are located on right hand side of keyboard.
Generally it consists of a set of 17 keys that are laid out in same
configuration used by adding machines and calculators.
iii) Function keys:-
The function keys [F1,F2,F3 etc] are arranged in a row along the top of the
keyboard and could be assigned specific commands by current application or
the operating system.
iv) Control keys:-
These keys provide curser and screen control. It includes 4 directional
arrow keys, that are arranged in an inverted T formation between the typing
keys and in numeric keypad. Control keys also include Home, End, Insert,
Delete, Page up, Control [ctrl], Page down, Alternate [alt] & Escape [Esc].
The Windows keyboard also consists of two windows or start keys and an
Application key.
v) Special Purpose Keys:-
Apart from the above discussed keys, a keyboard contains some special
purpose keys such as Enter, Shift, Caps lock, Num lock, Space bar, Tab and
Print Screen.
WORKING
When the user presses the keys, a code corresponding to that key press is send to
the operating system. A copy of this code is also stored in the keyboard‘s memory. When
the operating system reads the scan code, it informs the same to the keyboard and scan
code stored in keyboard‘s memory is then erased. And the action corresponding to the
code is done. If the user hold down a key, the processor determines that he wish to send
that character repeatedly to the computer. In this process, the delay between each
instance of a character can normally be set in operating system, typically ranges from 2 to
30 characters per seconds (cps).
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Changing the Keyboard Layout
Start > Control Panel > Regional & Language Options >Language > Details > Add >
Enable Keyboard Layout >United Stated - Dvorak
5.5 Mouse A mouse (plural mice or mouses) functions as a pointing device by detecting two-
dimensional motion relative to its supporting surface. Can be used only with GUI
(Graphical user interface) based OS. E.g. Windows.
Types of mouse based on mechanism
1. Mechanical Mouse
2. Optical Mouse
Mechanical Mouse
A mouse that uses a rubber ball that makes contact with wheels inside the unit when it is
rolled on a pad or desktop.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
Optical Mouse
A mouse that uses light to detect movement. It emits a light and senses its reflection as it
is moved. Early optical mice required a special mouse pad, but today's devices can be
rolled over traditional pads like a mechanical mouse as well as over almost any surface
other than glass or mirror.
5.6 Optical Mark Reader The Optical Mark Reader is a device that "reads" pencil marks on NCS
compatible scan forms such as surveys or test answer forms. Optical Mark Reader is a
scanning device that can read marks such as pencil marks on a page; used to read forms
and multiple-choice questionnaires. Think of it as the machine that checks multiple
choice computer forms. In this document The Optical Mark Reader will be referred to as
the scanner or OMR. The computer test forms designed for the OMR are known as NCS
compatible scan forms. Tests and surveys completed on these forms are read in by the
scanner, checked, and the results are saved to a file. This data file can be converted into
an output file of several different formats, depending on which type of output you desire.
The OMR is a powerful tool that has many features. While using casstat
(grading tests), the OMR will print the number of correct answers and the percentage of
correct answers at the bottom of each test. It will also record statistical data about each
question. This data is recorded in the output file created when the forms are scanned.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
5.7 Optical character recognition
It is a device used for optical character recognition. Optical character
recognition, usually abbreviated to OCR, is the mechanical or electronic translation of
images of handwritten, typewritten or printed text (usually captured by a scanner) into
machine-editable text.
OCR is a field of research in pattern recognition, artificial intelligence and
machine vision. Though academic research in the field continues, the focus on OCR has
shifted to implementation of proven techniques. Optical character recognition (using
optical techniques such as mirrors and lenses) and digital character recognition (using
scanners and computer algorithms) were originally considered separate fields. Because
very few applications survive that use true optical techniques, the OCR term has now
been broadened to include digital image processing as well.
Early systems required training (the provision of known samples of each
character) to read a specific font. "Intelligent" systems with a high degree of recognition
accuracy for most fonts are now common. Some systems are even capable of reproducing
formatted output that closely approximates the original scanned page including images,
columns and other non-textual components.
5.8 Device interface
An interface device (IDF) is a hardware component or system of components that allows
a human being to interact with a computer, a telephone system, or other electronic
information system. The term is often encountered in the mobile communication industry
where designers are challenged to build the proper combination of portability, capability,
and ease of use into the interface device. The overall set of characteristics provided by an
interface device is often referred to as the user interface (and, for computers - at least, in
more academic discussions - the human-computer interface or HCI ). Today's desktop
and notebook computers have what has come to be called a graphical user interface
(GUI) to distinguish it from earlier, more limited interfaces such as the command line
interface (CLI).
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The Graphics Device Interface (GDI) is a Microsoft Windows application programming
interface and core operating system component responsible for representing graphical
objects and transmitting them to output devices such as monitors and printers.
GDI is responsible for tasks such as drawing lines and curves, rendering fonts and
handling palettes. It is not directly responsible for drawing windows, menus, etc.; that
task is reserved for the user subsystem, which resides in user32.dll and is built atop GDI.
GDI is similar to Macintosh's QuickDraw.
Perhaps the most significant capability of GDI over more direct methods of accessing the
hardware is its scaling capabilities, and abstraction of target devices. Using GDI, it is
very easy to draw on multiple devices, such as a screen and a printer, and expect proper
reproduction in each case. This capability is at the center of all What You See Is What
You Get applications for Microsoft Windows.
A human interface device or HID is a type of computer device that interacts directly
with, and most often takes input from, humans and may deliver output to humans. The
term "HID" most commonly refers to the USB-HID specification.
5.9 I/O processor I/O processor (IOP) A specialized computer that permits autonomous handling of data
between I/O devices and a central computer or the central memory of the computer. It can
be a programmable computer in its own right; in earlier forms, as a wired-program
computer, it was called a channel controller. See also direct memory access. Many
storage, networking, and embedded applications require fast I/O throughput for optimal
performance. Intel® I/O processors allow servers, workstations and storage subsystems
to transfer data faster, reduce communication bottlenecks, and improve overall system
performance by offloading I/O processing functions from the host CPU.
5.10 Standard I/O Interfaces One interface circuitry for one computer does not Support the other.
Separate interface has to be designed
Results in number of Interfaces.
A number of standards have been developed for expansion of bus
Eg: SCSI,PCI and USB
Small Computer Systems Interface (SCSI)
Parallel interface
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
8, 16, 32 bit data lines
Daisy chained
Devices are independent
Devices can communicate with each other as well as host
SCSI - 1
Early 1980s
8 bit
5MHz
Data rate 5MBytes.s-1
Seven devices
o Eight including host interface
SCSI - 2
1991
16 and 32 bit
10MHz
Data rate 20 or 40 Mbytes.s-1
o Check out Ultra/Wide SCSI
SCSI -3
1993
16 bits
40 MBPS over 68 pin connector.
The number of devices is16
PCI BUS
PCI Local Bus (usually shortened to PCI), or Conventional PCI, specifies a
computer bus for attaching peripheral devices to a computer motherboard. These devices
can take either the form of an integrated circuit fitted onto the motherboard itself, called a
planar device in the PCI specification or an expansion card that fits into a socket. The
name PCI is initialism formed from Peripheral Component Interconnect. The PCI bus is
common in modern PCs, where it has displaced ISA and VESA Local Bus as the standard
expansion bus, and it also appears in many other computer types. Despite the availability
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
of faster interfaces such as PCI-X and PCI Express, conventional PCI remains a very
common interface.
The PCI specification covers the physical size of the bus (including wire
spacing), electrical characteristics, bus timing, and protocols. The specification can be
purchased from the PCI Special Interest Group (PCI-SIG).
Typical PCI cards used in PCs include: network cards, sound cards, modems,
extra ports such as USB or serial, TV tuner cards and disk controllers. Historically video
cards were typically PCI devices, but growing bandwidth requirements soon outgrew the
capabilities of PCI. PCI video cards remain available for supporting extra monitors and
upgrading PCs that do not have any AGP or PCI express slots.
USB
Universal Serial Bus (USB) is a serial bus standard to interface devices to a
host computer. USB was designed to allow many peripherals to be connected using a
single standardized interface socket and to improve the plug-and-play capabilities by
allowing hot swapping, that is, by allowing devices to be connected and disconnected
without rebooting the computer or turning off the device. Other convenient features
include providing power to low-consumption devices without the need for an external
power supply and allowing many devices to be used without requiring manufacturer
specific, individual device drivers to be installed.
USB is intended to replace many legacy varieties of serial and parallel ports. USB
can connect computer peripherals such as mice, keyboards, PDAs, gamepads and
joysticks, scanners, digital cameras, printers, personal media players, and flash drives.
For many of those devices USB has become the standard connection method. USB was
originally designed for personal computers, but it has become commonplace on other
devices such as PDAs and video game consoles, and as a bridging power cord between a
device and an AC adapter plugged into a wall plug for charging purposes. As of 2008,
there are about 2 billion USB devices in the world.
RS 232 C
RS-232C is a long-established standard ("C" is the current version) that describes the
physical interface and protocol for relatively low-speed serial data communication
between computers and related devices. It was defined by an industry trade group, the
Electronic Industries Association (EIA), originally for teletypewriter devices.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
RS-232C is the interface that your computer uses to talk to and exchange data with your
modem and other serial devices. Somewhere in your PC, typically on a Universal
Asynchronous Receiver/Transmitter (UART) chip on your motherboard, the data from
your computer is transmitted to an internal or external modem (or other serial device)
from its Data Terminal Equipment (DTE) interface. Since data in your computer flows
along parallel circuits and serial devices can handle only one bit at a time, the UART chip
converts the groups of bits in parallel to a serial stream of bits. As your PC's DTE agent,
it also communicates with the modem or other serial device, which, in accordance with
the RS-232C standard, has a complementary interface called the Data Communications
Equipment (DCE) interface.
In telecommunications, RS-232 (Recommended Standard 232) is a standard for serial
binary data signals connecting between a DTE (Data Terminal Equipment) and a DCE
(Data Circuit-terminating Equipment). It is commonly used in computer serial ports. A
similar ITU-T standard is V.24.
Short for recommended standard-232C, a standard interface approved by the Electronic
Industries Alliance (EIA) for connecting serial devices. In 1987, the EIA released a new version
of the standard and changed the name to EIA-232-D. And in 1991, the EIA teamed up with
Telecommunications Industry association (TIA) and issued a new version of the standard called
EIA/TIA-232-E. Many people, however, still refer to the standard as RS-232C, or just RS-232.
Almost all modems conform to the EIA-232 standard and most personal computers have an EIA-
232 port for connecting a modem or other device. In addition to modems, many display screens,
mice, and serial printers are designed to connect to a EIA-232 port. In EIA-232 parlance, the device
that connects to the interface is called a Data Communications Equipment (DCE) and the device to
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
which it connects (e.g., the computer) is called a Data Terminal Equipment (DTE).
The EIA-232 standard supports two types of connectors -- a 25-pin D-type connector (DB-25) and
a 9-pin D-type connector (DB-9). The type of serial communications used by PCs requires only 9
pins so either type of connector will work equally well.
Although EIA-232 is still the most common standard for serial communication, the EIA has
recently defined successors to EIA-232 called RS-422 and RS-423. The new standards are
backward compatible so that RS-232 devices can connect to an RS-422 port.
Role in modern personal computers
PCI Express x1 card with one RS-232 port
Serial port
In the book PC 97 Hardware Design Guide,[3]
Microsoft deprecated support for the RS-
232 compatible serial port of the original IBM PC design. Today, RS-232 is gradually
being replaced in personal computers by USB for local communications. Compared with
RS-232, USB is faster, uses lower voltages, and has connectors that are simpler to
connect and use. Both standards have software support in popular operating systems.
USB is designed to make it easy for device drivers to communicate with hardware.
However, there is no direct analog to the terminal programs used to let users
communicate directly with serial ports. USB is more complex than the RS-232 standard
because it includes a protocol for transferring data to devices. This requires more
software to support the protocol used. RS-232 only standardizes the voltage of signals
and the functions of the physical interface pins. Serial ports of personal computers are
also often used to directly control various hardware devices, such as relays or lamps,
since the control lines of the interface could be easily manipulated by software. This isn't
feasible with USB, which requires some form of receiver to decode the serial data.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
As an alternative, USB docking ports are available which can provide connectors for a
keyboard, mouse, one or more serial ports, and one or more parallel ports. Corresponding
device drivers are required for each USB-connected device to allow programs to access
these USB-connected devices as if they were the original directly-connected peripherals.
Devices that convert USB to RS-232 may not work with all software on all personal
computers and may cause a reduction in bandwidth along with higher latency.
Personal computers may use the control pins of a serial port to interface to devices such
as uninterruptible power supplies. In this case, serial data is not sent, but the control lines
are used to signal conditions such as loss of power or low battery alarms.
Certain industries, in particular marine survey, provide a continued demand for RS-232
I/O due to sustained use of very expensive but aging equipment. It is far cheaper to
continue to use RS-232 than it is to replace the equipment. Some manufacturers have
responded to this demand: Toshiba re-introduced the DB9 Male on the Tecra laptop.
Companies such as Digi specialise in RS232 I/O cards.
Standard details
In RS-232, user data is sent as a time-series of bits. Both synchronous and asynchronous
transmissions are supported by the standard. In addition to the data circuits, the standard
defines a number of control circuits used to manage the connection between the DTE and
DCE. Each data or control circuit only operates in one direction, that is, signaling from a
DTE to the attached DCE or the reverse. Since transmit data and receive data are separate
circuits, the interface can operate in a full duplex manner, supporting concurrent data
flow in both directions. The standard does not define character framing within the data
stream, or character encoding.
IEEE 488.2 (GPIB)
IEEE-488 is a short-range, digital communications bus specification that has been in use
for over 30 years. Originally created for use with automated test equipment, the standard
is still in wide use for that purpose. IEEE-488 is also commonly known as HP-IB
(Hewlett-Packard Interface Bus) and GPIB (General Purpose Interface Bus).
IEEE-488 allows up to 15 devices to share a single eight-bit parallel electrical bus by
daisy chaining connections. The slowest device participates in control and data transfer
handshakes to determine the speed of the transaction. The maximum data rate is about
one Mbyte/s in the original standard, and about 8 Mbyte/s with later extensions.
R 402 Computer Organization
Computer Science & Engineering Dept. SJCET, Palai
The IEEE-488 connector has 24 pins. The bus employs 16 signal lines — eight bi-
directional used for data transfer, three for handshake, and five for bus management —
plus eight ground return lines.
In 1975 the bus was standardized by the Institute of Electrical and Electronics Engineers
as the IEEE Standard Digital Interface for Programmable Instrumentation, IEEE-
488-1975 (now 488.1). IEEE-488.1 formalized the mechanical, electrical, and basic
protocol parameters of GPIB, but said nothing about the format of commands or data.
The IEEE-488.2 standard, Codes, Formats, Protocols, and Common Commands for
IEEE-488.1 (June 1987), provided for basic syntax and format conventions, as well as
device-independent commands, data structures, error protocols, and the like. IEEE-488.2
built on -488.1 without superseding it; equipment can conform to -488.1 without
following -488.2.
While IEEE-488.1 defined the hardware, and IEEE-488.2 defined the syntax, there was
still no standard for instrument-specific commands.
Applications
At the outset, HP-IB's designers did not specifically plan for IEEE-488 to be a standard
peripheral interface for general-purpose computers. By 1977 the Commodore PET/CBM
range of educational/home/personal computers connected their disk drives, printers,
modems, etc, by IEEE-488 bus. All of Commodore's post-PET/CBM 8-bit machines,
from the VIC-20 to the C128, utilized a proprietary 'serial IEEE-488' for peripherals, with
round DIN connectors instead of the heavy-duty HP-IB plugs or a card-edge connector
plugging into the motherboard (for PET computers).
Hewlett-Packard and Tektronix also used IEEE-488 as a peripheral interface to connect
disk drives, tape drives, printers, plotters etc. to their workstation products and HP's HP
2100[4]
and HP 3000[5]
minicomputers. While the bus speed was increased to 10 MB/s for
such applications, the lack of command protocol standards limited third-party offerings
and interoperability, and later, faster, open standards such as SCSI eventually superseded
IEEE-488 for peripheral access.
Additionally, some of HP's advanced pocket calculators/computers of the 1980s, such as
the HP-41 and HP-71B series, could work with various instrumentation via an optional
HP-IB interface. The interface would connect to the calculator via an optional HP-IL
module.