nastas proiect

The Study and comparisonof

Pentium family Processors

Calin CiordasZhang Lei

Yingbo Zhu

2001 02

Calin Ciordas with the instruction and the Part II.3Zhang Lei with the Part II.2 and Part II.4Yingbo Zhu with the Part II.1 and the summary

CONTENTS

PART I Introduction................................................................3Part II Study Issues...................................................................3

1. Caches..................................................................................31.1 Consistency Protocol (MESI Protocol)........................31.2 The 'least recently used'( LRU) Mechanism................41.3 The Pentium Processor..................................................41.4 The Pentium Pro /Pentium II/Pentium III...................61.5 The Pentium 4 Processor...............................................7

2 pipeline...................................................................................82.1 Pentium & Pentium with MMX...................................82.2 The Pentium Pro /Pentium II/Pentium III...................92.2 Pentium 4.....................................................................102.3 pipeline summary........................................................10

3 Parallel and superscalar aspects of Pentium processor family..................................................................................................11

3.1 Superscalar aspects.......................................................123.2 SIMD..............................................................................13

4 Branch prediction..............................................................164.1 Pentium..........................................................................174.2 The Pentium Pro /Pentium II/Pentium III.................174.3 Pentium 4.......................................................................17

PART III SUMMARY.........................................................18Appendix1 : Comparison of Pentium Family Processors Specifications............................................................................19Appendix 2: References...........................................................23

2

PART I Introduction

The main task of this paper is to offer a view of the entire Intel Pentium processor family from architectural point of view. It is interesting to observe the evolution of design issues, the common things and the improvements that Intel engineers added over the years. The Pentium family is a family of CISC processors with advanced RISC concepts included. Excepting the first member (Pentium) which has a modest superscalar design the others presents a full superscalar design. Architectural extensions like MMX, SSE, SSE2 are also an important improvement. The comparison between different cache policies and the branch prediction strategies are presented.

In our opinion the Pentium family is an successful design family. For our study the Pentium, Pentium II and Pentium 4 were studied in more detail. The other

processors were partially studied with regards to certain interesting aspects, because Pentium Pro, Pentium II and Pentium III are based on similar designs.

Part II Study Issues

1. Caches

All caches use the following model. Main memory is divided up into fixed-size blocks called cache lines. A cache with n possible entries for each address is called an n-way set-associative cache. A 'two-way set-assocative' organisation is shown at Figure 1.1.

Figure 1.1 A 'two-way set-associative' Cache Organization

1.1 Consistency Protocol (MESI Protocol)The Pentium processor Cache Consistency Protocol is a set of rules by which states are assigned

to cached entries (lines). The rules apply for memory read/write cycles only. I/O and special cycles are not run through the data cache. Every line in the Pentium processor data cache is assigned a state dependent on both Pentium processor generated activities and activities generated by other bus masters (snooping). The Pentium processor Data Cache Protocol consists of four states that define whether a line is valid (HIT/MISS), if it is available in other caches, and if it has been MODIFIED. The four states are the M (Modified), E (Exclusive), S (Shared) and the I (Invalid) states and the protocol is referred to as the MESI protocol. A definition of the states is given below:

3

M - Modified: An M-state line is available in ONLY one cache and it is also MODIFIED (different from main memory). An M-state line can be accessed (read/written to) without sending a cycle out on the bus.

E - Exclusive: An E-state line is also available in ONLY one cache in the system, but the line is not MODIFIED (i.e., it is the same as main memory). An E-state line can be accessed (read/written to) without generating a bus cycle. A write to an E-state line will cause the line to become MODIFIED.

S - Shared: This state indicates that the line is potentially shared with other caches (i.e. the same line may exist in more than one cache). A read to an S-state line will not generate bus activity, but a write to a SHARED line will generate a write through cycle on the bus. The write through cycle may invalidate this line in other caches. A write to an S-state line will update the cache.

I - Invalid: This state indicates that the line is not available in the cache. A read to this line will be a MISS and may cause the Pentium processor to execute a LINE FILL (fetch the whole line into the cache from main memory). A write to an INVALID line will cause the Pentium processor to execute a write-through cycle on the bus.

1.2 The 'least recently used'( LRU) MechanismThe 'least recently used' bit indicates which set in each line has been used last, the other set will

be replaced if none of them hits and both are valid. The 'least recently used'( LRU) algorithm keeps an ordering of each set of locations that could be ascended from a given memory location. Whenever any of the present lines are accessed, it updates the list, marking that entry the most recently accessed. When it comes time to replace an entry, the one at the end of the list-the least recently accessed – is the one discarded. This decision tree is shown in Figure 1.2.

Figure 1.2 LRU Cache Replacement Strategy

1.3 The Pentium Processor1.3.1 On-Chip Caches

The Pentium processor implements two internal caches for a total integrated cache size of 16 Kbytes: an 8 Kbyte data cache and a separate 8 Kbyte code cache. These caches are transparent to application software to maintain compatibility with previous Intel Architecture generations.

The data cache fully supports the MESI (modified/exclusive/shared/invalid) writeback cache consistency protocol. The code cache is inherently write protected to prevent code from being inadvertently corrupted, and as a consequence supports a subset of the MESI protocol, the S(shared) and I (invalid) states. The caches have been designed for maximum flexibility and performance. The data cache is configurable as writeback or writethrough on a line-by-line basis. Memory areas can be defined as non-cacheable by software and external hardware. Cache writeback and invalidations can be

4

initiated by hardware or software. Protocols for cache consistency and line replacement are implemented in hardware, easing system design.

1.3.2 Cache OrganizationOn the Pentium processor, each of the caches are 8 Kbytes in size and each is organized as a 2-

way set associative cache. There are 128 sets in each cache, each set containing 2 lines (each line has its own tag address). Each cache line is 32 bytes wide. In the Pentium processor, replacement in both the data and instruction caches is handled by the LRU mechanism which requires one bit per set in each of the caches.

The data cache consists of eight banks interleaved on 4-byte boundaries. The data cache can be accessed simultaneously from both pipes, as long as the references are to different cache banks. A conceptual diagram of the organization of the data and code caches is shown in Figure 2-8. Note that the data cache supports the MESI writeback cache consistency protocol which requires 2 state bits, while the code cache supports the S and I state only and therefore requires only one state bit.

Figure 1-3 Conceptual Organization of Code and Data Caches

1.3.3 Cache StructureThe instruction and data caches can be accessed simultaneously. The instruction cache can

provide up to 32 bytes of raw opcodes and the data cache can provide data for two data references all in the same clock. This capability is implemented partially through the tag structure. The tags in the data cache are triple ported. One of the ports is dedicated to snooping while the other two are used to lookup two independent addresses corresponding to data references from each of the pipelines. The instruction cache tags of the Pentium processor are also triple ported. Again, one port is dedicated to support snooping and the other two ports facilitate split line accesses (simultaneously accessing upper half of one line and lower half of the next line).

The storage array in the data cache is single ported but interleaved on 4-byte boundaries to be able to provide data for two simultaneous accesses to the same cache line. Each of the caches are parity protected. In the instruction cache, there are parity bits on a quarter line basis and there is one parity bit for each tag. The data cache contains one parity bit for each tag and a parity bit per byte of data. Each of the caches are accessed with physical addresses and each cache has its own TLB (translation lookaside buffer) to translate linear addresses to physical addresses. The TLBs associated with the instruction cache are single ported whereas the data cache TLBs are fully dual ported to be able to translate two independent linear addresses for two data references simultaneously. The tag and data arrays of the TLBs are parity protected with a parity bit associated with each of the tag and data entries in the TLBs.

5

The data cache of the Pentium processor has a 4-way set associative, 64-entry TLB for 4-Kbyte pages and a separate 4-way set associative, 8-entry TLB to support 4-Mbyte pages. The code cache has one 4-way set associative, 32-entry TLB for 4-Kbyte pages and 4-Mbyte pages which are cached in 4-Kbyte increments. Replacement in the TLBs is handled by a pseudo LRU mechanism (similar to the Intel486 CPU) that requires 3 bits per set.

1.4 The Pentium Pro /Pentium II/Pentium III 1.4.1 The Pentium pro

The Pentium Pro Processor on-chip level one (L1) caches consist of one 8-Kbyte four-way set associative instruction cache unit with a cache line length of 32 bytes and one 8-Kbyte two-way set associative data cache unit. Not all misses in the L1 cache expose the full memory latency. The level two (L2) cache masks the full latency caused by an L1 cache miss. The minimum delay for a L1 and L2 cache miss is between 11 and 14 cycles based on DRAM page hit or miss. The data cache can be accessed simultaneously by a load instruction and a store instruction, as long as the references are to different cache banks.

Figure 1.4 The Pentium Pro, II, III Processor Micro-Architecturewith Advanced TransferCache Enhancement, The first and second level caches

1.4.2 The Pentium II /Pentium III

The on-chip cache subsystem of Pentium II and Pentium III processors consists of two 16-Kbyte four-way set associative caches with a cache line length of 32 bytes. The caches employ a write-back mechanism and a pseudo-LRU (least recently used) replacement algorithm. The data cache consists of eight banks interleaved on four-byte boundaries.Level two (L2) caches have been off chip but in the same package. They are 128K or more in size. L2 latencies are in the range of 4 to 10 cycles. An L2 miss initiates a transaction across the bus to memory chips. Such an access requires on the order of at least 11 additional bus cycles, assuming a DRAM page hit. A DRAM page miss incurs another three bus cycles. Each bus cycle equals several processor cycles, for example, one bus cycle for a 100 MHz bus is equal to four processor cycles on a 400 MHz processor. The speed of the bus and sizes of

6

L2 caches are implementation dependent, however. Check the specifications of a given system to understand the precise characteristics of the L2 cache.

Figure 1-5 The Intel NetBurst Micro-Architecture,the First Level, the Second Level Caches and Trace Cache

1.5 The Pentium 4 Processor

The Intel Pentium 4 processor is the latest IA-32 processor, and the first based on the Intel NetBurst micro-architecture ( Figure1.5). The Intel NetBurst micro-architecture can support up to three levels of on-chip cache. Only two levels of on-chip caches are implemented in the Pentium 4 processor, but there brings a new concept: Trace Caches. The level nearest to the execution core of the processor, the first level, contains separate caches for instructions and data: a first-level data cache and the trace cache, which is an advanced first-level instruction cache. All other levels of caches are shared. The levels in the cache hierarchy are not inclusive, that is, the fact that a line is in level i does not imply that it is also in level i+1. All caches use a pseudo-LRU (least recently used) replacement algorithm.

1.5.1 Execution Trace CacheThe execution trace cache (TC) is the primary instruction cache in the Intel NetBurst micro-

architecture. The TC stores decoded IA-32 instructions, or µops. This removes decoding costs on frequently-executed code, such as template restrictions and the extra latency to decode instructions upon a branch misprediction.

In the Pentium 4 processor implementation, the TC can hold up to 12K µops and can deliver up to three µops per cycle. The TC does not hold all of the µops that need to be executed in the execution core. In some situations, the execution core may need to execute a microcode flow, instead of the µop traces that are stored in the trace cache.The Pentium 4 processor is optimized so that most frequently-executed IA-32 instructions come from the trace cache, efficiently and continuously, while only a few instructions involve the microcode ROM.

1.5.2 The Second-level CacheA second-level cache miss initiates a transaction across the system bus interface to the memory

sub-system. The system bus interface supports using a scalable bus clock and achieves an effective speed that quadruples the speed of the scalable bus clock. It takes on the order of 12 processor cycles to get to the bus and back within the processor, and 6-12 bus cycles to access memory if there is no bus

7

congestion. Each bus cycle equals several processor cycles. The ratio of processor clock speed to the scalable bus clock speed is referred to as bus ratio. For example, one bus cycle for a 100 MHz bus is equal to 15 processor cycles on a 1.50 GHz processor.

2 pipelinePipelining is an architecture technique for increasing the throughput of complex, multiple cycle

instruction. The whole instruction can be divided to a series of smaller stages which can be completed within a single clock cycle, and the frequency and throughput of the system can be improved.

2.1Pentium & Pentium with MMXThe Pentium processor has a five stage pipeline for the integer instructions, while the Pentium

processor with MMX has an additional pipeline stage. The pipeline stages are shown below :PF PrefetchF Fetch(Pentium professor with MMX technology only)D1 Instruction DecodeD2 Address GenerateEX Execute –ALU and Cache AccessWB Write back

The Pentium processor is a superscalar machine ,which have two pipelines called the “u” and the “v” pipes. Figure 3.1 shows the instruction flow in the Pentium processor.

The Pentium processor also has a floating point pipeline. The floating point unit(FPU) is integrated with the integer unit on the same chip which has 8 pipeline stages, the first five share with the integer unit. Integer instructions pass though only the first 5 stages. Integer instructions use the fifth(X1) stages as a WB (write-back) stage. The 8 FP pipeline stages and the activities that performed in them are shown below:

PF PrefetchF Fetch(Pentium professor with MMX technology only)D1 Instruction DecodeD2 Address GenerateEX Memory and register read; conversion of FP data to external memory format and

memory write.X1 Floating Point Execute stage one.X2 Floating Point Execute stage two.WF Performing rounding and write floating-pointing result to register files.

8

Figure 2.1 Pentium processor pipeline

ER Error Reporting/Update Status Word.The Pentium processor with MMX has an extra stage by dividing the Prefetch to two stage,

Prefetch and Fetch., thus the pipeline has 6 stages deep to yield higher throughput.

2.2 The Pentium Pro /Pentium II/Pentium III The Pentium Pro /Pentium II/Pentium III processor have the same pipeline architecture . The

Pentium pro and Pentium II processor have an in order front end ,an out-order execution path, and an in-order back end. In effect, the Pentium pro/Pentium II processor consist of an outer CISC shell with an inner RISC core. Figure 3.2 show s a block diagram of the Pentium pro/Pentium II processor . The operation of Pentium pro/Pentium II processor can be summarized as follows:

A. The processor fetches instructions from memory in the order of the static program.B. Each instruction is translated into one or more fixed-length RISC instructions, known as micro-operation, or micro-ops.C. The processor executes the micro-ops on a superscalar pipeline organization, so that the micro-ops may be executed out of order.D. The processor commits the results of each micro-op execution to the processor’s register set in the order of the original flow.

The Pentium pro/Pentium II processor have 13 stages (Figure 3.3) as shown below:BTB0 Branch Target Buffer 0BTB1 Branch Target Buffer 1IFU0 Instruction FetchIFU1 Scan the bytes to determine instruction boundariesIFU2 Instruction predecodeID0 Instruction DecodeID1 Instruction Decode

9

Figure 2.2 Pentium pro/Pentium II block diagram

RAT Register AllocatorROB Reorder Buffer, up to two register reads per cycleRS Reservation Station, micro-ops wait for operands and functional pipelines in ports 0-

4 to become available.EX Execute Stage, 5 ports are available for the execute stageROB(wb) Reorder buffer (writeback)RRF Reorder Buffer read

2.2Pentium 4Pentium 4 has a different architecture with the previous one which has the name of NetBust

Micro-Architecture. As for the pipeline of the Pentium 4, it has used the Hyper Pipelined Technology, which can reach the comparatively great depth: 20 stages.. The execution of each command is divided into smaller parts, which is easier and faster to execute than the entire command, nothing prevents the developers from rising the CPU frequency. If the today's 0.18 micron technology allows achieving only 1GHz for Pentium III processor, the future Pentium 4 processors will be able to support up to 2GHz working frequency.

The pipeline of the Intel NetBust Micro-Architecture contain three sections:A. the in-order issue front endB. the out-order superscalar execution coreC. the in-order retirement unit.The front end supplies instructions in program order to the out order core. It fetch and decodes

IA-32 instructions to the micro-operations. The out-order can issue multiple micro-operations per cycle and aggressively reorder micro-operations so that those micro-operations, which is available for execution, can execute as soon as possible. The retirement section ensures that the results of execution of the micro-operations are processed according to the original program order and that the proper architecture states are updates. Figure 1.5 shows the block diagram of the Intel NetBurst Micro-Architecture.

2.3pipeline summaryIntel developed his processor series from Pentium to Pentium 4 now, The architecture of the

processor have changed a lot, to improve the performance and the thoughput ,the pipeline becomes longer and longer ,the operation becomes more complex. We can find it from the Figure 2.4

10

Figure 2.3 Pentium pro/Pentium II pipeline

To measure the pipeline performance, we can develop a speedup factor for the instruction pipeline compared to execution without the pipeline (based on a discussion in [HWAN93].This model supposes that instructions are processed without branches.

: speedup factor

: The time to execute n instructions without pipeline

: The time to execute n instructions with a stages pipelines

: number of stages in the instruction pipeline: the number of instructions: the cycle time of a instruction pipeline

We can find that we can get times speedup when come to ,so the larger the number of pipeline stages, the greater potential for speedup. However, a deeper pipeline isn't free from its drawbacks. The first one is evident: since there are more stages to execute before the operation is completed, the overall time required for each operation increases. That's why in order to make sure that younger Pentium 4 models prove faster than the elder Pentium III CPUs, Intel starts its new processor family at 1.4GHz. If Intel launched a 1GHz Pentium 4, it would undoubtedly be beaten by a 1GHz Pentium III CPU.

The second drawback of a deeper pipeline comes to light in case a branch prediction error occurs. The Pentium series processor is capable of executing instructions in succession as well as in parallel. In the latter case the instructions do not always follow the order they are listed in the program and the branches aren't always correctly predicted. In order to choose the right branch for further execution the CPU predicts the results judging by the collected stats. However, if the processor mis-predicts a branch, all the speculatively executed instructions must be flushed from the processor pipeline in order to restart the instruction execution down the correct program branch. On more deeply pipelined designs, more instructions must be flushed from the pipeline, resulting in a longer recovery time from a branch mis-predict. The net result is that applications that have many, difficult to predict, branches will tend to have a lower average level of instructions per clock.

3 Parallel and superscalar aspects of Pentium processor family.

As the functionality of these processors was previously explained, this part propose just a deep look inside the superscalar and parallel aspects of this processor family which were not mentioned before. It is beyond the purpose of this part to explain again the functionality of certain processors.

11

Figure 2.4 The pipeline of Intel Pentium series processor

There are also presented the SIMD aspects of this processor family: MMX, SSE, SSE2. We try to present the architectural details of these implementations and the reasons for which were added and not an enumeration of the instruction set added by them.

3.1 Superscalar aspects

3.1.1 PentiumThe original Pentium had a superscalar component consisting of the use of two separate integer

execution unit capable of executing 2 instructions in parallel. The pipelines are called “u” and “v” pipelines. The floating point unit shares the first 5 stages with the integer pipeline. In the decode stage D1, Pentium has 2 decoders working in parallel.

3.1.2 Pentium Pro/Pentium IIPentium II has basically the same superscallar organization as the Pentium Pro with the addition

of the MMX execution units.The essential components of the superscalar organization are the instruction fetch and decode

units, the dispatch and execute unit and the retire unit. The ID1 stage (instruction decode 1) contains 3 decoders which can work in parallel. One is a

complex one and the others are simple ones. The complex decoder can handle Pentium instruction which can translate into up to four micro-ops. The second and third decoders can handle just simple Pentium instruction that map into a single micro-ops. A few instructions require more than four micro-ops. These are transferred to the MIS (microcode instruction sequencer) which is a microcode ROM which contains the series of micro-ops associated with the complex machine instructions.

The output of ID1 or MIS is fed to the ID2 (instruction decode 2) in a block of 6 micro-ops at a time. Operations queued in ID2 pass through another renaming phase called register allocator (RAT) which remaps references to the 16 architectural registers into a set of 40 physical registers. In this way false dependencies are removed.

The RAT fed the reordered buffer with the revised micro-ops. ROB is a circular buffer which can hold up to 40 micro-ops. Micro-ops enter ROB in order and are dispatched to the execution unit out of order, the only criteria for this dispatch being the availability of the appropriate execution unit and the necessary data items. Micro-ops are retired from the rob in order.

The RS (reservation station) is responsible for retrieving micro-ops from the ROB. The RS look for micro-ops which status tell that are ready for execution (has all operands) and dispatch it to the appropriate execution unit. Up to 5 micro-ops can be dispatched in one cycle.

Five ports connects RS to execution units. Port 0 is used for both integer and floating-point instructions with the exception of simple operations on integers and handling branch mispredictions which are allocated to port 1. MMX execution units are allocated between these two ports. The other ports are for memory loads and stores.

Once an execution is completed, the entry in ROB is updated and the execution unit is free for the next micro-op.

The RU (retire unit) works to commit the result of instruction execution. Once it is determined that the micro-op is not vulnerable for branch misprediction it is marked as ready for retirement. When the previous Pentium instruction was retired and all the micro-ops of the next instruction have been marked as ready for retirement the RU deletes the micro-ops from the ROB and updates all the registers affected by this instruction.

3.1.4 Pentium 4

Instructions are fetched and decoded by a translation engine. There is only one decoder which can decode instructions at a maximum rate of 1 per clock cycle. Some complex instructions must use the microcode ROM (like the Pentium II/Pentium Pro). The translation engine builds the decoded instruction into sequences of micro-ops called traces, which are stored in the trace cache. The execution trace cache stores these micro-ops in the path of program execution flow, where the results of branches in the code are integrated into the same cache line. The trace cache can deliver up to 3 micro-ops per clock to the core. The core is designed to facilitate parallel execution. It can dispatch up to 6 micro-ops per cycle through the 4 issue ports pictured figure 3.1. Six micro-ops per cycle exceeds the trace cache and retirement micro-op bandwidth.

12

Figure 3.1 Pentium 4 Execution Unit

Most execution units can start executing a new micro-op every cycle, so that several instructions can be in flight at a time for each pipeline. A number of arithmetic logical unit (ALU) instructions can start two per cycle, and many floating-point instructions can start one every two cycles. Micro-ops can begin execution, out of order, as soon as their data inputs are ready and resources are available (the same concept as Pentium II/Pentium Pro).

Port 0. In the first half of the cycle, port 0 can dispatch either one floating-point move micro-op (including floating-point stack move, floating-point exchange or floating-point store data), or one arithmetic logical unit (ALU) micro-op (including arithmetic, logic or store data). In the second half of the cycle, it can dispatch one similar ALU micro-op.

Port 1. In the first half of the cycle, port 1 can dispatch either one floating-point execution (all floating-point operations except moves, all SIMD operations) micro-op or normal-speed integer (multiply, shift and rotate) micro-op, or one ALU (arithmetic, logic or branch) micro-op. In the second half of the cycle, it can dispatch one similar ALU micro-op.

Port 2. Port 2 supports the dispatch of one load operation per cycle.Port 3. Port 3 supports the dispatch of one store address operation per cycle. Thus the total issue bandwidth can range from zero to six micro-ops per cycle.When a micro-op completes and writes its result to the destination, it is retired. Up to 3 micro-ops

may be retired per cycle. The Reorder Buffer (ROB) is the unit in the processor which buffers completed micro-ops, updates the architectural state in order, and manages the ordering of exceptions. The retirement section also keeps track of branches and sends updated branch target information to the branch target buffer (BTB) to update branch history.

3.2 SIMD 3.2.1Pentium Pro

Pentium Pro does not implement the MMX (Matrix Math Extensions) execution unit or the MMX register set and therefore does not support the MMX instruction set.3.2.2 Pentium II

SIMD computations were introduced into Intel IA-32 architecture with the Intel MMX technology. Pentium II processor implements MMX support. The Intel MMX technology allows SIMD computations to be performed on packed byte, word and doubleword integers that are contained in a set of eight registers called MMX registers.

The eight general-purpose registers are used along with the existing IA-32 addressing modes to address operands in memory. (The MMX registers cannot be used to address memory). The general-purpose registers are also used to hold operands for some MMX technology operations.

13

These MMX registers are mapped over the FPU registers. FPU registers are 80 bits wide but MMX registers are 64 bits wide. MMX registers are aliased on the 64 bits mantissa portion of the FP registers. When a value is written to one of the MMX registers it also appears in the mantissa portion of the respective FP register. The reverse is also true. When a value is witten to an MMX register, bits 79-64 of the corresponding FP registers are all set to one. The MMX registers are explicity addressed by name (FP registers are addressable as stack locations).

An application can contain both x87 FPU floating-point and MMX instructions. However,because the MMX registers are aliased to the x87 FPU register stack, care must be taken when making transitions between x87 FPU instructions and MMX instructions to prevent the loss of data in the x87 FPU and MMX registers and to prevent incoherent or unexpected results.

The time when the first MMX instruction is executed two things occur: the FPU registers are renamed as MMX and the FPU tag word is marked valid. Because of this it is necessary that an EMMS (Empty MMX State) instruction be executed after completion of the MMX code and before any FPU code is executed. MMX instruction set is divided into the following groups of instructions: arithmethic, comparison, conversion, logical, shift, data transfer, Empty MMX State (EMMS) instruction.

The MMX execution units are connected to the Reservation Station (RS) at the first 2 ports. At port 0 is the MMX ALU Unit and MMX Multiplier Unit and at the port 1 there is an MMX ALU unit and MMX Shifter Unit.

3.2.3 Pentium III

The Intel MMX technology introduced single-instruction multiple-data (SIMD) capability into the IA-32 architecture, with the 64-bit MMX registers, 64-bit packed integer data types, and instructions that allowed SIMD operations to be performed on packed integers. The SSE extensions extend this SIMD execution model, by adding facilities for handling packed and scalar single-precision floating-point values contained in 128-bit registers.

The SSE extensions introduced one data type, the 128-bit packed single-precision floating-point data type, to the IA-32 architecture. This data type consists of four IEEE 32-bit single-precision floating-point values packed into a double quadword. The 32-bit MXCSR register, which provides control and status bits for operations performed on the XMM registers is also added to Intel IA-32 architecture.

Intel Pentium III offers Internet Streaming SIMD Extensions (SSE) which add 70 new instructions enabling advanced imaging, 3D, streaming audio and video and speech recognition for an enhanced Internet experience. These includes SIMD for floating point, additional SIMD integer and cacheability control instruction.

SSE allow SIMD computations to be performed on operands that contain 4 packed single-precision floating –point data elements. The operands can be either in memory or in a set of eight 128-bit registers called XMM registers. The SSE also extend SIMD computational capability with additional 64 bit MMX instructions. XMM registers can be addressed directly using the names XMM0 to XMM7; ad they can be accessed independently from the x87 FPU and MMX registers and the general-purpose registers ( they are not aliased to any other of the processor’s registers). The XMM registers can only be used to perform calculations on data; they cannot be used to address memory. Addressing memory is accomplished by using the general-purpose registers.

Data can be loaded into the XMM registers or written from the registers to memory in 32-bit, 64-bit, and 128-bit increments. When storing the entire contents of an XMM register in memory (128-bit store), the data is stored in 16 consecutive bytes, with the low-order byte of the register being stored in the first byte in memory.

The 32-bit MXCSR register contains control and status information for SSE and SSE2 SIMD floating-point operations. This register contains the flag and mask bits for the SIMD floating-point exceptions, the rounding control field for SIMD floating-point operations, the flush-to-zero flag that provides a means of controlling underflow conditions on SIMD floating-point operations, and the denormals-are-zeros flag that controls how SIMD floating-point instructions handle denormal source operands.

The contents of this register can be loaded from memory with the LDMXCSR and FXRSTORinstructions and stored in memory with the STMXCSR and FXSAVE instructions.The SSE instructions are divided into four functional groups• Packed and scalar single-precision floating-point instructions.• 64-bit SIMD integer instructions.• State management instructions

14

• Cacheability control, prefetch, and memory ordering instructions.The packed and scalar single-precision floating-point instructions are divided into the followingsubgroups:• Data movement instructions• Arithmetic instructions• Logical instructions• Comparison instructions• Shuffle instructions• Conversion instructionsThe SSE data movement instructions move single-precision floating-point data between XMM

registers and between an XMM register and memory.The SSE arithmetic instructions perform addition, subtraction, multiply, divide, reciprocal, square

root, reciprocal of square root, and maximum/minimum operations on packed and scalar single-precision floating-point values.

The SSE logical instructions preform AND, AND NOT, OR, and XOR operations on packed single-precision floating-point values.

The compare instructions compare packed and scalar single-precision floating-point values and return the results of the comparison either to the destination operand or to the EFLAGS register.

The SSE shuffle and unpack instructions shuffle or interleave the contents of two packed single-precision floating-point values and store the results in the destination operand.

The SSE conversion instructions support packed and scalar conversions between single-precision floating-point and doubleword integer formats.

The SSE extensions add also 64-bit packed integer instructions to the IA-32 architec-ture. These instructions operate on data in MMX registers and 64-bit memory locations.

The MXCSR state management instructions (LDMXCSR and STMXCSR) load and save the state of the MXCSR register, respectively. The LDMXCSR instruction loads the MXCSR register from memory, while the STMXCSR instruction stores the contents of the register to memory.

The SSE extensions introduce several new instructions to give programs more control over the caching of data.

The SSE extensions are fully compatible with all software written for IA-32 processors. All existing software continues to run correctly, without modification, on processors that incorporate

The SSE extensions, as well as in the presence of existing and new applications that incorporate these extensions.

The XMM registers are independent of the x87 FPU and MMX registers, so SSE and SSE2 oper-ations performed on the XMM registers can be performed in parallel with operations on the x87 FPU and MMX registers .

3.2.4 Pentium 4

The Pentium 4 upgrades the P6 CPU SSE to SSE2, Streaming SIMD Extensions 2, with seventy-six SIMD instructions and enhancements to sixty-eight integer SIMD instructions. That makes 144 SIMD instructions to manage floating point, application and multimedia performance. From a programmer’s perspective, the model for the new Pentium IV CPU is not that dissimilar to the MMX technology and SSE models in the Pentium II and III. The new SSE2 instructions add much more flexibility, as they allow SIMD computations to be performed on floating-point, integer, and packed integer data types in the XMM registers.

SSE2 use the same registers and is backward compatible with the SSE of the Pentium III processor.

New SIMD instructions aim to do away with one of the major bottlenecks found in today’s x86 CPUs: the x87 FPU, or floating-point unit. The performance of the x87 FPU is severely restricted by this aging standard. Improving performance would not be easy with its original design. Using SSE2 to bypass it completely circumvents the bottleneck. If Intel can find enough support among software developers to start using SSE2 for doing floating point operations, the Pentium IV’s SSE2 FPU will be a lot faster than an equivalently clocked x87 FPU. The SSE2 extends SIMD computations to operate on packed double-precision floating-point data elements and 128-bit packed integers. All 144 instructions in the SSE2 can operate an two packed double precision floating-point data elements, or on 16 packed byte, 8 packed word, 4 doubleword, and 2 quadword integers.

The SSE2 instructions are divided into four functional groups:• Packed and scalar double-precision floating-point instructions.• 64-bit and 128-bit SIMD integer instructions.

15

• 128-bit extensions of SIMD integer instructions introduced with the MMX technology andthe SSE extensions.• Cacheability-control and instruction-ordering instructions.The SSE2 extensions adds several 128-bit packed integer instructions to the IA-32 architecture.

Where appropriate, a 64-bit version of each of these instruction is also provided. The 128-bit versions of instructions operate on data in the XMM registers, and the 64 bit versions of these new instructions operate on data in the MMX registers.

3.2.5 Summary of SIMDThe full set of IA-32 SIMD technologies (the Intel MMX technology, the SSE extensions, and the

SSE2 extensions) gives the programmer the ability to develop algorithms that combine operations on packed 64 and 128 bit integer and single and double-precision floating-points operands.

All these technologies are architectural extensions in the IA-32 Intel architecture. All SIMD instructions are accessible from all IA-32 execution modes: protected mode, real address mode and Virtual 8086 mode.

A summary of types used for MMX, SSE and SSE2 can be see in figure 3.2.

Figure 3.2 MMX, SSE and SSE2 Data types

4 Branch predictionIn order to achieve high throughput and performance of the processor, Intel has made the pipeline

is longer and longer from Pentium to Pentium 4 processor. So the problem of how to deal with the branches is coming to more important. As usual, a variety of approaches have been taken for predicting the branch will be taken or not:

A. Predict never takenB. Predict always takenC. Predict by opcodeD. Taken/not taken switchE. Branch history table

16

The first three can be summarized to the static prediction algorithm , and the last two can be summarized as the Dynamic Prediction algorithm. In the Pentium series processor has used kinds of method to predict the branches ,assuring a steady flow of instructions to the initial stages of the pipelines.

4.1 PentiumThe Pentium processor uses a dynamic branch prediction strategy based on the history of recent

executions of a branches instruction. A Branch Target Buffer(BTB) is maintained that caches information about recently encountered branch instruction to predict the outcome of branch instructions which minimizes pipeline stalls due to prefetch delays. The Pentium processor accesses the BTB with the address of the instruction in D1 stages. It contains a Branch prediction state machine with four stages:

A. Strongly not takenB. Weakly not takenC. Weakly takenD. Strongly takenIf an entry already exists in the BTB, then the instruction unit is guided by the history information

for that entry in determining whether to predict that the branch is taken. If a branch is predicted ,then the branch destination address associated with this entry is used for prepetching the branch target instruction. Once the instruction is executed, the history portion of the appropriate entry is updated to reflect the result of the branch instruction. If this instruction is not represented in the BTB, then the address of the instruction is loaded into an entry in the BTB; If necessary, an old entry is deleted.

4.2 The Pentium Pro /Pentium II/Pentium III The Pentium Pro /Pentium II/Pentium III have much longer pipelines, so the penalty for mis-

prediction is greater. Accordingly, the Pentium pro and the Pentium II use a more elaborate branch prediction scheme to reduce the mis-prediction rate.

The Pentium pro/Pentium II BTB is organized as a four –way set associative cache with 512 lines. Each entry uses the address of the branch as a tag. The entry also includes the branch destination address for the last time this branch was taken and a 4-bit history field. Thus use of four history bits contrasts with the 2 bits used in the original Pentium processor. With 4 bits ,the Pentium pro/Pentium II mechanism can take into account a longer history in predicting branches. The algorithm that referred to as Yeh’s algorithm can provide a significant reduction in misprediction compare to algorithms that use only 2 bits of history[EVER98].

4.3 Pentium 4The Pentium 4 processor with the Intel NetBust Micro-Architecture predicts all near branches,

including conditional , unconditional calls and returns, and indirect branches. It does not predict far transfers, for example, far calls, irets, and software interrupts. Several mechanisms are implemented to aid in predicting branches more accurately and in reducing the cost of taken branches:

A. Dynamically predict the direction and target of the branches based on the instructions’s linear address using the branch target buffer (BTB). The branch prediction buffer that store more detail on the history of past branches is increased up to 4kB,while the buffer by P6 family is only 512Byte big.B. If no dynamic prediction is available or if it is invalid, statically predict the outcome based on the offset of the target: a backward branch is predicted to be taken, a forward branch is predicted to be not taken.C. Return address are predicted using the 16-entry return address stack.D.Traces of instructions are built across predicted taken branches to avoid branch penalties.The Pentium 4 processor with a larger branch target buffer and the more advanced branch

prediction algorithm has the net effect of reducing the number of branch mis-prediction by about 33% over the Pentium III processsor’s branch prediction capability. This a really good value, because it means that Pentium 4 offers over 90-95% of correct predictions.

17

PART III SUMMARY

As we have discussed , till now on, all Pentium family processors, including Pentium, Pentium Pro, Pentium II, Pentium III, and the latest Pentium 4, are all based on the IA-32 Intel Architecture. The computing power and the complexity (or roughly, the number of transistors per processor) of Intel architecture processors has grown, over the years, in close relation to Moore's law. By taking advantage of new process technology and new micro-architecture designs, each new generations of IA-32 processors have demonstrated frequency-scaling headroom and new performance levels over the previous generation processors. We synthesized key features of Pentium Family Processors as Table 1, and more detailed comparisons are attached.

Table 1. Key Features of Pentium Family Processors

Intel Processor

DateIntroducd

Max.Clock Shiped

Transistorsper Die

RegisterSizes 1

Ext.DataBus

Max.Extern.Addr.

Caches

Pentium 1993 60 MHz 3.1 M 32 GP80 FPU 64 4 GB L1:16KB

Pentium Pro 1995 200 MHz 5.5 M 32 GP

80 FPU 64 64 GBL1: 16KBL2: 256KBor 512KB

Pentium II 1997 266 MHz 7 M

32 GP80 FPU64 MMX

64 64 GBL1: 32KBL2: 256KBor 512KB

Pentium III 1999 500 MHz 8.2 M

32 GP80 FPU64 MMX128XMM

64 64 GB L1: 32KBL2: 512KB

Pentium 4 2000 IntelNetBurstmicroarchitecture

1.50 GHz 42 M

GP: 32FPU: 80MMX: 64XMM: 128

3.2GB/s 64 GB

L1:8KBTraceCache: 12K µopL2:256KB

The IA-32 Intel Architecture has been at the forefront of the computer revolution and is today clearly the preferred computer architecture, as measured by the number of computers in use and total computing power available in the world. Two of the major factors that may be the cause of the popularity of IA-32 architecture are: compatibility of software written to run on IA-32 processors, and the fact that each generation of IA-32 processors deliver significantly higher performance than the previous generation.

The IA-32 architecture has been and is committed to the task of maintaining backward compatibility at the object code level to preserve Intel customers’ large investment in software. At thesame time, in each generation of the architecture, the latest most effective micro-architecture and silicon fabrication technologies have been used to produce high-performance processors. In each generation of IA-32 processors, Intel has conceived and incorporated increasingly sophisticatedtechniques into its micro-architecture in pursuit of ever faster computers.

What’s the future of Pentium Family? Perhaps Intel has realized disadvantages of IA-32 architecture, too emphasize on backward compatibility, they have to implement more sophistic techniques to obtain obscure features. Intel and HP cooperate to a new micro-architecture IA64, IA-64 micro-architecture may be the best way for next generation processor, but it’s revolutionary method for implementation, we have to give up all programs on IA-32, transfer to this newest generation.

18

19

Appendix1 : Comparison of Pentium Family Processors Specifications

General Details

Name Pentium Pentium Pro Pentium II Pentium III Pentium 4

Family/Generation 80586, 5th Generation80686, 6th Generation

80686, 6th Generation, MMX

80686, 6th Generation, MMX, SSE

Intel NetBurst MicroArchitecture 42M transistors

Clock Frequencies

CPU Core Speed 75, 90, 100, 120, 133, 150, 166, 200 MHz

133, 150, 166, 180, 200 MHz

333 Mhz600E, 650E, ..., 850E MHz

1.4GHz

External Bus Speed50, 60, or 66 MHz

60 or 66 MHz, GTL+

66 MHz, GTL+100 MHz GTL+ (Slot 1), AGTL+ (Slot 2)

400MHz

Processor Core

Generic DetailsCISC, In-order and Pipelined Execution

RISC, Out-of-order and Speculative Execution



RISC, Out-Of-Order and Speculative Execution

Specific Details Dual Pipeline Design20 Entry RS, 40 Entry ROB

20 Entry RS, 40 Entry ROB

20 Entry RS, 40 Entry ROB

Registers32 Bit Integer, 80 Bit FP

32 Bit Integer, 80 Bit FP, 40 Entry RAT

32 Bit Integer, 80 Bit FP, 64 Bit MM, 40 Entry RAT

32 Bit Integer, 80 Bit FP, 64 Bit MM, 128 Bit SSE, 40 Entry RAT

GP: 32FPU: 80MMX: 64XMM: 128

Pipeline Depth2 (Shared) plus 2x 3 (Dual Pipeline) Stages

12 (In-order) plus 2 (Out-of-order) Stages



20 stages

Execution Units

2x Integer, Pipelined FPU

2x ALU, Load, Store Adress, Store Data, Pipelined FPU

2x ALU/MMX, Load, Store Adress, Store Data, Pipelined FPU

2x ALU/MMX/SSE, Load, Store Adress, Store Data, Pipelined FPU

2*DDR ALU/MMX/SSE2, Load, Store Address, Store Data, Pipelined FPU

Processor Address Bus Width 32 Bit 36 Bit 36 Bit 36 Bit 36Bit

21

Buses Data Bus Width

64 Bit64 Bit, separate 64 Bit Backside L2 Cache Bus

64 Bit, separate 64 Bit Backside L2 Cache Bus

64 Bitseparate 64+8 Bit Backside L2 Cache Bus with ECC (0.25 µm)separate 256+32 Bit Backside L2 Cache Bus with ECC (0.18 µm)

64Bit

Physical Memory 2^32 Bit = 4 GB 2^36 Bit = 64 GB 2^36 Bit = 64 GB 2^36 Bit = 64 GB 2^36Bit=64GB

Virtual Memory(8,190 + 8,192) x 4 GB = 65,528 GB (~64 TB)

(8,190 + 8,192) x 4 GB = 65,528 GB (~64 TB)

(8,190 + 8,192) x 4 GB = 65,528 GB (~64 TB)

(8,190 + 8,192) x 4 GB = 65,528 GB (~64 TB)

(8,190+8,192)*4GB=65,528GB(~64TB)

MultiprocessingSMP, 2 Processors, using integrated local APICs

SMP, 4 Processors, using integrated local APICs



SMP, 2 ProcessorsAPICs

Processor Caches

Level 0 N/A N/A N/A N/A N/A

Level 1 Code8 KB, 2-Way, 32 Byte/Line, SI,2x Fetch Port (supports Split-line Acess),Snoop Port (for SMC), LRU

8 KB, 4-Way, 32 Byte/Line, SI,Fetch Port, Internal and ExternalSnoop Port (for SMC/XMC), LRU



8KB, 4-Way, 64B/line

Data

8 KB, 2-Way, 32 Byte/Line, MESI,Non-blocking, Dual-ported, Snoop Port,8 Banks, LRU

8 KB, 2-Way, 32 Byte/Line, MESI,Non-blocking, Dual-ported, Snoop Port,Write Allocate, 8 Banks, LRU



Trace Cache:12000uOPS

Level 2 External, depends on 256 KB..1 MB, 4- 256 KB, 4-Way, 32 Unified, 256 KB, 8- 256KB, 8-Way,

22

Motherboard

Way, 32 Byte/Line,Non-blocking, 64 GB cacheable,using 1 or 2 Dies inside Package

Byte/Line,Non-blocking, 64 GB cacheable,

Way, 32 Byte/Line, MESINon-blocking, 64 GB cacheable, LRU

128B/line

Processor Buffers

Read Buffer

32 Byte for Code Cache32 Byte for Data Cache

4x 32 Byte 4x 32 Byte

4x 32 Byte (Shared)

Write Buffer

2x 8 Byte (supports Dual Pipeline Design)3x 32 Byte (Line Replacement Write Buffer,Internal and External Snoop Write Buffer)

32 Byte 32 Byte

Prefetch Queue

2x 32 Byte (supports Dual Pipeline Design)SMC can be observed up to 94 Byte ahead

32 Byte 32 Byte 32 Byte

Branch Prediction

Static Yes Yes Yes Yes Yes

Dynamic256 Entries, 4-Way, 4-State

512 Entries, 4-Way, providing16x 4-State Pattern Recognition



RSB N/A 4 Entries 4 Entries 4 Entries

TLB 4KB CODE 32 Entries, 4-Way, LRU

4KB CODE 32 Entries, 4-Way, LRU



4KB CODE

4MB CODEN/A (uses 4 KB Code Entries)

LARGE CODE 2 Entries, Full, LRU



LARGE CODE

23

4KB DATA64 Entries, 4-Way, LRU

4KB DATA64 Entries, 4-Way, LRU

4KB DATA 64 Entries, 4-Way, LRU

4KB DATA 64 Entries, 4-Way, LRU

4KB DATA

4MB DATA8 Entries, 4-Way, LRU

LARGE DATA8 Entries, 4-Way, LRU

LARGE DATA 8 Entries, 4-Way, LRU

LARGE DATA 8 Entries, 4-Way, LRU

LARGE DATA

Instruction Set Regular IA-32 IA-32 IA-32 IA-32 IA-32

Floating Point Integrated Integrated Integrated Integrated Integrated

Multi Media N/A N/AMMX, FXSAVE/FXRSTOR

MMX, SSE MMX, SSE2

Processor ModesReal, Protected, Virtual, Paging, SMM, Probe Mode

Real, Protected, Virtual, Paging, SMM, Probe Mode




24

Appendix 2: References

ASTA99 A.S.Tanubau: Structured Computer Organization, Prentice Hall, 1999EVER98 Evers.M.,et al. “An Analysis of Correlation and Predictability: What makes Two-Level Branch Predictors work” Proceeding,25th Annual Inter-national Symposium on Microarchitecture, July 1998HWAN93 Hwang, K. Advanced Computer Architecture. New York:McGraw-Hill,1993Will96 William ,S., Computer organization and architecture,Prentice-Hall,Inc,1996INTEL98 INTEL Inc., P6 Family of processors Hardware Developer’s Manual,244001-001,1998INTEL01 INTEL Inc., Intel Pentium 4 Processor and Intel 850 Performance Brief,249240-003,2001INTEL00 INTEL Inc., A Detailed Look Inside the Intel NetBurst Micro-Architecture of the Intel Pentium 4 Processor,2000

25

Date post:	08-Dec-2015
Category:	Documents
Upload:	nicolae-stefan
View:	220 times
Download:	0 times

nastas proiect

Documents