SHAKTI Tutorial at the VLSI Design Conference

SHAKTI ProcessorsAn Open Source Hardware InitiativeRISE LAB, IIT Madras

VLSI Design Conf. 2016, Kolkata

Who Are We?• A crew of academicians and industrial consultants aimed at

creating a platform for open source hardware in India.• Active Members of the initiative:

• Prof. V. Kamakoti: Professor, CSE Dept., IIT-Madras• G.S. Madhusudan: Principal Scientist, IIT-Madras• Neel T. Gala: PhD Scholar, CSE Dept., IIT-Madras• Arjun C. Menon: MS Scholar, CSE Dept., IIT-Madras• Rahul Bodduna: MS Scholar, CSE Dept., IIT-Madras

• The initiative also receives a continuous flow of M.Techs, internsand Project associates contributing to various aspects.

Our Motivation?• Academia vs.

Industry• Publications• Licensed IP• Lack of support

• Simulators• Limited accuracy• HW limitations are

abstracted• What it takes to

code in HW?

• Open RTL?• Design Space

Exploration• Generalised• Minimal Support• Abstraction

Where to Start?• Selecting a good ISA.• ISA (Instruction Set Architecture) forms the backbone of any general purpose

processor.• ISA dictates the final Cycles per instruction and underlying micro-architecture.• Also defines the length of a program.• Affects compiler innovation.

• Should I go for RISC or CISC?• CISC is dead. No new commercial CISC ISA in 30+ years.• RISC is widespread and agreed upon for general purpose ISAs.

• VLIW?• A Fiasco so far!!• Unpredictable branches, Too complex compilers, variable memory latency.

Why the wait?• ISAs are proprietary – for business reasons.• Most companies driving dominant ISAs in the market lack

necessary expertise/experience to develop a proficient ISA.• ISAs have become unnecessarily large and complex.• Custom modification or expansion not easy

Are there any good open ISAs?• RISC-V from UC-Berkeley.

• Modest aim : “to create an industrial standard ISA with relevanthardware and software ecosystem for use in all computing devices”.

• Based on RISC methodology.• Really simple and minimal ISA.• Enough scope for expansion.

The RISC-V ISA !!• Offers 3 Base Integer ISAs.

• RV32I, RV64I, and RV128I – one per address width• Compact: ~50 Instructions needed.

• Extensions Available:• M: Integer Multiply/Divide.• A: Atomic memory operations.• F: Single Precision Floating Point.• D: Double Precision Floating Point.• C: Compressed instruction encoding.

• Provision of extra space for Unique SoC based instructions

The RISC-V ISA !!• Support for variable length instructions• Each instruction can be any number of 16-bit parcels• Jumps are be 2-bytes aligned

* Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical Report UCB/EECS-2014-54., EECS Department,University of California, Berkeley, May 2014.

RISC-V INSTRUCTION FORMAT*

* Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical ReportUCB/EECS-2014-54., EECS Department, University of California, Berkeley, May 2014.

Other Challenges in HW Design• Time to market

Challenges in Hardware Design• Hardware Software co-simulation

HDL(50 KHz)

C++(200 KHz)

Transactors(tens ofMHz)

H.264 Decoder

Ref: http://www.eve-team.com/demos/mandelbrot_demo.html

Challenges in Hardware Design• Design Space Exploration

• Globally optimal solution• More accurate compared to an architectural

simulator• Some of them are not cycle accurate• Estimate models• No power analysis• Ideal modelling of some modules

Image Ref: http://www.academia.edu/4594082/Accuracy_Evaluation_of_GEM5_Simulator_System

Bluespec System Verilog (BSV) to Rescue!• Design time

• Reduced by providing a higher level of abstraction• Generic Interfaces• Extensive parameterization

Bluespec System Verilog (BSV) to Rescue!• Design Space Exploration

• Extensive parameterization• Library components

• RegFile• Different types of FIFOs• Bus Fabric: AXI and AHB with TLM2 interface• Vectors• Block RAMs• Clock generators and Synchronizers• Functions: lead zero detector, gcd calculator, etc.

• Hardware Software co-simulation• Transactor library

Shakti Processors• E Class

– 16/32-bit 3-stage Micro-controller

– Stripped ISA• C Class

– 32-bit 5-stage Micro-controller

– Stripped ISA

• I-Class variant

– 64-bit Industrial purpose

– Multi-threading, Quad Issue, etc.

• S-Class

– Targeting Server Applications

– Multi-threaded Variant of I-Class

– Hybrid Memory Cube.

• M-Class

– Multi-core version of I Class highperformance and embeddedapplications.

– Targeting complex SoC systems

• H-Class

– 32+ Cores

– SIMD Support

• T-Class

– Tagged ISA for security.

Tutorial Outline• E-class

• Architecture• Debug-environment• Fault-tolerant version

• I-class• Architecture• Verification Suite• Performance evaluation

• Other works• Rapid IO.• Server fabrics.• SSD controller: NVMe variant

E Class Microcontroller

Target Applications• Targeted for low end embedded applications.

• GPS navigation systems.• Storage controllers.• FPGA based controllers.• Designs running at 50-100MHz Max Frequency.

• Minimal computation applications• Customized firmware

• Security• Used as a co-processor to ensure security and privileged access to

main processor peripherals.• As a reference RTL model for teaching basics of computer

architecture in academia.

Instruction Support• Base Integer ISA – RV32IMA

• Basic arithmetic instructions such as Add, Sub, Mul, Div, etc.• Basic logical instructions such as AND, OR, XOR, etc.• Atomic instructions• Basic Branch, Jumps, Load/Store instructions.

• Also supports RVC (RISC-V Compressed ISA)• 16-bit instructions for efficient resource utilization for low-end

applications.

Why RVC*• Code redundancy

• Experimentally observed, about 30 frequent instructions occur 70% of time

• Many instructions have few unique operands• Destination register and one of the source register are same 36% of the time.

• Registers exhibit substantial locality of reference• Immediate operands are usually small

• About two-third registers fit within 6 bits.

* Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical Report UCB/EECS-2014-54., EECS Department,University of California, Berkeley, May 2014.

* Andrew Waterman. Improving energy efficiency and reducing code size with RISC-V compressed. Master's thesis, University of California, Berkeley, 2011.

RVC INSTRUCTION FORMAT** Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical Report UCB/EECS-2014-54., EECS Department,University of California, Berkeley, May 2014.

* Andrew Waterman. Improving energy efficiency and reducing code size with RISC-V compressed. Master's thesis, University of California, Berkeley, 2011.

EXAMPLE*• Examples of encoding

• Here,• If instruction is addi with same source and destination registers and a small immediate(less than 6 bits), encode it

using only one register field and 6 bits for immediate operand.• Branch if equal beq instruction with small branches(less than 6 bits) is encountered, encode it using a 5 bit opcode,

2 3-bits register encoding and 6-bits immediate operand space.• If a subroutine return is encountered with x0 as destination register with no immediate operand, encode it using 5-

bits opcode 5-bits for ra.

RVC Pros and ConsPROS:• Reduced Instruction Fetch traffic

• Reduced code density• Cache performance improvement

• Lesser cache misses• Suitability for systems with smaller address space

• Lesser cycles to fetch instructionsCONS:• Power of expression is lost• Compilers need to be optimized for this feature

OUR DESIGN CHOICES• Three-stage Pipeline, in-order execution only

• Instruction Fetch Unit• Decode and Execute Unit• Commit Unit• Five-stage Pipeline design also planned

• Memory-mapped I/O• Less internal logic, Smaller Chip, Less power

consumption

• AHB Bus as the interconnect Fabric• Use 33% less power compared to Axi4• Latency and throughput are comparable

• CPU’s Architectural choices• No. of GPRs : 32• Width of GPR : 32-bits• Width of Address Bus : 32-bits• Width of Data Bus : 32-bits

Three-stage Pipeline Core

Hardware Support for RVC• All 32 registers are accessed.• The decoder is slightly more complex.• Since only the lower 16-bits are used in RVC mode, we implement

the entire register file as 2 independent arrays, where the higher16 bit array is power gated when operating in Compressed mode.

What are Transactors?• Developing synthesizeable buses/transactors can be hectic and

diverts focus from the main project.• Bluespec provides a range of transactors and parameterized

modules for faster development.• AMBA based AXI,AHB,APB buses.• UART• Memory models, etc.

• We use the AHB transactor for our e-class micro-controller.

Generic Peripheral Interfaces

Proposed SoC Layout

COMPARISON – AHB v/s Axi4 - LATENCY

* taken from Design & Reuse at http://www.design-reuse.com/articles/24123/amba-ahb-to-axi-bus-comparison.html

COMPARISON – AHB v/s Axi4 - THROUGHPUT


COMPARISON – AHB v/s Axi4 - UTILIZATION


Debugger• Why?

• To step through programs and trace architectural state changes.• Runs slower than executing the program directly.• The debugger typically shows the location where a “trap” has occurred

– an invalid instruction or an unknown state in the program.• Offer sophisticated features such as

• Step-by-step execution.• Break point insertion.• Modify program state on the fly.

• Debuggers can also be used fault coverage and performance analysis.

Hardware Support for Debugger• Advanced JTAG debugging interface.• Protocol yet to be finalised by the RISC-V foundation.• A hardware implementation for openrisc is available which supports

RISC-V ISA.

JTAG Cable

JTAG TAPAdv.

Debuginterface

CPU

Syst

em B

usHostWorkstation

with GDB

CPU Soft Debug Interface

CPU DEBUGGER INTERFACE• reset : Reset CPU state to initial• run_continue : Run CPU starting from the inputted/current PC to completion• run_step : Run CPU for one step starting from inputted/current PC• stop : Halt the execution of the CPU• stop_reason : Return why CPU stopped• read_pc : Read the current value of the PC• write_pc : Write into the PC• read_gpr : Read a value from the Register File• write_gpr : Write into the Register File

• req_read_memW : Request a memory read• rsp_read_memW : Read the response from previous memory read request• write_memW : Request a memory write• read_instret : Return the number of instructions retired by the CPU• read_cycle : Read the number of CPU cycles executed so far

Verification Environment• Generating all corner test cases to test a processor is a difficult

task. Some sort of automation is help full.• AAPG – Automatic Assemble Program Generator

• Python based executable.• Constraints can be provided to generate specific targeted test cases

• Number of total instructions to be generated.• Percentage of Branches, ALU, Load/Stores etc.• Various dependency generation.• Random Data memory generation.• Programmable support for generating/omitting certain instructions

• Test cases are universal and can be run on the riscv-GCC assembler.

Verification Environment• Instruction Set Simulator-

• A C/C++ based single cycle processor simulator.• Supports the entire RV32/64 IMAFD.• Creates a register dump after every instruction.• A universal simulator which can used across any micro-architecture for

verification.• The functionality of the ISS has been verified against UCB’s Spike

Simulator.

Verification Environment• Instruction Set Simulator-

• Creating and managing memory elements within the ISS is a challengefor fast operation.

• We have used two data structure to mimic Data memory• A contiguous memory for addresses in the range 0-220.

• Loads/Stores beyond 220 are stored in a dynamic lookup table along withaddresses and corresponding data.

• Backward jumps are tricky to handle• Challenges in file IO operations.

Verification EnvironmentAAPG

InitialiasedmemoryInstructions ISS

Processor

Match?

I-Cache D-Cache

no

modify

• Automatic Assembly ProgramGenerator(AAPG) generatesrandom assembly instruction –written in python.

• Instruction Set Simulator(ISS) –functional equivalent of a singlestage processor in C.

• Register dump after everyinstruction is matched againsteach other for verification.

Fault Tolerant Microcontroller

Micro-Architecture• 32-bit 5 stage pipeline with branch prediction.• Supports all integer instructions. RV32I and ‘M’ extension.

• Resilient against hard and soft errors.• Tolerate single bit error from fetch stage to writeback stage.• Tolerate one ALU failure in the execution unit.

FT Technique implemented for Memories/Registers• The Single Error Correction and Double Error Detection (SEC-DED) technique has been used to

mitigate single bit error at Instruction Memory (IMEM), Data Memory (DMEM), ArchitecturalRegister File (ARF) and Program Counter register (PC).

• Data of IMEM, DMEM, ARF and PC are flows along with checkbits all through the pipelinestages. With this approach, SEEs on the ISBs are also covered as far as data path is concerned.

• (32,7) hamming code has been used for SEC-DED.

• SEC-DED logic is placed in the design after inter stage buffer (ISB) and before thecombinational logic at appropriate pipeline stages.

Simplified View of Proposed Architecture

FT Technique Implemented for ALU Design• The fault tolerant ALU comprises 5 types of functional units (FUs) namely adder, comparator, multiplier,

shifter and logical unit.

• All types of FUs are in dual modular redundant (DMR) configuration with fault handling logic called FaultTolerant FU (FTFU).

• Faulty FU will be isolated using time redundancy technique.

• Different recomputation techniques has been proposed for different types of FUs.

ISB-ID/EX

ADD1 ADD2

Fault handling Logic& Data Posting

MUL1 MUL2


LU1 LU2


Shifter1 Shifter2


DE-MUX

MUX

ISB-ID/EX

Fault Handling Logic• No error (NE) : Match between the outputs of Prime FU (FUP)

and Redundant FU (FUR) are considered as No Error (NE).

• Transient Error (TE) : Mismatch between the outputs of FUPand FUR are considered as Transient Error (TE).

• Permanent Error (PE) : If transient error persists for 3consecutive cycle, then it considered as Permanent Error(PE).

Equality Check Error Type Action Taken

Match NE Posted from FUP

Mismatch for less than 3

consecutive cycles

TE No posting (Redo the operation)

Mismatch for 3 consecutive

cycles

PE Recomputation to identify faulty FU and Update the

corresponding PE Flag

PE Flag of FUP PE Flag of FUR Conclusion/Action Taken

0 0 Not able to detect faulty FU, Post data as ‘0’. Call to Interrupt Service Routine (ISR) .

0 1 FUR faulty. Isolate FUR and Post from FUP here onwards.

1 0 FUP faulty. Isolate FUp and Post from FUR here onwards.

1 1 Both FUP & FUR faulty. Post data as‘0’. Call to ISR.

Fault Handling Table

PE Flag Table

Block Diagram of Fault Tolerant FU (FTFU)

Re-Computation Techniques for ALU1. Addition and SubtractionRecomputing with complemented operands (RECOMPO) scheme. Sum and Carry are self dualfunctions.

In Normal Computation :Fn = OP1 + OP2 (Carryin=0 )In Recomputation Step :Fr = (~OP1+1) + ~OP2 (+1 in OP1 to absorb Carryin = 1)Fault Detection Capability :As Fn & Fr are complementary, stuck at failure at any bit position will be known.Act as a basic block for Comparator Functional Unit also.

2. MultiplicationIn Normal Computation: Fn = OP1 * OP2

In Recomputation Step : Fr = (2’s Complement of OP1) * OP2Fault Detection Capability :

1. Under No fault condition : Fn = - Fr = 2n – Fr.

2. Any stuck at failure at particular bit position ‘ i ’ :

Result of normal computation i.e. Rnormal = Fn ± 2i .

Result of recomputation step i.e. Rrecomp = Fr or Fr ± 2i.

Fault handling logic will check whether Rnormal and 2’s Complement of Rrecomp is equal or not.

2’s complement of Rrecomp = 2n – Rrecomp = 2n – (Fr or Fr ± 2i)= (2n – Fr ) or (2n – Fr -/+ 2i )

= Fn or Fn (-/+) 2i.Therefore 2’s complement of Rrecomp ≠ Rnormal, In this way , we can detect single or multiple bitsstuck fault in multiplier circuitry.

3. Shift Related OperationRecomputing with shifted operands (RESO) technique has been used. RESO-2 implemented here.In Normal Computation : Operand is given to shifter which computes output as Fn.In Recomputation Step : Operand left shifted by 2 bit positions will be feed to the shifter circuitry whichcomputes output as Fr.

Fault Detection Capability :1. Under No fault condition : Fn [n-1:0] = Fr [n+1:2]2. Any stuck at failure at particular bit position ‘i’:Result of normal computation i.e. Rnormal [n-1:0] = Fn ± 2i .Result of recomputation step i.e. Rrecomp [n+1:2] = Fn or Fn ± 2i-2.Fault handling logic will check whether Rnormal [n-1:0] and Rrecomp [n+1:2] is equal or not.As error causes different result in two recomputation steps, the error is detected.

4. Logical Operation

Recomputing with SwappedOperands technique has been

used.Lower halves and Upper halves of

operands are swapped duringRe-Computation.

The remaining arithmeticcomputation units like dividers

need to be hardened

Verification MethodologyModule Level Verification

(This involves regressionverification of following

modules)

• Fault Tolerant Adder• Fault Tolerant Multiplier• Fault Tolerant Logical Unit• Fault Tolerant Shifter• Fault Tolerant Comparator• Error Correction Module

(SEC-DED module)

System Level Verification

• This involves verification atthe processor level.

• Aim is to validate the systemunder different fault rates.

Module Level VerificationSystematic Testing : FIM simulates stuck at ‘1’ and stuck at ‘0’ faults at every bit position.

Random Testing : FIM simulates multiple bit error randomly at the output of FU-1.

System Level Verification Fault Injection at ISB between decode and execute stage. Injecting faults atID/EX-ISB simulates the fault of various components of processor like IMEM,DMEM, ARF, PC register & previous ISBs in the pipeline flow. Fault Injection at the Fault Tolerant FUs . Total 8 fault injection locations. LFSR_SEL and LFSR_EN will control the enableinput of LFSRs used for FIM. This scenario can simulate error for 1/2/3consecutive clock cycles.

ASIC ImplementationThe design is targeted for 55 nm technology

Worst Corner Condition (PVT) : Slow, 1.08V, 125°c.

Best Corner Condition (PVT) : Fast, 1.32V,-40°c.

Processorversion

Total Instance Count Power (mW) CoreArea

(mm2)

MinClock

period(ns)Comb. Seq. Total Static Dynamic Total

Base 22542 2634 25176 0.58 20.97 21.55 0.2704 2.4

Fault tolerant 28259 3507 31766 1.16 9.19 10.35 0.3249 3.7

Overhead 25.36% 33.14% 26.17% 100% -56.17% -51.97% 20.15% 54.17%

I Class ProcessorOut-Of-Order Architecture

Putting the code to use.

Hands-onBenchmark Results

Architecture of Out-of-Order Processor

Concept

The need for Out-of-OrderProcessor

Necessity

4

3

21

Why Out-of-Order(OoO) Processor?● Fills stalls created due to variable latencies in execution.

● Complex method for fine-grained Data prefetching

● Balances out poor compilers and lazily written code.

● Exploits Instruction Level Parallelism.

PERFORMANCE!!

Academia and OoO Research● General lack of effort to build and

evaluate OoO designs.

● Proprietary limitations at each step.● Use of Software simulators like GEM5

○ Cannot evaluate Area and Powerconsumptions accurately.

○ Cannot rely for accuracy○ Too slow.○ Actual hardware overheads of

optimizations are not easilyidentified.

● Advent of Open ISAs like RISC-V,OpenRISC.

● Introduction of High-levelHardware Description LanguagesChisel, Bluespec etc.

Instruction Level ParallelismPipeline CPI = Ideal CPI + Data Stalls + Control Stalls + Structural Stalls

Technique Reduces

Forwarding and Bypass Potential Data Hazard Stalls

Delayed Branches and simple branch scheduling Control Hazard Stalls

Basic Dynamic Scheduling (Score-boarding) Stalls from True Dependencies

Dynamic Scheduling with renaming Stalls from Output and Anti dependencies

Branch Prediction Control Hazard Stalls

Multiple Issue Ideal CPI

Dynamic Scheduling● Scoreboarding

○ Allows instructions to get issued Out-of-Order only when theresufficient resource and no data dependencies.

○ No forwarding● Tomasulo Algorithm

○ Operand forwarding by register renaming - using ReOrder Buffer -avoids data hazards

○ Registers are written in commit stage.● Merged Register File approach

○ Operands forwarding by register renaming - using extra registers○ Registers are written right after execution.

Dynamic Scheduling● Scoreboarding

○ Allows instructions to get issued Out-of-Order only when theresufficient resource and no data dependencies.

○ No forwarding● Tomasulo Algorithm

○ Operand forwarding by register renaming - using ReOrder Buffer -avoids data hazards

○ Registers are written in commit stage.● Merged Register File approach

○ Operands forwarding by register renaming - using extra registers○ Registers are written right after execution.

The Case of the I Class Processor● A 64-bit processor providing Out-of-Order Execution.● Based on RISC-V ISA.

● RV64I + RV64M + RV64A + RV64FD● Coded in Bluespec System Verilog (BSV).● Architecture implements the Merged Register File approach to

achieve high instruction level parallelism.

I Class Features● Based on RISC-V ISA – RV64I and ‘M’ extension● Dual Issue.● Parameterized pipeline and Issue Queue.● CAM based speculative load store unit.● Prioritized for selecting instructions from issue queue.● Inter-functional unit bypass network.● Tournament Branch Predictor.● 32KB I-Cache and Non-Blocking Cache.

Architecture OverviewPipeline Stages1. FETCH2. DECODE3. MAP

- Register Renaming.4. Wakeup5. SELECT AND GRANT

- Selection from Instruction Queue.6. DATA READ AND DRIVE7. EXECUTE8. COMMIT

Storage structuresPipeline stagesExecution Unit

Fetch and DecodeFetchBPU CACHE

DECODE

● Next instruction is predicted for every instruction in Fetch Unit● Direct jump instructions like JAL are executed in Decode stage itself.

Branch Prediction Unit• Tournament Branch Predictor is being used.

• Tournament between Bimodal and Global Branch Predictor.• Both the predictor tables have 16 entries.• Total storage used by the predictor is ~ 2Kbits.

• State-of-the-art model branch predictor based on Predictionby Partial Matching(PPM).• Comprises of 1 Bimodal Bank and 4 Global banks for prediction.• Total memory required for it is 64Kbits

MAP• New source register address for instructions from decode stages are assigned from fRAM.• Destination registers are mapped to new address using FRQ.• Dependencies are checked between the instructions and dispatched into

instruction queue. FRQ

Instruction-1 rd rs1 rs2

Instruction-2 rd rs1 rs2

fRAM

Selection from Issue Queue• Instructions eligible for selection for execution

• Instructions which have all their operands ready.• Corresponding execution units are predicted to be free.

• Selection of Instruction from set of eligible ones.• Age based selection policy.• Position based selection policy.

• Priority Encoders for selection policy.• Tree-shaped priority encoders for position based policy.• Barrel shifted Tree based priority encoders for age based selection policy.

• Selected instructions are enqueued into inter-stage buffer.

Priority EncodersSerial Priority Encoder Tree-shaped Priority Encoder

Data Read and Drive• In this stage, instructions are dequeued from inter-stage buffer and

operands are read from registerfile.

• I-class processor can issue up to 5 instructions per cycle which results in10 read ports for physical register file.

• Immediate fields of instructions are stored in separate location (immediatebuffer) and thepointer to the immediate buffer location is stored in issue queue entry.

• The inter-stage buffer can accommodate single element. This is because all theexecution units in

I-Class processor are pipelined.

Instruction Execute• Result calculated is broadcasted to Map stage and status of source operands are

updated.

• Branch Unit broadcasts training packet to Branch Predictor Unit.• Branch training packet consists of Squash PC(Jump Address) and Branch

result(Taken or not)• Branch result is also update in Squash Buffer which is useful at the time of

committing.

• All arithmetic operations are single cycle operations.

• A separate execution unit for Multiplier and Divider.• Multiplier takes 6 cycles and Divider takes 36 cycles both pipelined.

• Arrival of result for load instructions cannot be predicted. Hence thedependent instructions are notified accordingly.

CAM based Load Store UnitEAC

StoreQueue

LoadQueue

MemoryAccess

Cache

CAM SEARCH

CAM SEARCH

Broadcast Loadresult

StoreCommit

LoadCommit

Flush Wire

Each memory accessinstruction is allotted anentry in one of LS queues.

The value from the store isforwarded in case ofaddress match.

Alias bit is set in case of wrongspeculation and pipeline isflushed at the time ofcommit.

CAM based Load Store UnitCAM search onload queue

Match

No

Send store requestto D-Cache

ForwardedSet?

Forwardacknowledgeset?

Set forwardacknowledge

Set alias bit

Yes

YesNo

NoYes

Invalidate StoreQueue Entry

● Alias bit is set two cases○ When there no

forwarding from thecorresponding store

○ When there isforwarding from thewrong store.

● If not both, Forwardacknowledge is set.

● The store instructionanyway proceeds to D-Cache.

Instruction Wake-up• Result from the execution units is broadcasted to Instruction Queue(IQ).• Each instruction compares destination tag with source operand tags in IQ and updates operand ready.• During the same cycle Registerfile is also updated.

Bypass Network• Dependent instructions have 3 cycle bubble between them.

Select Drive Execute Broadcast

Wakeup Select Drive Execute

Select Drive Execute Broadcast

Wakeup Select Drive ExecuteConsumer

Producer

Producer

Consumer

● In bypass network instructions are predicted to get finished in certain cycles● Accordingly instructions dependent are woken up.

Implementation of Bypass Network• Instead of having registers for operand ready, every instruction is attributed to

Delay register.

• Delay registers contents are moved “Shift register” at the time of broadcast.

• Contents of “Shift registers” are right shifted every cycle. When the right most bit in“Shift register” is set, then corresponding instruction is released for execution.

Commit Stage• All instructions are committed in order. This is done by maintaining head pointer for

oldest instruction in the instruction queue.

• I-Class processor supports multiple commit and dependencies between theinstructions are checked before commit.

• The architectural destination registers of instructions are added back to FRQ.

• The destination register mapping of committing instruction is updated into rRAM.

• The following are the cases of exception in Commit Stage

• Mis-predicted Branches whose decision can be determined from SquashBuffer.

• If the instruction is wrongly speculated load.

Handling Exceptions During Commit• Exceptions Being supported:

• Instruction address mis-aligned exception• Instruction access fault.• Illegal Instruction.• Load address misaligned exception• Load access fault.

• The interrupt controller takes care of exceptions from thecommit stage.

• Interrupt controller has logic to transfer the control to machinemode and update the program counter.

Flushing Pipeline• Incase exception, branch misprediction and wrongly speculated

load execution, whole pipeline is flushed.• Flushing the pipeline:

• Clear all the inter-stage buffers.• Clear Load Queue, Store Queue and the instruction Queue.• Copy the contents of rRAM to fRAM.• Mark all the contents of FRQ valid and reset Head and Tail to zero.• Change the program counter to Squash Program counter of the

corresponding instruction

Cache Architecture• Totally parameterized for all the buffers and cache.• Non-Blocking – Services requests unless load buffer is

full or more than 2 requests for same cache lineaddress.

• Load Buffer is implemented as vector of concurrentregisters and wait buffer is implemented mimics asearchable FIFO.

• Completion Buffer issues tokens and is responsible forin-order completion of the requests

CompletionBufferCache

Load Buffer WaitBuffer

Higher levels ofmemory

Cache Architecture• In case of cache miss, the entries are

stored in Load Buffer.• Through fully-associative search the

number in Load Buffer(LB), if the entry isalready in LB, then wait buffer isupdated.

• While servicing entries in wait buffer, theCPU requests are stalled.

Response toCPUResponse to

CPU

Request tomemory

Cache hit?

Already inLB?

Updatewait

Buffer

Stallrequestsfrom CPU

No

Yes No

Execution Modes• Privilege levels are used to provide protection between different component of the software

stack.• At anytime, the core runs in one of the following privilege modes.

• Machine Level is the highest privilege level, inherently trusted and is mandatory.

Level Encoding Name Abbreviation

0 00 User U

1 01 Supervisor S

2 10 Hypervisor H

3 11 Machine M

Execution Modes• User-mode (U-mode) - conventional application

Supervisor-Mode - operating systemhypervisor-mode (H-mode) - virtual machine monitors.

• There are Control and Status Registers(CSR) associated with each privilege level.• Privileged instructions are used to access the CSR instructions.

• The “csr” field is used to address CSR registers. Fields rs1 and rd denote integer registers.• The CSR registers are addressed based on privilege level access and read/write permissions.

RTL Code• Written in Bluespec System Verilog.

• The total number of lines of code : ~ 11k• Cache LoC - ~3k• Core LoC -~ 8k

• 24 Modules – Highly modularized.• Facilitates design space exploration

• Able to develop Bypass Network and plugging two extra modules.

• 35 parameterizable variables

• Enabled us quick trade-off analysis between IPC and clock speed.

IPC for AAPG Test Cases• IPC variation based on issue queue size.

Dhrystone ResultsIPC

Issue Queue Size

Frequency recorded

65 nm UMCIP Standard cell library with operatingconditions 1.32 V supply voltage and 110 0F is used.

Performance• Fully synthesizable• Runs at 110MHz Xilinx Artix-7 board

Benchmarks IPC

COREMARKS 1.18

Code Access and Use• The code is publicly available at

https://bitbucket.org/casl/shakti_public.• Under the BSD license.• Verification Environment can be made use of to verify the code.• To run benchmarks of any kind, generate list of instructions into

input.hex file and initialize memory rtl_mem_init.txt file.• Parameters can be changed in defined_parameters.bsv file in

BSV_source.• Use makefile to compile and simulate files.

Other Works

RapidIO interconnect• The Serial RapidIO Gen3+ standard is proposed to be used as the CPU + I/O interconnect• It is proposed to use the 10/25G SERDES version of this standard. A port will consist of 4, 8 or 16

lanes of 10/25 G each transported over electrical/optical links.• CC links for multi-socket configurations will need all 16 lanes• I/O interconnects will probably need 100 Gbits/sec (4x25 G)• Cache coherency• A 5+ state MOESI/MESIF like directory based protocol will be the CC protocol• Mapped to each core's AMBA/CHI interconnect• Scalable upto 128 sockets ( 16 clusters of 8 sockets/cluster)• Optical configurations• 10 lanes of 10G or 4 lanes of 25G muxed onto the 802.3bm standard's 100G optical link• Intel CX optical interconnect can accommodate larger lane count• Extending max packet size of SRIO to 4k• TBD based on performance of 256 byte packets• IIT-M IP• Complete implementation of 3.x standard except analog components• 4 x 4 switch (4 lane ports, max of 100 Gbit/secper port at 25G lanes) being developed• Critical for non SHAKTI projects also

Server Fabrics• Workload parallelism involves access to shared resources.

• The core should be fast and exploit ILP.• Interconnection fabric should also support throughput requirements.

• It’s the complete system’s performance which needs to improve.• All applications do not have similar memory access patterns

• Hadoop• RDBMS

• Need for an Adaptive Server Framework• Hybrid interconnects which speed up resource sharing.• Fast Memories such as Hybrid-Memory Cube.

• Observed the 8physical threads per process is sufficient• Use Bi-directional Ring for upto 8-cores.• MESI+GOLS to take care of cache coherence.• Easily scalable.• Power and Area efficient than Mesh.• Dynamic Clustering.

Server Fabrics

Core

L2

Core

L2

Core

L2

Core

L2

L2

Core

Core

L2

Core

L2

Core

L2

HMC / DDRRIO - CC

• Hybrid Interconnect schemes• Rings scale well for upto 8 cores.• (> 16) cores mesh provides better performance .• CCNoC for cache coherence.• Mesh is power and area inefficient• Need for hybrid structures

Server Fabrics

Mesh of Rings with 2 bridges Mesh of Rings with 4 bridges Hybrid with hierarchical rings

• Use RapidIO for socket-socket connection• A 5+ state MOESI/MESIF like directory based protocol will be the CC

protocol• Scalable upto 128 sockets ( 16 clusters of 8 sockets)• proposed to use the 10/25G SERDES version of this standard.

Server Fabrics

Proc-1 Proc-2

Proc-4 Proc-3

RIO

LightNVM● Specification for Open-channel SSD

● Extension to NVMe specification

● Adapts the Linux kernel's NVMe driver to provide the LightNVM interface

● Allows SSD to expose internal organization & control the flow of data to it

● In-kernel FTL

MULTI-CHANNEL STORAGE CONTROLLER

TOP LEVEL DATA FLOW OF NVM CONTROLLER

RESULTS● In speed test using PCIe Gen-2 4x, data rate is 14.76Gb/s

● In simulation test, max. data rate is 15.83Gb/s and max. bandwidth utilization

is approx. 98.9%.

● Normalized NAND interface utilization is 99.9% for single channel controller

● Arbiter scales well with increase of IO queues

● Synthesis results show ideal proportional increase in combinational and

sequential area wrt number of channels

Thank YouContact DetailsNeel Gala – [email protected] Menon – [email protected] Bodduna – [email protected] Site : www.rise.iitm.ac.in/shakti/

BioGraphy of Authors

Prof. V. Kamakoti• V. Kamakoti currently holds the post of professor in the Department

of Computer Science and Engineering, Indian Institute of Technology,Madras. He has more than 15 years of experience in computersystems development and specializes in the area ComputerArchitecture, CAD for VLSI and High Performance Computing.Professor Kamakoti holds a Master of Science degree and adoctorate of philosophy in computer science and engineering fromthe Indian Institute of Technology, Madras. He has authored anumber of research papers that have been published in variousinternational journals and in the proceedings of many scientificconferences.

Neel T. Gala• Neel T. Gala has completed his B.Tech from NIT-Warangal in

2010 and is currently pursuing his PhD at IIT-Madras. Hisprimary area of research is in low power approximate circuitdesigns and techniques. He also holds a strong passion forcomputer architecture and digital design. During his PhD, Neelhas published upto 6 publications in international conferencesand journals and holds a US Patent in collaboration with TexasInstruments. He has also collaborated with variousgovernment bodies in regards with processor design andverification.

Arjun C. Menon• Arjun C. Menon is an MS student in Computer Science

Department, IIT Madras. His primary area of research is onproviding hardware support for eliminating software attacks.He is one of the lead designers in the SHAKTI processors.During the course of his masters, he is involved in designingvarious processors for the government of India.

Rahul Bodduna• Rahul Bodduna has completed his B.Tech from IIT-Mandi in

2013. He has since been associated with IIT-Madras as aProject Associate contributing to the SHAKTI processorsinitiative. He holds a strong base in designing out-of-ordercores and memory management units from scratch.

Date post:	20-Mar-2017
Category:	Technology
Upload:	g-s-madhusudan
View:	516 times
Download:	4 times