Date post: | 20-Mar-2017 |
Category: |
Technology |
Upload: | g-s-madhusudan |
View: | 516 times |
Download: | 4 times |
SHAKTI ProcessorsAn Open Source Hardware InitiativeRISE LAB, IIT Madras
VLSI Design Conf. 2016, Kolkata
Who Are We?• A crew of academicians and industrial consultants aimed at
creating a platform for open source hardware in India.• Active Members of the initiative:
• Prof. V. Kamakoti: Professor, CSE Dept., IIT-Madras• G.S. Madhusudan: Principal Scientist, IIT-Madras• Neel T. Gala: PhD Scholar, CSE Dept., IIT-Madras• Arjun C. Menon: MS Scholar, CSE Dept., IIT-Madras• Rahul Bodduna: MS Scholar, CSE Dept., IIT-Madras
• The initiative also receives a continuous flow of M.Techs, internsand Project associates contributing to various aspects.
Our Motivation?• Academia vs.
Industry• Publications• Licensed IP• Lack of support
• Simulators• Limited accuracy• HW limitations are
abstracted• What it takes to
code in HW?
• Open RTL?• Design Space
Exploration• Generalised• Minimal Support• Abstraction
Where to Start?• Selecting a good ISA.• ISA (Instruction Set Architecture) forms the backbone of any general purpose
processor.• ISA dictates the final Cycles per instruction and underlying micro-architecture.• Also defines the length of a program.• Affects compiler innovation.
• Should I go for RISC or CISC?• CISC is dead. No new commercial CISC ISA in 30+ years.• RISC is widespread and agreed upon for general purpose ISAs.
• VLIW?• A Fiasco so far!!• Unpredictable branches, Too complex compilers, variable memory latency.
Why the wait?• ISAs are proprietary – for business reasons.• Most companies driving dominant ISAs in the market lack
necessary expertise/experience to develop a proficient ISA.• ISAs have become unnecessarily large and complex.• Custom modification or expansion not easy
Are there any good open ISAs?• RISC-V from UC-Berkeley.
• Modest aim : “to create an industrial standard ISA with relevanthardware and software ecosystem for use in all computing devices”.
• Based on RISC methodology.• Really simple and minimal ISA.• Enough scope for expansion.
The RISC-V ISA !!• Offers 3 Base Integer ISAs.
• RV32I, RV64I, and RV128I – one per address width• Compact: ~50 Instructions needed.
• Extensions Available:• M: Integer Multiply/Divide.• A: Atomic memory operations.• F: Single Precision Floating Point.• D: Double Precision Floating Point.• C: Compressed instruction encoding.
• Provision of extra space for Unique SoC based instructions
The RISC-V ISA !!• Support for variable length instructions• Each instruction can be any number of 16-bit parcels• Jumps are be 2-bytes aligned
* Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical Report UCB/EECS-2014-54., EECS Department,University of California, Berkeley, May 2014.
RISC-V INSTRUCTION FORMAT*
* Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical ReportUCB/EECS-2014-54., EECS Department, University of California, Berkeley, May 2014.
Other Challenges in HW Design• Time to market
Challenges in Hardware Design• Hardware Software co-simulation
HDL(50 KHz)
C++(200 KHz)
Transactors(tens ofMHz)
H.264 Decoder
Ref: http://www.eve-team.com/demos/mandelbrot_demo.html
Challenges in Hardware Design• Design Space Exploration
• Globally optimal solution• More accurate compared to an architectural
simulator• Some of them are not cycle accurate• Estimate models• No power analysis• Ideal modelling of some modules
Image Ref: http://www.academia.edu/4594082/Accuracy_Evaluation_of_GEM5_Simulator_System
Bluespec System Verilog (BSV) to Rescue!• Design time
• Reduced by providing a higher level of abstraction• Generic Interfaces• Extensive parameterization
Bluespec System Verilog (BSV) to Rescue!• Design Space Exploration
• Extensive parameterization• Library components
• RegFile• Different types of FIFOs• Bus Fabric: AXI and AHB with TLM2 interface• Vectors• Block RAMs• Clock generators and Synchronizers• Functions: lead zero detector, gcd calculator, etc.
• Hardware Software co-simulation• Transactor library
Shakti Processors• E Class
– 16/32-bit 3-stage Micro-controller
– Stripped ISA• C Class
– 32-bit 5-stage Micro-controller
– Stripped ISA
• I-Class variant
– 64-bit Industrial purpose
– Multi-threading, Quad Issue, etc.
• S-Class
– Targeting Server Applications
– Multi-threaded Variant of I-Class
– Hybrid Memory Cube.
• M-Class
– Multi-core version of I Class highperformance and embeddedapplications.
– Targeting complex SoC systems
• H-Class
– 32+ Cores
– SIMD Support
• T-Class
– Tagged ISA for security.
Tutorial Outline• E-class
• Architecture• Debug-environment• Fault-tolerant version
• I-class• Architecture• Verification Suite• Performance evaluation
• Other works• Rapid IO.• Server fabrics.• SSD controller: NVMe variant
E Class Microcontroller
Target Applications• Targeted for low end embedded applications.
• GPS navigation systems.• Storage controllers.• FPGA based controllers.• Designs running at 50-100MHz Max Frequency.
• Minimal computation applications• Customized firmware
• Security• Used as a co-processor to ensure security and privileged access to
main processor peripherals.• As a reference RTL model for teaching basics of computer
architecture in academia.
Instruction Support• Base Integer ISA – RV32IMA
• Basic arithmetic instructions such as Add, Sub, Mul, Div, etc.• Basic logical instructions such as AND, OR, XOR, etc.• Atomic instructions• Basic Branch, Jumps, Load/Store instructions.
• Also supports RVC (RISC-V Compressed ISA)• 16-bit instructions for efficient resource utilization for low-end
applications.
Why RVC*• Code redundancy
• Experimentally observed, about 30 frequent instructions occur 70% of time
• Many instructions have few unique operands• Destination register and one of the source register are same 36% of the time.
• Registers exhibit substantial locality of reference• Immediate operands are usually small
• About two-third registers fit within 6 bits.
* Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical Report UCB/EECS-2014-54., EECS Department,University of California, Berkeley, May 2014.
* Andrew Waterman. Improving energy efficiency and reducing code size with RISC-V compressed. Master's thesis, University of California, Berkeley, 2011.
RVC INSTRUCTION FORMAT** Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. The RISC-V instruction set manual, Volume I: User-level ISA. Technical Report UCB/EECS-2014-54., EECS Department,University of California, Berkeley, May 2014.
* Andrew Waterman. Improving energy efficiency and reducing code size with RISC-V compressed. Master's thesis, University of California, Berkeley, 2011.
EXAMPLE*• Examples of encoding
• Here,• If instruction is addi with same source and destination registers and a small immediate(less than 6 bits), encode it
using only one register field and 6 bits for immediate operand.• Branch if equal beq instruction with small branches(less than 6 bits) is encountered, encode it using a 5 bit opcode,
2 3-bits register encoding and 6-bits immediate operand space.• If a subroutine return is encountered with x0 as destination register with no immediate operand, encode it using 5-
bits opcode 5-bits for ra.
RVC Pros and ConsPROS:• Reduced Instruction Fetch traffic
• Reduced code density• Cache performance improvement
• Lesser cache misses• Suitability for systems with smaller address space
• Lesser cycles to fetch instructionsCONS:• Power of expression is lost• Compilers need to be optimized for this feature
OUR DESIGN CHOICES• Three-stage Pipeline, in-order execution only
• Instruction Fetch Unit• Decode and Execute Unit• Commit Unit• Five-stage Pipeline design also planned
• Memory-mapped I/O• Less internal logic, Smaller Chip, Less power
consumption
• AHB Bus as the interconnect Fabric• Use 33% less power compared to Axi4• Latency and throughput are comparable
• CPU’s Architectural choices• No. of GPRs : 32• Width of GPR : 32-bits• Width of Address Bus : 32-bits• Width of Data Bus : 32-bits
Three-stage Pipeline Core
Hardware Support for RVC• All 32 registers are accessed.• The decoder is slightly more complex.• Since only the lower 16-bits are used in RVC mode, we implement
the entire register file as 2 independent arrays, where the higher16 bit array is power gated when operating in Compressed mode.
What are Transactors?• Developing synthesizeable buses/transactors can be hectic and
diverts focus from the main project.• Bluespec provides a range of transactors and parameterized
modules for faster development.• AMBA based AXI,AHB,APB buses.• UART• Memory models, etc.
• We use the AHB transactor for our e-class micro-controller.
Generic Peripheral Interfaces
Proposed SoC Layout
COMPARISON – AHB v/s Axi4 - LATENCY
* taken from Design & Reuse at http://www.design-reuse.com/articles/24123/amba-ahb-to-axi-bus-comparison.html
COMPARISON – AHB v/s Axi4 - THROUGHPUT
* taken from Design & Reuse at http://www.design-reuse.com/articles/24123/amba-ahb-to-axi-bus-comparison.html
COMPARISON – AHB v/s Axi4 - UTILIZATION
* taken from Design & Reuse at http://www.design-reuse.com/articles/24123/amba-ahb-to-axi-bus-comparison.html
Debugger• Why?
• To step through programs and trace architectural state changes.• Runs slower than executing the program directly.• The debugger typically shows the location where a “trap” has occurred
– an invalid instruction or an unknown state in the program.• Offer sophisticated features such as
• Step-by-step execution.• Break point insertion.• Modify program state on the fly.
• Debuggers can also be used fault coverage and performance analysis.
Hardware Support for Debugger• Advanced JTAG debugging interface.• Protocol yet to be finalised by the RISC-V foundation.• A hardware implementation for openrisc is available which supports
RISC-V ISA.
JTAG Cable
JTAG TAPAdv.
Debuginterface
CPU
Syst
em B
usHostWorkstation
with GDB
CPU Soft Debug Interface
CPU DEBUGGER INTERFACE• reset : Reset CPU state to initial• run_continue : Run CPU starting from the inputted/current PC to completion• run_step : Run CPU for one step starting from inputted/current PC• stop : Halt the execution of the CPU• stop_reason : Return why CPU stopped• read_pc : Read the current value of the PC• write_pc : Write into the PC• read_gpr : Read a value from the Register File• write_gpr : Write into the Register File
• req_read_memW : Request a memory read• rsp_read_memW : Read the response from previous memory read request• write_memW : Request a memory write• read_instret : Return the number of instructions retired by the CPU• read_cycle : Read the number of CPU cycles executed so far
Verification Environment• Generating all corner test cases to test a processor is a difficult
task. Some sort of automation is help full.• AAPG – Automatic Assemble Program Generator
• Python based executable.• Constraints can be provided to generate specific targeted test cases
• Number of total instructions to be generated.• Percentage of Branches, ALU, Load/Stores etc.• Various dependency generation.• Random Data memory generation.• Programmable support for generating/omitting certain instructions
• Test cases are universal and can be run on the riscv-GCC assembler.
Verification Environment• Instruction Set Simulator-
• A C/C++ based single cycle processor simulator.• Supports the entire RV32/64 IMAFD.• Creates a register dump after every instruction.• A universal simulator which can used across any micro-architecture for
verification.• The functionality of the ISS has been verified against UCB’s Spike
Simulator.
Verification Environment• Instruction Set Simulator-
• Creating and managing memory elements within the ISS is a challengefor fast operation.
• We have used two data structure to mimic Data memory• A contiguous memory for addresses in the range 0-220.
• Loads/Stores beyond 220 are stored in a dynamic lookup table along withaddresses and corresponding data.
• Backward jumps are tricky to handle• Challenges in file IO operations.
Verification EnvironmentAAPG
InitialiasedmemoryInstructions ISS
Processor
Match?
I-Cache D-Cache
no
modify
• Automatic Assembly ProgramGenerator(AAPG) generatesrandom assembly instruction –written in python.
• Instruction Set Simulator(ISS) –functional equivalent of a singlestage processor in C.
• Register dump after everyinstruction is matched againsteach other for verification.
Fault Tolerant Microcontroller
Micro-Architecture• 32-bit 5 stage pipeline with branch prediction.• Supports all integer instructions. RV32I and ‘M’ extension.
• Resilient against hard and soft errors.• Tolerate single bit error from fetch stage to writeback stage.• Tolerate one ALU failure in the execution unit.
FT Technique implemented for Memories/Registers• The Single Error Correction and Double Error Detection (SEC-DED) technique has been used to
mitigate single bit error at Instruction Memory (IMEM), Data Memory (DMEM), ArchitecturalRegister File (ARF) and Program Counter register (PC).
• Data of IMEM, DMEM, ARF and PC are flows along with checkbits all through the pipelinestages. With this approach, SEEs on the ISBs are also covered as far as data path is concerned.
• (32,7) hamming code has been used for SEC-DED.
• SEC-DED logic is placed in the design after inter stage buffer (ISB) and before thecombinational logic at appropriate pipeline stages.
Simplified View of Proposed Architecture
FT Technique Implemented for ALU Design• The fault tolerant ALU comprises 5 types of functional units (FUs) namely adder, comparator, multiplier,
shifter and logical unit.
• All types of FUs are in dual modular redundant (DMR) configuration with fault handling logic called FaultTolerant FU (FTFU).
• Faulty FU will be isolated using time redundancy technique.
• Different recomputation techniques has been proposed for different types of FUs.
ISB-ID/EX
ADD1 ADD2
Fault handling Logic& Data Posting
MUL1 MUL2
Fault handling Logic& Data Posting
LU1 LU2
Fault handling Logic& Data Posting
Shifter1 Shifter2
Fault handling Logic& Data Posting
DE-MUX
MUX
ISB-ID/EX
Fault Handling Logic• No error (NE) : Match between the outputs of Prime FU (FUP)
and Redundant FU (FUR) are considered as No Error (NE).
• Transient Error (TE) : Mismatch between the outputs of FUPand FUR are considered as Transient Error (TE).
• Permanent Error (PE) : If transient error persists for 3consecutive cycle, then it considered as Permanent Error(PE).
Equality Check Error Type Action Taken
Match NE Posted from FUP
Mismatch for less than 3
consecutive cycles
TE No posting (Redo the operation)
Mismatch for 3 consecutive
cycles
PE Recomputation to identify faulty FU and Update the
corresponding PE Flag
PE Flag of FUP PE Flag of FUR Conclusion/Action Taken
0 0 Not able to detect faulty FU, Post data as ‘0’. Call to Interrupt Service Routine (ISR) .
0 1 FUR faulty. Isolate FUR and Post from FUP here onwards.
1 0 FUP faulty. Isolate FUp and Post from FUR here onwards.
1 1 Both FUP & FUR faulty. Post data as‘0’. Call to ISR.
Fault Handling Table
PE Flag Table
Block Diagram of Fault Tolerant FU (FTFU)
Re-Computation Techniques for ALU1. Addition and SubtractionRecomputing with complemented operands (RECOMPO) scheme. Sum and Carry are self dualfunctions.
In Normal Computation :Fn = OP1 + OP2 (Carryin=0 )In Recomputation Step :Fr = (~OP1+1) + ~OP2 (+1 in OP1 to absorb Carryin = 1)Fault Detection Capability :As Fn & Fr are complementary, stuck at failure at any bit position will be known.Act as a basic block for Comparator Functional Unit also.
2. MultiplicationIn Normal Computation: Fn = OP1 * OP2
In Recomputation Step : Fr = (2’s Complement of OP1) * OP2Fault Detection Capability :
1. Under No fault condition : Fn = - Fr = 2n – Fr.
2. Any stuck at failure at particular bit position ‘ i ’ :
Result of normal computation i.e. Rnormal = Fn ± 2i .
Result of recomputation step i.e. Rrecomp = Fr or Fr ± 2i.
Fault handling logic will check whether Rnormal and 2’s Complement of Rrecomp is equal or not.
2’s complement of Rrecomp = 2n – Rrecomp = 2n – (Fr or Fr ± 2i)= (2n – Fr ) or (2n – Fr -/+ 2i )
= Fn or Fn (-/+) 2i.Therefore 2’s complement of Rrecomp ≠ Rnormal, In this way , we can detect single or multiple bitsstuck fault in multiplier circuitry.
3. Shift Related OperationRecomputing with shifted operands (RESO) technique has been used. RESO-2 implemented here.In Normal Computation : Operand is given to shifter which computes output as Fn.In Recomputation Step : Operand left shifted by 2 bit positions will be feed to the shifter circuitry whichcomputes output as Fr.
Fault Detection Capability :1. Under No fault condition : Fn [n-1:0] = Fr [n+1:2]2. Any stuck at failure at particular bit position ‘i’:Result of normal computation i.e. Rnormal [n-1:0] = Fn ± 2i .Result of recomputation step i.e. Rrecomp [n+1:2] = Fn or Fn ± 2i-2.Fault handling logic will check whether Rnormal [n-1:0] and Rrecomp [n+1:2] is equal or not.As error causes different result in two recomputation steps, the error is detected.
4. Logical Operation
Recomputing with SwappedOperands technique has been
used.Lower halves and Upper halves of
operands are swapped duringRe-Computation.
The remaining arithmeticcomputation units like dividers
need to be hardened
Verification MethodologyModule Level Verification
(This involves regressionverification of following
modules)
• Fault Tolerant Adder• Fault Tolerant Multiplier• Fault Tolerant Logical Unit• Fault Tolerant Shifter• Fault Tolerant Comparator• Error Correction Module
(SEC-DED module)
System Level Verification
• This involves verification atthe processor level.
• Aim is to validate the systemunder different fault rates.
Module Level VerificationSystematic Testing : FIM simulates stuck at ‘1’ and stuck at ‘0’ faults at every bit position.
Random Testing : FIM simulates multiple bit error randomly at the output of FU-1.
System Level Verification Fault Injection at ISB between decode and execute stage. Injecting faults atID/EX-ISB simulates the fault of various components of processor like IMEM,DMEM, ARF, PC register & previous ISBs in the pipeline flow. Fault Injection at the Fault Tolerant FUs . Total 8 fault injection locations. LFSR_SEL and LFSR_EN will control the enableinput of LFSRs used for FIM. This scenario can simulate error for 1/2/3consecutive clock cycles.
ASIC ImplementationThe design is targeted for 55 nm technology
Worst Corner Condition (PVT) : Slow, 1.08V, 125°c.
Best Corner Condition (PVT) : Fast, 1.32V,-40°c.
Processorversion
Total Instance Count Power (mW) CoreArea
(mm2)
MinClock
period(ns)Comb. Seq. Total Static Dynamic Total
Base 22542 2634 25176 0.58 20.97 21.55 0.2704 2.4
Fault tolerant 28259 3507 31766 1.16 9.19 10.35 0.3249 3.7
Overhead 25.36% 33.14% 26.17% 100% -56.17% -51.97% 20.15% 54.17%
I Class ProcessorOut-Of-Order Architecture
Putting the code to use.
Hands-onBenchmark Results
Architecture of Out-of-Order Processor
Concept
The need for Out-of-OrderProcessor
Necessity
4
3
21
Why Out-of-Order(OoO) Processor?● Fills stalls created due to variable latencies in execution.
● Complex method for fine-grained Data prefetching
● Balances out poor compilers and lazily written code.
● Exploits Instruction Level Parallelism.
PERFORMANCE!!
Academia and OoO Research● General lack of effort to build and
evaluate OoO designs.
● Proprietary limitations at each step.● Use of Software simulators like GEM5
○ Cannot evaluate Area and Powerconsumptions accurately.
○ Cannot rely for accuracy○ Too slow.○ Actual hardware overheads of
optimizations are not easilyidentified.
● Advent of Open ISAs like RISC-V,OpenRISC.
● Introduction of High-levelHardware Description LanguagesChisel, Bluespec etc.
Instruction Level ParallelismPipeline CPI = Ideal CPI + Data Stalls + Control Stalls + Structural Stalls
Technique Reduces
Forwarding and Bypass Potential Data Hazard Stalls
Delayed Branches and simple branch scheduling Control Hazard Stalls
Basic Dynamic Scheduling (Score-boarding) Stalls from True Dependencies
Dynamic Scheduling with renaming Stalls from Output and Anti dependencies
Branch Prediction Control Hazard Stalls
Multiple Issue Ideal CPI
Dynamic Scheduling● Scoreboarding
○ Allows instructions to get issued Out-of-Order only when theresufficient resource and no data dependencies.
○ No forwarding● Tomasulo Algorithm
○ Operand forwarding by register renaming - using ReOrder Buffer -avoids data hazards
○ Registers are written in commit stage.● Merged Register File approach
○ Operands forwarding by register renaming - using extra registers○ Registers are written right after execution.
Dynamic Scheduling● Scoreboarding
○ Allows instructions to get issued Out-of-Order only when theresufficient resource and no data dependencies.
○ No forwarding● Tomasulo Algorithm
○ Operand forwarding by register renaming - using ReOrder Buffer -avoids data hazards
○ Registers are written in commit stage.● Merged Register File approach
○ Operands forwarding by register renaming - using extra registers○ Registers are written right after execution.
The Case of the I Class Processor● A 64-bit processor providing Out-of-Order Execution.● Based on RISC-V ISA.
● RV64I + RV64M + RV64A + RV64FD● Coded in Bluespec System Verilog (BSV).● Architecture implements the Merged Register File approach to
achieve high instruction level parallelism.
I Class Features● Based on RISC-V ISA – RV64I and ‘M’ extension● Dual Issue.● Parameterized pipeline and Issue Queue.● CAM based speculative load store unit.● Prioritized for selecting instructions from issue queue.● Inter-functional unit bypass network.● Tournament Branch Predictor.● 32KB I-Cache and Non-Blocking Cache.
Architecture OverviewPipeline Stages1. FETCH2. DECODE3. MAP
- Register Renaming.4. Wakeup5. SELECT AND GRANT
- Selection from Instruction Queue.6. DATA READ AND DRIVE7. EXECUTE8. COMMIT
Storage structuresPipeline stagesExecution Unit
Fetch and DecodeFetchBPU CACHE
DECODE
● Next instruction is predicted for every instruction in Fetch Unit● Direct jump instructions like JAL are executed in Decode stage itself.
Branch Prediction Unit• Tournament Branch Predictor is being used.
• Tournament between Bimodal and Global Branch Predictor.• Both the predictor tables have 16 entries.• Total storage used by the predictor is ~ 2Kbits.
• State-of-the-art model branch predictor based on Predictionby Partial Matching(PPM).• Comprises of 1 Bimodal Bank and 4 Global banks for prediction.• Total memory required for it is 64Kbits
MAP• New source register address for instructions from decode stages are assigned from fRAM.• Destination registers are mapped to new address using FRQ.• Dependencies are checked between the instructions and dispatched into
instruction queue. FRQ
Instruction-1 rd rs1 rs2
Instruction-2 rd rs1 rs2
fRAM
Selection from Issue Queue• Instructions eligible for selection for execution
• Instructions which have all their operands ready.• Corresponding execution units are predicted to be free.
• Selection of Instruction from set of eligible ones.• Age based selection policy.• Position based selection policy.
• Priority Encoders for selection policy.• Tree-shaped priority encoders for position based policy.• Barrel shifted Tree based priority encoders for age based selection policy.
• Selected instructions are enqueued into inter-stage buffer.
Priority EncodersSerial Priority Encoder Tree-shaped Priority Encoder
Data Read and Drive• In this stage, instructions are dequeued from inter-stage buffer and
operands are read from registerfile.
• I-class processor can issue up to 5 instructions per cycle which results in10 read ports for physical register file.
• Immediate fields of instructions are stored in separate location (immediatebuffer) and thepointer to the immediate buffer location is stored in issue queue entry.
• The inter-stage buffer can accommodate single element. This is because all theexecution units in
I-Class processor are pipelined.
Instruction Execute• Result calculated is broadcasted to Map stage and status of source operands are
updated.
• Branch Unit broadcasts training packet to Branch Predictor Unit.• Branch training packet consists of Squash PC(Jump Address) and Branch
result(Taken or not)• Branch result is also update in Squash Buffer which is useful at the time of
committing.
• All arithmetic operations are single cycle operations.
• A separate execution unit for Multiplier and Divider.• Multiplier takes 6 cycles and Divider takes 36 cycles both pipelined.
• Arrival of result for load instructions cannot be predicted. Hence thedependent instructions are notified accordingly.
CAM based Load Store UnitEAC
StoreQueue
LoadQueue
MemoryAccess
Cache
CAM SEARCH
CAM SEARCH
Broadcast Loadresult
StoreCommit
LoadCommit
Flush Wire
Each memory accessinstruction is allotted anentry in one of LS queues.
The value from the store isforwarded in case ofaddress match.
Alias bit is set in case of wrongspeculation and pipeline isflushed at the time ofcommit.
CAM based Load Store UnitCAM search onload queue
Match
No
Send store requestto D-Cache
ForwardedSet?
Forwardacknowledgeset?
Set forwardacknowledge
Set alias bit
Yes
YesNo
NoYes
Invalidate StoreQueue Entry
● Alias bit is set two cases○ When there no
forwarding from thecorresponding store
○ When there isforwarding from thewrong store.
● If not both, Forwardacknowledge is set.
● The store instructionanyway proceeds to D-Cache.
Instruction Wake-up• Result from the execution units is broadcasted to Instruction Queue(IQ).• Each instruction compares destination tag with source operand tags in IQ and updates operand ready.• During the same cycle Registerfile is also updated.
Bypass Network• Dependent instructions have 3 cycle bubble between them.
Select Drive Execute Broadcast
Wakeup Select Drive Execute
Select Drive Execute Broadcast
Wakeup Select Drive ExecuteConsumer
Producer
Producer
Consumer
● In bypass network instructions are predicted to get finished in certain cycles● Accordingly instructions dependent are woken up.
Implementation of Bypass Network• Instead of having registers for operand ready, every instruction is attributed to
Delay register.
• Delay registers contents are moved “Shift register” at the time of broadcast.
• Contents of “Shift registers” are right shifted every cycle. When the right most bit in“Shift register” is set, then corresponding instruction is released for execution.
Commit Stage• All instructions are committed in order. This is done by maintaining head pointer for
oldest instruction in the instruction queue.
• I-Class processor supports multiple commit and dependencies between theinstructions are checked before commit.
• The architectural destination registers of instructions are added back to FRQ.
• The destination register mapping of committing instruction is updated into rRAM.
• The following are the cases of exception in Commit Stage
• Mis-predicted Branches whose decision can be determined from SquashBuffer.
• If the instruction is wrongly speculated load.
Handling Exceptions During Commit• Exceptions Being supported:
• Instruction address mis-aligned exception• Instruction access fault.• Illegal Instruction.• Load address misaligned exception• Load access fault.
• The interrupt controller takes care of exceptions from thecommit stage.
• Interrupt controller has logic to transfer the control to machinemode and update the program counter.
Flushing Pipeline• Incase exception, branch misprediction and wrongly speculated
load execution, whole pipeline is flushed.• Flushing the pipeline:
• Clear all the inter-stage buffers.• Clear Load Queue, Store Queue and the instruction Queue.• Copy the contents of rRAM to fRAM.• Mark all the contents of FRQ valid and reset Head and Tail to zero.• Change the program counter to Squash Program counter of the
corresponding instruction
Cache Architecture• Totally parameterized for all the buffers and cache.• Non-Blocking – Services requests unless load buffer is
full or more than 2 requests for same cache lineaddress.
• Load Buffer is implemented as vector of concurrentregisters and wait buffer is implemented mimics asearchable FIFO.
• Completion Buffer issues tokens and is responsible forin-order completion of the requests
CompletionBufferCache
Load Buffer WaitBuffer
Higher levels ofmemory
Cache Architecture• In case of cache miss, the entries are
stored in Load Buffer.• Through fully-associative search the
number in Load Buffer(LB), if the entry isalready in LB, then wait buffer isupdated.
• While servicing entries in wait buffer, theCPU requests are stalled.
Response toCPUResponse to
CPU
Request tomemory
Cache hit?
Already inLB?
Updatewait
Buffer
Stallrequestsfrom CPU
No
Yes No
Execution Modes• Privilege levels are used to provide protection between different component of the software
stack.• At anytime, the core runs in one of the following privilege modes.
• Machine Level is the highest privilege level, inherently trusted and is mandatory.
Level Encoding Name Abbreviation
0 00 User U
1 01 Supervisor S
2 10 Hypervisor H
3 11 Machine M
Execution Modes• User-mode (U-mode) - conventional application
Supervisor-Mode - operating systemhypervisor-mode (H-mode) - virtual machine monitors.
• There are Control and Status Registers(CSR) associated with each privilege level.• Privileged instructions are used to access the CSR instructions.
• The “csr” field is used to address CSR registers. Fields rs1 and rd denote integer registers.• The CSR registers are addressed based on privilege level access and read/write permissions.
RTL Code• Written in Bluespec System Verilog.
• The total number of lines of code : ~ 11k• Cache LoC - ~3k• Core LoC -~ 8k
• 24 Modules – Highly modularized.• Facilitates design space exploration
• Able to develop Bypass Network and plugging two extra modules.
• 35 parameterizable variables
• Enabled us quick trade-off analysis between IPC and clock speed.
IPC for AAPG Test Cases• IPC variation based on issue queue size.
Dhrystone ResultsIPC
Issue Queue Size
Frequency recorded
65 nm UMCIP Standard cell library with operatingconditions 1.32 V supply voltage and 110 0F is used.
Performance• Fully synthesizable• Runs at 110MHz Xilinx Artix-7 board
Benchmarks IPC
COREMARKS 1.18
Code Access and Use• The code is publicly available at
https://bitbucket.org/casl/shakti_public.• Under the BSD license.• Verification Environment can be made use of to verify the code.• To run benchmarks of any kind, generate list of instructions into
input.hex file and initialize memory rtl_mem_init.txt file.• Parameters can be changed in defined_parameters.bsv file in
BSV_source.• Use makefile to compile and simulate files.
Other Works
RapidIO interconnect• The Serial RapidIO Gen3+ standard is proposed to be used as the CPU + I/O interconnect• It is proposed to use the 10/25G SERDES version of this standard. A port will consist of 4, 8 or 16
lanes of 10/25 G each transported over electrical/optical links.• CC links for multi-socket configurations will need all 16 lanes• I/O interconnects will probably need 100 Gbits/sec (4x25 G)• Cache coherency• A 5+ state MOESI/MESIF like directory based protocol will be the CC protocol• Mapped to each core's AMBA/CHI interconnect• Scalable upto 128 sockets ( 16 clusters of 8 sockets/cluster)• Optical configurations• 10 lanes of 10G or 4 lanes of 25G muxed onto the 802.3bm standard's 100G optical link• Intel CX optical interconnect can accommodate larger lane count• Extending max packet size of SRIO to 4k• TBD based on performance of 256 byte packets• IIT-M IP• Complete implementation of 3.x standard except analog components• 4 x 4 switch (4 lane ports, max of 100 Gbit/secper port at 25G lanes) being developed• Critical for non SHAKTI projects also
Server Fabrics• Workload parallelism involves access to shared resources.
• The core should be fast and exploit ILP.• Interconnection fabric should also support throughput requirements.
• It’s the complete system’s performance which needs to improve.• All applications do not have similar memory access patterns
• Hadoop• RDBMS
• Need for an Adaptive Server Framework• Hybrid interconnects which speed up resource sharing.• Fast Memories such as Hybrid-Memory Cube.
• Observed the 8physical threads per process is sufficient• Use Bi-directional Ring for upto 8-cores.• MESI+GOLS to take care of cache coherence.• Easily scalable.• Power and Area efficient than Mesh.• Dynamic Clustering.
Server Fabrics
Core
L2
Core
L2
Core
L2
Core
L2
L2
Core
Core
L2
Core
L2
Core
L2
HMC / DDRRIO - CC
• Hybrid Interconnect schemes• Rings scale well for upto 8 cores.• (> 16) cores mesh provides better performance .• CCNoC for cache coherence.• Mesh is power and area inefficient• Need for hybrid structures
Server Fabrics
Mesh of Rings with 2 bridges Mesh of Rings with 4 bridges Hybrid with hierarchical rings
• Use RapidIO for socket-socket connection• A 5+ state MOESI/MESIF like directory based protocol will be the CC
protocol• Scalable upto 128 sockets ( 16 clusters of 8 sockets)• proposed to use the 10/25G SERDES version of this standard.
Server Fabrics
Proc-1 Proc-2
Proc-4 Proc-3
RIO
LightNVM● Specification for Open-channel SSD
● Extension to NVMe specification
● Adapts the Linux kernel's NVMe driver to provide the LightNVM interface
● Allows SSD to expose internal organization & control the flow of data to it
● In-kernel FTL
MULTI-CHANNEL STORAGE CONTROLLER
TOP LEVEL DATA FLOW OF NVM CONTROLLER
RESULTS● In speed test using PCIe Gen-2 4x, data rate is 14.76Gb/s
● In simulation test, max. data rate is 15.83Gb/s and max. bandwidth utilization
is approx. 98.9%.
● Normalized NAND interface utilization is 99.9% for single channel controller
● Arbiter scales well with increase of IO queues
● Synthesis results show ideal proportional increase in combinational and
sequential area wrt number of channels
Thank YouContact DetailsNeel Gala – [email protected] Menon – [email protected] Bodduna – [email protected] Site : www.rise.iitm.ac.in/shakti/
BioGraphy of Authors
Prof. V. Kamakoti• V. Kamakoti currently holds the post of professor in the Department
of Computer Science and Engineering, Indian Institute of Technology,Madras. He has more than 15 years of experience in computersystems development and specializes in the area ComputerArchitecture, CAD for VLSI and High Performance Computing.Professor Kamakoti holds a Master of Science degree and adoctorate of philosophy in computer science and engineering fromthe Indian Institute of Technology, Madras. He has authored anumber of research papers that have been published in variousinternational journals and in the proceedings of many scientificconferences.
Neel T. Gala• Neel T. Gala has completed his B.Tech from NIT-Warangal in
2010 and is currently pursuing his PhD at IIT-Madras. Hisprimary area of research is in low power approximate circuitdesigns and techniques. He also holds a strong passion forcomputer architecture and digital design. During his PhD, Neelhas published upto 6 publications in international conferencesand journals and holds a US Patent in collaboration with TexasInstruments. He has also collaborated with variousgovernment bodies in regards with processor design andverification.
Arjun C. Menon• Arjun C. Menon is an MS student in Computer Science
Department, IIT Madras. His primary area of research is onproviding hardware support for eliminating software attacks.He is one of the lead designers in the SHAKTI processors.During the course of his masters, he is involved in designingvarious processors for the government of India.
Rahul Bodduna• Rahul Bodduna has completed his B.Tech from IIT-Mandi in
2013. He has since been associated with IIT-Madras as aProject Associate contributing to the SHAKTI processorsinitiative. He holds a strong base in designing out-of-ordercores and memory management units from scratch.