:The First Open-Source
SMP Linux-booting RISC-V System Scaling From One to Many CoresJonathan Balkind, Michael Schaffner, Katie Lim, Florian Zaruba, Fei Gao,
Jinzheng Tu, Luca Benini, David WentzlaffPrinceton University, ETH Zurich
+ Ariane
openpiton.org pulp-platform.org
Who are we?
• Jonathan Balkind• Lead architect of
OpenPiton• OpenPiton Team
• Led by Prof. David Wentzlaff• Princeton Parallel
Research Group• Open source HW since 2015• 13 PhD students• 1 Postdoc• N undergraduates
• Michael Schaffner• Responsible for OpenPiton+
Ariane integration• PULP Team
• Led by Prof. Luca Benini• ETHZ / Università di Bologna• Open source HW since 2013• Leaders in RISC-V
development• Ariane dev: Florian Zaruba,
Michael Schaffner and others
2
Support
This material is based on research sponsored by the NSF under Grants No. CNS-1823222, CCF-1217553, CCF1453112, CCF-1823032, and CCF-1438980, AFOSR under Grant No. FA9550-14-1-0148, Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement No. FA8650-18-2-7846, FA8650-18-2-7852, and FA8650-18-2-7862 and DARPA under Grants No. N66001-14-1-4040 and HR0011-13-2-0005. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those ofthe authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed orimplied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA), the NSF, AFOSR, or the U.S. Government.
3
Project Overview• Collaboration between Princeton University and ETH Zurich• Goal is to develop a permissively licensed, Linux capable
manycore research platform based on RISC-V• Based on mature, extensible designs• Booted SMP Linux in <6 months• The world's first open-source, SMP Linux-booting, RISC-V manycore
• Ariane• RV64GC Core (with extensions)• Linux capable
• OpenPiton• Manycore research platform• Distributed cache coherence and NoC
4
Ariane RV64GC Core• Application class processor• Written in SystemVerilog
• Linux Capable• Tightly integrated D$ and I$• M, S and U privilege modes• TLB, SV39• Hardware PTW
• Optimized for performance• Frequency: 1.5 GHz (22 FDX)• Area: ~ 175 kGE• Critical path: ~ 25 logic levels
• Scoreboarding• 6-stage pipeline• In-order (single) issue• Out-of-order write-back• In-order commit
• Designed for extensibility• Branch-prediction• Return Address Stack (RAS)• Branch Target Buffer (BTB)• Branch History Table (BHT)
5
Ariane
6
Silicon Proven Designs: Ariane
7
Issue
QUENTIN KERBIN
HYPERDRIVE
Poseidon layoutAriane
Kosmodrom layout
Ariane LPAriane HP
L2
NTX
• Ariane taped-out in GlobalFoundries 22nm FDX twice• 16kB instruction and 32kB data caches• Poseidon:
• Area: 0.23 mm2 - 175 kGE• 0.2 - 1.7 GHz (0.5 V - 1.15 V)
• Kosmodrom:• RV64GCXsmallFloat, Transprecision / Vector FPU• Ariane HP
• 8T library, 0.8V, 1.3 GHz• 55 mW @ 1 GHz
• Ariane LP• 7.5T ULP library, 0.5V, 250 MHz• 5 mW @ 200 MHz
OpenPiton• Open source manycore• Written in Verilog RTL• P-Mesh coherence scales to ½ billion cores• Configurable core, uncore• Simulation in VCS, ModelSim, Incisive, Verilator, Icarus• Includes synthesis and back-end flow• ASIC & FPGA verified• ASIC power and energy fully characterized [HPCA 2018]• Runs full stack multi-user Debian Linux• Used for Architecture, Programming Language,
Compilers, Operating Systems, Security, EDA research
8
Tile
Chip
chipset
OpenPiton Tile
9
To Other Tiles
L2 Cache Slice+
Directory Cache
P-MeshRouters
(3)
L1.5 Cache
CCX Arbiter
FPU
Modified OpenSPARC T1
Core
MITTS(Traffic Shaper)
System Overview
10
Tile
System Overview
11
System Overview
12
Chip
System Overview
13
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
Chip Chipset
System Overview
14
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM
Chip Chipset
System Overview
15
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM WishboneSDHC
AXII/O
Chip Chipset
System Overview
16
P-Mesh Off-Chip Routers (3)
Chip Bridge
P-Mesh Chipset Crossbars (3)
DRAM WishboneSDHC
AXII/O
Chip Chipset
Silicon Proven Designs: Piton• 25-core• 2 Threads per core• Modified 64 bit OpenSPARC T1 Core
• 3 P-Mesh NoCs• 64 bit, 2D Mesh• Extend off-chip enabling multichip systems
• P-Mesh Directory-Based Cache System• 64kB L2 Cache per core (Shared)• 8kB L1.5 & L1 Data Caches• 16kB L1 Instruction Cache
• IBM 32nm SOI Process• 6mm x 6mm• 460 Million Transistors - Among largest chips built in academia
• Target: 1 GHz Clock @ 900 mV• Received silicon and runs full-stack Debian in lab
17
OpenPiton+Ariane
18
OpenPiton+Ariane Cache Modifications
• New write-through cache subsystem with invalidations and the TRI interface• LR/SC in L1.5 cache• Fetch-and-op in L2
cache
19
OpenPiton+Ariane Platform Support
• Bootrom auto-generated with device tree from configuration• RISC-V Debug• OpenOCD + GDB• Bootloading
• CLINT• PLIC• lowRISC rv_plic
20
21
Component Configurability Options
Cores (per chip) Up to 65,536
Cores (per system) Up to 500 million
Core Type OpenSPARC T1 Ariane 64 bit RISC-V
Threads per Core 1/2/4 1
Floating-Point Unit FP64, FP32 FP64, FP32, FP16, FP8, BFLOAT16
TLBs 8/16/32/64 entries Number of entries (16 entries)
L1 I-Cache Number of Sets, Ways (16kB, 4-way)
L1 D-Cache Number of Sets, Ways (8kB, 4-way)
L1.5 Cache Number of Sets, Ways (8kB, 4-way)
L2 Cache Number of Sets, Ways (64kB, 4-way)
Intra-chip Topologies 2D Mesh, Crossbar
Inter-chip Topologies 2D Mesh, 3D Mesh, Crossbar, Butterfly Network
Bootloading SD/SDHC Card, UART, RISC-V JTAG Debug
Configurability Options
FPGA Prototyping PlatformsAvailable:• Digilent Genesys2• $999 ($600 academic)• 1-2 cores at 66MHz• Xilinx VC707• $3500• 1-4 cores at 60MHz• Digilent Nexys Video• $500 ($250 academic)• 1 core at 30MHz
In progress:• Xilinx VCU118, BittWare XUPP3R• $7000-8000• >100MHz• Amazon AWS F1• Rent by the hour• 1-N cores• Live demo at tomorrow's tutorial!
22
Roadmap
• Testing:• Memory consistency testing with
litmus/herd/diy• Randomised testing with riscv-
torture, RISC-V DV
• Bootloading:• OpenSBI• U-Boot/Coreboot/...
• Debian/Fedora Linux distro
• Performance enhancements• Multi-level TLBs• Branch Prediction improvements*• Increasing TRI/L1.5 line size• Multi-issue Ariane*
• Open backend flow for Ariane• Tapeouts!
* GSoC project with FOSSi Foundation
23
Boot SMP Linux Today!• Clone from:• https://github.com/PrincetonUniversity/openpiton• Simulation with Modelsim, VCS, Verilator• FPGA implementation with Vivado 2018.2 or newer
• RV64GC Demo• 2 cores on Genesys2 at 66MHz• Play Tetris, browse the web!
• Tutorial tomorrow afternoon! (in this room)• Hands-on with Verilator simulation• Boot SMP Linux on FPGA• http://openpiton.org/ISCA19_tutorial.html
24
25
QUESTIONS?
@OpenPitonhttp://openpiton.org
@pulp_platformhttp://pulp-platform.org
Balkind and Scha�ner, et al.
Table 3: Some of the supported FPGA build con�gurations. Both cores have the same default cache con�guration (see Table 1).The results have been generated with Vivado 2018.2, using OpenPiton r11 / Ariane v4.1 including additional developmentpatches that will be part of upcoming releases.
Board Name / Clock Con�g Core FPU LUTs Registers RAM Tiles DSPsFPGA Type [MHz] X ⇥ Y Type [y/n] [k] [k] [#] [#]Digilent NexysVideoArtix 77a200tsbg484
30 1 ⇥ 1 Ariane no 95 (71%) 72 (27%) 66 (18%) 16 (2%)30 1 ⇥ 1 Ariane yes 110 (82%) 75 (28%) 66 (18%) 27 (4%)30 1 ⇥ 1 OpenSPARCT1 yes 115 (86%) 96 (36%) 59 (16%) 13 (2%)
Digilent Genesys2Kintex 77k325t�g900-2
67 1 ⇥ 1 Ariane no 86 (42%) 72 (17%) 66 (15%) 16 (2%)67 1 ⇥ 1 Ariane yes 99 (49%) 75 (18%) 66 (15%) 27 (3%)67 1 ⇥ 1 OpenSPARCT1 yes 105 (52%) 91 (22%) 59 (13%) 16 (2%)67 2 ⇥ 1 Ariane no 141 (69%) 113 (28%) 124 (28%) 16 (4%)67 2 ⇥ 1 Ariane yes 167 (82%) 120 (30%) 124 (28%) 54 (6%)67 2 ⇥ 1 OpenSPARCT1† yes 160 (79%) 137 (33%) 112 (25%) 32 (4%)
Xilinx VC707Virtex 77vx485t�g1761-2
60 1 ⇥ 1 Ariane no 99 (33%) 73 (12%) 63 (6%) 16 (<1%)60 1 ⇥ 1 Ariane yes 114 (37%) 77 (13%) 63 (6%) 27 (1%)60 1 ⇥ 1 OpenSPARCT1 yes 119 (39%) 97 (16%) 53 (5%) 16 (<1%)60 2 ⇥ 2 Ariane no 284.1 (94%) 202 (33%) 237 (23%) 64 (2%)60 3 ⇥ 1 Ariane yes 268 (88%) 169 (28%) 179 (17%) 81 (3%)60 3 ⇥ 1 OpenSPARCT1† yes 255 (84%) 208 (34%) 158 (15%) 48 (2%)
Xilinx VCU118Virtex US+xcvu9p�ga2104-2L
100 1 ⇥ 1 Ariane no 90 (8%) 81 (3%) 88 (4%) 19 (<1%)100 1 ⇥ 1 Ariane yes 103 (9%) 84 (4%) 89 (4%) 30 (<1%)100 1 ⇥ 1 OpenSPARCT1 yes 108 (9%) 100 (4%) 79 (4%) 19 (<1%)100 4 ⇥ 4 Ariane no 923 (78%) 704 (30%) 963 (45%) 259 (4%)100 4 ⇥ 2 Ariane yes 583 (49%) 399 (17%) 495 (23%) 219 (3%)
†Without Coherence Domain Restriction [8] in caches.
• CLINT: The Core Local Interrupt Controller (CLINT) pro-vides Inter Processor Interrupts (IPI) and a common time-base. Each core has its own timer compare register whichtriggers an external timer interrupt when it matches theglobal time-base.
• PLIC: The Platform Level Interrupt Controller (PLIC) is ainterrupt controller whichmanages external peripheral inter-rupts. It provides a context for each privilege level and core.The software can con�gure di�erent priority thresholds foreach context. The PLIC is still subject to o�cial standardisa-tion. However, there is already an implementation includinga Linux driver, which is agreed upon.
2.5 Automatic Device Tree GenerationIn order to capture the di�erent platform con�gurations that Open-Piton+Ariane provides, we added an automatic device tree genera-tion script to the PyHP preprocessor from OpenPiton. This scriptparses an XML description of the system address map and platformperipherals (which is also used to generate the chipset crossbar),and together with the information about the number of cores andthe clock frequency it generates a device tree that is compiledinto a bootrom attached to the peripheral space. The "zero-stage"bootloader stored in that bootrom initialises the cores and loadsa pointer to the device tree blob into register a1 as per RISC-Vconvention. With this automatic device tree generation, the sameLinux image can be booted on di�erently parameterised instances,automatically adapting to the platform at runtime.
3 SIMULATION & EMULATION PLATFORMSAriane plugs into the sims simulation infrastructure provided inOpenPiton. This handles the building of simulation models witheach of the supported simulators (at present, Mentor QuestaSim,Synopsys VCS and Verilator), as well as running one test or anentire test suite against the compiled model. We have enhancedsims to support compilation of RISC-V assembly and C tests, andthe direct use of pre-compiled binaries. The primary bare-metal testsuite is the publicly available riscv-tests repository [20]. Beyondbare-metal testing, we also simulate Linux boot for debugging,which takes approximately 4 days to boot for a single core (DRAMreduced to 128MB to speed up the memory initialisation phase insimulation).
3.1 FPGA FlowsThe Ariane core option has been integrated into the OpenPitonprotosyn build �ow and is available for the Digilent Nexys Videoand Genesys2 boards, as well as the Xilinx VC707 and VCU118development boards. The resource consumption of a set of buildswith the standard cache con�guration and di�erent numbers ofcores is shown in Table 3. Since the Ariane FPU pipeline registershave not been optimised for FPGA mapping, enabling the FPU willresult in a somewhat lower core clock frequency. The LUT distribu-tion for single-core Genesys2 builds is shown in Figure 3. The coreamounts to around 22%-41% of the total resources, depending onthe actual con�guration (Ariane with or without FPU, OpenSPARCT1 with FPU). Further, we note that the T1 is around 23% and 93%larger than Ariane with and without FPU, respectively. This area
26