Post on 22-Jan-2018
transcript
Tackling 400 MHz Timing-‐Closure for 25/50/100 GbE
Shep Siegel Atomic Rules LLC
1 ©2015 Atomic Rules LLC
Shepard.Siegel@atomicrules.com Tackling 400 MHz Timing Closure 2015-‐09-‐22
IntroducGon Transcript 1/3
• “ I don’t know of any reason why you would have (Gming closure) issues with the V-‐US fabric at 400 MHz, why don’t you try it and see how it goes?” – Gordon Brebner
(personal correspondence, Fall 2014)
2 ©2015 Atomic Rules LLC
IntroducGon Transcript 2/3
• “Sure, sounds great, we are pu[ng our best engineers right on it. The 25/50/100 GbE work you are doing sounds exciGng!” – Shep Siegel
(personal response to Gordon, Fall 2014)
3 ©2015 Atomic Rules LLC
IntroducGon Transcript 3/3
• { Sound of Impact } – Unknown
(somewhere in Vivado 2015.1, January 2015)
4 ©2015 Atomic Rules LLC
Architecture Mabers!
• This talk teases Fmax and Timing Closure
• It is really about how to avoid ge[ng painted into that problem-‐corner in the first place
• And that requires good Architecture
5 ©2015 Atomic Rules LLC
Why Architecture?
• Architecture impacts many aspects of design, Gming closure is but one of them
• Architectural choices are Strategic – Expensive if you get it wrong
• Domain-‐Specific-‐Languages (DSLs) make Architectural InvesGgaGon easier than ever
6 ©2015 Atomic Rules LLC
What Architecture? • In what way do you wish to express the design? – World of choice of DSLs, legacy RTLs – You can mix and match with Vivado IPI
• What choices are you making? – Language Choices
• ‘C’ or other imperaGve expression • Your DSL of choice – What is Appropriate? • “Pick and Shovel” – SomeGmes a legacy RTL is just fine
– Device-‐Centric, Structural Choices • MSLICE vs. LSLICE • CARRY8 vs. DSP48E2 • Distributed RAM vs. BRAM
• Are you aware that you are viewing the problem as top-‐down, bobom-‐up, or middle-‐out?
7 ©2015 Atomic Rules LLC
How Architecture?
• All this can be overwhelming
• Suggest a divide-‐and-‐conquer approach – IteraGve Refinement is one way
• Don’t delay, start experimenGng at once! – Small Failures ooen yield rich insights
8 ©2015 Atomic Rules LLC
This Talk – Problem Statement
• The 20 nm UltraScale fabric is fast • 25/50/100 GbE suggests a natural ~400 MHz – Area and Cost concerns to keep packet data paths as narrow and occupied as is pracGcal
• But 400 MHz in a V-‐US-‐2 is challenging – What can we do to close Gming? – How do we avoid negaGve setup slack? – How did we close Gming with 25GbE UDP/IP?
9 ©2015 Atomic Rules LLC
Olivebridge Manifesto • AR First-‐Mover/Early-‐Adopter in 25 GbE IP – Have Product ready to meet market needs
• L2/L3/L4 Packet Processing at Line Rates – Ethernet 802.3 / Internet Protocols at 25 Gbps
• 400 MHz Fabric OperaGon on 20 nm FPGAs – Requires Specialized Circuit and Physical Design
• 2.5x Under-‐Clocking for 10 GbE on 28 nm – Broader Market While 25 Gb AdopGon Grows
10 ©2015 Atomic Rules LLC
Olivebridge Focus • UDP/IP Datagram Service for 10 and 25 GbE – Well-‐defined funcGon and interfaces – Serve exisGng customer needs in 10 GbE space – Be the early-‐to-‐market in nascent 25 GbE space – Be well posiGoned with 50/100 GbE variaGons
• L2 802.3 Packet ValidaGon – IniGally use FPGA Vendor MAC/PCS/PMA/PHY IP – Self-‐Synchronizing Generators, Mungers, Checkers
11 ©2015 Atomic Rules LLC
Olivebridge Plarorms (parGal) • BibWare A5PL – 28 nm Altera Arria V GZ • BibWare S5PEDS – 28 nm 2x Altera StraGx V • Xilinx ZC706 – 28 nm Xilinx Zynq-‐7 • Xilinx KCU105 – 20 nm Xilinx Kintex-‐UltraScale (K-‐US) • Xilinx VCU107 – 20 nm Xilinx Virtex-‐UltraScale (V-‐US) • BibWare Jasper-‐ 20 nm Xilinx Virtex-‐UltraScale (V-‐US) • BibWare Mustang – 20 nm Xilinx Kintex-‐UltraScale (K-‐US) • BibWare A10PS4 -‐ 20 nm Altera Arria 10 (vaporware) • Xilinx TBD – 16 nm Xilinx Virtex-‐UltraScale+ (V-‐US+)
12 ©2015 Atomic Rules LLC
Where to Start?
• Business Unit idenGfied the 25 GBE UDP/IP baked-‐into the Olivebridge Manifesto – Made a business case for investment
• We started sketching top-‐down; but know from experience that bobom-‐up will come into play
13 ©2015 Atomic Rules LLC
25 GbE UDP/IP 1/2 • We bet that the 28 Gb GTY would be a game-‐changer • We knew that we didn’t have the depth or resources of the Sarance team at Xilinx – Would rather buy than build the PMA/PCS/MAC – Ride Xilinx’ coat-‐tails of silicon, tools, IP
• We wanted our first IP offering to be unambiguous in funcGon; but disGncGve in posiGoning – UDP/IP is ubiquitous (also the market need) – 8B data paths are 400 MHz are not common (yet)
14 ©2015 Atomic Rules LLC
ApplicaGon Drives Architecture 1/2
• Facts on the ground were that we were going to use the Xilinx/Sarance PHY/PMA/PCS/MAC stack. – Allowed us to quickly kick off a “show me” demo of 25 GbE
• From L2 down to the wire, we had to trust Xilinx
17 ©2015 Atomic Rules LLC
ApplicaGon Drives Architecture 2/2
• But from the MAC L2 interface on up, the freedom to innovate was in our hands – Our triumph if our choices are good – Our failure if our choices are bad
• IPI enables heterogeneous architectures within a single applicaGon – One approach does not need to fit all!
18 ©2015 Atomic Rules LLC
Off to Work
• We love our DSLs! • Away we code in Bluespec SystemVerilog (BSV), and in a few weeks we have funcGonal sim of key elements – ARP Cache – SegmentaGon and Reassembly – IGMP Join/Leave Machinery – PCAP files as sinks and sources of packet streams
19 ©2015 Atomic Rules LLC
Architecture drives ImplementaGon
• As good and bad ideas about architecture anneal, good Gme to not lose site of basic facts on the ground
• Remember – Architecture is the key Fmax driver – Experiment early and ooen – Otherwise you may become a vicGm to the tools
20 ©2015 Atomic Rules LLC
No Magic – Just Sound PracGce
• Mature 20 nm Silicon – Study the data sheets, use DocNav
• Mature FPGA CAD Tools (e.g. Vivado 2015.x) – Run out-‐of-‐context builds early and ooen
• Mature Engineers – Frequency Scaling ended a decade ago
21 ©2015 Atomic Rules LLC
Crawl, Walk, Run
• Observing your code run CORRECTLY in Verilog sim is a valuable pre-‐condiGon for architectural innovaGon! – You can then automate the tests, so you can watch your innovaGon break the regressions, then refine your innovaGon to be correct
• FuncGonal-‐Correctness First, Performance Correctness IteraGvely
22 ©2015 Atomic Rules LLC
Synthesis, Out-‐of-‐Context
• Sooware Engineers say “compile early and ooen”
• Circuit Designers can do the same by running Vivado Synthesis, out-‐of-‐context, on RTL circuit fragments (sub-‐modules) – Ge[ng feedback in minutes as to the approximate area and Fmax of a module is one of the most-‐exploitable objecGve measures at your disposal
23 ©2015 Atomic Rules LLC
What Happened?
• Our first architectural choices gave us a taste of correct funcGon; but missed Gming miserably. – We had over 50% negaGve setup slack (negaGve setup Gme of more than 1.25 ns on a 2.5 ns Gming arc)
• Panic or Progress – Progress of course! – We call this a “Happy Mistake”
24 ©2015 Atomic Rules LLC
Two Paths
• Our iniGal architecture was coming up short at 25 GbE (400 MHz). The architecture-‐lead handed the design over to the implementaGon-‐lead with a simple task: – Under-‐clock the RTL IP at 156.25 MHz instead of 400 MHz to realize the same funcGon at 10 GbE, not 25 GbE
• The architecture-‐lead went back to looking at topologies with 6 or fewer levels of 6LUTs
25 ©2015 Atomic Rules LLC
10 GbE Success
• Before we sebled on the correct architecture choices for 25 GbE, we had a funcGonal demo at 10 GbE to show our stakeholders
• And since we had yet to make some of the hard architectural decisions, we had parallelized some of the architecture work with the implementaGon work
26 ©2015 Atomic Rules LLC
400 MHz or Bust
• As much as this talk stresses architectural innovaGon at the early stage, it’s worthwhile to run through P&R to be sure – We found physical design issues, essenGally independent of the architectural choices
– Addressing some of them made reasoning about the architecture choices easier
– In the end the numbers from the backend of Vivado are the ones that count
28 ©2015 Atomic Rules LLC
ElasGc Pipelines and Atomic Rules
• Unsurprisingly, one of our go-‐to design-‐paberns are elasGc pipelines that are produced and consumed by atomic rules – In the end we achieved our throughput and area goals by adding some latency-‐jiber by the use of a cascade of shallow (2 deep) FIFOs implemented out of fabric, distributed RAM, or SRL16/32s.
– Since “lowest latency” was not one of the current requirements, we stopped short of creaGng a staGc schedule.
29 ©2015 Atomic Rules LLC
Hoplite to the Rescue
• We’ve been closely following the work of Jan Gray and Nachiket Kapre on their Hoplite NoC – Best paper award at FPL 2015 – Architecture-‐driven design of an austere NoC – SpaGal Programing taken to an extreme – Harmonizes with other 400 MHz IPs (Olivebridge)
32 ©2015 Atomic Rules LLC
33 ©2015 Atomic Rules LLC
KU040-‐2, 4x6, 2.4ns, 6600 LUTs(2.7%), 16K DFF (3.4%), 0-‐injecGon latency 31ns, 100Gb/node
Image courtesy of Jan Gray. See fpga.org/hoplite
Conclusions and Summary
• 400 MHz fabric logic is achievable in -‐2 grade 20 nm V/K-‐US with plenty of up-‐front thinking
• Failure to understand the Gming costs early can impact the schedule, or stop the show!
• Vivado empowers the designer at several levels to successfully reach these goals – In IPI to allow heterogeneous, best-‐tools DSLs – In Synthesis to get Gming and area early and ooen – In P&R to, more Gmes than not, beber Synthesis
34 ©2015 Atomic Rules LLC
About Atomic Rules
• Digital Systems Consultancy based in Auburn, NH
• Specializing in FPGA Programming, Systems IntegraGon for Commercial and Defense
• In business for 7 years – AcGvely recruiGng new engineers
36 ©2015 Atomic Rules LLC
Lead Talent
• Shep Siegel (CTO and Founder): – ex-‐Mercury Systems, ex-‐Datacube, ex-‐Ampex, author and speaker, graduate of RIT, Senior Member IEEE, Senior Member ACM
• David Wright (VP Strategy): – ex-‐IBM, ex-‐Perot Systems, ex-‐Datacube, graduate of UNH, Agile methods ScrumMaster
37 ©2015 Atomic Rules LLC
Lead Talent (conGnued) • John Miller (Embedded Sooware):
– RDMA, ARM, POSIX, RTOS, Ethernet, PCIe, Linux KMD – 20 Years Bridging H/W and S/W for Embedded – John.Miller@atomicrules.com
• Hadar Agam (EE/CS Digital Design): – Complex Concurrency, Bluespec SystemVerilog, Digital Design – 20 Years of industry-‐leading design innovaGon – Hadar.Agam@atomicrules.com
• Ed Czeck (EE/CS Digital Design): – Complex Concurrency, FuncGonal Programming, Digital Design – 20 Years of industry-‐leading design innovaGon – Ed.Czeck@atomicrules.com
38 ©2015 Atomic Rules LLC
Lead Talent (conGnued) • Bach Long (EE/CS Digital Design)
– Verilog, BSV, Deep Quartus/Vivado – 10 Years Reconfigurable CompuGng – Bach.Long@atomicrules.com
• Aaron Severance (EE/CS Digital Design): – Vector Processing, Verilog, System Design – Recent UBC PhD, VectorBlox Co-‐Founder – Aaron.Severance@atomicrules.com
• Steve Gabriel (EE/DSP/Math Control/Signal System Design): – Quad8, Evans&Sutherland, Ampex, Microsoo – Quaternion rotaGons and Galois fields – Steve.Gabriel@atomicrules.com
39 ©2015 Atomic Rules LLC
About Atomic Rules 1/2
• A typical engagement is to create codes to operaGonalize a plarorm, create/refresh an applicaGon, or both
• We sell rights to use the IP we own • We sell IP and product development services • We sell support services around code we write
40 ©2015 Atomic Rules LLC
About Atomic Rules 2/2
• We know Signals and Systems • We know Complex Concurrency • We know Middleware and IP IntegraGon to offer our clients… • “More with More”, in less Gme • Fewer defects: Correct-‐by-‐ConstrucGon • Greater producGvity, reduced Gme-‐to-‐soluGon • Reduced cost of reuse and tech refresh
41 ©2015 Atomic Rules LLC
Core Beliefs and Axioms • SeparaGon of Concerns • Divide and Conquer • Automate or Die • Write Things Once • Interface Before ImplementaGon • FuncGonal Correctness First • Components Must Compose • Components Work as Expected • IP Should be Portable, Vendor-‐AgnosGc if possible
42 ©2015 Atomic Rules LLC
Agile and IteraGve
• Rapidly “Going Deep” is a highly-‐valued – Proof points that can be seen and measured
• Agile and IteraGve design – Achieve FuncGonal Correctness quickly – Achieve Performance Correctness itera8vely
43 ©2015 Atomic Rules LLC
Client Roster (parGal list) • BAE Systems • CSPi / Myricom • DRS Technologies • Maxim Integrated • Mercury Federal Systems • Skreens Entertainment • Stanford University • US Air Force (AFRL) • Xilinx
44 ©2015 Atomic Rules LLC
Partner Roster • 25G / 50G Ethernet ConsorGum • 25-‐50-‐100 Ethernet Alliance • Accellera/OCP-‐IP Community Member • ARM Connected Community Member • BibWare SoluGon Partner • Bluespec Technology Partner • MathWorks ConnecGons Partner • NetFPGA Infrastructure Developer • OpenCPI Infrastructure Developer • P4 Language ConsorGum Member • PCI-‐SIG Member • VITA Trade AssociaGon Member • Xilinx Alliance Member Partner
45 ©2015 Atomic Rules LLC
47 ©2015 Atomic Rules LLC
Intern Program
Atomic Rules Intern Emeritus, U-‐Ark MSCE Graduate (2013), ChrisGna Smith