Date post: | 18-Mar-2023 |
Category: |
Documents |
Upload: | khangminh22 |
View: | 0 times |
Download: | 0 times |
University of CaliforniaSanta Barbara
Programming Language Techniques for Improving ISAand HDL Design
A dissertation submitted in partial satisfactionof the requirements for the degree
Doctor of Philosophyin
Computer Science
by
Michael Alexandre Christensen
Committee in charge:
Professor Ben Hardekopf, ChairProfessor Timothy SherwoodProfessor Jonathan Balkind
December 2021
The Dissertation of Michael Alexandre Christensen is approved.
Professor Timothy Sherwood
Professor Jonathan Balkind
Professor Ben Hardekopf, Committee Chair
December 2021
Programming Language Techniques for Improving ISA and HDL Design
Copyright © 2021
by
Michael Alexandre Christensen
iii
Dedicated to my wife, Crystal, and my children, Brooklyn,Rhys, and Ira, for the joy they bring to my life.
iv
Acknowledgements
The past six years have been the best of my life. They were also a lot of work, and I
have many people to thank for helping me get here:
Ben Hardekopf For teachingme how to do research, givingme the freedom to discover
my interests, and being supremely patient as I had three children along the way.
Tim Sherwood For your unparalleled advice and encouragement. You helped me
believe that my ideas were good and that I could finish my PhD.
Jonathan Balkind For your invaluable insights into where the field is and where it’s
headed. You expanded my vision of what’s possible when we use PL to solve
computer architecture problems.
Rich Wolski For your incredible passion for systems. I’m grateful to have been able to
learn from, TA for, and work with you during the first years of my PhD.
Joseph McMahan For being a great rolemodel and even greater friend. I feel extremely
lucky to have been able to work alongside you on Zarf starting my first year—
thanks for everything.
Kyle Dewey For being a mentor with a work ethic like no other. You were an anchor
and example, always willing to lend a commiserating ear through the lows and
rejoice in the highs of my PhD experience.
George Tzimpragos For all the time you spent teaching me about the world of su-
perconductor electronics. Your creativity and the depth and breadth of your
knowledge continually amaze me.
Mehmet Emre For your enviable ability to grok it all. We never have a conversation
where I don’t learn something new from you. Project Neptune forever.v
Lawton Nichols For teaching me the importance of consistency and perseverance. You
motivate me to continue running, learning, and trying new things.
Miroslav (Mika) Gavrilov For being the sweetest, most caring person I know. You’re
still UncleMika tomy children, and they nowknow the cosmic entities of Lovecraft
(and their alphabet) thanks to you.
The PL Lab Harlan Kringen, Zach Sisco, Madhukar Kedlaya, Davina Zamanzadeh, and
everyone not already mentioned, past and present, for the energy and creativity
you brought to the lab.
The Arch Lab Deeksha Dangwal, Jennifer Volk, Weilong Cui, and everyone not already
mentioned, past and present, for helping a PL person feel welcome and for guiding
me through the details.
John Shalf and George Michelogiannakis For the opportunity to work on the inter-
esting computer architecture problems of superconductor electronics.
Christophe Giraud-Carrier, Scott Burton, and Ryan Farrell For introducing me to re-
search and guiding my decision to attend graduate school.
Janis Smith For being the best mother-in-law anyone could ask for.
My Parents For your examples of hard work and for the many sacrifices you made that
enabled me to pursue my passions.
Crystal For your unwavering love and support through all of this. You’re my best
friend and continual source of strength, and I’m grateful to get to navigate this
life with you.
vi
Curriculum VitæMichael Alexandre Christensen
Education
2015 – 2021 Ph.D. in Computer Science, University of California, Santa Barbara• M.S. in Computer Science awarded in March 2021
2007 – 2013 B.S. in Computer Science, Brigham Young University, Provo• Minor in Mathematics
Courses TA’d
Winter 2020 UCSB CS 64: Computer Organization and Logic DesignFall 2019 UCSB CS 170: Operating SystemsSpring 2019 UCSB CS 138: Formal Languages and AutomataWinter 2017 UCSB CS 162: Programming LanguagesWinter 2016 UCSB CS 162: Programming LanguagesFall 2015 UCSB CS 64: Computer Organization and Logic DesignWinter 2013 BYU CS 478: Machine Learning and Data Mining
Experience
2015 – 2021 Research Assistant, UCSB (Programming Languages Lab)2020 – 2021 Research Assistant, Lawrence Berkeley National Lab
(Computer Architecture Group)Summer 2020 Software Engineer Intern, Facebook (Hack Language Team)Summer 2019 Software Engineer Intern, Facebook (Disaster Recovery Team)2014 – 2015 Software Engineer II, Dell EMC (NetWorker Team)2013 Research Assistant, BYU Data Mining LabSummer 2012 Software Engineer Intern, SerialTek
PublicationsMichael Christensen, Georgios Tzimpragos, Harlan Kringen, Jennifer Volk, TimothySherwood, Ben Hardekopf. “PyLSE: A Pulse-Transfer Level Language for Supercon-ductor Electronics,” Under review.Michael Christensen, Timothy Sherwood, Jonathan Balkind, Ben Hardekopf. “WireSorts: A Language Abstraction for Safe Hardware Composition,“ Proceedings of the 42ndACM SIGPLAN International Conference on Programming Language Design and Implementa-tion (PLDI), June 2021. Virtual, Canada.
vii
Michael Christensen, Joseph McMahan, Lawton Nichols, Jared Roesch, Timothy Sher-wood, Ben Hardekopf. “Safe Functional Systems through Integrity Types and VerifiedAssembly,“ Theoretical Computer Science, 2021.Joseph McMahan,Michael Christensen, Kyle Dewey, Ben Hardekopf, Timothy Sher-wood. “Bouncer: Static ProgramAnalysis inHardware,” Proceedings of the 2019ACM/IEEE46thAnnual International Symposium onComputer Architecture (ISCA), June 2019. Phoenix,AZ, USA.Joseph McMahan, Michael Christensen, Lawton Nichols, Jared Roesch, Sung-Yee Guo,Ben Hardekopf, and Timothy Sherwood. “An Architecture for Analysis,” IEEE Micro:Top Picks from the 2017 Computer Architecture Conferences (IEEE Micro - Top Picks), vol.38, no. 3, pp 107-115, May/June 2018.Joseph McMahan, Michael Christensen, Lawton Nichols, Jared Roesch, Sung-Yee Guo,Ben Hardekopf, and Timothy Sherwood. “An Architecture Supporting Formal andCompositional Binary Analysis,” Proceedings of the Twenty-Second International Conferenceon Architectural Support for Programming Languages and Operating Systems (ASPLOS),April 2017. Xi’an, China.
Awards and Service
June 2021 Outstanding Teaching Assistant, Computer Science Department,UCSB (2020 – 2021 Academic Year)
June 2016 Student Volunteer Co-Captain, PLDI2008 – 2010 Volunteer Church Representative, The Church of Jesus Christ of
Latter-Day Saints (San José, Costa Rica)
viii
Abstract
Programming Language Techniques for Improving ISA and HDL Design
by
Michael Alexandre Christensen
Despite all the effort spent in testing, analyzing, and formally verifying software,
a program is ultimately only as correct as the underlying hardware on which it runs.
As processors become more performant, their microarchitectures become increasingly
complex; this complexity often manifests in instruction set architectures (ISAs) that
are bloated, imprecise, and therefore unamenable to formal verification. The ISA itself
is realized as a hardware implementation written in a hardware description language
(HDL). Unfortunately, modern HDLs lack the expressive, composable programming
abstractions we’ve come to expect of traditional high-level programming languages,
hampering innovation and correct-by-construction hardware design. Furthermore, the
unique characteristics of emerging technologies like superconductor electronics (SCE)
require us to rethink the HDLs we use and retool the entire design, simulation, and
verification stack.
This thesis shows how various programming language techniques, applied to the
realm of computer architecture and hardware design, help address these issues. I show
that abstraction, formal semantics, and type theory can be used to create an ISA that
is precise, concise, and amenable to formal reasoning by both human and machine.
I also show how HDLs can better support composability via formalized notions of
intermodular communication and dependency. Finally, I show that we can improve
SCE behavioral modeling and system design using a new automata-based pulse-transfer
level language.ix
Contents
Curriculum Vitae vii
Abstract ix
List of Figures xii
List of Tables xiv
1 Introduction 11.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 Organization of This Document . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Zarf: An Architecture Supporting Formal and Compositional Binary Analysis 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3 Hardware Architecture and ISA . . . . . . . . . . . . . . . . . . . . . . . . 152.4 System Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.5 ISA Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.6 Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3 Bouncer: Static Program Analysis in Hardware 623.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2 Hardware Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 653.3 Static Analysis Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.4 Algorithm for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.5 BEU Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963.6 Provable Non-bypassability . . . . . . . . . . . . . . . . . . . . . . . . . . 1013.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1033.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1083.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
x
4 Wire Sorts: A Language Abstraction for Safe Hardware Composition 1134.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1134.2 Motivation and Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 1174.3 Wire Sorts and Well-Connectedness . . . . . . . . . . . . . . . . . . . . . 1244.4 Implementation of Modular Well-Connectedness Checks . . . . . . . . . 1374.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1384.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5 PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics 1495.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1495.2 Defining Computation on Pulses . . . . . . . . . . . . . . . . . . . . . . . 1515.3 A Language Abstraction for Superconductor Electronics . . . . . . . . . . 1555.4 PyLSE Language Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1645.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1955.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
6 Conclusions and Future Work 198
A Zarf and Bouncer 200A.1 Small-Step Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200A.2 Big-Step Lazy Semantics for Typed Zarf . . . . . . . . . . . . . . . . . . . 206
Bibliography 211
xi
List of Figures
2.1 High-level Zarf system architecture . . . . . . . . . . . . . . . . . . . . . . 92.2 Abstract syntax of Zarf’s functional ISA . . . . . . . . . . . . . . . . . . . 182.3 Compiling high-level assembly into a Zarf binary . . . . . . . . . . . . . 222.4 ECG filter process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.5 Big-step semantics for Zarf’s functional ISA . . . . . . . . . . . . . . . . . 302.6 Big-step semantics helpers for Zarf’s functional ISA . . . . . . . . . . . . 312.7 Coq extraction of verified application components . . . . . . . . . . . . . 392.8 Integrity typing rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.9 Integrity typing rules helpers . . . . . . . . . . . . . . . . . . . . . . . . . 502.10 Joining two types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502.11 Subtyping rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 The Binary Exclusion Unit as a gatekeeper . . . . . . . . . . . . . . . . . . 703.2 Typed Zarf abstract syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.3 Zarf static semantic domains . . . . . . . . . . . . . . . . . . . . . . . . . . 733.4 Zarf static semantics (typing rules) . . . . . . . . . . . . . . . . . . . . . . 743.5 An example Type Reference Table (TRT) for the function map . . . . . . . 953.6 State machine of the Binary Exclusion Unit . . . . . . . . . . . . . . . . . 983.7 BEU evaluation for a set of sample MiBench programs . . . . . . . . . . . 104
4.1 Normal FIFO queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.2 Forwarding FIFO queue. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.3 Forwarding FIFO connection causing a combinational loop . . . . . . . . 1204.4 Example for computing the output-port-set and input-port-set of a module1264.5 Connections between to-sync or from-syncwires . . . . . . . . . . . . . 1264.6 Connections between from-port or to-port wires . . . . . . . . . . . . . . 1304.7 Illustration of the Wire Well-Connectedness definition . . . . . . . . . . . 1314.8 Wire sorts for synchronous memories . . . . . . . . . . . . . . . . . . . . 1364.9 Path through a parallel-in serial-out (PISO) shift register . . . . . . . . . 141
5.1 Abstraction levels in semiconducting and superconducting . . . . . . . . 1505.2 Comparing information in CMOS and SFQ. . . . . . . . . . . . . . . . . . 152
xii
5.3 Schematic and Mealy machine of the Synchronous And Element . . . . . 1535.4 Waveform with timing constraints of the Synchronous And Element . . 1545.5 Anatomy of a PyLSE Machine transition . . . . . . . . . . . . . . . . . . . 1565.6 PyLSE Machine for the Synchronous And Element . . . . . . . . . . . . . 1575.7 Transition, Dispatch, Trace, and Network relations of the PyLSE Machine 1605.8 Synchronous And Element PyLSE code. . . . . . . . . . . . . . . . . . . . 1655.9 Hole description of a memory . . . . . . . . . . . . . . . . . . . . . . . . . 1665.10 Simulating the memory Functional class . . . . . . . . . . . . . . . . . . 1675.11 Min-max pair PyLSE code and block diagram . . . . . . . . . . . . . . . . 1695.12 Simulation of the Synchronous And Element in PyLSE . . . . . . . . . . 1715.13 Example of error reporting during a PyLSE simulation . . . . . . . . . . 1725.14 Expanding a PyLSE Machine transition into TA transitions . . . . . . . . 1745.15 Block diagram of an eight-input bitonic sorter . . . . . . . . . . . . . . . . 1795.16 N-input bitonic sorter written in PyLSE . . . . . . . . . . . . . . . . . . . 1805.17 8-input bitonic sorter implementation written in PyLSE . . . . . . . . . . 1815.18 SPICE vs. PyLSE simulation results for the C Element. . . . . . . . . . . . 1825.19 SPICE vs. PyLSE simulation results for the Inverted C Element. . . . . . 1835.20 SPICE vs. PyLSE simulation results for the min-max pair. . . . . . . . . . 1845.21 SPICE vs. PyLSE simulation results for the eight-input bitonic sorter. . . 1855.22 PyLSE in the SCE Design Flow . . . . . . . . . . . . . . . . . . . . . . . . 197
A.1 Abstract syntax for the small-step semantics of Zarf’s functional ISA. . . 200A.2 Semantic domains for the small-step semantics of Zarf’s functional ISA. . 201A.3 Semantic domains for the big-step semantics . . . . . . . . . . . . . . . . 206A.4 Big-step lazy semantics of Zarf’s functional ISA. . . . . . . . . . . . . . . 207
xiii
List of Tables
2.1 Resource usage of Zarf and basic MicroBlaze . . . . . . . . . . . . . . . . 59
3.1 21 conditions requiring dynamic checks absent static type checking . . . 683.2 Breakdown of states in the Binary Exclusion Unit’s state machine . . . . 993.3 Examples of how the Binary Exclusion Unit identifies erroneous code . . 106
4.1 Wire sorts of module ports for a subset of BaseJump STL . . . . . . . . . 1404.2 Size, wire sort inference time, and number of IO ports of 17 OPDBmodules1434.3 Cycle detection time (synthesis vs. wire sorts on OPDB designs) . . . . . 1444.4 Number of annotations per sort . . . . . . . . . . . . . . . . . . . . . . . . 147
5.1 PyLSE encapsulating and support functions used in example code . . . . 1725.2 Simulation times of PyLSE vs. SPICE-level models . . . . . . . . . . . . . 1845.3 Comparison of PyLSE Machines and UPPAAL-flavored Timed Automata 191
A.1 Small-step state transition rules of Zarf’s functional ISA. . . . . . . . . . . 203
xiv
Chapter 1
Introduction
Our programs are only as correct as the machines on which they run. This correctness
is often taken for granted, with the software realm siloed off from the hardware realm
and often oblivious to the intricacies of the processor running the code. However, if we
are to truly tackle the spectre of full-stack program verification, we must adequately
include our hardware in the discussion. In this thesis, I argue that to do so, we must
change how we approach hardware design both above and below the microarchitecture as
well for emerging technologies beyond CMOS.
Above the Microarchitecture The separation of concerns between hardware and
software has given the hardware realm the freedom to focus on creating machines that
meet particular power, efficiency, area, and security targets, all while interacting with
software via the same interface. This interface, the instruction set architecture (ISA),
serves as both an abstraction and a contract by presenting a set of assembly instructions
and a specification (often in thousands of pages of plain prose and pseudocode) of
how each instruction affects machine behavior. As processors have become increasingly
performant, their microarchitectures have become increasingly large and complex;
1
Introduction Chapter 1
unfortunately, this complexity has permeated the ISA abstraction boundary, manifesting
as ISAs that are buggy, imprecise, ill-defined, and hard to reason about, model, and
simulate.
Below the Microarchitecture ISAs are in turn implemented as processor microar-
chitectures using a hardware description language (HDL); these processors are just
one example of intellectual property (IP) blocks—reusable hardware components, pack-
aged into IP catalogs, that are then connected together. Systems on chip (SOCs) and
manycore processors integrate hundreds of these components, and as design teams
become larger and more remote, it becomes more important that these IPs, especially
proprietary IPs, have detailed and precise interface specifications. Naively connecting
hardware blocks together leads to bugs which are difficult to diagnose when looking at
the interfaces alone and which are especially difficult to debug after synthesis. Thus,
composability of IP becomes a major issue, andmodern HDLs lack effective abstractions
for dealing with this intermodular composability and communication beyond simple
abstractions like the “module” associating “input” and “output” wires with each other
via gates. Modern high-level programming languages, on the other hand, have many
mechanisms supporting effective modularity, abstraction, and dependency specifica-
tion, and we can use these techniques to be more precise about the surprisingly complex
requirements imposed on the use of data and what compositions lead to well-defined
digital designs.
And Beyond As we enter a post-Moore, post-Dennard scaling era in search of addi-
tional power, speed, and efficiency, we must to look beyond CMOS toward emerging
technologies like superconductor electronics (SCE). The unique characteristics of SCE,
such as its pulse-based information encoding and stateful gate primitives, require im-
2
Introduction Chapter 1
proved language abstractions and that we rethink the entire design, simulation, and
verification stack. Programming language theory again provides us insight into how
we can use better mathematical foundations, such as automata theory, for improved
SCE languages and simulation frameworks.
We need to ensure the correctness of the entire stack, from source code to silicon1
die, and this thesis seeks to show that the precision of programming language theory
can be effectively applied on the hardware side. This brings me to my thesis statement:
1.1 Thesis Statement
The application of programming language principles like abstraction and formal
semantics can make it easier to write verifiably correct and secure hardware, tangibly
improving both the interface which software uses to communicate with the machine
(the instruction set architecture) and the means by which we design the machine
itself (the hardware description language).
1.2 Organization of This Document
This thesis is structured as follows:
Chapter 2 This chapter discusses Zarf, an ISA resembling the untyped lambda cal-
culus designed to bring the assembly closer to frameworks used for formal program
analysis and reasoning, and which has been implemented as an FPGA prototype run-
ning an embedded medical application. I show that the architecture allows for the
formal verification of multiple properties of the end-to-end system, including a proof1Niobium for SCE.
3
Introduction Chapter 1
of correctness of the assembly-level implementation of the core algorithm, the integrity
of trusted data via a non-interference proof, and a guarantee that our prototype meets
critical timing requirements. It is based on work published in ASPLOS 2017 (Cita-
tion: [1]; DOI: 10.1145/3037697.3037733; © 2017 ACM), IEEE Micro Top Pick 2018
(Citation: [2]; DOI: 10.1109/MM.2018.032271067; © 2018 IEEE), and the Journal of
Theoretical Computer Science 2021 (Citation: [3]; DOI: 10.1016/j.tcs.2020.09.039;
© 2021 Elsevier) and which appeared as a portion of the PhD thesis of Joseph McMahan
[4], co-author on the aforementioned publications. It is based on work supported by
the National Science Foundation under Grant No. 1239567, 1162187, and 1563935.
Chapter 3 This chapter discussesBouncer, which extends Zarf with let-polymorphism
tomake it easier to correctlywrite and reason about Zarf binaries. Using special-purpose
hardware, I show that we can use static analysis to prevent all program binaries with
memory errors, invalid control flow, and other undesirable properties from ever being
loaded onto the embedded program store. We can check this in a streaming and
verifiably non-bypassable way, directly in hardware, resulting in a system that is small,
efficient, and which guarantees freedom from many security and safety concerns. It is
based on work published in ISCA 2019 (Citation: [5]; DOI: 10.1145/3307650.3322256;
© 2019 ACM) and which also appeared as a portion of the aforementioned thesis by
Joseph McMahan. It is based on work supported by the National Science Foundation
under Grants No. 1740352, 1730309, 1717779, 1563935, 1350690, and a gift from Cisco
Systems.
Chapter 4 This chapter discussesWire Sorts, an abstraction thatmakes it impossible to
create certain classes of erroneous digital design and makes it easier to express interface
requirements and compose IP. Using a taxonomy of sorts to soundly abstract even
4
Introduction Chapter 1
complex combinational dependencies of arbitrary hardware modules, I show that we
can facilitate modularity in digital design by escalating problematic aspects of module
input/output interaction to the language-level interface specification. I also show how
these sorts have been formalized and proven sound and demonstrate that they can be
applied and even inferred automatically at scale via an examination of a variety of large
open-source digital designs and libraries. It is based on work published in PLDI 2021
(Citation: [6]; DOI: 10.1145/3453483.3454037; © 2021 ACM) and supported by the
National Science Foundation under Grants No. 1763699 and 1717779. Source code for
this is available as an accompanying artifact at https://zenodo.org/record/4695169.
Chapter 5 This chapter discusses PyLSE, a new pulse-transfer level language for
superconductor electronics built on the theory of automata. PyLSE enables the precise
specification of gate semantics via a transition system-based Python embedded domain-
specific language (DSL). I show that it is a effective framework for easily composing
cells into larger designs and that it facilitates the identification of timing and logic errors
through a mix of dynamic checks and sound static analysis. I also show how PyLSE
can be formalized mathematically, demonstrating its capabilities through the creation,
simulation, and verification of a selection of SCE designs. It is based on work that has
been submitted to a conference and is currently under review. I am the first author on
this work, and my co-authors are Georgios Tzimpragos, Harlan Kringen, Jennifer Volk,
Timothy Sherwood, and Ben Hardekopf.
5
Chapter 2
Zarf: An Architecture Supporting
Formal and Compositional Binary
Analysis
2.1 Introduction
Embedded devices are ubiquitous, withmany nowplaying roles that support human
health, well-being, and safety. The critical nature of these systems — automotive, medi-
cal, cryptographic, avionic — is at odds with the increasing complexity of embedded
software overall: even simple devices can easily include an HTTP server for monitoring
purposes. Traditional processor interfaces are inherently global and stateful, making
the task of isolating and verifying critical application subsets a significant challenge.
Architectural extensions have been proposed that enhance the power, performance, and
functionality of systems, but in contrast, we cover the first architecture designed with
formal program analysis as a core motivating principle.
High-level, functional languages offer a remarkable ability to reason about the behav-
6
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
ior of programs, but are often unsuited to low-level embedded systems, where reasoning
must be done at the assembly level to give a full picture of the code that will actually
execute. At a high level, in a language designed for verification, reasoning typically
requires relying on a language run-time that can be prohibitive for resource-constrained
or real-time embedded systems, and/or require the assumption that thousands of lines
of untrusted code in the language stack are correct.
Another approach is to directly model the processor interface by giving formal
semantics to the ISA. However, reasoning about binary behavior on traditional archi-
tectures is difficult and often left incomplete. Unless all program components and
architectural behaviors are included, any piece outside the expected model could mu-
tate a piece of machine state and violate the assumptions of the verification effort. Even
assuming that never happens, using a verified compiler, assuming other modules are
correct, using only a subset of the ISA and assuming the rest is unused, program-specific
reasoning is still difficult — i.e., reasoning about C still means reasoning about pointers,
memory mutation, and countless imperative, effectful behaviors.
We propose a system where the critical code can execute at the assembly level in a
way that is very similar to the underlying computational model upon which proof and
reasoning systems are already built. Under such a mode of computation, properties
such as isolation, composition, and correctness can be reasoned about incrementally,
rather than monolithically. However, instead of requiring a complete reprogramming
of all software in a system, we instead examine a novel system architecture consisting of
two cooperating layers: one built around a traditional imperative ISA, which can execute
arbitrary, untrusted code, and one built around a novel, complete, purely functional ISA
designed specifically to enable reasoning about behavior at the binary level. Application
behaviors that are mission critical can be hoisted piecemeal from the imperative to the
functional world as needed.7
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Our proposed system, the Zarf Architecture for Recursive Functions, observes the
following properties:
1. The functional ISA, “Zarf,” is devoid of all global or mutable state, and provides a
compact, complete, and mathematical semantics for the behavior of instructions;
2. The imperative ISA is strictly separated from the functional ISA, connected only
via a communication channel through which the system components can pass
values;
3. The subset of the application which operates on Zarf can be verified and reasoned
about without regard to the operation of the imperative components, meaning
that only the critical components need to be ported and modeled;
4. Reasoning on the functional ISA is provably composable — i.e., two separate
pieces can be statically shown to never interfere with each other.
To demonstrate the usefulness of this platform, we develop, model, and test a
sample application which implements an Implantable Cardio-Defibrillator (ICD) — an
embedded medical device which is implanted in a patient’s chest cavity, monitors the
heart, and administers shocks under certain conditions to prevent or counter cardiac
arrest. Though ICDs provide life-saving treatment for patients with serious arrhythmia,
these devices, along with other embedded medical devices, have seen thousands of
recalls due to dangerous software bugs [7, 8]. By leveraging this two-layer approach,
we are able to formally verify the correctness of a low-level implementation of the
core functions in Coq and directly extract executable assembly code without needing
software runtimes. The ISA semantics allow us to construct an integrity type system and
formally prove that the rest of the code never corrupts the inputs or outputs of the critical
functions. Furthermore, the functional abstraction built in to the binary code allows8
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Provable Type AnalysisTheorem ProvingCustom Analysis
Best Effort Analysis
Critical I/O
Standard I/O
App function binary
function binary coroutinehandler
λ-Execution Layer
C/asm
Traditional Microprocessor
lib
criti
cal
criti
cal
traditional binary
Figure 2.1: High-level Zarf system architecture: by dividing the system into two hard-ware realms— one that provides a precise, mathematical semantics for reasoning aboutprogram behavior, and the other a standard imperative core for legacy software — wecan formally verify and otherwise reason about critical subsets of applications withoutneeding to model and verify the entire program.
us to bound worst-case execution time, even in the face of garbage collection. Taken
altogether, we have an embedded medical application whose core components have
been proven correct, where non-interference is guaranteed, where real-time deadlines
are assured to be met, and where C code can execute arbitrary auxiliary functions in
parallel for monitoring. The high-level system architecture is shown in Figure 2.1.
Given the significant amount of related efforts in verification and ISA design, we
begin by summarizing how our work differs from previous efforts in the fields of
verification and architecture (Section 2.2). We then describe the Zarf platform in more
detail and describe a hardware implementation, which runs the application on an
FPGA (Section 2.3). Details of our embedded ICD software application and the ways
it can leverage the properties of the Zarf platform are described next (Section 2.4),9
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
followed by a precise definition of Zarf’s semantics (Section 2.5). We then discuss the
verification of multiple properties of the critical sub-components of the ICD, covering
correctness, timing, and non-interference (Section 2.6). Finally, we evaluate this system
architecture and approach, presenting hardware resource requirements of the novel
ISA, and examine the performance loss of the verified components when compared to
an unverified C alternative (Section 2.7), and conclude (Section 2.8).
2.2 Related Work
2.2.1 Verification
Our dual ISA approach, where one is untrusted and the other trusted, draws in part
onRushby’swork on security kernels [9]. He separatesmachine components into virtual
“regimes” and proves isolation. Having done so, Rushby can then show that security is
maintained when introducing clearly defined and limited channels of communication
whose information flow can be tracked. Our imperative and functional ISAs behave as
separate components, communicating only through a specified, dedicated channel, thus
eliminating any insecure information flow — via memory contamination, for example.
Our security type system draws from the work done on the Secure Lambda (SLam)
calculus by Heintze and Riecke [10] and its further development by Abadi et al. in
their Core Calculus of Dependency [11]. It also draws inspiration from Volpano [12]
et al., who created a type system for secure information flow for an imperative block-
structured language. By showing that their type system is sound, they show the absence
of flow from high-security data to lower-security output, or similarly, that low-security
data does not affect the integrity of higher-security data. Other seminal work on secure
information flow via the formulation of a type system include Denning [13], Goguen
10
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
[14], Pottier’s information flow for ML [15], and Sabelfeld and Myers’s survey on
language-based information flow security [16].
Productive, expressive high-level languages that are also purely functional are
excellent source platforms for Zarf. Even languages like Haskell, though, can have
occasional weaknesses that can lead to runtime type-errors. Subsets such as Safe
Haskell [17] shore up these loopholes, and provide extensions for sandboxing arbitrary
untrusted code. Zarf provides isolation guarantees at the ISA level and does not require
runtimes, but relies on languages like Safe Haskell for source code development.
Previous work on ISA-level verification has often involved either simplified or in-
complete models of the architecture. These can be in the form of new “idealized”
assembly-like languages: Yu et al. [18] use Coq to apply Hoare-style reasoning for as-
sembly programs written in a simple RISC-like language. They also provide a certified
memory management library for the machine they describe. Chlipala presents Bedrock,
a framework that facilitates the implementation and verification of low-level programs
[19], but limits available memory structures.
Kami [20] is a platform for specifying, implementing, and verifying hardware de-
signs in Coq. By making the language of design the same as the language of verification,
the gap that traditionally exists between high-level specification and actual hardware
implementation is minimized. A key distinction between Kami and our work is that
Zarf focuses on building a processor specifically to make software verification easier,
while Kami focuses on the task of specifying and building verifiable hardware. The
two are very complementary — you can build a traditional imperative machine using
Kami (e.g. their pipelined RISC-V processor implementation), and you can build a Zarf
machine using traditional hardware design.
Verification has also been done for subsets of existing machines. For example,
a “substantial” subset of the Motorola MC68020 interface is modeled and used to11
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
mechanically prove the correctness of quicksort, GCD, and binary search [21]; other
examples include a formalization of the SPARC instruction set, including some of
the more complex properties, such as branch delay slots [22]; and subsets of x86
[23]. One of the biggest efforts to date has been a formal model of the ARMv7 ISA
using a monadic specification in HOL4 [24]. Moore developed Piton, a high level
assembly language and a verified compiler for the FM8502 microprocessor [25], which
complemented the verification work done on the FM8502 implementation [26]. These
are large efforts because of the difficulty in reasoning about imperative systems. At
higher levels of abstraction, entire journal issues have been devoted to works on Java
bytecode verification [27].
In addition to proofs on machine code for existing machines, it is also possible to
define new assembly abstractions that carry useful information. Typed assembly as an
intermediate representation was previously identified as a method for Proof-Carrying
Code [28], where machine-checked proofs guarantee properties of a program [29].
Typed assemblies and intermediate representations have seen extensive use in the veri-
fication community [30, 19, 31, 32] and have been extended with dependent types [33],
allowing for more expressive programs and proofs at the assembly level.
Verified compilers are a popular topic in the verification community [34, 35, 36, 37],
the most well-known example being CompCert [38], a verified C compiler. Verified
compilers are usually equipped with a proof of semantics preservation, demonstrating
that for every output program, the semantics match those of the corresponding input
program. A verified compiler does not provide tools for, nor simplify the process of
doing, program-specific reasoning. One needs a secondary tool-chain for reasoning
about source programs, such as the Verified Software Toolchain (VST) [39] for Com-
pCert. These frameworks often have a great cost, mandating the use of sophisticated
program logics, such as higher-order separation logic in VST, in order to fully reason12
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
about possible program behaviors.
Further, in many systems, it’s possible that not all source code is available; without
being able to reason about binary programs, guarantees made on a piece of the source
program (and preserved by the verified compiler) may be violated by other components.
Extensions to support combining the output of verified compilers, such as separate
compilation and linking, are still an active research area [40, 41]. As work on verified
compilers requires a semantic model of the ISA, it is complemented by our work, which
gives complete and formal semantics for an ISA.
Previouswork at the intersection of verification and biological systems has attempted
to improve device reliability through modeling efforts. This includes work that formu-
lates real-time automata models of the heart for device testing [42], formal models of
pacing systems in Z notation [43], quantitative and automated checking of the interac-
tion of heart-pacemaker automata to verify pacemaker properties [44], and semi-formal
verification by combining platform-dependent and independent model checking to
exhaustively check the state space of an embedded system [45]. Our work is comple-
mented by verification works such as these that refine device specification by taking
into account device-environment interactions.
2.2.2 Architecture
The SECDMachine [46] is an abstract machine for evaluating arithmetic expressions
based in the lambda calculus, designed in 1963 as a target for functional language
compilers. It describes the concept of “state” (consisting of a Stack, Environment,
Control, and Dump) and transitions between states during said evaluation. Interpreters
for SECD run on standard, imperative hardware. Hardware implementations of the
SECD Machine have been produced [47], which explore the implementation of SECD
13
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
at the RTL and transistor level, but present the same high-level interface. The SECD
hardware provides an abstract-machine semantics, indicating how the machine state
changes with each instruction. Our verification layer makes machine components
fully transparent, presenting a higher-level small-step operational semantics, where
instructions affect an abstract environment, and a big-step semantics, which immediately
reduces each operation to a value. These latter two versions of the semantics are more
compact, precise, and useful for typical program-level reasoning.
The SKI Reduction Machine [48] was a hardware platform whose machine code
was specially designed to do reductions on simple combinators, this being the basis of
computation. Like our verification layer, it was garbage-collected and its language was
purely applicative. The goal was to create a machine with a fast, simple, and complete
ISA. The choice to use the “simpler” SKI model means that machine instructions are a
step removed from the typically function-based, mathematical methods of reasoning
about programs. Our functional ISA,while also simple and complete, chooses somewhat
more robust instructions based on function application; though the implementation is
more complicated, modern hardware resources can easily handle the resulting state
machine, giving a simple ISA that is sufficiently high-level for program reasoning.
The most famous work on hardware support for functional programming was on
Lisp Machines [49, 50, 51]. Lisp machines provided a specialized instruction set and
data format to efficiently implement the most common list operations used in functional
programming. For example, Knight [51] describes a machine with instructions for Lisp
primitives such as CAR and CADR, and also for complex operations like CALL and
MOVE. While these these machines partially inspired this work, Lisp Machines are
not directly applicable to the problem at hand. Side-effects on global state at the ISA
level are critical to the operation of these machines, and while fast function calls are
supported, the stepwise register-memory-update model common to more traditional14
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
ISAs is still a foundation of these Lisp Machine ISAs. In fact, several commercial Lisp
Machine efforts attempted to capitalize on this fact by building Lisp Machines as a thin
translation layer on top of other processors.
Flicker also dealt with architectural support for a smaller TCB in the presence of
untrusted, imperative code, but did so with architectural extensions that could create
small, independent, trusted bubbles within untrusted code [52]. Our architecture is
almost inverted, with a trusted region providing the main control, calling out to an
untrusted core as needed. Previous works such as NoHype [53] dealt with raising the
level of abstraction of the ISA and factoring software responsibilities into the hardware.
Our verification layer shares some of these characteristics, but deals with verification
instead of virtualization, as well as being a complete, self-contained, functional ISA.
Previous work has explored the security vulnerabilities present in many embedded
medical devices, as well as zero-power defenses against them [54, 55, 56]. The focus of
our work is analysis and correctness properties, and we do not deal with security.
2.3 Hardware Architecture and ISA
Our system relies on two separate layers, running two different ISAs, connected
only by a data channel. This allows one of the layers to be specialized to the execution
of machine code with 1) a compact, precise, and complete semantics highly amenable
to proofs, and 2) the ability to compose verified pieces safely. It is entirely possible that
all code in the system be written to be purely functional and run on Zarf: the ISA for
this layer is complete. However, embedded devices often contain a mix of software,
including legacy code or nice-to-have features that do not affect the application’s be-
havior, such as relaying data and diagnostic information to outside receivers. With a
two-layer approach, we can run imperative code that is orthogonal to the operation of
15
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
critical application components while still connecting with the vetted, functional code
in a structured way. This, in turn, allows code to be formally verified piecemeal, with
functions “raised” into Zarf as deemed necessary.
The following subsections describe the interface and construction of Zarf, including
the reasons we take an approach much closer to the lambda calculus underlying most
software proof techniques, how we capture this style of execution in an instruction set,
the semantics for that instruction set, and more practical considerations such as I/O,
errors, and ALU functions.
2.3.1 Design Goals
Normal, imperative architectures have been difficult to model, and the task of
composing verified components is still an open problem [40, 41]. We identify the
following features as undesirable and counterproductive to the goal of assembly-level
verification:
1. Large amounts of global machine state (memory, stack, registers, etc.) directly
accessible to instructions, all of which must be modeled and managed in every
proof, and which inhibit modularity: state may be modified by code you haven’t
seen.
2. The mutable nature of machine state, which prevents abstraction and composition
when reasoning about functions or sets of instructions.
3. A large number of instructions and features: a complete model must incorporate
all of them (e.g., fully modeling the behavior of the ARMv7 was 6,500 lines of
HOL4 [24]).
16
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
4. Arbitrary control flow, which often requires complex and approximate analyses
to soundly determine possible control flows [57].
5. Unenforced function call conventions, meaning onemust prove that every function
respects the convention.
6. Implicit instruction semantics, such as exceptions where “jump” becomes “jump
and update registers on certain conditions.”
To avoid these traits, we design an interface that is small, explicit in all arguments,
and completely free of state manipulation and side effects — with the exception of
I/O, which is necessary for programs to be useful. Without explicit state to reference
(memory and registers), standard imperative operations become impossible, and we
must raise the level of abstraction. Instead of imperative instructions acting as the
building blocks of a program, our basic unit is the function. This is a major departure
from a typical imperative assembly, where the notion of a “function” is a higher-level
construct consisting of a label, control flow operations, and a calling convention enforced
by the compiler — but which has no definition in the machine itself. By bringing the
definition of functions to the ISA level, they become not just callable “methods” that
serve to separate out independent routines, but are actually strict functions in the
mathematical sense: they have no side effects, never mutate state, and simply map
inputs to outputs. This change allows us to attach precise and formal semantics to the
ISA operations.
2.3.2 Description and Semantics
Zarf’s functional ISA is effectively an a) untyped, b) lambda-lifted, c) adminis-
trative normal form (ANF) lambda calculus. Those limitations are a result of the
17
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
x ∈ Variable n ∈ Z fn, cn ∈ Name
p ∈ ProgramF−−−→decl fun main = e
decl ∈ DeclarationF cons | funccons ∈ ConstructorF con cn x⃗
func ∈ FunctionF fun fn x⃗ = ee ∈ ExpressionF let | case | res
let ∈ LetF let x = id −−→arg in e
case ∈ CaseF case arg of−→br else e
res ∈ ResultF result argbr ∈ BranchF cn x⃗ ⇒ e | n ⇒ e
id ∈ IdentifierF x | fn | cn | oparg ∈ ArgumentF n | x
op ∈ PrimOpF + | − | × | ÷ | =| < | ≤ | ∧ | ∨ | ∧̄ | ⊻
| ⊕ | ≪ | ≫ | sra | ¬
| getint | putint
Figure 2.2: Abstract syntax of Zarf’s functional ISA. A program is a set of functionand constructor declarations, where functions are composed solely of let, case, andresult expressions, and constructors are tuples with unique names. Case expressionscontain branches and serve as the mechanism for both control flow and deconstructionof constructor forms. An arrow over any metavariable (e.g. x⃗) signifies a list of zero ormore elements. op refers to a function that is implemented in hardware (such as ALUoperations); though the execution of the function invokes a hardware unit instead of apiece of software, the functional interface is identical to program-defined functions.
18
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
implementation being done in real hardware: a) to avoid the complexity of a hardware
typechecker, the assembly is untyped1; b) because every function must live somewhere
in the global instruction memory, only top-level declarations of functions are allowed
(lambda-lifted); c) because the instruction words are fixed-width with a static number
of operands, nested expressions are not allowed and every sub-expression must be
bound to its own variable (ANF). The abstract syntax of Zarf assembly is given in
Figure 2.2.
All words in the machine are 32-bits. Each binary program starts with a magic word,
a word-length integer N stating how many functions are contained in the program,
and then a sequence of N functions. Each function starts with an informational word
that lets the machine know the “fingerprint” of the function (including the number of
arguments expected and how many locals will be used) and a word-length integer M
to specify that the body of the function is M words long. The remaining M words of the
function are then composed entirely of the individual instructions of the machine.
Each function, as it is loaded, is given a unique and sequential identifier. These
function identifiers are the only globally visible state in the system and serve as both a
kind of name and a kind of pointer back to the code. Other functions can refer to, test,
and apply arguments to function identifiers. There are two varieties of function identi-
fiers: those that refer to full functions that contain a body of code, and “constructors,”
which have no body at all. Constructors are essentially stub functions and cannot be
executed. However, just like other functions, you can apply arguments to them. These
special function identifiers thus can serve as a “name” for software data types, where
arguments are the composed data elements. (In more formal terms, you can use our
constructors to implement algebraic data types.)1The original ISA definition as presented in the paper was untyped; in later work, we extended the
ISA to include type annotations and a hardware typechecker.
19
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
The words defining the body of a function are built out of just three instructions:
let, case, and result, which we will describe below. Unlike RISC instructions, let
and case can be multiple words long (depending on the number of arguments and
branches, respectively). However, unlike most CISC instructions, each piece of the
variable length instruction is also word-aligned and trivial to decode.
Zarf has no programmer-visible registers or memory addresses, but instructions
will still need to reference particular data elements. Instructions can refer to data by
its source and index, where the source is one of a predefined set — e.g., local and arg,
which serve a purpose similar to the stack on a traditional machine. The local and arg
indices might be analogous to stack offsets, while the actual addresses themselves are
never visible.
The primary ways of generating Zarf assembly are via extraction from Coq and
writing it by hand. We also have a Haskell compiler that supporting a subset of basic
Haskell constructs. In our experience, Zarf assembly code resembles a typical functional
programming language like desugared Haskell or OCaml, and the resultant express-
ibility makes directly writing assembly relatively easy; the user doesn’t need to worry
about memory address calculations, maintaining register or stack state across function
calls, or the myriad other things that make programming traditional ISAs tedious and
error-prone. For more information on automatic Coq extraction, see our discussion of
the ICD implementation in Section 2.6.
Figure 2.5 gives the complete ISA behavior using a big-step semantics, which ex-
plains how each instruction reduces to a value. This semantics uses eager evaluation
for simplicity; though the current hardware implementation uses lazy semantics, the
difference is not observable in our application because I/O interactions are localized
to a specific function and always evaluated immediately. The semantics use assembly
keywords for readability; Figure 2.3 shows how the assembly maps one-to-one with20
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
the binary encoding, and Figure 2.7 shows how low-level Coq code can be directly
converted to our assembly.
2.3.3 Instruction Set
The let instruction applies a function to arguments and assigns it a local identifier.
The first word in the let instruction indicates a function identifier or closure object
and the number of argument words that follow. Note that unlike a function “call”, let
does not immediately change the control flow or force evaluation of arguments; rather
it creates a new structure in memory (closure) tying the code (function identifier) to
the data (arguments), which, when finally needed, can actually be evaluated (using
lazy evaluation semantics). Additionally, the let instruction allows partial application,
meaning that new functions (but not function identifiers) can be dynamically produced
by applying a function identifier to some, but not all, of its arguments.
The case instruction provides pattern-matching for control flow. It takes a value,
then makes a set of equality comparisons, one for each “pattern” provided. The first
word of the case instruction indicates a piece of data to evaluate. As we need an actual
value, this is the point in execution that forces evaluation of structures created with
let— however, it is evaluated only enough to get a value with which comparisons can
be made; specifically, until it results in either an integer or a constructor object2. The
words following the instruction encode patterns (pattern_literal and pattern_cons)
against which to match the case value. If the case value exactly equals the literal value
or function (i.e. constructor) identifier, execution proceeds with the next instruction;
otherwise, it skips the number of words specified in the pattern argument. A matching
pattern_else is required for every casewhich will be executedwhen no other matches2More precisely, evaluation of that argument will always produces a result in Weak Head-Normal
Form (WHNF), but never a lambda abstraction.
21
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Figure 2.3: How the high-level assembly instructions are directly compiled into aZarf binary for execution. This example shows the map function, along with the listconstructors, in (a) high-level untyped assembly, (b) machine assembly, and (c) binary.(a) The standard linked-list definition requires just two constructors: a list is eitherempty or a 2-element struct containing a head (a value) and a tail (a list) [lines 1-2].The function map takes a function and a list as arguments [line 3]; it builds a new list,applying the function to each list element. If the argument list is empty, it returns anempty list [lines 5-6]. Otherwise, if the list matches against the head/tail constructor[line 7], it applies the function it was given to the list element [lines 8-9], calls maprecursively on the list tail [lines 10-12], builds a new list [lines 13-15], and yields thatnew list [line 16]. The else branch is not shown. (b) In the lowering to machineassembly, names are replaced with local indices, addressing a value on the locals stack(e.g., list′ becomes local 2 [line 13]). Function allocations are broken up so that eachargument occupies its own word. (c) The binary is a direct mapping from the assemblyin (b), simply translating ops to opcodes and data sources to integer identifiers. ‘x’indicates an unused field. (d) Binary encoding. Each word of the binary is either thestart of a function, the start of an instruction, or an argument word in a let instruction.With no architecturally visible state, data is accessed with a scoped system where theprogram identifies source and index; all data references use the same source/indexpattern.
22
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
are found (and demarcates the end of the case instruction encoding). Case/pattern
sequences not adhering to the encoding described are malformed and invalid — e.g.,
you cannot skip to the middle of a branch, or have a case without an else branch.
The result instruction is a single word, indicating a single piece of data that the
current function should yield. Every branch of every function must terminate with a
result instruction (disallowing re-convergent branches means the simple pattern-skip
mechanism is all that is necessary for control flow). Functions that do not produce a
value do not make sense in an environment without side effects, and so are disallowed.
After a result, control flow passes to the case instruction where the function result
was required.
We realize that this is a departure from traditional hardware instructions and suggest
reference to Figure 2.3 to help ground our descriptions in a concrete example. Figure
2.3 shows a small function, map, written in high-level assembly, machine assembly, and
encoded as a binary. A more thorough description of the semantics of each of these
instructions is found in Section 2.5.
2.3.4 Built-In Functions, I/O, and Errors
ALU operations are, for the most part, already purely mathematical functions —
they just map inputs to an output. The Zarf functional ISA is built around the notion of
function calls, so no newmechanismor instructions are needed to use the hardwareALU.
Invoking a hardware “add” is the same as invoking a program-supplied function. In our
prototype, function indices less than 256 (0x100) are reserved for hardware operations;
the first program-supplied function, main, is 0x100, with functions numbered up from
there. During evaluation, if the machine encounters a function with an index less than
0x100, it knows to invoke the ALU instead of jumping to a space in instruction memory.
23
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
The only two functions with side-effects in the system, input and output, are also
primitive functions. The input function takes one argument (a port number) and
returns a single word from that port; the output function takes two arguments, a
port and a value, and writes its result to the port, returning the value written. Since
data dependencies are never violated in function evaluation, software can ensure I/O
operations always occur in the right order even in a pure functional environment by
introducing artificial data dependencies; this is the principle underlying the I/O monad
[58, 59], used prominently in languages like Haskell.
In a purely functional system there are no side effects, and thus no notion of an
“exception”. For program-defined functions, this just requires that every branch of
every case return a value (that value could be a program-defined error). However,
some invalid conditions resulting from a malformed program can still occur at runtime.
To respect the purely functional system, these must cause a non-effectful result that
is still distinguishable from valid results. Our solution is to define a “runtime error
constructor” in the space of reserved functions. Every function, both hardware- and
software-defined, can potentially return an instance of the error constructor. The ISA
semantics are undefined in these error cases, because it’s very easy to avoid— compiling
from any Hindley-Milner typechecked language will guarantee the absence of runtime
type errors [60, 61].
2.4 System Software
This section describes the software architecture across the two realms (functional
and imperative) of the system, and provides an overview of the ICD and the functional
coroutines.
24
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Figure 2.4: The ECG takes input signals sampled at 200 Hz and filters them multipletimes, after which the peaks are classified and the rate of heartbeat determined. Thesevalues are fed to an ATP (antitachycardia pacing) procedure, which decides if ventricu-lar tachycardia is occurring based on current and previous heart rate, and administerspacing shocks to prevent acceleration and ventricular fibrillation (a form of cardiacarrest).
25
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
2.4.1 Functional vs. Imperative
As our system is composed of two small and separate computational layers, the
software is split across two different ISAs. For existing applications, or applications
prototyped for existing platforms, the decision of which components to migrate to Zarf
represents a trade-off of increased abstraction and verification capability for additional
development effort and some decrease in performance. Section 2.7 provides some
quantitative worst-case bounds for this trade-off.
Zarf runs a small microkernel based on cooperative coroutines [62, 63] to handle
the scheduling and communication of different software components. This allows
us to more easily group and reason about code in terms of higher-level behaviors
— i.e., the small surface area of each coroutine means they can be considered (and
occasionally verified) in blocks, as collections of functions with a single specification
and interface. The cooperative nature of the system is a design choice that allows us
to avoid interrupts, which would complicate proofs of a single coroutine’s behavior.
Timing analysis (section 2.6.2) ensures each coroutine always returns control.
Zarf enables reasoning about these coroutines at the assembly and binary level.
Section 2.6 demonstrates different properties that can be verified. The integrity type
system allows a developer to statically prove that a given set of coroutines (and the
microkernel itself) will execute in cooperation without one coroutine corrupting values
important to another. This composability of verification is extremely difficult on tradi-
tional architectures, as the global and mutable nature of all state makes it quite easy for
any software component to affect any other.
The imperative layer — which can be any embedded CPU, but for our purposes is a
Xilinx MicroBlaze processor — runs whatever pieces of the software are not placed on
Zarf. This allows for monitoring software, low-level drivers, communication protocols,
26
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
and other complex, imperative code to exist and run without requiring modeling or
pure-functional implementations. As this area of the system is untrusted and unverified,
anything on which the critical components depend should be rewritten to run on Zarf.
In our sample application, three application coroutines are run on Zarf: one that
handles the core ICD application, an I/O routine that handles the timing of reading
the values from the patient’s heart and outputting when shocks should occur, and a
routine that sends values to themonitoring software on the imperative layer. The system
operates in real-time, reading a single value from the heart, running ECG and ICD
processing, and communicating the resulting value back out. In our application, the
monitoring software tracks the number of times treatment occurs, and, when prompted
from its communication channel, will output that number. This imperative software
could be arbitrarily complex and handle more complicated monitoring and diagnosis,
communication drivers to communicate with the outside world, or other features; as it
is a standard imperative core, any embedded C code can be easily compiled for it with
an off-the-shelf compiler.
2.4.2 ICD
ICDs are small, battery-powered, embedded systems which are implanted in a
patient’s chest cavity and connect directly with the heart. For patients with arrhythmia
and at risk for heart failure, an ICD is a potentially life-saving device. Currently, the
primary use of ICDs is to detect dangerous arrhthymias (such as ventricular tachycardia,
or VT) and administer pacing shocks (anti-tachycardia pacing, or ATP). These shocks
help prevent the acceleration in heart rate leading to ventricular fibrillation, a form of
cardiac arrest.
From 1990 to 2000, over 200,000 ICDs and pacemakers were recalled due to software
27
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
issues [7]. Between 2001 and 2015, over 150,000 implanted medical devices were
recalled by the FDA because of life-threatening software bugs [8]. However, ICDs are
credited with saving thousands of lives; for patients who have survived life-threatening
arrhthymia, ICDs decrease mortality rates by 20-30% over medication [64, 65, 66].
Currently, around 10,000 new patients have an ICD implanted each month [67], and
around 800,000 people are living with ICDs [68].
The core of our ICD is an embedded, real-time ECG algorithm that performs QRS3
detection on raw electrocardiogram data to determine the timing between heartbeats.
We work off of an established real-time QRS detection algorithm [69], which has seen
wide use and been the subject of studies examining its performance and efficacy [70].
An open-source update of several versions of the algorithm [71] is available; we use
the results of this open-source work as the basis of our algorithm’s specification as well
as the C alternative. After the ECG algorithm detects the pacing between heartbeats,
the ATP function checks for signs of ventricular tachycardia and, if found, administers
a series of pacing shocks. We implement the VT test and ATP treatment published in
[72].
The I/O coroutine is passed the output of the previous iteration of the ICD coroutine.
A hardware timer is used to ensure that I/O events occur at the correct frequency. When
the correct time has elapsed (5 ms), the I/O coroutine outputs the given value and
reads the next input value. It yields this value to the microkernel.
This input is then passed through to the ICD coroutine, which implements a series
of filter passes to detect the spacing between QRS complexes (Figure 2.4 illustrates
the ECG filter passes). If 18 of the last 24 beats had periods less than 360 ms (cor-
responding to a heart rate greater than 167 bpm), the ICD coroutine moves into a3The “QRS complex” is made up of the rapid sequence of Q, R, and S waves corresponding to the
depolarization of the left and right ventricles of the heart, forming the distinctive peak in an ECG.
28
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
treatment-administering state, where it outputs three sequences of eight pulses at 88%
of the current heartrate, with a 20 ms decrement between sequences. This is designed
to prevent continued acceleration and restore a safe rhythm.
The monitoring software, which runs on the MicroBlaze, receives the output of the
ICD coroutine each cycle. A command can be given on the diagnostic input channel for
the software to output the number of times treatment has occurred.
I/O events occur at a fixed frequency of 200 Hz. Timing analysis in Section 2.6.2
confirms that, after an input event, the entire cycle of each coroutine running and
yielding, including garbage collection, is able to conclude well within the 5 ms window,
meaning that the entire system is always able to meet its real-time deadline.
2.5 ISA Semantics
Zarf has the core goal of providing concise, mathematical semantics for its hardware
ISA. These can be found in Figure 2.5, which gives the complete ISA behavior using a
big-step semantics, explaining how each instruction reduces to a value. This semantics
uses eager evaluation for simplicity; though the current hardware implementation
uses lazy semantics, the difference is not observable in our application because I/O
interactions are localized to a specific function and always evaluated immediately.
The semantics are discussed in more detail in the following subsections. Note that
terms introduced in the abstract syntax (Figure 2.2) are used in the semantics. Each
rule (or helper function) is applicable in a different case, depending on what is under
evaluation; the scenarios are all mutually exclusive, meaning that there is always exactly
one rule that can (and should) be applied at every step.
29
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
c ∈ Constructor = Name ×−−−−→Value clo ∈ Closure = (λx⃗.e) ×
−−−−→Value
v ∈ Value = Z ⊎ Constructor ⊎ Closure ρ ∈ Env = Variable→ Value
⊢ e ⇓ v
⊢−−−→decl fun main = e ⇓ v
(program) v = ρ(arg)ρ ⊢ result arg ⇓ v
(result)
v⃗1 = ρ(−−→arg) v2 = applyCn(cn,v⃗1) ρ[x 7→ v2] ⊢ e ⇓ v3
ρ ⊢ let x = cn −−→arg in e ⇓ v3(let-con)
fn < {getint,putint} fun fn x⃗2 = e2 ∈−−−→decl
v⃗1 = ρ(−−→arg) v2 = applyFn((λx⃗2.e2, []),v⃗1,ρ) ρ[x1 7→ v2] ⊢ e1 ⇓ v3
ρ ⊢ let x1 = fn −−→arg in e1 ⇓ v3(let-fun)
v1 = ρ(x2) v⃗2 = ρ(−−→arg) v3 = applyFn(v1,v⃗2,ρ) ρ[x1 7→ v3] ⊢ e ⇓ v4
ρ ⊢ let x1 = x2−−→arg in e ⇓ v4
(let-var)
v⃗1 = ρ(−−→arg) v2 = applyPrim(op, v⃗1) ρ[x 7→ v2] ⊢ e ⇓ v3
ρ ⊢ let x = op −−→arg in e ⇓ v3(let-prim)
(cn, v⃗1) = ρ(arg) (cn x⃗ ⇒ e1) ∈−→br ρ[x⃗ 7→ v⃗1] ⊢ e1 ⇓ v2
ρ ⊢ case arg of−→br else e2 ⇓ v2
(case-con)
n = ρ(arg) (n ⇒ e1) ∈−→br ρ ⊢ e1 ⇓ v
ρ ⊢ case arg of−→br else e2 ⇓ v
(case-lit)
((cn,−→v1) = ρ(arg) (cn x⃗ ⇒ e1) <−→br) ∨ (n = ρ(arg) (n ⇒ e1) <
−→br)
ρ ⊢ e2 ⇓ v2
ρ ⊢ case arg of−→br else e2 ⇓ v2
(case-else)
n2 is input from port n1
ρ[x 7→ n2] ⊢ e ⇓ vρ ⊢ let x = getint n1 in e ⇓ v
(getint)
n1 is a portn2 = ρ(arg) ρ[x 7→ n2] ⊢ e ⇓ vρ ⊢ let x = putint n1 arg in e ⇓ v
(putint)
Figure 2.5: Big-step semantics for Zarf’s functional ISA. It is a ternary relation onan environment; a let, case, or result expression; and the value to which it evaluates.Evaluation begins with the main function’s body. ρ[x 7→ v] returns an updated copy ofthe environment with x mapped to v. getint gets an integer from a specified port, andputint puts an integer onto a specified port; both are the only mechanisms for I/O.
30
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
applyFn((λx⃗1.e, v⃗1),v⃗2,ρ) =
v if |⃗v2| = 0, |⃗v1| = |x⃗1|, andρ[x⃗1 7→ v⃗1] ⊢ e ⇓ v
(λx⃗1.e, v⃗1) if |⃗v2| = 0 and |⃗v1| < |x⃗1|
applyFn((λx⃗1.e, v⃗1 :+ hd(v⃗2)),tl(v⃗2),ρ) if |⃗v2| > 0 and |⃗v1| < |x⃗1|
applyFn((λx⃗2.e′, v⃗3),v⃗2,ρ) if |⃗v2| > 0, |x⃗1| = |⃗v1|, andρ[x⃗1 7→ v⃗1] ⊢ e ⇓ (λx⃗2.e′, v⃗3)
applyCn(cn,v⃗) =
(cn, v⃗) if (con cn x⃗) ∈−−−→decl and |⃗v| = |x⃗|
(λx⃗.let c = cn x⃗ in result c, v⃗) if (con cn x⃗) ∈−−−→decl and |⃗v| < |x⃗|
ρ(arg) =
n if arg = nv if arg = x and (x 7→ v) ∈ ρ
applyPrim(op, v⃗1) =
v if |⃗v1| = arity(op) and
v = eval(op, v⃗1)(λx⃗1.let x2 = op x⃗1 in result x2, v⃗1) if |⃗v1| < arity(op) and
|x⃗1| = arity(op)
Figure 2.6: Big-step semantics helpers for Zarf’s functional ISA. applyFn (applyCn)performs function (constructor) application. x⃗ :+ y appends y to the end of x⃗, creating anew list. |x⃗|means length of the list x⃗. eval returns the value of applying a primitiveoperation to arguments. Because functions are lambda-lifted, our version of closurestrack the list of values to be applied upon saturation, rather than an entire environmentlike normal closures.
31
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
2.5.1 Names and Programs
A Constructor is a unique name and a list of zero or more values. Constructors
serve as software data types, as a simple system for building up more complex data
objects. The name indicates the “type” of the constructor, encoded statically as a unique
integer, which the machine uses at runtime to distinguish constructors of different
types.
A Closure is a function object, tying a function to a list of zero or more values, which
are the arguments that have already been supplied. Closures allow for the dynamic
construction of function objects from statically defined functions: e.g., applying the
argument 1 to the static binary function add creates a new closure, which expects one
argument, that performs the function λx.x + 1.
A Value is either an integer, a constructor, or a closure. The machine uses one bit at
runtime to track which values are primitives and which are objects (either constructors
or closures), and identifies constructor types with their name (unique type integer),
but is otherwise untyped.
An Environment is a semantic entity mapping variables (local names) to values
(integers, constructors, and closures). The semantics are high-level and do not specify
how the machine should implement the behavior of the environment, but function
arguments and local variables fall into the environment of each function. The set of
functions (declared at the top level) are stored in a list of declarations −−−→decl, which is
essentially a map from the function’s name to its parameter list and body.
The PROGRAM rule states that there should be a set of zero or more function and
constructor declarations, and one function mainwith a body expression e. Given the
declarations and main function, and application of the semantic rules, we can reduce e
to a value.
32
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
2.5.2 Result
The RESULT rule states that, given the current function environment and a result
instruction with argument arg, we can reduce the current function execution to a single
value v using the environment to look up arg, if arg is a variable, or simply returning it,
if arg is a number.
2.5.3 Let
A let instructionwill be reducedusing one of four rules: LET-FUN, LET-CON, LET-VAR,
or LET-PRIM. The first is used for static program function application; i.e., applying
arguments to a program-defined function (which excludes I/O and hardware func-
tions). Similarly, the second is used for static constructor application; i.e., applying
arguments to a program-defined constructor. The third is used to apply arguments
to a runtime value, which will be a closure expecting additional arguments. The final
is application on primitive (hardware ALU) functions. I/O functions (getint and
putint) have separate rules.
LET-FUN is used when the instruction under evaluation is a let instruction applying
zero or more arguments to a program-defined function f n. Its premises state that the
function should not be getint or putint, that the function should be defined in the
program declarations, that the arguments should all be reducible to a sequence of values−→v1 using the current function environment, that application of the applyFn helper rule
on the body of f n with arguments −→v1 will result in a value v2, and finally, that binding
a new local variable to v2 will allow us to reduce the remainder of the instructions
in the current function to a value.4 The final premise of the rule (ρ[x1 7→ v2] ⊢ e1 ⇓ v3)
4As these are big-step semantics, rules indicate how expressions reduce directly to values, rather thangiving step-by-step instructions for performing the reduction. In evaluating LET-FUN, for example, the ruledoes not instruct you to “call” into the indicated function, but rather just states that the function reducesto a value (as it must, eventually), then uses the value; as we are writing mathematical expressions, the
33
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
continues the execution: e1 is the remainder of the instructions, where the environment
now includes a mapping from x1 to the newly calculated value v2. A premise of this
format occurs in every rule for non-terminal instructions (everything but PROGRAM and
RESULT); the semantics treat the instructions as recursive, such that each instruction
“points” to the rest of the instructions.
The premises for LET-CON are very similar to those of LET-FUN, with the primary
difference coming in applyCn. While function applications can be oversaturated, we
disallow oversaturation of constructor applications for simplicity; we haven’t found this
overly restrictive at all in practice. Otherwise, the rules behaves similarly, storing the
value that results from applying arguments to a constructor into the environment ρ
before continuing to evaluate the next expression e to a value v3. In this way, closures
and constructors are structurally the same; both are function identifiers with a sequence
of arguments. The difference is that closures are already fully evaluated.
LET-VAR is used when a let instruction applies arguments to a dynamic (runtime)
value x2. The premises state that x2 should be reducible via the environment to a value
v1, that the sequence of arguments are reducible to a sequence of values −→v2, that applying
the arguments −→v2 to the object v1 with the applyFn operation will result in a value v3,
and that binding a local variable to that value will allow us to reduce the remainder
of the instructions to a value. The applyFn is written to only accept closures as its
first arguments; in calling it, there is an implicit premise that v1 is a closure object. x2
reducing to any other type of value (an integer or constructor) is a runtime error, and
can occur only if the program is not well-typed.
LET-PRIM is similar to LET-FUN, but is used only when the static function is a primi-
tive operation. These functions must be treated differently because there is no program
reduction is assumed to occur immediately. One can still “execute” the semantics by stepping through,evaluating operands as necessary using the rules in places that the semantics simply reduce immediatelyto a value.
34
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
definition for add, or sub, or mult; during machine execution, the hardware ALU han-
dles them. The semantics define ALU operations the sameway as the machine (as 32-bit
modular mathematical operations). The premises of LET-PRIM contain no function
declaration; in addition, they invoke applyPrim instead of applyFn. Otherwise, the
application is similar: the function call reduces to a value, and binding that value to a
local variable will allow us to reduce the remainder of the instructions in the function
under evaluation.
2.5.4 I/O
GETINT is the rule for the hardware-defined input function; it is invoked when a let
instruction uses getint in a program, which takes a single argument n1. The premises
state that reading a value from port n1 should return an integer n2; binding that value
to a local variable, we can proceed with evaluation.
PUTINT is the rule for the output function (also hardware-defined); it takes two
values: a port number n1, and an arg that should be output. The premises use the
environment to reduce arg to a value; not stated (as it is a side effect) is that the
hardware sends this integer value to the indicated port. The integer sent is also used
as the value to which the instruction reduces, so it is bound to a local variable, and
evaluation proceeds.
2.5.5 Case
We use two rules for the evaluation of case instructions (CASE-CON and CASE-LIT,
for constructor and integer scrutinees, respectively); in addition, the rule CASE-ELSE
handles the “else” branch for each of the two case varieties.
These CASE rules are invoked when a case instruction is under evaluation; which
35
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
rule is used depends on what the scrutinee reduces to: if it is a constructor, CASE-CON
is used, while CASE-LIT handles integers. A case instruction includes a series of zero
or more branches, each of which has a constructor name or integer as a guard, and
one else branch. Exactly one branch will always execute. Since constructor “names”
become unique integers, constructor and integer matches must be encoded differently
to distinguish which variety each branch is meant to match.
The first premise of CASE-CON indicates that the scrutinee arg must reduce to a
constructor; the second says to take the expression from the branch with the matching
constructor name and use that for the remainder of the function, which will reduce to a
value. CASE-LIT is similar, but requires the arg to reduce to an integer, and takes the
expression from the branch that attempts to match exactly that integer.
The CASE-ELSE rule is invoked when no matching branch is found for the case in-
struction (indicated in the disjuncted premises); there, it indicates to take the expression
from the else branch in the instruction and use that for the remainder of the function.
2.5.6 apply Helper Functions
applyFn is perhaps the most complicated rule, because it is where currying is
handled—different actionsmust be taken if too few, toomany, or just enough arguments
are supplied to a function application. The helper rule takes a closure and a list of zero
or more values, which will be arguments to the closure.
1. If zero arguments are supplied, and there are exactly enough values already
saved and ready to be applied in the closure, then simply feed those values as the
arguments to the function, reducing the function body to a value, and return that.
2. If zero arguments are supplied, and there are not enough saved values in the
closure, return the same closure.36
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
3. If at least one argument is supplied, and there are not enough saved values in the
closure, recursively call into applyFn, taking the first argument from the list and
appending it to the list of saved values. Either the argument list will run out before
the function is saturated (resulting in case 2), or the function will eventually be
saturated (resulting in case 1 or 4).
4. If at least one argument is supplied, and there are exactly enough saved values
already in the closure, then we evaluate the closure (which must, if the program
is well-typed, result in another closure), then recursively call applyFnwith the
new closure and the same argument list.
applyCn handles applications of arguments to constructors. The first rule is if exactly
enough arguments are applied to the constructor application; in that case, a constructor
containing those values is built and returned. The second handles the case where
fewer values were supplied than expected (partially saturating a constructor); in this
case, a closure is returned to capture the values already supplied, and when additional
values are applied, it will invoke applyCn again to check if the constructor has all fields
necessary to be built. We note that this appears to be dynamically creating syntax
(the rule indicates placing a let instruction into the created closure), but as every
constructor has a finite number of fields, these functions are all known statically.
applyPrim handles evaluation of primitive operations. The first case is invoked
when the correct arguments have been supplied for that particular operation, and
simply evaluates the operation according to the rule for 32-bit modular arithmetic,
returning the answer. Similar to applyCn, we must account for the case where the
function did not receive enough arguments; as there, we create a new closure to capture
the arguments supplied, and the operation can be evaluated once all arguments are
received (the second case).
37
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
The ρ(arg) = ... helper function is for convenience, simply stating that if arg is an
integer, return that value; otherwise, it must be a name mapped to some value in the
environment, in which case that value should be returned.
2.6 Verification
We separate the verification of the embedded ICD application into three parts:
verification of the correctness of the ICD coroutine, a timing analysis to show that the
assembly meets timing requirements in the worst case, and a proof of non-interference
between the trusted ICD coroutine and untrusted code outside of it.
2.6.1 Correctness
We first implement a high-level version of the application’s critical algorithms (the
ECG filters and ATP procedure) in Gallina, the specification language of the Coq
theorem prover [73], using this version as our specification of functional correctness.
This specification operates on streams — a datatype that represents an infinite list —
by taking a stream as input and transforming it into an output stream. By sticking to a
high-level, abstract specification, we can be more confident that we have specified the
algorithm correctly. An ICD implementation cannot operate on streams, as all data is
not immediately available; instead, it takes a single value, yields a single value, and
then repeats the process.
The form of the correctness proof is by refinement: first, we create a Coq imple-
mentation of the ICD algorithm that is “lower-level” than the Coq specification. This
lower-level implementation operates on machine values rather than streams, isolates
function applications to let expressions, and avoids the use of “if-then-else” expressions,
among other trivially-resolved differences. We then create an extractor that converts38
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Figure 2.7: Extraction of verified application components, summarized for a smallexcerpt. (a) The high-level Coq specification is written to operate on Streams (infinitelists); values are pulled from the front of the stream. (b) An intermediate versionis written in Coq which operates on integers instead of streams, and unfolds nestedoperations so each function call and arithmetic operation takes one line. This intermedi-ate version is proven equivalent in Coq to the high-level specification — meaning thatrepeated recursive application of (b) will always output the same sequence of values as(a). (c) A simple extractor just replaces the keywords in (b) to produce valid assemblycode that can run on Zarf.
39
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
this lower-level Coq code directly into executable Zarf functional assembly code (see
Figure 2.7). If, for all possible input streams, we can prove that the output stream
produced by the high-level Coq specification is the same sequence of values produced
by the lower-level implementation, we can conclude that the program we run on Zarf
is faithful to the high-level Coq specification. This proof of equivalence between the
two Coq implementations is done by induction and coinduction over the program,
showing that if output has matched up to point N, and the computation of value N is
equivalent, then value N + 1 will be equivalent as well. As compared to extracting for
an imperative architecture, we avoid needing to compile functional operations to an
imperative ISA and do not require a large software runtime — or any software runtime
at all. The translation simply replaces Coq keywords with Zarf assembly keywords,
which is possible because the low-level Coq specification is in the A-normal form that
Zarf requires. For example, the Coq keyword CoFixpoint would be textually replaced
with fun, match replaced with case, etc.
We begin constructing our proof by first defining the relevant datatypes and expres-
sions of the ISA as inductively defined mathematical objects:
Inductive data : Set :=
| local : nat -> data
| arg : nat -> data
| caseValue : data
| caseField : nat -> data
| literal : Z -> data
| function : string -> data
| functionApplication : string -> list data -> data.
Inductive zarf_inst : Set :=
| apply' : data -> list data -> zarf_inst
40
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
| case : data -> list alt -> zarf_inst
| ret : data -> zarf_inst
with alt : Set :=
| literalAlt : Z -> list zarf_inst -> alt
| constructorAlt : string -> list zarf_inst -> alt.
Inductive zarf_table : Set :=
| function_table : string -> nat -> list zarf_inst -> zarf_table
| constructor_table : string -> nat -> zarf_table.
We’re then able to define a small interpreter that executes the instruction semantics
(several constructors have been omitted for brevity and clarity of presentation):
Inductive zarf_run_function' : string -> list data -> list data -> data ->
list zarf_inst -> data -> Prop :=
| H_apply' x res ys zs (args:list data) f locals caseVal :
zarf_run_function' f args (locals ++ execApply x ys args locals caseVal) caseVal
zs res -> zarf_run_function' f args locals caseVal (apply' x ys :: zs) res
...
| H_matchesLiteral xrep z zs alts d res (l:list zarf_inst) f args locals caseVal x :
getData x args locals caseVal = Some xrep ->
matchesCase xrep (literalAlt d l) = true ->
zarf_run_function' f args locals xrep l res ->
zarf_run_function' f args locals caseVal
(case x (literalAlt d l :: z :: alts) :: zs) res
| H_ret x args locals caseVal f xrep zs : getData x args locals caseVal = Some xrep
-> zarf_run_function' f args locals caseVal (ret x :: zs) xrep .
Inductive zarf_run' (t : zarf_table) (args : list data) : data -> Prop :=
| Run_function f arity insts res : t = function_table f arity insts ->
zarf_run_function' f args [] (function "error") insts res ->
zarf_run' t args res.
41
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
The zarf_run_function’ interpreter is made up of several cases, each of which
correspond to rules in the big-step semantics defined in Figure 2.5. For example, the
first case, H_apply’, defines what function application means. It says that if we add
a thunk to our local environment that is the result of executing the apply operation
(locals + execApply x ys args locals caseVal), and then execute the rest of the
instructions zs, we have successfully executed an apply statement followed by the rest
of the instructions zs.
We declare several axioms asserting the correctness of the ISA’s built-in functions and
their equivalence to built-in Coq operations. These built-in functions (like add, multiply,
etc., labelled PrimOp in Figure 2.2) are ultimately the base operations employed by all
user-defined functions, and we reference them during the proof of correctness of the
higher-level ECG algorithms later on. We also define a few axioms related to list and
pair construction, whose correspondence to the user-made Zarf function equivalents
are trivial. Here are a few assorted examples:
Axiom cons_rep : forall l' A (x':A) xs', l' == x' :: xs' -> exists x xs,
l' = (functionApplication "cons" [x; xs]) /\ x == x' /\ xs == xs'.
Axiom snd_same : forall (A B : Type) x (x' : A*B), x == x' ->
functionApplication "snd" [x] == snd x'.
Axiom add_same : forall x y x' y', x == x' -> y == y' -> x + y ==
functionApplication "add" [x'; y'].
We can then begin defining and proving things about non-builtin functions, like
append, reverse, index, and variousmatrixmultiplication operations. Here’s an example
of defining the simplest user-defined function, id, which is used often in Zarf assembly
for assigning a constant numeric value to a variable (because let expressions purely
operate over function identifiers (see Figure 2.2)). Proof bodies are omitted for brevity:
Definition zarf_id : zarf_table :=
42
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
function_table "id" 1 [ret (arg 0)].
Axiom id_eta : forall x res, zarf_run' zarf_id [x] res ->
functionApplication "id" [x] = res.
Theorem equivalence_id :
forall A (x:A) x', x == x' ->
exists res, zarf_run' zarf_id [x'] res /\ res == x.
Lemma id_same :
forall (A : Type) x (x' : A), x == x' ->
functionApplication "id" [x] = x.
We then proceed to define several ECG helper functions, like the low pass filter, at
both a high-level stream-based abstraction and the lower-level implementation which
operates on machine values, and verify their equivalence (and thus the correctness of
the machine’s implementation of the specification):
(* Define Zarf implementation of the low pass filter as a series of Zarf instr. *)
Definition zarf_lpf : zarf_table :=
function_table "lpf" 0 [ apply' (function "lpf_rec") ...<arguments>...].
Definition zarf_lpf_rec : zarf_table :=
function_table "lpf_rec" 13 [
apply' (function "mult") [arg 1; literal 2];
...
apply' (function "tuple2") [local 7; local 6];
ret (local 8)
].
(* Define the high-level Coq stream-based specification of the low pass filter *)
CoFixpoint low_pass_filter_rec (xs: Stream Z) (ys: list Z) : (Stream Z) := ...
43
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
(* Declare that the result of the running the "lpf_rec" Zarf function
is the same that results from running the low-pass filter function
through our Coq-defined Zarf interpreter *)
Axiom lpf_rec_eta : forall xs' res, zarf_run' zarf_lpf_rec xs' res ->
functionApplication "lpf_rec" xs' = res.
(* Define the non-stream-based LPF function *)
CoFixpoint low_pass_filter_rec2
(y2 y1 x10 x9 x8 x7 x6 x5 x4 x3 x2 x1 x0: Z) : Tuple2 Z :=
let y1m2 := Z.mul 2 y1 in
let x5m2 := Z.mul 2 x5 in
let t0 := Z.sub y1m2 y2 in
...
(* Prove the two implementations are equivalent *)
Theorem same_low_pass_filter_rec_and_rec2 :
forall xs ret1 ret2
y2 y1 x10' x9' x8' x7' x6' x5' x4' x3' x2' x1' x0',
(forall i, i >= 0 -> i <= 10 -> Str_nth i xs =
nth i [x10';x9';x8';x7';x6';x5';x4';x3';x2';x1';x0'] 0%Z)%nat ->
low_pass_filter_rec xs [y2;y1] = ret1 ->
low_pass_filter_rec2 y2 y1 x10' x9' x8' x7' x6' x5' x4' x3' x2' x1' x0' = ret2
-> StreamTuple2Same 10 xs ret1 ret2.
...
After we’ve proven the same_low_pass_filter_rec_and_rec2 theorem, we can
convert the low-level low pass filter implementation directly into Zarf assembly. In this
example, the assembly version of low_pass_filter_rec2would look like:
fun lpf_rec y2 y1 x10 x9 x8 x7 x6 x5 x4 x3 x2 x1 x0 =
let y1mult2 = mult y1 2 in
44
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
let x5mult2 = mult x5 2 in
let t0 = sub y1mult2 y2 in
...
The preceding snippets of code have been just a sample of the entire set of proofs
we wrote. The full proofs of correctness of the assembly-level critical ECG and ATP
functions take under 2,500 lines of Coq. The implementations are converted line-for-line
into Zarf assembly code, which is combined with assembly for the microkernel and
other coroutines.
In total, the Trusted Code Base for the correctness proof includes: the hardware, the
Coq proof assistant, and the small extractor that converts the low-level Coq code into
Zarf functional assembly code. All other code is untrusted and may be incorrect, and
the proof will still hold. The high-level ISA and clearly-defined semantics make this
very small TCB possible, allowing the exclusion of language runtimes, compilers, and
associated tooling that is frequently present in the TCB in verification efforts.
2.6.2 Timing
With a knowledge of how the Zarf hardware executes each instruction, we create
worst-case timing bounds for each operation. In general, in a functional setting, un-
bounded recursion makes it impossible to statically predict execution time of routines.
Though our application uses infinite recursion to loop indefinitely, the goal is to show
that each iteration of the loop meets the real-time deadline; within that loop, each
coroutine is executed only once, and no functions call into themselves. This allows
us to compute a total worst-case execution time for the sum of all the instructions by
extracting the worst-case route through the hardware state machine to execute each
possible operation. For example, applying two arguments to a primitive ALU function
45
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
and evaluating it has a maximum runtime of 30 cycles — this includes the overhead of
constructing an object in memory for the call, performing a function call, fetching the
values of the operands, performing the operation, marking the reference as “evaluated”
and saving the result, etc. In an average case, only a fraction of the possible overhead
will actually be invoked (see Section 2.7 for CPI averages).
Hardware garbage collection is a complicating factor on timing. GC can be config-
ured to run at specific intervals or when memory usage reaches a certain limit; for our
application, to guarantee real-time execution, the microkernel calls a hardware function
to invoke the garbage collector once each iteration. To reason about how long the
garbage collection takes, we bound the worst-case memory usage of a single iteration
of the application loop. The hardware implements a semispace-based trace collector, so
collection time is based on the live set, not how much memory was used in all. For the
trace-collector state machine, each live object takes N+4 cycles to copy (for N memory
words in the object), and it takes 2 cycles to check a reference to see if it’s already been
collected. We bound the worst-case by conservatively assuming that all the memory
that is allocated for one loop through the application might be simultaneously live at
collection time, and that every argument in each function object may be a reference
which the collector will have to spend 2 cycles checking.
From the static analysis, we determine that the worst execution of the entire loop
is 4,686 cycles, not including garbage collection. Garbage collection is bounded by a
worst-case of 4,379 cycles, making a total of 9,065 cycles to run one iteration of system—
or 181.3 µs on our FPGA-synthesized prototype running at 50 MHz, falling well-within
the real-time deadline of 5 ms.
46
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
2.6.3 Non-Interference
Because the ICD coroutine has been proven correct (Section 2.6.1), we treat its output
as trusted. This output must then travel through the rest of the cooperative microkernel
until it reaches the outside world via the I/O coroutine’s putint primitive. In order to
guarantee the integrity of this data (meaning it is never corrupted nor influenced by
less-trusted data), we rely on a proof of non-interference. Non-interference means that
“values of variables at a given security level ℓ ∈ L can only influence variables at any
security level that is greater than or equal to ℓ in the security latticeL” [74]. In a standard
security lattice, L (low-security) ⊑ H (high-security), meaning that high-security data
does not flow to (or affect) low-security output. In our application, however, we are
concerned with integrity; our lattice is composed of two labels, T (trusted) and U
(untrusted), organized such that T ⊑ U. Therefore, our integrity non-interference
property is that untrusted values cannot affect trusted values [16].
To prove this about Zarf, we create a simple integrity type system that provides a set
of typing rules to determine and verify the integrity type of each expression, function,
and constructor in a program. After providing trust-level annotations in a few places
and constraining the normal Zarf semantics slightly to make type-checking much easier,
we can run a type-checker over the resulting Zarf code to know whether it maintains
data integrity. We extend the original Zarf syntax to allow for these type annotations,
as follows:
ℓ, pc ∈ LabelF T | U
τ ∈ TypeF numℓ | (cn, τ⃗) | (⃗τ→ τ)
func ∈ FunctionF fun fn x1 :τ1, . . . , xn :τn :τ = e
cons ∈ ConstructorF con cn x1 :τ1, . . . , xn :τn
47
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Specifically, following the spirit of Abadi et al. [11] and Simonet [75], types are
inductively defined as either labeled numbers, or functions and constructors composed
of other types. Our proof of soundness on this type system follows the approach done
in work by Volpano et al. [12]. We show that if an expression e has some specific type
τ and evaluates to some value v, then changing any value whose type is less-trusted
than e’s type results in e evaluating to the same value v; thus, we show that arbitrarily
changing untrusted data cannot affect trusted data. We prove soundness case-wise over
the three types of expressions in our language, combining our evaluation semantics
with our security typing rules.
Integrity Type System
The integrity type system is found in Figure 2.8. Our integrity lattice is composed of
two elements, T and U (trusted and untrusted, respectively), such that T < U (opposite
of a normal security lattice). We extended the Zarf ISA by requiring function and
constructor type annotations. Constructors, which previously were untyped, are now
singleton types: each constructor declaration defines a type, but that constructor is
the sole inhabitor of the type. This restriction eliminates case expressions as sources
of control flow when casing on a constructor type (since we know statically that the
case expression’s scrutinee will be a single unique value, and therefore also statically
know which branch will be taken); note that this also eliminates the consideration of
the else branch in a case expression on a constructor type. Instead, case expressions in
this lightly-typed Zarf are solely for binding a constructor’s internal values to variables
(via deconstruction). Though this causes a loss in expressive power in the general case
(constructors must be singleton types), our microkernel was designed without the need
for this type of control flow.
case expressions whose scrutinee is a number, however, still allow for control flow48
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
ℓ, pc ∈ LabelF T | U τ ∈ TypeF numℓ | (cn, τ⃗) | (τ⃗→ τ)Γ ∈ Env = Identifier → Type
Γ[i1 7→ τ1, . . . , in 7→ τn] ⊢ e :τΓ ⊢ (fun fn i1 :τ1, . . . , in :τn :τ = e) :(τ1, . . . , τn)→ τ
(func)
τ1 = Γ(id) −→τ2 = Γ(−−→arg) τ3 = applyType(τ1,
−→τ2) Γ[x 7→ τ3] ⊢ e :τ4
Γ ⊢ let x = id −−→arg in e :τ4(let)
(cn, τ⃗1) = Γ(arg) (cn x⃗ ⇒ e1) ∈−→br Γ[x⃗ 7→ τ⃗1] ⊢ e1 :τ2
Γ ⊢ case arg of−→br else e0 :τ2
(case-cons)
numℓ = Γ(arg) Γ ⊢ e1 :τ1 . . . Γ ⊢ en :τn Γ ⊢ e0 :τ0
τ = (τ0 ⊔ τ1 ⊔ . . . ⊔ τn) • ℓΓ ⊢ case arg of n1 ⇒ e1 . . . nn ⇒ en else e0 :τ (case-lit)
τ = Γ(arg)Γ ⊢ result arg :τ (result)
Γ[x 7→ numT] ⊢ e :τΓ ⊢ let x = getint n in e :τ (getint)
numℓ = Γ(arg) Γ[x 7→ numℓ] ⊢ e :τΓ ⊢ let x = putint n arg in e :τ (putint)
Figure 2.8: Integrity typing rules. A type is inductively defined as either a labellednumber, a singleton constructor, or a function constructed of these types. The type envi-ronment maps variables, function, and constructor names to types. Since all functionsare annotated with their types, type checking proceeds by ensuring that the return typeof a function is the same as the type deduced by checking the function’s body expressionwith the function’s parameter types added to the type environment. ⊔ denotes the joinof two types, and • denotes the joining of a type’s integrity label with another.
49
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
applyType((τ⃗1 → τ),τ⃗2) =τ if |⃗τ1| = 0 and |⃗τ2| = 0(τ⃗1 → τ) if |⃗τ1| > 0 and |⃗τ2| = 0applyType(τ⃗3 → τ,τ⃗4) if τ⃗1 = τ1 :: τ⃗3, τ⃗2 = τ2 :: τ⃗4, and τ2 ≤ τ1
applyType(τ⃗3 → τ4,τ⃗2) if |⃗τ1| = 0, |⃗τ2| > 0, and τ = (τ⃗3 → τ4)
Γ(n) = numpc Γ(id ) =
(τ⃗→ (cn, τ⃗)) if id = cn and con cn x⃗ : τ⃗ ∈
−−−→decl
(τ⃗→ τ) if id = fn and fun fn x⃗ : τ⃗ :τ = e ∈−−−→decl
τ if id = x and (x 7→ τ) ∈ Γ
Figure 2.9: Integrity typing rules helpers. Γ is a helper function that gets the type ofan argument, and applyType applies a function type to argument types. Applying ahelper function that takes one argument to a list of arguments is shorthand for mappingthat function over the list.
numℓ ⊔ numℓ′
= numℓ⊔ℓ′
(cn, τ⃗) ⊔ (cn, τ⃗) = (cn, τ⃗)
(τ⃗→ τ) ⊔ (τ⃗′ → τ′) = (τ⃗ ⊔ τ⃗′ → τ ⊔ τ′)
numℓ • ℓ′ = numℓ⊔ℓ′
Figure 2.10: Joining two types. The • operator is used to join a type’s label with anotherlabel; if the type that the label is being joined with is not a num, the label will be joinedwith each of the type’s inner types until a base num is reached. Joining two lists oftypes is equal to the pairwise join of their elements. Constructor join is trivial becauseconstructors are singletons whose type never changes, and only equal constructors canbe compared.
50
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
ℓ ⊑ ℓ′
numℓ ≤ numℓ′ (num)
τ⃗′ ≤ τ⃗ τ ≤ τ′
(τ⃗→ τ) ≤ (τ⃗′ → τ′)(func)
(cn, τ⃗) ≤ (cn, τ⃗)(cons) τ1 ≤ τ2 τ2 ≤ τ3
τ1 ≤ τ3(tran)
Figure 2.11: Subtyping rules. One type is a subtype of another if their base types areequal and, in the case of the base num type, the first’s label is lower in the integritylattice than the other. A list is a subtype of another if pairwise each element of the firstis a subtype of the corresponding element in the other list.
(since the value of an num is not known ahead of time); therefore, the type of this form
of case expression is the join of all of its branch types. The type of the scrutinee, which
is significant in a security analysis, is here irrelevant — there are no implicit flows for
integrity. Because we do not use union types, another small restriction we enforce is that
each branch in a case expression must result in the same base type (i.e. all must either
type-check to a num, (cn, τ⃗), or τ⃗ → τ), such that we may join them together properly
(see Figure 2.10).
The integrity label associated with a num depends on the integrity level of the code
that created it: untrusted code can only create numbers of type numU, while trusted
code can create trusted numbers (which can be treated as untrusted numbers via
subtyping; see Figure 2.11). Primitive operations (add, subtract, etc.) are treated as
named functions contained within the set of declarations −−−→decl. The type of primitive
operators is dependent on the trust level of the caller: for example, the type of add is
numℓ1 → numℓ2 → numℓ1⊔ℓ2⊔pc, where pc represents the trust level of the current program
location (we assume its value can be tracked and changed outside of the type system
proper). This all implies that untrusted code cannot use the primitive operations to
create any type of trusted value (regardless of the types of the numbers an untrusted
caller uses), thus restricting untrusted code’s ability to obtain trusted values to (1) the
51
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
getint function (which in our application is data straight from the heart monitor) and
(2) by calling trusted functions which return trusted values.
This system will verify integrity for a value with singular endpoints — i.e., for the
code being checked, it is received at one point and sent at one point. More complex
annotations and treatment of values, like an arbitrary number of mutually untrusted but
critical values passing through an arbitrary number of trusted and untrusted regions,
can be guaranteed with this type-system via piecewise checking. By guaranteeing each
link in the chain one-at-a-time, the integrity of the chain is verified.
The soundness proof of the integrity type system proceeds by cases on the three
forms of expressions.
Lemma 1 (Case Expression Soundness). If ∀x e0 e1 arg1 arg2 τ0 τ1 cn1 τ⃗1 v1 v2 ρ Γ, where
1. e0 = (case arg1 of−→br else e1)
2. Γ ⊢ e0 :τ0 ∧ ρ ⊢ e0 ⇓ v1 ∧ ρ(x) = arg1
3. Γ ⊢ arg2 :τ1 ∧ τ1 > τ0 ∧ ρ ⊢ arg2 ⇓ v2
then ρ[x 7→ v2] ⊢ e0 ⇓ v1.
Proof. 1. If Γ ⊢ arg1 : numℓ, then ∀n e2 . . . em , e0 = (case n of n2 ⇒ e2 . . . nm ⇒
em else e1), and either ℓ = T or ℓ = U. We show that regardless of arg2’s level when
it is of type num, it cannot be changed and therefore e0’s value doesn’t change.
(a) If ∃ℓ1 ∈ τ0 s.t. ℓ1 = T, then by typing rule case-lit and the rule for join, n’s
integrity label is T. Therefore, arg1 cannot both equal n and be arbitrarily
changed to some expression arg2 because it is not an expression whose type
label is less trusted than the type of the entire expression (i.e. numT ̸≥ τ0).
Thus we cannot replace arg1 with arg2, so in this case the value of e0 remains
52
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
the same, as desired. Since e1 through em+1 are expressions whose soundness
with respect to the type system can be considered separately through Lemmas
1, 2, and 3, we do not consider them here.
(b) If ∃ℓ1 ∈ τ0 s.t. ℓ1 = U, then by our definition of the T−U integrity lattice, there
can be no values whose type is greater than τ0 (arg1 included) that we can
change. Therefore, e0 remains unchanged, satisfying our conclusion.
2. If Γ ⊢ arg1 : (cn, τ⃗1), then ∀cn3 . . . cnn x⃗3 . . . x⃗n e3 . . . en,
e0 = (case (cn, τ⃗1) of cn3 x⃗3 ⇒ e3 . . . cnn x⃗n ⇒ en else e1).
• We know by the operational semantics (restricted to accommodate this type
system, with singleton constructor types) that which branch we case on is
determined entirely by the constructor that arg1 evaluates to, and not the val-
ues contained within that constructor. Therefore, changing the expressions
within any constructor will result in the same branch being taken, such that
e0 evaluates to the branch’s right-hand-side expression. Therefore, we cannot
choose to replace arg1 with another arbitrary arg2 when Γ ⊢ arg1 : (cn, τ⃗1).
• Let (cn3 x⃗3 ⇒ e3) be the matching branch (where cn = cn3). Based on the
previous bullet point, we know that changing the expressions of any other
branches will not change the value of the entire case expression, so we focus
on this particular branch as an example. We must show that ∀τ3,∃x3 ∈
x⃗3 s.t. Γ ⊢ x3 :τ3 > τ0, changing the value that x3 maps to in ρ does not change
the value that e3 evaluates to; that is, ρ[x3 7→ v3] : e3 ⇓ v, where ρ(arg2) = v3.
Since e3 is an expression, its soundness is either covered by Lemma 1 (by
induction) or Lemmas 2 or 3.
□
53
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Lemma 2 (Result Expression Soundness). If ∀x e0 arg1 arg2 τ1 τ2 v1 ρ Γ, where
1. e0 = (result arg1) ∧ ρ ⊢ e0 ⇓ v1
2. Γ ⊢ arg1 :τ1 ∧ arg2 = ρ(arg1)
3. ρ ⊢ arg2 :τ2 ∧ τ2 > τ1 ∧ ρ ⊢ arg2 ⇓ v2
then ρ[(x 7→ v2)] ⊢ e0 ⇓ v1
Proof. The result expression is used for wrapping a value into a single expression con-
taining that value. Therefore, changing the value of arg1 to arg2 would change the
resultant value v1 that e0 is given, contradicting our result. As another point, by the
typing rule result, result’s type is precisely the type of arg1, meaning there are no values
within e0 to change that would not cause us to violate (3) above. Therefore, the value
of arg1 must equal the value of arg2 such that value of e0 cannot change. □
Lemma 3 (Let Expression Soundness). If ∀e0 e1 x id arg1 arg2−−−→arg3
−−→arg4
τ1 τ2 v1 v2 v3 v⃗3 v⃗4 v5 ρ Γ, where
1. e0 = (let x = id −−→arg3 in e1)
2. Γ ⊢ e0 :τ1 ∧ ρ ⊢ e0 ⇓ v1
3. Γ ⊢ id : τ⃗→ τ
4. ρ(−−→arg3) = v⃗3
5. Γ(arg1) = τ2 ∧ τ2 > τ1
6. arg0 ∈−−→arg3 ∧
−−→arg4 =−−→arg3 − arg0 + arg1
7. ρ(−−→arg4) = v⃗4
54
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
8. (id ∈ −−−→cons ∧ applyCn(id,v⃗3) = v2) ∨ (applyFn(id,v⃗3,ρ) = v2)
9. (id ∈ −−−→cons ∧ applyCn(id,v⃗4) = v3) ∨ (applyFn(id,v⃗4,ρ) = v3)
10. v2 = v3
11. ρ[x 7→ v3] ⊢ e1 ⇓ v5
then v1 = v5.
Proof. By cases on id:
1. If id is a primitive function (add, multiply, etc.), then v2 , v3 ⇐⇒−−→arg3 ,
−−→arg4. By
the typing rule of primitives, the type τ that the function returns is the least upper
bound of all of its arguments, including arg1, meaning by definition, both the
value and type of the primitive operation are entirely dependent on all arguments.
Therefore, there cannot exist an arg2 that allows us to substitute it for arg1 whose
type is less trusted than τwithout changing the entire value v1.
2. If id is a constructor, then id has the type−→τ → (cn, τ⃗). id’s return type is determined
statically and does not change throughout program execution. Therefore, there
does not exist a subexpression in −−−→arg3, or more generally, in e0, that can changed
without changing the type of the constructor, which would contradict our having
the same values after evaluation.
3. If id is a non-recursive function composed solely of case and result expressions
and applications of primitive functions and constructors used in let expressions,
then by (1), (2), Lemmas 1 and 2 and induction on Lemma 3, we know id must
be sound. By extension, if id calls a function that fulfills these requirements, one
can unfold the called function’s contents in order to see that the resultant value v2
satisfies this case.55
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
4. If id is a recursive function or calls a function which calls id (i.e. mutual recursion),
it is possible that the function call never terminates and therefore never results
in a single value. The soundness of e0 must then be guaranteed via induction on
possible expressions, proven in the previous lemmas. We know statically that the
type of id is of the form τ⃗ → τ, so we are guaranteed via simplification rules in
the apply helper functions that types of −−→arg3 must be equal to or subtypes of τ⃗, or
otherwise our operational semantics would get stuck. By induction, any recursive
calls made in e1 must also satisfy this lemma, meaning that the actual arguments−−→arg3 are used properly, otherwise e0 wouldn’t type check to type τ1 by getting
stuck.
By proving that v2’s value does not change when less-trusted values change, we can
safely continue with the evaluation of e1, which will be a case, result, or let, all of which
are handled in Lemma 1, Lemma 2, and Lemma 3, respectively. □
Theorem 1 (Integrity Type System Soundness). Our integrity type system is sound if, given
some expression e of type τ which evaluates to some value v, we can show that we can arbitrarily
change any (or all) expressions in e which are less trusted than τ so that e still evaluates to v;
i.e., untrusted data does not affect trusted data.
Formally, if ∀e1 e2 e3 e4 τ1 τ2 τ3 v ρ Γ, where
1. ρ ⊢ e1 ⇓ v ∧ Γ ⊢ e1 :τ1
2. e2 ∈ subexprs(e1) ∧ Γ ⊢ e2 :τ2 ∧ τ2 > τ1
3. Γ ⊢ e3 :τ3 ∧ τ3 ≥ τ2
then ρ ⊢ e1[e3/e2] ⇓ v
56
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Proof. There are just three types of expressions: let, case, and result. By Lemma 1, we
show that case expressions (the vehicle for control-flow) are sound. By Lemma 2,
we show that result expressions are sound. Likewise, by Lemma 3, we show that let
expressions (the vehicle for function application) are sound. Thus, we have exhaus-
tively shown soundness of all expressions. Furthermore, we can see that when these
expressions are composed according to the abstract syntax, with the additional typing
annotations and a few restrictions, any well-typed Zarf program has the property of
non-interference with respect to integrity, even while using a simplistic type system
such as that explained here. □
2.6.4 Programmer Responsibility
We have demonstrated that there are varying degrees of responsibility a Zarf pro-
grammer can take when writing their application, each involving greater effort. The
first is doing the minimum: the programmer writes their program in Zarf assembly.
A major advantage of Zarf is that the application automatically gains the benefits of
memory and control-flow safety inherent in the ISA, properties that other ISAs don’t
easily offer. Anywell-formed application that runs on Zarf gets these properties without
any additional programmer involvement.
The second degree of responsibility that can be taken is writing the application’s
specification in Coq and automatically lowering it to Zarf to prove its correctness. This
approach involves a non-trivial amount of proof-writing, but since the ISA resembles
the language of verification very closely, we argue that the amount of work involved
relative to doing so over other imperative ISAs is significantly less. Since high-level
specification and verification of critical applications is common practice, this level of
programmer responsibility is not usual. Gladly, however, any future proof efforts might
57
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
not need to be entirely application-specific. Given the exercise of proving the correctness
of the Zarf implementation of the ICD algorithm, we now have a set of theorems and
proofs showing the equivalence between common user-made Zarf functions and Coq
versions. It is conceivable that verification in Coq of future Zarf applications could
reuse this underlying work.
The third degree of responsibility involves proving additional properties over the
system, beyond the aforementioned safety and correctness. We demonstrated this by
laying a security type system over the ISA, somewhat restricting it (like all type systems
are wont to do) in exchange for the added property of non-interference. Because
the process of writing a type system and checker are sufficiently general, we can see
additional type systems or analyses being made over the base Zarf ISA relatively easily.
Finally, there is the issue of determining which parts of an application should go
into each hardware execution realm. Zarf has two execution realms due in part to the
assumption that users might want to include legacy or high performance, non-critical
code; this code can run on the imperative ISA. However, any activity providing critical
functionality for safe operation should happen in the functional processor. Veridrone
[76] is an example of another project beyond our ICD that might benefit from this
approach; that project uses both a lower-performance core safety control system and
a higher-performance unverified version that is more energy-efficient and allows for
smoother flying.
2.7 Evaluation
To validate our designs, we download the Zarf hardware specification onto a Xil-
inx Artix-7 FGPA and run our sample application. For a comparison, we also run a
completely unverified C version of the application on a Xilinx Microblaze on the same
58
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
Resource Zarf MicroBlazeLUTs 4,337 1,840FFs 2,779 1,556
Cycle Time 20ns (50 MHz) 10ns (100 MHz)
Table 2.1: Resource usage of Zarf and basic MicroBlaze (3-stage pipeline), the two layersof Zarf, when synthesized for a Xilinx Artix-7 FPGA. In total, the logic of Zarf uses29,980 gates.
FPGA. Hardware synthesis results are summarized in Table 2.1.
The hardware description of Zarf is more complex than a simple embedded CPU,
with 66 total states of control logic (4 deal with program loading, 15 with function
application, 18 with function evaluation, and 29 with garbage collection). In all, the
combinational logic takes 29,980 primitive gates (roughly the size of a MIPS R3000), or
4,337 LUTs when synthesized for an Artix-7 FPGA (less than 7% of the available logic
resources). Estimated on 130nm, the combinational logic takes up .274 mm2. Though
larger than very simple embedded CPUs, Zarf is still quite a bit smaller than many
common embedded microcontrollers.
From a dynamic trace of several million cycles, the ICD application exhibited the
following average CPI for each instruction type. Let instructions had an average of
5.16 arguments and took on average 10.36 cycles. Case instructions averaged at 10.59
cycles; each branch head in a case takes exactly 1 cycle to check if the branch matches.
Results took 11.01 cycles on average. The total dynamic CPI across the trace was
7.46 (or 11.86 if garbage collection time is included). Approximately one third of the
dynamic instructions were branch heads.
The C version of the ICD application on the MicroBlaze takes fewer than one thou-
sand cycles for each iteration of the application. The analysis in section 2.6.2 discusses
the worst-case runtime of the Zarf application, which is around 9,000 cycles or 180
µs (though much faster in the typical case). This is in addition to a longer cycle time
59
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
(see Table 2.1). When compared to the carefully optimized and tiny MicroBlaze, our
experimental prototype uses approximately twice the hardware resources, and the
application is around 20x slower in the worst case than Microblaze in the common
case — but is still over 25 times faster than it needs to be to meet the critical real-time
deadlines, all while adding invaluable guarantees about the correctness of the most
critical application components and assurance of non-interference between separate
functions.
2.8 Conclusion
As computing continues to automate and improve the control of life-critical systems,
new techniques which ease the development of formally trustworthy systems are sorely
needed. The systemapproach demonstrated in thiswork shows that deep and composable
reasoning directly onmachine instructions is possible when the architecture is amenable
to such reasoning. Our prototype implementation of this concept uses Zarf to control the
operation of critical components in a way that allows assembly-level verified versions of
critical code to operate safely in close partnership withmore traditional and less-verified
system components without the need to include run-times and compilers in the TCB.
We take a holistic approach to the evaluation of this idea, not only demonstrating its
practicality through an FPGA-implemented prototype, but furthermore showing the
successful application of three different forms of static analysis at the assembly level of
Zarf.
As we move to increasingly diverse systems on chip, heterogeneity in semantic
complexity is an interesting new dimension to consider. A very small core supporting
highly critical workloads might help ameliorate critical bugs, vulnerabilities, and/or
excessive high-assurance costs. A core executing the Zarf ISA would take up roughly
60
Zarf: An Architecture Supporting Formal and Compositional Binary Analysis Chapter 2
0.002% of a modern SoC. Our hope is that this work will begin a broader discussion
about the role of formal methods in computer architecture design and how it might be
embraced as a part, rather than an afterthought, of the design process.
61
Chapter 3
Bouncer: Static Program Analysis in
Hardware
3.1 Introduction
The demand for more connectivity and richer interactions in everyday objects means
that everything from light bulbs to thermostats now contains general-purpose micro-
processors for carrying out fairly straightforward and low-performance tasks. Left
unanalyzed, these systems and their associated software stacks can be expected to hold
a seemingly endless collection of opportunities for attack. Static analysis provides pow-
erful tools to those wishing to understand or limit the set of behaviors some software
might exhibit. By facilitating sound reasoning over the set of all possible executions,
this type of analysis can identify important classes of behavior and prevent them from
ever happening. If embedded system developers simply never released software that
failed, such that those well-analyzed applications were the only things to ever execute
on platforms under our control, many of the bugs and vulnerabilities that plague our
life would be eliminated. Unfortunately, realizing this in practice has proven incredibly
62
Bouncer: Static Program Analysis in Hardware Chapter 3
hard due to pressure to market, pressure to reduce cost, and the delayed and stochastic
cost associated with vulnerabilities and bugs.
While larger software companies might be more trusted to rigorously verify their
software releases, the embedded systems market has a long and heavy tail of providers
with a much wider distribution of expertise and resources at their disposal. When we
bring an embedded device into our home or business, how can we have confidence that
the software running there (which depends on chains of control well outside our ability
to observe) is “above the bar” for us? Seemingly innocuous issues, for example passing
a string instead of an integer, can open the door for an attacker to gain root privileges
and serve as a base for other attacks (exactly this happened already in a class of WiFi
routers [77]). Similar attacks targeting embedded devices and firmware updates have
succeeded on everything from printers [78] to thermostats [79].
The basic research question we ask in this paper is: is it possible to make forms
of static analysis an intrinsic part of executing on a microprocessor? In other words,
we examine a machine that will guarantee at the hardware level that any and all code
executing on it is bound to the constraints imposed by a given static program analysis.
This moves the decision to do a proper analysis away from those that push software
updates (who may be making decisions about updates many years removed from the
original purchase) to the decision to purchase and deploy a particular hardware device
itself.
Such a machine would reject any attempt to load it with code that fails to meet
the specified “bar,” independent of who wrote it, who signed it, how it was managed,
or where the software came from. The trust one could put in aspects of execution on
such a processor could be independent of measurement, attestation, or other active
third-party evaluation. By doing the checks in hardware, we can make them intrinsic to
the device’s functionality: the checks will be fully live right from power-up; the checks63
Bouncer: Static Program Analysis in Hardware Chapter 3
will require no dependency on other software on the system functioning correctly (zero
TCB); and if properly designed, they will be directly wired into the operation of the
system, making them provably impossible to bypass.
As this is the first approach to propose and evaluate fully-hardware implemented
static analysis there are two big open questions: a) is it even possible to do a useful
static analysis in hardware, and b) what would the costs of such an analysis be in terms
of time or area? We answer these questions through the hardware development of a
newmodule, the Binary Exclusion Unit (which we call “the bouncer” more informally),
capable of scanning and rejecting program binaries right as they are streamed onto the
device. Specifically, we make the following contributions:
• We introduce hardware static binary analysis and show that it can be implemented
in a way that can never be circumvented through some clever manipulation of
software (e.g. a compromised set of keys, a bug in the operating system, or a
change in the boot ordering).
• We describe a method of static analysis co-design where the checking algorithm
is modified to be more amenable to hardware implementation while maintaining
correctness and efficiency.
• We demonstrate that the analysis, in conjunction with the functional ISA, ensures
all executions are free of memory and type errors and have guaranteed control
flow integrity.
• We evaluate the functioning of the system with a complete RTL implementation
(synthesizable Verilog) of the checker and processor interoperatingwith gate-level
simulation.
• Finally, we show that the resulting system is efficient both in terms of hardware64
Bouncer: Static Program Analysis in Hardware Chapter 3
resources required and performance, and describe how program transformations
can make it even more so.
We elaborate on the motive of our work (Section 3.2), present our hardware static
analysis in the form of a new hardware/software co-designed type system and prove
its soundness (Section 3.3), outline the checking algorithm implementing the type
system (Section 3.4), and design type annotations that can be easily encoded into the
machine binary and provide a hardware implementation of the typechecker (Section
3.5). We prove the non-bypassability of the circuit in Section 3.6, something that would
be extremely difficult to achieve for a software solution. Next, we provide hardware
synthesis figures, evaluate update-time overhead, and show how to manage worst-case
examples (Section 3.7). Finally, we discuss related work (Section 3.8) and conclude.
3.2 Hardware Static Analysis
In building a static analysis hardware engine directly into an embedded micro-
controller, one of the big advantages of customization is that at the hardware level
we can see, either physically through inspection or through analysis at the gate or RTL level,
exactly how information is flowing through a system to introduce safety or security
mechanisms that are truly non-bypassable. No software can change the functioning of
the system at that level. However, doing static analysis at the level of machine code is
no easy task — even for software.
Fortunately, there are some great works to draw inspiration from. Previous work
has used types to aid in assembly-level analysis; specifically TAL [80] and TALx86 [81]
have created systems where source properties are preserved and represented in an
idealized assembly language (the former) or directly on a subset of x86 (the latter).
Working up the stack from assembly, other prior works attempt to prove properties65
Bouncer: Static Program Analysis in Hardware Chapter 3
and guarantee software safety at even higher levels of abstractions. We seek to take
these software ideas and find a way to make them intrinsic properties of the physical
hardware for embedded systems where needed.
In this work we draw upon the opportunity afforded by architectures that have
already been designed with ease of analysis in mind. Specifically, we leverage the Zarf
ISA, a purely functional, immutable, high-level ISA and hardware platform used for
binary reasoning, which is suitable for execution of the most critical portions of a system
[1]. At a high level, the Zarf ISA consists of three instructions: Let performs function
application and object allocation, applying arguments to a function and creating an
object that represents the result of the call. Case is used for both pattern-matching and
control flow. One cases on a variable, then gives a series of patterns as branch heads;
only the branch with the matching pattern is executed. Patterns can be constructors
(datatypes) or integer values, depending on what was cased on. Result is the return
instruction; it indicates what value is returned at the end of a function. Branches in
case statements are non-reconvergent, so each must end in a result instruction.
A big advantage of this ISA for static analysis is that it has a compact and precise
semantics. If we could could guarantee the physical machine would always execute
only according to these semantics (e.g. always respecting call/return behavior, using
the proper number of arguments from the stack, etc.) we would end up with a system
that has some very desirable properties. In Section 3.7 we show that these include
verifiable control flow integrity, type safety, memory safety, and others; e.g., ROP [82]
is impossible, programs never encounter type errors, and buffer overruns can never
happen.
Unfortunately, the semantics of any language govern the behavior of execution only
for “well-formed” programs. When we are talking about machine code, as opposed to
programming languages, things are a little trickier, because machines are expected to66
Bouncer: Static Program Analysis in Hardware Chapter 3
read instruction bits from memory and execute them faithfully as they arrive. As we
describe in more detail below, checking membership in the language of well-formed
Zarf programs is actually something that requires some sophistication and would be
difficult to do at run-time. Even though there are just three instructions, Zarf binaries
support casing, constructors, datatypes, functions, and other higher-level concepts as
first-class citizens in the architecture. Our goal is to correctly implement these checks
statically and show that the only binaries that can ever execute on this machine pass
this static analysis.
3.2.1 The Analysis Implemented
While one could, in theory, capture every possible deviation from the Zarf semantics
with a set of run-time checks in hardware, actually catching every possible thing that
can go wrong quickly grows in complexity. An advantage of static checking over
dynamic checks is that once the binaries are analyzed, no additional energy and time
costs are required during execution. For an embedded system that runs the same code
continuously, any small static cost is amortized rather quickly. As we will show later, in
fact the static analysis can actually be done in a single streaming pass over the executable.
However, just to see the scope of the problem it is useful to enumerate some of the
dynamic checks that would be required to achieve the same objective as our hardware
static analysis.
Table 3.1 lists ways that programs can fail and costs that are incurred if one were
to dynamically check for errors on the platform. There are 21 different ways for the
hardware to throw errors, the great majority of which require keeping some significant
bookkeeping to actually check. At the very least, we would need to keep extra infor-
mation on number of arguments, number of local variables, number of recently cased
67
Bouncer: Static Program Analysis in Hardware Chapter 3
Possible failure: Meaning:malformed instruction Bit sequence does not correspond to a valid instruction.fetch out-of-bounds arg Accessing argument N when there are fewer than N arguments.fetch out-of-bounds local Accessing local N when there are fewer than N locals allocated.fetch out-of-bounds field Accessing field N when there are fewer than N fields in the
case’d constructor.fetch invalid source Bit sequence does not correspond to a valid source.apply arguments to literal Treating a literal value as a function and passing arguments to
it.apply arguments to construc-tor
Treating a saturated constructor as a function and passing ar-guments to it.
applicationwith toomany args Passing more arguments than a function can handle, even if itreturns other functions.
application on invalid source Invalid source designation for function in application.oversaturated error closure Passing arguments to an error closure.oversaturated primitive Passingmore arguments than a primitive operation can handle.passing non-literal into primi-tive op
Passing an object (constructor or closure) into a primitive op-eration.
case on undersaturated clo-sure
Trying to branch on the result of a function that cannot beevaluated.
unused arguments on stack Oversaturating a function and branching on the result whennot all arguments have been consumed.
matching a literal instead of apattern
Branching on a function that returns a constructor, but tryingto match an integer.
invalid skip on literal match Instruction says to skip N words on failed match, but that loca-tion is not a branch head.
no else branch on literal match Incomplete case statement because of lack of else branch.matching a pattern instead ofa literal
Branching on a function that returns an integer, but trying tomatch a constructor.
incomplete constructor set incase statement
Incomplete case statement because not all possible constructorsare present.
invalid skip on pattern match Instruction says to skip N words on a failed match, but thatlocation is not a branch head.
no else branch on patternmatch
Incomplete case statement because of lack of else branch.
Table 3.1: Summary of 21 conditions that require dynamic checks in the absence of statictype checking. With our approach, checking is achieved ahead of time, in a single passthrough the program; energy and time are not wasted with repeated error checking.No information needs to be tracked at runtime, and the only runtime hardware checkis for out-of-memory errors. All of the listed errors are guaranteed by our type systemto not occur.
68
Bouncer: Static Program Analysis in Hardware Chapter 3
constructor fields, and runtime tags on heap objects to distinguish between closures
and constructors — all of which the hardware would need to track at runtime. Crucially,
this information must be incorruptible and inaccessible to the software for the dynamic
checks to be sound. If software is able to access and corrupt this information, it com-
promises the integrity of the dynamic checks. In general, guaranteeing that the set of
dynamic checks are always occurring, i.e. not bypassed, can be very difficult. With a
hardware-implemented static analysis, we are able to formally prove that our checks
cannot be bypassed (outlined in Section 3.6). In addition to the hardware implementa-
tion overhead of these checks, reasoning about software behavior in the face of dynamic
checks becomes more difficult as well if error states are returned. Programmers that
wish to handle errors due to code that fails such checks are forced to reason about every
situation that can arise (e.g. what if this function encounters an oversaturated primitive,
or cases on an undersaturated closure, and so on.). Instead, by performing the checks
statically, all software components understand that any other component with which
they might interact on the system is subject to the same analysis as their own code.
3.2.2 The Bouncer Architecture
Given that we can develop a unit to actually perform the desired static analysis, a
big question is where it fits into the actual micro-controller design. Figure 3.1 shows
how a static analysis engine (the Binary Exclusion Unit) fits into an embedded system
at a high level: all incoming programs are vetted by the checker before being written
to program storage, ensuring that all code that the core executes conforms to the type
system’s proven high-level guarantees. During programming mode, as a binary image
is loaded into the core, the checker has write access to the program store and can use
data memory as a working space. The Binary Exclusion Unit can thus be used as a
69
Bouncer: Static Program Analysis in Hardware Chapter 3
Figure 3.1: The Binary Exclusion Unit works as a gatekeeper, only allowing inputbinary programs if they pass the static analysis. When in “Programming” mode, thecore is halted while the program is fed to the checker; if it passes, it is written to thesystem instruction memory. The checker makes use of the core’s data memory, whichis otherwise unused during system programming. At run-time, the checker is disabledand consumes no resources. Programs that pass static analysis are guaranteed to befree of memory errors, type errors, and control flow vulnerabilities. The checker isnon-bypassable; all input binaries are subject to the inspection.
70
Bouncer: Static Program Analysis in Hardware Chapter 3
runtime guard, checking programs right before execution when they are loaded into
memory, or as a program-time guard, checking programs when they are placed into
program storage (flash, NVM, etc.).
Only once the programming mode is complete do the instruction and data memory
become visible. The upshot of catching errors this way is that this gives you feedback
at programming time, before a device is deployed, that the binary contains errors. It
further ensures that when reprogramming occurs in the field, malicious or malformed
code that exploits interactions outside of the ISA semantics will never be loaded.
In either case, checking works the same way: each word of the binary is examined
one at a time in a linear pass over the program as it is fed through the Binary Exclusion
Unit. It is trivial to verify that the BEU is the only unit given access to write to the
memory — the more interesting discussion, covered later, is the verification that the
only way though the BEU is via a static analysis.
3.3 Static Analysis Strategy
While many different static analysis approaches might be implemented in hardware
in the way we described in the sections above, to embody these ideas in a hardware
prototype we need a specific analysis specification and implementation. Here we draw
inspiration from TAL [80], and use types to clearly and completely specify allowed
behavior. By extending the Zarf ISA with types, passing a portion of that type informa-
tion along with the binary, and then performing the static analysis to check those types,
we know the program conforms to the allowed behaviors. This new type-extended Zarf
ISA is, unlike untyped Zarf, based on the polymorphic lambda calculus. Figure 3.2
describes the abstract syntax of the typed ISA; note that there are four types: integers,
functions, type variables, and datatypes (which are similar to algebraic datatypes found
71
Bouncer: Static Program Analysis in Hardware Chapter 3
x ∈ Variable n ∈ Z fn, tn, cn ∈ Name ⊕ ∈ PrimOpα ∈ GenericTypeVariable β ∈ RigidTypeVariable
P ∈ ProgramF−−−→data
−−−→func
data ∈ DatatypeF data tn α⃗ = −−−→conscons ∈ ConstructorF con cn τ⃗
func ∈ FunctionF fun fn −−−→x : τ τ = ee ∈ ExpressionF let | case | res
let ∈ LetF let x = n in e | let x = id −−→arg in e
case ∈ CaseF case x of−→br | case x of
−→br else e
res ∈ ResultF result argbr ∈ BranchF cn x⃗⇒ e | n⇒ e
id ∈ IdentifierF fn | cn | ⊕ | xarg ∈ ArgumentF n | x
τ ∈ TypeF Int | dt | ft | Tdt ∈ DatatypeF tn τ⃗ft ∈ FuncTypeF τ⃗→ τ
T ∈ TypeVarF α | β
Figure 3.2: Typed Zarf abstract syntax. An arrow over any metavariable signifies alist of zero or more elements, except for a datatype’s constructor list, which must benon-empty.
72
Bouncer: Static Program Analysis in Hardware Chapter 3
Γ ∈ Env = Variable→ Type C ∈ ConstraintSet = P(Type × Type)
σ ∈ Substitution = TypeVar → Type b ∈ Bool = true + false
Figure 3.3: Semantic domains for the Zarf static semantics. See Figure 3.4 for the typingrules.
in languages like Haskell and ML). Both functions and datatypes are declared at the
top level; since the ISA is lambda-lifted, the introduction of universally-quantified type
variables ranging over a function body or datatype is limited to the top level as well,
simplifying the ISA’s type system.
Our static analysis requires that type information be encoded into the binary, but we
note specifically that the Binary ExclusionUnit discards these annotationswhen finished,
leaving a (safe and certified) standard binary program in protected core memory. To
qualify as a typed Zarf program, a binary must declare types of all top-level functions
and make all (data) constructors members of a datatype. With this, all types will be
tracked and checked, including type variables for polymorphism, facilitating local type
inference within the bodies of functions.
3.3.1 Static Semantic Rules
The type system in Figure 3.4 describes, using a set of inference rules, what it means
for a Zarf binary to be well-typed; the associated static semantic domains are found in
Figure 3.3. In this section we define what each inference rule means formally.
func-ret
Rule func-ret checks functions that have zero parameters. It does so by determining
the type τr2 of the function body via the inference rule over e, comparing it against the
expected return type τr2 via the call to princType. If the result of that function call
73
Bouncer: Static Program Analysis in Hardware Chapter 3
Functions ⊢ func : τ
τr1 = makeRigid(τr) ⊢ e : τr2
princType([τr1, τr2]) = τr1
⊢ fun fn [] τr = e : τr(func-ret)
(τ⃗p1 → τr1) = makeRigid(τ⃗p → τr)x⃗ 7→ τ⃗p1 ⊢ e : τr2 princType([τr1, τr2]) = τr1
⊢ fun fn −−−−→x : τp τr = e : τ⃗p → τr(func-params)
Expressions Γ ⊢ e : τ
idTy(Γ, id) = τi α = freshGenTV Γ1 = Γ[x1 7→ α]map(argTy(Γ1),−−→arg) = τ⃗a applyType(τi, τ⃗a, [], α) = τ1 Γ[x1 7→ τ1] ⊢ e : τ
Γ ⊢ let x1 = id −−→arg in e : τ(let-var)
Γ[x 7→ Int] ⊢ e : τΓ ⊢ let x = n in e : τ
(let-int)argTy(Γ, arg) = τΓ ⊢ result arg : τ
(result)
Γ(x) = dt −−−→cons = getCons(dt) allConsPres(−−−→cons,−→br) = true
τ⃗ = brTypes(Γ,−→br,−−−→cons) princType(τ⃗) = τ
Γ ⊢ case x of−→br : τ
(case-con)
Γ(x) = dt −−−→cons = getCons(dt) τ⃗ = brTypes(Γ,−→br,−−−→cons)
Γ ⊢ e : τe princType(τe :: τ⃗) = τ
Γ ⊢ case x of−→br else e : τ
(case-con-else)
Γ(x) = Int (−−−−−−→ni ⇒ ei) ∈−→br Γ ⊢ ei : τi Γ ⊢ e : τe princType(τe :: τ⃗i) = τ
Γ ⊢ case x of−→br else e : τ
(case-int)
Figure 3.4: Zarf static semantics (typing rules). See Figure 3.2 for the abstract syntaxand Figure 3.3 for the static semantic domains. map, filter, and concatMap refer totheir standard definitions. delete removes the first instance of an element from a list.We use the notation Γ[x 7→ τ] to represent the creation of an updated map Γ where thenew entry x 7→ τ has been added to the old map; similarly, Γ[⃗x 7→ τ⃗] denotes mappingthe first variable in x⃗ to the first type in τ⃗, etc. for all point-wise pairs. Unless otherwisestated, the lengths of both lists must be the same. We use the notation O(·) to indicatean “Option” type: essentially a set guaranteed to contain zero or one values. We use thesymbol • to represent the absence of a value where an “Option” type is required. Wedifferentiate between metavariables with subscripts that can be a combination of lettersor numbers; for example, τ and τ1 represent distinct metavariables denoting types. SeeSections 3.3.1 and 3.3.2 for descriptions of each rule and helper function, respectively.
74
Bouncer: Static Program Analysis in Hardware Chapter 3
doesn’t equal the expected type, the inference rule fails, indicating a type error.
func-params
Rule func-params checks functions that have one or more parameters. It does so
by mapping each function’s parameters to its declared types before checking the body.
makeRigid universally quantifies all type variables in the type declaration across the
body. Like rule func-ret, it uses princType to check that the function body’s expected
type matches its inferred type, with failure indicating a type error.
let-var
Rule let-var applies a type to zero or more arguments using the helper applyType
to get the principal type of the application. Functions may be partially-applied, and
mapping the bound variable to a fresh type variable allows for recursive definitions. It
determines the type of the next subexpression expression e in an updated environment
Γ[x1 7→ τ1].
let-int
Rule let-int performs constant assignment, simply updating the environment with
a binding from identifier x to the type Int.
case-con
Rule case-con is used when scrutinizing a datatype. It gets the list of constructors
associated with a particular datatype, replacing all type variables in its constructors’
fields with any type variable instantiations found in the datatype. It uses the helper
brTypes to get the type of each branch.
75
Bouncer: Static Program Analysis in Hardware Chapter 3
case-con-else
Rule case-con-else is similar to case-con, but used when all constructors of the
datatype aren’t present. In this case, an else branch is required so that the entire
expression can always evaluate to a value with a type τ.
case-int
Rule case-int is used when scrutinizing an integer. It compares branch body types
for equality, and like rule case-con-else, an else branch is required, as it is not feasible
to have branches for every possible integer value.
result
Rule result is the base case, simply producing the type of a bound variable or integer.
As result is the only expression without a subexpression, it is guaranteed to be the last
expression of all branches of a function.
3.3.2 Static Semantic Helper Functions
The rules used in type system in Figures 3.4 and described in the previous section
use a variety of auxiliary functions for clarity in defining the semantics. In this section
we define what each helper function does formally.
applyType
Performs constraint generation, unification (unify), and substitution (substitute)
to get the principal type of an application. When no arguments are applied, the type of
this helper’s first parameter is returned, thus allowing the Let instruction to apply an
integer, datatype, or generic type as well as function types. Note that because the type
76
Bouncer: Static Program Analysis in Hardware Chapter 3
returned by applyType is the principal type of the variable to which it is bound in the
Let instruction, no constraints are propagated to any instructions that follow, limiting
the amount of information that needs to be tracked throughout typechecking, as well
as making error reporting of ill-typed applications more accurate.
allConsPres
allConsPres checks if all the constructors have a matching branch. Note that it is
not ill-typed for the pattern branch to provide more binders than the constructor has
fields; it’s only ill-typed if those extraneous binders are used in the branch body. In the
helper brTypes, when the added binders are mapped to their types, it is implied that the
mapping is guaranteed to be no larger than the number of types in the corresponding
constructor. If an extraneous binder is used in the branch body, it will not appear in the
environment because it wasn’t added, and the type system will determine the branch
body to be ill-typed, as expected. This relaxation is needed here because the machine
replaces all the binder references with a field index; if the field is never used in the body
of the branch, then it would not appear in the binary, and therefore we cannot try to
over-constrain it here because the binary would not produce an error during execution.
Also note that there is no requirement that −→br contain exactly the same number of
pattern branches as the number of constructors; it is okay for there to be more branches
that constructors for the given datatype. In the helper function brTypes, we ensure
that each pattern branch matches a constructor in the cased-on datatype; therefore,
the implication is that having duplicate pattern branches is acceptable, as long as they
match a constructor of the scrutinized datatype.
77
Bouncer: Static Program Analysis in Hardware Chapter 3
allConsPres ∈−−−−−−−−−−→Constructor ×
−−−−−−→Branch→ Bool
allConsPres([], ) = true
allConsPres((con cn τ⃗) ::−−−→cons,−→br) = (cn ⇒ ) ∈
−→br ∧ allConsPres(−−−→cons,
−→br)
applyHelper
applyHelper generates constraints between a function application’s parameters
and arguments, taking care to handle over-application appropriately. It does so by
recursively iterating through a list of parameter and argument types, generating the
constraint that the current parameter must equal the current argument. As function
types are uncurried in this formalism, it is used to determine when to continue applying
extra arguments to the current function’s return type, which must also be a function
type.
applyHelper ∈ FuncType ×−−−→Type × ConstraintSet × TypeVar → Type
applyHelper(τp :: τ⃗p → τr, τa :: τ⃗a,C1, α) = applyType(τ f , τ⃗a,C2, α)
where
C2 = {τp = τa} ∪C1
τ f =
τr if τ⃗p = []
τ⃗p → τr otherwise
applyType
applyType describes application of a type to a (possibly empty) list of argument
types, performing constraint generation, unification (unify), and substitution (substitute)
to get the principle type of an application. When applying a function type to a non-
empty list of argument types, it calls applyHelper, which performs the step of constraint
78
Bouncer: Static Program Analysis in Hardware Chapter 3
generation so that this function can verify that the application is valid via unification.
This and the associated helper functions are needed because of the uncurried presen-
tation of functions in the abstract syntax, which reflects their representation in the
machine more closely. Note that it is considered an error to try to apply a non-function
type to a non-empty list of argument types. When no arguments are applied, the type
of this helper’s first parameter is returned, thus allowing the Let instruction to apply
an integer, datatype, or generic type as well as function types.
The argument α is used to denote the type variable that the variable in the current
let binding was bound to before calling this function. This is necessary to handle the
case where an argument’s type is unknown when application begins because it is being
recursively defined. For example, in let ones = Cons 1 ones in ..., ones is being
recursively defined and therefore will not have a type fully determined until after
application; upon a call to applyHelper, the parameter ones will have type α (where α
is fresh) in the environment. After the process of constraint generation and unification
is completed during this application, if α is found in the substitution, that means it was
used as an argument during application and we can then check that its use corresponds
correctly with the result of the entire application.
applyType ∈ Type ×−−−→Type × ConstraintSet × TypeVar → Type
applyType(τ1, τ⃗a,C, α) =
τ2 if τ⃗a = []
applyHelper(⃗τp → τr, τ⃗a,C, α) if τ⃗a , [] ∧ τ1 = τ⃗p → τr
where
σ = unify(C)
τ2 = substitute(σ, τ1)
true = (α < dom(σ)) ∨ ((α 7→ τα) ∈ σ ∧ substitute(σ, τα) = τ2)
79
Bouncer: Static Program Analysis in Hardware Chapter 3
argTy
argTy determines the type of an argument; if the argument is a variable, it looks
up its mapping in the type environment. All arguments must be literals (integers) or
previously bound in the environment.
argTy ∈ Env × Argument → Type
argTy(Γ, arg) =
Int if arg = n
τ if arg = x ∧ (x 7→ τ) ∈ Γ
brTypes
brTypes typechecks a list of branch bodies, mapping the branch’s binders to the
matching constructor’s field types. It does so by iterating over a list of pattern branches,
evaluating the branch’s body expression in an environment where the pattern’s binders
have been mapped to the matching constructors field types. Note that the number
of variables in a pattern branch need not equal the number of fields in the matching
constructor; any variables without a matching field (because too many variables were
supplied in pattern) are ignored. This is okay because any reference to those omitted
variables (which shouldn’t be allowed) will be caught during typechecking of the
branch’s body expression, giving an ill-typed result due to a bad environment lookup, as
desired. This rule also shows that each pattern branchmust have amatching constructor
in the list of constructors; it is considered ill-typed for a pattern matching a constructor
of a different datatype to be present in the list of branches of the current case.
80
Bouncer: Static Program Analysis in Hardware Chapter 3
brTypes ∈ Env ×−−−−−−−−−−→Constructor ×
−−−−−−−→PatCons→
−−−→Type
brTypes(Γ,−−−→cons, []) = []
brTypes(Γ,−−−→cons, (cn x⃗⇒ e) ::−→br) = τ :: τ⃗
where
true = (con cn τ⃗ ∈ cons)
Γ[⃗x 7→ τ⃗] ⊢ e : τ
τ⃗ = brTypes(Γ,−−−→cons,−→br)
BuiltinTypes
BuiltinTypesmaps primitive operator identifiers (⊕ as shown in the abstract syn-
tax) to the types of the primitive operations they represent. For example, +maps to
the function type [Int, Int] → Int. The set of builtin operators and their types is fixed,
straightforward, and thus omitted here for brevity’s sake.
Datatypes
Datatypes is simply the list of datatypes extracted out of the top-level program
definition.
Functypes
Functypes is a map from function and constructor identifiers to their respective
types, created from the list of function and datatype declarations that constitute a pro-
gram. −−−→data and −−−→func are the list of datatypes and functions, respectively, that constitute
the contents of a program (see Figure 3.2).
81
Bouncer: Static Program Analysis in Hardware Chapter 3
Functypes ∈ Operator → Type
Functypes = τ⃗c ++ τ⃗ f ++ τ⃗b
where
τ⃗c = concatMap(createConsTys,−−−→data)
τ⃗ f = map(createFuncTy,−−−→func)
τ⃗b = BuiltinTypes
createConsTys
createConsTys is a helper function for Functypes, creating types for each of the
constructors that are part of a datatype. If a constructor has fields, its type is a function
typewhere those fields are the parameter types and the datatype of which it is amember
is the return type. If the constructor doesn’t have fields, its type is just the datatype of
which it is a member.
createConsTys ∈ Datatype→ (Name→ Type)
createConsTys(data tn α⃗ = −−−→cons) = map(createConTy(tn α⃗),−−−→cons)
createConTy
createConTy is a helper function for createConTy, creating a types a constructor
that is a part of a datatype.
createConTy ∈ Datatype × Constructor → (Name→ Type)
createConTy(dt, con cn τ⃗c)cn 7→ (τ⃗c → dt) if |⃗τc| ≥ 1
cn 7→ dt otherwise
82
Bouncer: Static Program Analysis in Hardware Chapter 3
createFuncTy
createFuncTy is a helper function for Functypes, creating a type for a functions
defined in the program.
createFuncTy ∈ Function→ (Name→ Type)
createFuncTy(fun fn −−−→x : τ τ = e) =fn 7→ (⃗τ→ τ) if |−−−→x : τ| ≥ 1
fn 7→ τ otherwise
freshTypes
freshTypes is a helper function for idTy, replacing all of a type’s generic type
variables with fresh generic type variables, consistently across the type.
freshTypes ∈ Type→ Type
freshTypes(τ1) = τ2
where
α⃗ = filter(isGeneric, getTyVars(τ1))
σ = α⃗ 7→−−−−−−−−−−−→freshGenTV
τ2 = substitute(σ, τ1)
freshGenTV
freshGenTV gets a fresh uniquely-identifiable generic type variable.
freshRigTV
freshRigTV gets a fresh uniquely-identifiable rigid type variable.
83
Bouncer: Static Program Analysis in Hardware Chapter 3
getCons
getCons retrieves the constructors associated with a datatype and replaces any
type parameters in those constructors’ field types with any type variable instantiations
recorded in the datatype.
getCons ∈ Datatype→−−−−−−−−−−→Constructor
getCons(tn τ⃗t) = −−−→cons2
where
(data tn α⃗t = −−−→cons1) ∈ Datatypes
σ = zip(α⃗t, τ⃗t)
−−−→cons2 = map(instCon(σ),−−−→cons1)
getTyVars
getTyVars gets the set of type variables used in a type. unions takes the union of
each set in a list.
getTyVars ∈ Type→ P(TypeVar)
getTyVars(τ) =
{T } if τ = T
{T⃗ } if τ = dt T⃗
unions(map(getTyVars, τ⃗p)) ∪ getTyVars(τr) if τ = τ⃗p → τr
idTy
idTy allows let-polymorphism by replacing non-rigid type variables with fresh ones.
It does so by getting the type associated with an identifier from the type environment,
if present; otherwise, it looks up the identifier in the set of declared function types.84
Bouncer: Static Program Analysis in Hardware Chapter 3
Afterwards, it uses freshTypes to replace all generic type variables in the type with
fresh new ones. This function is used during the let expression for getting unique
instances of types so that type variables do not inadvertently clash during the process
of unification and substitution (let-polymorphism).
idTy ∈ Env × Identifier → Type
idTy(Γ, id) =
freshTypes(τ) if id = x ∧ (x 7→ τ) ∈ Γ
freshTypes(τ) if id = op ∧ τ = Functypes(op)
instCon
instCon instantiates a constructor by replacing the type variables in its field types
with its mapping in the constraint list.
instCon ∈ Substitution × Constructor → Constructor
instCon(σ, con cn τ⃗) = con cn substitute(σ, τ⃗)
makeRigid
makeRigid converts all generic type variables in a type into consistently-renamed
rigid type variables.
makeRigid ∈ Type→ Type
makeRigid(τ1) = τ2
where
α⃗ = getTyVars(τ1)
σ = α⃗ 7→−−−−−−−−−−−→freshRigTV
τ2 = substitute(σ, τ1)
85
Bouncer: Static Program Analysis in Hardware Chapter 3
princType
princType determines if a list of types all refer to the same type, determining the
most general type that matches all of them. For example, it is used at the end of the
case typing rule to check that all branch bodies have a compatible type. This check is
different than a normal check for equality because it takes into account type variables.
princType ∈−−−→Type→ Type
princType(τh :: []) = τh
princType(τ1 ::τ2 :: τ⃗tl) = princType(τ3 :: τ⃗tl)
where
σ = unify({τ1 = τ2})
τ3 = substitute(σ, τ2)
substitute
substitute takes a constraint set and a type, recursively replacing all type variables
present as keys in the constraint set with their associated value.
substitute ∈ ConstraintSet × Type→ Type
substitute(C, τ) =
τ2 if τ = T ∧ (T = τ1) ∈ C ∧ τ2 = substitute(C, τ1)
tn τ⃗2 if τ = tn τ⃗1 ∧ τ⃗2 = map(substitute(C), τ⃗1)
τ⃗p2 → τr2 if τ = τ⃗p1 → τr1 ∧
τ⃗p2 = map(substitute(C), τ⃗p1) ∧
τr2 = substitute(C, τr1)
τ otherwise
86
Bouncer: Static Program Analysis in Hardware Chapter 3
unify
unify performs the standard process of unification, iterating over each constraint
in the constraint set and creating a substitution that contains a mapping from type
variables to types. Any cases unlisted in this function imply that the function fails in that
case. Regarding rigid type variables: the rigid types β1 and β2 are successfully unified if
and only if β1 equals β2. This is because in the context of checking a function, rigid type
variables cannot be replaced or unified with any other concrete type or polymorphic
type variable. For compound types like functions and data types, unification proceeds
recursively.
Similar to map creation, we use the notation {⃗τ1 = τ⃗2} to denote the creation of a
constraint set where the first element of τ⃗1 is paired with the first element of τ⃗2, the
second element of τ⃗1 paired with the second element of τ⃗2, etc.
87
Bouncer: Static Program Analysis in Hardware Chapter 3
unify ∈ ConstraintSet → Substitution
unify(∅) = {}
unify({τ1 = τ2} ∪C1) =
unify(C1) if τ1 = τ2
σ[α 7→ τ2] if τ1 = α ∧ α < getTyVars(τ2) ∧
C2 = updateConstraints(C1, {α = τ2}) ∧ σ = unify(C2)
σ[α 7→ τ1] if τ2 = α ∧ α < getTyVars(τ1) ∧
C2 = updateConstraints(C1, {α = τ1}) ∧ σ = unify(C2)
unify(C2) if τ1 = tn τ⃗1 ∧ τ2 = tn τ⃗2 ∧ |⃗τ1| = |⃗τ2| ∧
C2 = C1 ∪ {⃗τ1 = τ⃗2}
unify(C2) if τ1 = τ⃗p1 → τr1 ∧ τ2 = τ⃗p2 → τr2 ∧ |⃗τp1| = |⃗τp2| ∧
C2 = C1 ∪ {⃗τp1 = τ⃗p2} ∪ {τr1 = τr2}
updateConstraints
updateConstraints iterates through a constraint set, replacing all type variables in
each constraint by their mapped types, if present in the substitution. This is a helper
function used only by unify.
88
Bouncer: Static Program Analysis in Hardware Chapter 3
updateConstraints ∈ ConstraintSet × Substitution→ ConstraintSet
updateConstraints(∅, ) = ∅
updateConstraints({τ1 = τ2} ∪C1, σ) = {τ3 = τ4} ∪C2
where
τ3 = substitute(σ, τ1)
τ4 = substitute(σ, τ2)
C2 = updateConstraints(σ,C1)
3.3.3 Properties and Proofs
Two formal properties, when combined, can guarantee that the machine never has
to create and return an error object. The first is progress, which says that if a term
is well-typed, then there is always a way to continue evaluating it according to the
semantic rules; the second is preservation, which says that if a term is well-typed,
evaluating it will result in a well-typed term. Taken together, we have a guarantee that
there will always be an applicable semantic rule to evaluate each step of the program,
which means that we never encounter anything outside of our semantic definitions and
never run into type or memory errors.
We prove progress and preservation in a straightforward way, via induction on the
typing rules and the dynamic semantics, giving a brief overview below.
Lemma 4 (Apply Type). applyType (τi, τ⃗a,C, α)returns the principal type of an application
of a type to zero or more arguments.
Proof. applyType generates a constraint for each parameter and argument until the
list of arguments τ⃗a is exhausted. Unifying these constraints to produce a substitution,
it then determines the principal type of the application; this proof relies on standard89
Bouncer: Static Program Analysis in Hardware Chapter 3
proofs on principal type generation.
Lemma 5 (Progress of Functions). Assuming the correct arguments are given, executing the
body of a well-typed function
fun fn −−−→x : τ τ = e produces a value of type τ , Error, when the body terminates.
Proof. The rule func-params checks a function body e in an environment Γ that maps
the parameters in x⃗ to their declared types in τ⃗. Any type variables in those parameter
types and the return type are replaced with fresh rigid type variables, implying that
those parameters cannot be specializedwithin the function body (i.e they are universally
quantified across the entire function). Rule func-ret follows similarly, except that the
initial environment used to typecheck the body e is empty. We must show that Γ ⊢ e : τ,
that is, that the function body evaluates to a non-error value of type τ. The proof
proceeds by induction on the derivation of e and using Lemma 4:
Case (let-var: e = let x1 = id −−→arg in e1). Let τi be the type of id; the helper function idTy
looks up its entry in Γ if present, otherwise id is function or constructor name and its
type is statically known by referring to the program’s function type list. If −−→arg = [], then
by induction Γ[x1 7→ τi] ⊢ e1 : τ. Otherwise, by Lemma 4, applyType(τi, τ⃗a, [], α) = τ1
returns x1’s principal type (where τ⃗a and α are the types of the arguments and x1,
initially, respectively). By induction then, Γ[x1 7→ τ1] ⊢ e1 : τ.
Case (let-int: e = let x = n in e1). Then Γ[x 7→ Int] ⊢ e1 : τ by induction, since n : Int.
Case (case-con: e = case x of−→br). Let (x 7→ dt) ∈ Γ, so that the value of x is a constructor
named cn with fields of type seqsτ. By definition, all constructors that are members of
type dt must have amatching branch cni x⃗i ⇒ eiinbr. Additionally, eachmatching branch
must contain at least as many binders x⃗i as fields in its constructor, such that all variable
references in the branch body ei are valid. (All of these requirements are handled
in the helper function allConsPres.) The helper function brTypes then proceeds to90
Bouncer: Static Program Analysis in Hardware Chapter 3
get the type of each branch body, in the environment Γ extended with bindings from
the pattern binders to the constructor field types for that particular constructor. By
induction, those bodies e⃗i return types τ⃗i, all of which must be unifiable as a type τ.
As case is the only source of control flow, requiring case completeness ensures that
machine cannot attempt to go to an invalid location or reach an error state.
Case (case-con-else: e = case x of−→br else ee). This case proceeds as in case-con above,
except the restriction that all branches for the scrutinized datatype be present is relaxed.
Instead, an else expression ee is required to maintain control-flow safety.
Case (case-int: e = case x of−→br else ee). Let (x 7→ Int) ∈ Γ, so that the value of x is n.
Since all branches −−−−−−→ni ⇒ ei ∈−→br may be a match, the typing rule requires all branch
bodies result in a unifiable type. Additionally, since the number of values in the set
of integers is enormous, the rule requires an else expression ee be provided to handle
the non-matching case. As the case expression is the only source of control flow in
the system, and because we require an else branch when scrutinizing an integer, the
machine cannot go to invalid location or produce an error at this stage. By induction,
Γ ⊢ ee : τe and Γ ⊢ ei : τi for all branch bodies, which must be unifiable to the type τ.
Case (result: e = result arg). If arg = n, then τ = Int. If arg = x, then (x 7→ τ) ∈ Γ, stored
previously via the rules let-* or because it was an argument to the function. As this
is the base instruction, it is always the last instruction in each branches of a function.
Therefore its type will be the type of the function itself (and must match all other result
types of the function, checked in the func-* rules).
Theorem 2 (Progress of Programs). Let P be a well-typed program composed of a list of
datatypes −−−→data and functions −−−→func. Let(fun main −−−→x : τ τ = e
)∈−−−→func be the entry point to P
where execution begins. Then P either halts and returns a value of type τ , Error, or it
continues execution indefinitely.91
Bouncer: Static Program Analysis in Hardware Chapter 3
Proof. By Lemma 5 and rule func-params, we know that
fun main −−−→x : τ τ = e has type τ (similarly for functions without parameters, using rule
func-ret). Since a hardware error value of type Error is created when the machine
encounters an invalid state during evaluation, and Lemma 5 says that a well-typed
function does not lead to an invalid state, P returns a value of type τ , Errorwhen it
terminates.
3.4 Algorithm for Analysis
As mentioned earlier, the Binary Exclusion Unit (BEU) can be used as a runtime
guard, checking programs right before execution when they are loaded into memory,
or as a program-time guard, checking programs when they are placed into program
storage (flash, NVM, etc.). In either case, checking works the same way: each word of
the binary is examined one at a time as it streams through. Central to this process is
the embedded Type Reference Table (TRT), which is copied from the binary into the
checker’s memory and contains the type information for the binary. This serves as a
reference during all stages of the checking process and will be extended during the
checking of each function as local variables are introduced. Later, when the BEU arrives
at a new function, it consults the function signature, which provides type information
for the arguments and the return type of the function. Each instruction in the function
is then scanned word-by-word, guaranteeing type safety of each instruction according
to the static semantics (Figure 3.4). Checking can fail at any step of the process: e.g.,
a function might expect an Integer but is passed a List, or the add function, which
expects two Integer arguments, is given three. A single type violation causes the entire
program to be rejected. The steps required to check each instruction class are described
in more detail below:
92
Bouncer: Static Program Analysis in Hardware Chapter 3
Let —When a Let instruction is encountered, we first check for special-case operations:
applying no arguments to something will always result in the same thing, so we can
simply assign the result to that type and do no further checking. Assuming the Let
does have arguments, the checker then gets the type of the function and creates an alias
of it in a new TRT entry. The point of the alias is to make each type variable unique
— e.g., the same type List a (a list of elements of type “a”) used in two places may
not be using the same type for “a”, so the separate usages should have separate type
variables. In order to allow recursive Let operations, a type variable is assigned to the
result of the operation; when all the arguments have been processed, that variable will
be set equal to what’s left. The checker goes through each argument, one at a time, and
unifies its type with the function’s expected type. This creates a list of constraints that,
along with the constraint on the resulting type variable, are checked altogether as the
last step. If there are no inconsistencies in the constraint set, the operation was valid,
and a new valid type is produced for the local variable.
Because type inference is relatively simple, we chose to forgo type annotations
on each function application that indicate the result of the operation. Instead, the
checker uses function-local type inference to figure out the return type of each function
application. Because function calls (Let instructions) make up the majority of the
instructions in a binary, the absence of annotations on each one results in much smaller
binary sizes for typed binaries.
Special care must be taken in Let instructions when the resulting type is a function,
and when the function being applied has a function in its return type. The former re-
quires creating a new TRT entry for the function; the latter requires a special “unfolding”
routine to begin applying arguments to the function in the return type. Both of these
are reasons that the Let section of the hardware checker has so many states (Table 3.2).
93
Bouncer: Static Program Analysis in Hardware Chapter 3
Case — Case instructions are much more straightforward. The checker simply saves
some type information on what the program is casing on, which is used in later in-
structions. Specifically, the primary task is to get the type of the scrutinee (the thing
being cased on) and save a reference both to the particular variable’s type and the root
program datatype (assuming the variable is a constructor, not an integer). For example,
this way branches will know that a List was cased on, not a Tuple, and know that the
particular variable was a List Int as opposed to a List Char.
Pattern_literal branch heads are quite simple: the case head must be an integer,
and the value specified in the instruction must be an integer.
Pattern_con branch heads are one of the more complex things to check. We have
to reconcile the generic type of the indicated pattern (constructor) with the specific
type of the variable that we’re matching against. To do this, the checker must get the
function type specified in the pattern head, then alias it in a new TRT entry. Then we
must generate the constraint that the return type of the function is the same as the type
of the scrutinee— this ensures that the type variables in this entry will be constrained to
be the same as those in the original scrutinee. Constraints can then be checked, yielding
a map with which the variables can be recursively replaced to the correct types. Finally,
a pointer is set to where the fields of the constructor begin (if applicable). When we
are done, we have direct, usable information on the type of each field in the constructor,
which can be used by following instructions.
In addition, we must keep track of which constructors we’ve seen in this case state-
ment; that way, when we get to the end of the Case, we’ll know if all of the constructors
of that type were present or not. A Case statement must either contain an else branch
or use all constructors of the scrutinee’s type.
94
Bouncer: Static Program Analysis in Hardware Chapter 3
Figure 3.5: An example Type Reference Table (TRT) for the function map. The originalprogram is shown in (a), while (b) shows the actual binary type information that theassembler produces (annotated for human understanding). This type information isincluded at the head of the binary file, leaving the program unchanged. The first sectionlists types used in the signatures of the program, while the second section contains typeinformation for the parameters and return type of each function. The type system ispolymorphic and uses function-local type inference.
95
Bouncer: Static Program Analysis in Hardware Chapter 3
3.5 BEU Implementation
At a high level, the BEU is a hardware implementation of a pushdown automaton
(PDA) and is structured as a state-machine with explicit support for subroutine calls.
While there a numerous bookkeeping structures required, we must take care to access
a single structure at a time to ensure we do not create structural hazards. The final
analysis hardware is the result of a chain of successive lowerings from a high-level
static semantics ending with a concrete state chart that we could then implement with
minimal and straighforward hardware. First, a bit-accurate software checker was made
that checked binary files. Then, a cycle-accurate software pushdown automata was
written from that refined specification. From that program, the leap to real hardware
was somewhat straightforward (see Section 3.7 for synthesis results). The full details
of the checker cannot hope to fit in this paper, so we concentrate here on the strategy
used at a high level and a couple of details to give a better sense for the full design.
The first challenge in implementing this analysis is how to encode the type informa-
tion into the binary. As discussed in the prior section, we put this information at the
head of each binary in the form of the TRT. To get a sense of what that actually looks
like in a real implementation, Figure 3.5 shows an example TRT for the function map.
This information is discarded after checking, leaving a standard, untyped binary, which
executes with normal performance.
At the bit-level, we see only a sequential series of bytes. Therefore, all type infor-
mation must be encoded into a single list. To avoid unnecessary complexity, we make
all entries in the TRT fixed-width 32-bit words. An entry can be either 1) a program-
specified datatype or built-in type1, or 2) a derived type based on another type. Entries
of type 2 can have one or more argument words, which we refer to as “typewords.”1The Zarf ISA includes integers and an error datatype built-in.
96
Bouncer: Static Program Analysis in Hardware Chapter 3
“Derived” here means that the entry contains references to other types in the table.
This manifests as either a type applied to some type variables or as a function. For
example, List is specified as a program datatype with one type variable, then derived
type entries can create the types List a, List Int, etc, where a and Int are typewords
following the derived type entry.
The second challenge in bringing the typechecker to a low level is dealing with
recursive types. Implicitly, types in the system may be arbitrarily nested: for example,
one could declare a List of Tuples of Lists of Ints. During the checking process, the
hardware typechecker must be able to recursively descend through a type in order to
make copies, do comparisons, and validate types. Because of this, the Binary Exclusion
Unit cannot be expressed as a simple state machine — a stack is required for recursive
operations (and hence the pushdown automaton).
Data structures used in the higher-level checking, like maps, need to be converted
to structures native to hardware: they must flattened into a list, which can be stored
in memory. In some cases, this requires a linear scan to check for the presence of
some elements, such as checking case completeness — but those lists tend to be small,
containing just one entry for each constructor of a given datatype. We found that all of
the structures could be represented as lists with stack pointers, except in the case of the
type variable map used in the recursive replace procedure, which required two lists
(one to check for membership in the map, the second with values at the appropriate
indices).
To create the control structure of the PDA, we started by implementing a software-
level checker, broken into a set of functions implemented with discrete steps, where
each step cannot access more than one array in sequence (in hardware, the arrays will
become memories, which we do not want strung together in a single cycle). While,
given our space constraints, it is difficult to describe the system in detail, the number of97
Bouncer: Static Program Analysis in Hardware Chapter 3
Figu
re3.6:
Thefullstatemachine
portionof
theha
rdwareBina
ryEx
clus
ionUnit,which
type
checks
abina
ry,p
olym
or-
phictype
system
.The
machine
mak
esus
eof
seve
ralsub
routines,sosomestates
aremissing
incomingor
outgoing
edge
s;thesecorrespo
ndto
jumps
toan
dfrom
asu
brou
tine.
98
Bouncer: Static Program Analysis in Hardware Chapter 3
Purpose Number of StatesInitialization 21Function signatures 15Dispatch 6Let checking 37Return checking 3Case checking 7Literal pattern checking 1Constructor pattern checking 21Following references 6Type variable (TV) counting 12Recursive TV replacement 12Recursive TV aliasing 26Generating constraints 19Checking constraints 21Total 207
Table 3.2: Number of states devoted to the various parts of the Binary Exclusion Unit’sstate machine. Checking function calls, allowing for polymorphic functions with typevariables, and constraint checking were the most complex behaviors, making up mostof the states.
states for each part of the analysis is a reasonable proxy for complexity. The resulting
state machine has 207 states and they are broken down by purpose in Table 3.2. We
summarize them briefly here, with number of states denoted in parentheses. The ini-
tialization stage reads the program and prepares the type table (21 states). Function
heads are checked to ensure the argument count matches the provided function sig-
nature, and bookkeeping is done to note the types of each argument and the return
type (15). Dispatch decides which instruction is executed next and handles saving and
restoring state as necessary for Case statements (6). Let (37), Result (3), Case (7),
Pattern_literal (1), and Pattern_con (21) are checked as outlined in Section 3.4.
Because types can be recursively nested, a type entry in the TRT can reference other
types; a set of states is devoted to following references to find root types as needed (6
states). To handle this, the state machine implements something akin to subroutines.
99
Bouncer: Static Program Analysis in Hardware Chapter 3
A routine executes at the beginning of each function that counts the number of type
variables used in the signature (these type variables are “rigid” within the scope of the
function and cannot be forced to equal a concrete type) (12). Another routine recursively
replaces type variables to make one type entry match the variables in another; it allows
pattern heads to be forced to the same type as the variable in the Case instruction (12).
The aliasing subroutine recursively walks a type and maps its type variables to a “fresh”
set (26). This allows, for example, each usage of List a to have “a” refer to a different
type. Part of the complexity of this task is keeping track of the variables already seen
and what they map to so that a variable is not accidentally mapped twice to different
values. Constraint generation takes two type entries and, based on the entries, branches
and generates the appropriate constraint for the constraint set indicating that the entries
should be equal (19).
Finally, we have the constraint checking routine (21). This is invoked at the end
of each Let instruction, as well as after a Result. Constraint propagation proceeds by
taking one constraint from the set, which consists of a pair of types, then walking all
the remaining constraints in the set and replacing all occurrences of the first type with
the second. In this way, for each unique constraint, one type is eliminated from the
constraint set. If at some point two different concrete types (like Int and List) are
found to be equal, the set is inconsistent and typechecking fails, rejecting the program.
Similarly, if ever a rigid type variable (a type variable used in the function signature)
is found to be equal to a concrete type, typechecking fails. This second fail condition
ensures that functions with polymorphic type signatures are, in fact, polymorphic.
Without it, one could write a function that takes “a” and returns “a”, which shouldwork
for all types, but in fact only works for integers.
As we developed our software and hardware checkers, we used a software fuzzing
technique to generate 200,184 test cases based on prior techniques in program testing100
Bouncer: Static Program Analysis in Hardware Chapter 3
[83]. Rather than generating random bits, which would not meaningfully exercise
the checker, we encode the type system’s semantics with logic programming and run
them “in reverse” to generate, rather than check, well-typed programs. By performing
mutations on half of these programs, we also generate ill-typed benchmarks. In all
200,184 generated test cases, the simulated hardware RTL has 100% agreement with
the high-level checker semantics. The tests provide 100% of coverage of all 207 states of
the checker.
While the resulting analysis engine is complex, one could certainly reuse parts
of the analysis for other sets of properties and automated translation would be an
interesting direction for future work. The software model is 1,593 lines of Python,
while the hardware RTL is 1,786 lines (requiring extra code for declarations and the
simulation test harness). Synthesis results are found in Section 3.7.
3.6 Provable Non-bypassability
The hardware static analysis we developed has a variety of states governing when
it is active, how it initializes, and so on. An important point of this paper is the non-
bypassibility of these checks, but we need to know that some sequence of inputs cannot
be given to the checker that causes outputs to write to memory that have not been
checked by analysis. To solve this problem, we can create an assertion and employ the
Z3 SMT solver [84] to check it for us. Z3 is well-suited to our task because of its ability
to represent logical constructs and solve propositional queries. In addition, because
we can directly represent the circuit in Z3 at the logic level (gates), we do not have
to operate at higher levels of abstraction and risk the proof not holding for the real
hardware.
We actually translate our entire analysis circuit into a large Z3 expression. Then, we
101
Bouncer: Static Program Analysis in Hardware Chapter 3
add two constraints: the first says that, at some point in the operation of the circuit,
it output the “passed” (meaning well-typed) signal, while the second says that at no
point did the hardware enter the checking states. If the conjunction of the expressions
is unsatisfiable, there is no way to get a “pass” signal without undergoing checking
(and the program will never be loaded if it fails checking). Around 30 of the states deal
with program loading, initialization, etc., and perform no checking; our proof guards
against, for example, situations in which some clever manipulation of the state machine
moves it from initialization directly to passing, or otherwise manages to circumvent the
checking behavior of the state machine.
In the most direct strategy, we use the built-in bitvec Z3 type for wires in the circuit,
with gates acting as logical operations on those bitvectors. Memories are represented as
arrays. Arrays in Z3 are unbounded, but because we address the array with a bitvector,
there is an implicit bound enforced that makes the practical array non-infinite.
A straightforward approach to handling sequential operation of the analysis is
to duplicate the circuit once for each cycle we wish to explore. The cycle number is
appended to the name of each variable to ensure they are unique. Obviously, because
the entire circuit is duplicated for each cycle, this method does not scale well — both
in terms of memory usage and the time it takes to determine satisfiability. Checking
non-bypassability for numbers of cycles up to 32 took under 2 minutes and used less
than 1 GB of RAM. Checking for 64 cycles used almost 16 GB and did not terminate
within four days.
To make the SMT query approach scalable, we employ Z3’s theory of arrays. Instead
of representing each wire as a bitvector, duplicated once for each cycle, we represent it
as an array mapping integers to bitvectors: the integer index indicates the cycle, while
the value at the index is the value the wire takes in that cycle. There is then one array for
each wire in the circuit, and one array of arrays for each memory in the circuit (the first102
Bouncer: Static Program Analysis in Hardware Chapter 3
array represents the memory in each cycle, while the internal array gives the state of the
memory in that cycle). Logical expression (gates) can then be represented as universal
quantifiers over the arrays. For example, an AND expression might look like, ForAll(i,
wire3[i] == wire1[i] & wire2[i]). This constrains the value of wire3 for all cycles.
Sequential operations are easy, simply referring to the previous index where necessary
for register operations, e.g. ForAll(i, reg1[i] == reg1_input[i-1]). To bound the
number of cycles, we add constraints to each universal quantifier that i is always less
than the bound; this prevents Z3 from trying to reason about the circuit for steps beyond
i.
Solving satisfiability with arrays took under twominutes and under one GB of RAM,
no matter what bound we placed on the cycle count — in fact, even when unbounded,
Z3 was still able to demonstrate our hardware analysis bypassibility assertion was
unsatisfiable — i.e., the circuit is non-bypassable.
3.7 Evaluation
3.7.1 Checking Benchmarks
To understand if real-world programs can be efficiently typed and checked with our
system, we implement a subset of the benchmarks from MiBench [85]. These tended
to be much longer and more complex programs when compared to the randomly-
generated ones. While the fuzzer’s programs averaged 50-65 instructions per program,
the embedded benchmarks range from 500 to over 7,000 and represent code structures
drawn from real-world applications, such as hashes, error detection, sorting, and IP
lookup. In addition to the MiBench programs, a standard library of functions was
checked, as well as a synthetic program combining all the other programs (to see the
103
Bouncer: Static Program Analysis in Hardware Chapter 3
Figure 3.7: BEU evaluation for a set of sample programs drawn from MiBench, anembedded benchmark suite. For most programs, complete binary checking will take150-160 cycles per instruction. LEFT: Time for hardware checker to complete, in cycles,as a function of the input program’s file size. RIGHT: The same checking time, dividedover the number of instructions in each program. Though the stock CRC32 has thelongest typecheck time, an automatic procedure can modify the program to lower thechecking time while preserving program semantics, noted as CRC-short.
characteristics of longer programs).
Figure 3.7 shows how long typechecking took for the benchmark programs as a
function of their code size. A linear trend is clearly visible for most of the programs,
but one stands out from the pack: the CRC32 error detection function. The default
CRC32 implementation is, in fact, a pathological case for our checking method as it is
dominated by a single large function in the program. This function constructs a lookup
table used elsewhere and is fully unrolled in the code. No other benchmark had a
function nearly as large. The typecheck algorithm, while linear in program length (it
checks in a single pass), is quadratic in function length and type complexity2. This
insight not only explains the anomalous behavior of the initial CRC32 program, but
provides a clear solution: break up the large function.
We test this hypothesis by breaking up CRC32 and re-checking it. While the task of2“Type complexity” refers to how many base types are in a type; i.e., the length of its member types,
recursively.
104
Bouncer: Static Program Analysis in Hardware Chapter 3
breaking up a function in a traditional imperative programming language is complicated
by the large amounts of global and implicit state, and would be even harder to perform
at level of assembly, in a pure functional environment every piece of state is explicit.
This makes the process not only easier, but even possible to fully automate. When
we look at CRC32 specifically, the state, passed directly from one instruction to the
next for table composition, can be captured in a single argument. We perform this
transformation on our CRC32 program to break table construction across 26 single-
argument functions, producing the CRC-short data point in the graphs in Figure 3.7.
It still stands slightly above average because the table-construction functions are still
above the average function length; recursively applying the breaking procedure could
easily reduce the gap further.
While function length is an important aspect of checking time, with some care it can
be effectively managed, and in the end all of the programs examined can be statically
analyzed in hardware at a rate greater than 1 instruction per 100 cycles. This rate is
more than fast enough to allow checking to happen at software-update-time, and could
perhaps even be used at load-time, depending on the sensitivity of the application to
startup latency.
3.7.2 Practical Application to an ICD
In addition to the benchmarks described above, we additionally provide results for
a complete embedded medical application that was typed and checked; specifically, an
ICD, or implantable cardioverter-defibrillator3. The ICD code was the largest single
program examined (only the synthetic, combined program was larger). Its complexity3An ICD is a device planted in a patient’s chest cavity, which monitors the heart for life-threatening
arrhythmias. In the case one is detected, a series of pacing shocks are administered to the heart to restorea safe rhythm.
105
Bouncer: Static Program Analysis in Hardware Chapter 3
required the use of multiple cooperating coroutines, managed by a small microkernel
that handled scheduling and communication. Despite its length and complexity, it
had the best typecheck characteristics of any of our test programs, with its cycles-per-
instruction figure falling just below the average at 55.2. The process of adding types to
the application was relatively simple, taking approximately 2 hours by hand.
Attempted Attack ResultBinary that reads past the end of anobject to access arbitrary memory
Hardware refuses to load binarydue to type error "field count mis-match"
Binary that passes an argument to afunction of the wrong type to causeunexpected behavior
Hardware refuses to load binarydue to type error "not expectedtype"
Binary that writes past the end ofan object to corrupt memory
Hardware refuses to load binarydue to “application on non-function type”
Binary that passes too few argu-ments to a function to attempt tocorrupt the stack
Hardware refuses to load binarydue to "undersaturated call"
Binary that uses an invalid branchhead to try and make arbitraryjump
Hardware refuses to load binarydue to type error "branch type mis-match"
Binary that jumps past the end ofa case statement to enable creationof ROP gadgets
Hardware refuses to load binarydue to "invalid branch target"
Jump past the end of a function tocreate ROP gadgets
Hardware refuses to load binarydue "invalid branch target"
Table 3.3: A list of some of the erroneous code that may be present in a binary (testedin our ICD application) and how the BEU identifies it as an error. Some of these errors,such as reading off the end of an object, writing beyond the end of an object, andjumping to arbitrary code points, are sufficient to thwart common attacks, like bufferoverflow and ROP.
Since the ICD represents the largest and most complex program, as well as the exact
type of program the BEU is designed to protect, we attempt to introduce a set of errors
in the program to demonstrate the ability of the BEU to ensure integrity and security.
106
Bouncer: Static Program Analysis in Hardware Chapter 3
Some of the errors are designed to crash the program; some are designed to hijack
control flow; others are designed to read privileged data. The list of attempted attacks
and how the BEU caught them are shown in Table 3.3.
In an unchecked system, passing an invalid function argument, writing past the end
of an object, and passing an invalid number of function arguments could all lead to
undefined behavior or system crashes. While past work could establish that a specific
piece of code would not do these things independent of the device, this work establishes
these properties for the device itself, applying to all programs that can potentially exe-
cute — it is simply impossible to load a binary that will allow these errors to manifest.
To establish that this was indeed the case, Table 3.3 shows the result of our attempts
to produce these behaviors: a type error, a function application error, and an under-
saturated call error, respectively. Reading past the end of an object in an attempt to
snoop privileged data was thwarted by detecting a type error dealing with field count
mismatches. Control-flow hijacks, like using an invalid branch head, jumping past the
end of a case statement, and jumping past the end of a function, were caught by a type
mismatch in the first case and the detection of an invalid branch target in the latter two.
Though not exhaustive, these attacks show the resilience of the system to injected
errors when compared to an unchecked alternative and demonstrate its practicality in
the face of real errors and attempted attacks.
3.7.3 Synthesis Results
Synthesized with Yosys, the hardware typechecker logic uses 21,285 cells (of which
829 are D Flip Flops, the equivalent of approximately 26 32-bit registers). Mapped
to the open-source VSC 130nm library, it is .131 mm2, with a clock rate of 140.8 MHz.
Scaled to 32nm, it is approximately .0079 mm2. As an addition to an embedded system
107
Bouncer: Static Program Analysis in Hardware Chapter 3
or SoC, it provides only a tiny increase in chip area, and requires no power at run-time
(having already checked the loaded program).
Assuming the checker can use the system memory, it requires no additional mem-
ory blocks; if not, it needs a memory space at least as large as the input binary type
information, and space linear in the size of the program’s functions.
The worst-case checking rate was 301 cycles per instruction for a pathological pro-
gram; even a program of 450,000 lines with worst-case checking performance can be
checked in under a second at the computed clock speed of 140 MHz on 130nm.
3.8 Related Work
Typed Assembly
When dealing with typed assembly, the most prominent works are TAL [80] and
its extensions TALx86 [81], DTAL [86], STAL [87], and TALT [88]. In TAL, they
demonstrate the ability to safely convert high-level languages based on System F (e.g.
ML) into a typed target assembly language, maintaining type information through
the entire compilation process. Their target typed assembly provides several high-
level abstractions like integers, tuples, and code labels, as well as type constructors for
building new abstractions.
TALx86 is a version of IA32, extending TAL to handle additional basic type construc-
tors (like records and sums), recursive types, arrays, and higher-order type constructors.
They use dependent types to better support arrays; the size of an array becomes part of
its type, and they introduce singleton types to track integer values of arbitrary registers
or memory words. TAL provides a way to ensure that high-level properties like type-
and memory-safety are preserved after compiler transformations and optimizations
have taken place.
108
Bouncer: Static Program Analysis in Hardware Chapter 3
Unlike TAL, our type system was co-designed with hardware checking in mind — a
distinction that greatly impacts the type system design. It allows for binary encoding of
types and empowers the target machine, rather than the program authors, to decide if
a program is malformed. TAL requires a complex, compile-time software typechecker,
as opposed to our small, load-time hardware checker. Our type system operates on an
actual machine binary and not an intermediate language.
The eventual target of TALx86 is untyped assembly code (assembled by their MASM
assembler into x86). The types are not carried in the binary and are not visible to the
device that ultimately runs the code. Though useful, a device cannot trust that the
program it has been given has been vetted; therefore, bad binaries can still run on TAL’s
target machines.
Our work’s most significant contribution, the Binary Exclusion Unit (BEU), over-
comes this problem. The BEU, a hardware typechecker for the system capable of
rejecting malformed programs, is an integral, non-bypassable part of the machine; if
typechecking fails, execution cannot begin. To our knowledge, this is the only hard-
ware module that performs typechecking on binary programs. We leave expansion of
the BEU to other ISAs for future work, but note that the complexity of the TAL type
system indicates that a hardware implementation would be significantly more work
and overhead on an imperative ISA.
Architecture and Programming Languages
In SAFE [89], the authors develop a machine design that dynamically tracks types
at the hardware level. Using these types along with hardware tags assigned to each
word, their system works to prove properties about information-flow control and non-
interference. They claim that the generic architecture of their system could facilitate
efforts related to memory and control-flow safety in further work.
109
Bouncer: Static Program Analysis in Hardware Chapter 3
There has also been important work in binary analysis, which seeks to recover infor-
mation from arbitrary binaries to make sound and useful observations. For example,
Code Surfer [90] is a tool that analyzes executables to observe run-time and memory
usage patterns and determine whether a binary may be malicious. Work on binary type
reconstruction in particular seeks to recover type information from binaries. In onework
[91], they recover high-level C types from binaries via a conservative inference-based
algorithm. In Retypd [92], Noonan et al. develop a technique for inferring complex
types from binaries, including polymorphic and recursive structures, as well as pointer,
subtyping, and type qualifier information. Caballero et al. [93] provide a survey of the
many approaches to binary type inference and reconstruction.
Static safety via on-card bytecode verification in a JavaCard [94] is an interesting
line of work with a similar goal to our approach. However, a hardware implementation
can be verified non-bypassable in a way that is much harder to guarantee for software.
The Java type system is known to both violate safety [95, 96] and be undecidable [97]
which makes it a far more difficult target for static analysis and, we would argue, nearly
impossible to implement in hardware directly.
At the intersection of hardware and functional programming, previous works have
synthesized hardware directly from high-level Haskell programs [98], even incorporat-
ing pipelined dataflow parallelism [99]. Run-time solutions to help enforce memory
management for C programs have been proposed at the software level [100], as well as
in hardware-enforced implementations [101, 102]; these provide run-time, rather than
static, checks.
Other work has used formal methods to find and enforce properties at the hardware
level to help ensure hardware and software security [103], while others have shows the
effectiveness of hardware-software co-analysis for exploring and verifying information
flow properties in IoT software [104].110
Bouncer: Static Program Analysis in Hardware Chapter 3
3.9 Conclusion
While themicro-controller design in this papermight be an extremely non-traditional
example, going so far as to have proofs of the properties that hold and rejecting non-
conforming programs outright, it opens the door to other work that limits hardware
functionality in meaningful and helpful ways without entirely giving up programmabil-
ity. The result of our effort is a Binary Exclusion Unit that can easily fit into embedded
systems or perhaps even serve as an element in a heterogeneous system-on-chip, provid-
ing a hardware-based solution that cannot be circumvented by software. Our approach
prevents all malformed binaries from ever being loaded (let alone run), and ensures
that all code loaded onto the machine is free from memory errors, type errors, and
erroneous control flow. It requires neither special keys/attestation nor trust in any part
of the system stack (a size zero TCB), providing its guarantees with static checks alone
(no dynamic run-time checking is needed).
This approach has many non-traditional moving parts, from the function-oriented
microprocessor at its heart, to the higher-level instruction set semantics, to the engine
that performs static analysis in hardware. Rather thanwork on an architecture simulator,
we built both the processor and the hardware checking engine in RTL, both to provide
synthesis results and to demonstrate the feasibility of actually building such a thing.
We have proofs of correctness for our approach at the algorithm level and gate-level
proofs of non-bypassability. We coded and typed not just a set of benchmarks, but
also a more complete medical application, which we then tried to break in order to
show that such an approach works in practice as well as in theory. The final design is
surprisingly small, taking no more than .0079 mm2, and is capable of performing our
static analysis on binaries at an average throughput of around 60 cycles per instruction.
We believe this is the first time any binary static analysis has been implemented in
111
Bouncer: Static Program Analysis in Hardware Chapter 3
the machine hardware, and we think it opens an interesting new door for exploration
where properties of the software running on a physical platform are enforced by the
platform itself.
112
Chapter 4
Wire Sorts: A Language Abstraction for
Safe Hardware Composition
4.1 Introduction
In our current era of diminished transistor scaling, the need for higher levels of
energy efficiency and performance is greater than ever. The quest to achieve these
goals calls for more people to be able to participate in the creation of accelerators
and other digital hardware designs. It has become common for hardware designers
to utilize commercial libraries (known as Intellectual Property or IP catalogs) to get
hold of the most efficient or performant hardware components. At the same time,
open-source hardware has begun to emerge as a viable development strategy, drawing
parallels to open-source software, due to the commercial benefits of exploiting free
and open components. This new development paradigm raises questions of how
hardware developers can best compose their components and treat their underlying
implementations as opaque.
Modern high-level programming languages have many mechanisms that work in
113
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
support of effective modularity and abstraction; for example, one might place require-
ments on data (e.g. arguments) at an interface (e.g. function call) through a type
system. Most hardware description languages (HDLs), in contrast, have comparatively
little support for these features. The interface of the primary unit of abstraction, a
module, is typically described simply as “wires” which, in turn, may be refined as
“input” or “output.” However, we find experimentally across hundreds of designs that
these interfaces actually carry surprisingly complex requirements not just on how the
data are to be used or interpreted but even on what compositions leads to well-defined
digital designs. The goal of our work is to turn a programming language eye to this
problem: to be mathematically precise in the definition of wired interfaces and ulti-
mately give more support to hardware designers seeking modularity, abstraction, and
better compositional guarantees at the HDL level.
We wish to support a scenario where (1) separate hardware designers can inde-
pendently create a set of hardware modules according to some connection protocol
using an HDL, and the HDL can automatically infer relevant properties about the input
and output wires for each module in isolation; (2) a hardware designer can treat these
modules as opaque components without knowledge of their internals, wiring them
together into a circuit such that the HDL provide guarantees based on the properties of
the modules’ input and output wires; and (3) the number of design “surprises” discov-
ered late in the development cycle due to intermodular incompatibilities is significantly
reduced.
Such a scenario is increasingly not just desirable but strictly necessary. In the tradi-
tional design methodology where a whole chip may be designed by a single company
or team who can agree on interfaces in advance and readily inspect modules’ internals,
establishing modules’ compositionality was straightforward. However, this is a much
stickier problem in a world of IP-driven design where the user of a module may have114
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
no knowledge of the module’s internals, perhaps even working with an obfuscated or
encrypted design [105]. IP catalog designers today lack any clear specification of the
module-level connection properties needed to ensure well-composed designs. Thus,
it is incredibly easy to create a design that assumes something about an up- or down-
stream interface which only becomes apparent after the full design has been completed
at the RTL level. Discovering such an issue late in the process can be a serious issue
because the exact cycle a data value is produced might need to change to accommodate
a different interface. While this sounds easy in theory, traditional RTL design practices
are fragile to timing changes, and fixing problems might mean significant surgery to
control state machines, the recoordination of multiple producers or consumers, or even
failure to meet a latency goal. As we ask a broader set of engineers to engage in the
hardware design process, whether to understand tradeoffs in an AI accelerator design
or deploy computation into an FPGA in the cloud, we need languages that help steer
effort towards realizable designs and reduce the number of “surprises” (i.e. failures)
typically only found at the very last stages of implementation (at synthesis time).
The specific property that we focus on in this work is what we are calling well-
connectedness; we formally specify the property in Section 4.3.4 but informally, it implies
that the final circuit does not contain any combinational loops.1 Combinational loops
are a sign of a broken design (except in certain rare circumstances) andmust be avoided.
Such loops are easy to spot once all components have been fully implemented and then
synthesized into a netlist (one need only look for cycles in the netlist graph) but are
hidden through the entire process of design at the HDL level, especially when they cross
module boundaries and require reasoning about multiple modules’ internal structures.1This is a necessary but not sufficient condition for overall correctness. For example, we are not
concerned in this paper about checking that a specific protocol is being correctly followed. Our techniquescould potentially also reason about properties related to timing and circuit layout, but we leave these forfuture work.
115
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
This is a real problem we have encountered in our experiences writing digital hardware
designs, motivating us to find better ways via programming language abstraction and
enforcement.
This problem of avoiding combinational loops at the HDL module level is surpris-
ingly subtle, requiring that designers reason about a number of non-obvious corner
cases. Well-connectedness cannot always be guaranteed by looking at pair-wise module
interconnections but is in fact a property of the entire circuit requiring information about
all modules at once. Nevertheless, we show that it is possible to annotate module inter-
faces at the HDL level for each module independently such that the well-connectedness
of a given combination of modules can be automatically proven by only looking at
these interface annotations. We further show that if a full implementation of the design
is already available, such as for legacy code, we can automatically infer annotations
directly from the design. These annotations in turn radically lessen the number of
interfaces where “surprises” might occur, allowing designers to focus their attention
more effectively. The specific contributions of this paper are:
• We are the first to apply a modular static analysis to the problem of ensuring
the correct compositionality of hardware modules in arbitrary RTL, via a global
property which we define as well-connectedness.
• We prove this property is achievable in a modular way via a mathematical spec-
ification of wire dependencies, developing a novel taxonomy of sorts: to-sync,
to-port, from-sync, and from-port.
• We embody these properties, and the analysis they enable, in a usable and scalable
tool that completely prevents the late discovery of combinational loops. We further
propose an extension to the analysis to protect synchronous memory semantics
through composition.116
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
• We analyze more than 500 parameterized hardware modules to quantify, for the
first time, the diversity of expectations placed on module interfaces found in the
wild. Across three independent projects (BaseJump STL, OpenPiton, and a RISC-V
implementation) our analysis is able to automatically infer the correct wire sorts
to enable composability in less than 31 seconds. Our analysis is 2.6–33.9x faster at
finding intermodular loops than standard cycle detection during synthesis.
In Section 4.2 we describe related approaches to modular hardware design and
why they do not solve the problem addressed in this paper. In Section 4.3 we describe
the formalisms related to well-connectedness and the wire properties of to-sync, to-
port, from-sync, and from-port, and discuss how to apply these formalisms to analyze
module connections. In Section 4.4 we discuss details of the added annotation language
and checker tool implementation. In Section 4.5 we evaluate our technique and tool by
applying the checker to a series of standalone modules as well modularly-constructed
circuits, and conclude in Section 4.6 we conclude.
4.2 Motivation and Related Work
To demonstrate the problem, we use the example of a simple first-in first-out (FIFO)
queue using the ready-valid protocol, as shown in Figure 4.1. The role of the FIFO
queue is to accept input data from one module (at the consumer endpoint), buffer
that data inside its internal state, and then send the data to another module (at the
producer endpoint) upon request. The consumer endpoint consists of a set of wires:
datain contains the data being sent to the FIFO; validin determines whether the incoming
signals on datain represent valid input from the connected module; and readyout is an
outgoing signal indicating whether the FIFO is ready to accept input (i.e., it isn’t full).
Similarly, the producer endpoint consists of another set of wires: dataout contains the117
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Figure 4.1: Normal FIFO queue. The consumer endpoint receives data from onemodule,and the producer endpoint sends data to another module.
data being produced by the FIFO and read from another connected module; validout
determines whether the outgoing signals on dataout represent valid data from the FIFO
(i.e., it isn’t empty); and readyin is an incoming signal indicating whether the connected
module is ready to receive data from this FIFO.
We have left the internals of the FIFO opaque (as they may realistically be to a user);
the details do not matter for our purposes except to note that each FIFO endpoint is
combinationally independent of the other. In other words, every path between the
endpoints is interrupted by some state inside the FIFO, so that an action at one endpoint
cannot affect the other endpoint within a single cycle.
A FIFO queue of this kind is often called a “universal interface” because it can be
placed between any two modules without danger of ill effects due to timing issues.
However, for various reasons (such as efficiency) a normal FIFO queue may not be
appropriate. A forwarding FIFO improves efficiency by allowing data entering in one
clock cycle to be immediately sent out in the same clock cycle if the FIFO is empty. An
abstract depiction of this module is shown in Figure 4.2.
The important points for our purposes are that: (1) the module interface (i.e., the
ready-valid endpoint specification) is unchanged from the normal FIFO, so that from a
118
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Figure 4.2: Forwarding FIFO queue.
module connection standpoint the two are indistinguishable; and (2) the endpoints are
no longer combinationally independent because there is a combinational path from one
endpoint to the other, enabling the data forwarding that is the whole point of the new
module. Here’s a closer look at the combinational logic used for assigning to validout
across the two FIFO modules (where countreg is a register containing the number of
enqueued elements):
• Normal:
validout F (countreg > 0)
• Forwarding:
validout F (countreg > 0) ∨ (validin ∧ readyout)
This combinational dependence between the endpoints means that designers may
inadvertently cause a combinational loop when they wire modules together. In fact,
the problem may not even arise due to direct interactions between the queue and the
modules connected to its endpoints, but rather due to indirect interactions mediated by
yet other modules. We show an example of a problematic circuit in Figure 4.3.
119
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Figure 4.3: Forwarding FIFO connected to other modules causing a combinational loop(in yellow). Only pertinent IO ports have been shown for each module.
Here we have three modules: a normal FIFO, a forwarding FIFO, and some module
X. In this contrived example, the normal FIFO sends a signal to module X that is some
combinational function of its validin wire (here, a direct connection); module X sends
some combinational function of its input to the forwarding FIFO’s validin, which (as
previously discussed) is a combinational input to the forwarding FIFO’s validout, which
in turn is wired to the normal FIFO’s validin. If the forwarding FIFO were instead a
normal FIFO (which at a module connection level looks the same) then this would
be fine, but since it is not this circuit contains a combinational loop. Detecting and
understanding the cause of the loop requires reasoning about the internal details of
three different modules.
We note that detecting the existence of the combinational loop is simple once the
HDL program has been synthesized to a netlist: simply perform a standard cycle
detection algorithm. Verilog [106] synthesis tools such as the linter in Verilator [107]
and the Yosys synthesis suite [108] andHDLs such as Chisel [109] and PyRTL [110, 111]
can provide warnings about loops during synthesis. However, relying on loop detection
120
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
at synthesis time has several drawbacks. First, gate-level netlists take a long time to
produce and are significantly larger (47X larger in one example we studied), since high-
level and multi-bit operations have been transformed into sets of simple 1-bit primitive
gates. Second, these detection systems aren’t infallible: under certain combinations
of flags or optimizations, tools like Yosys fail to detect loops or silently delete them,
“successfully” synthesizing the offending circuit. Third, once the loop is detected after
synthesis, it is entirely up to the designer to trace the synthesized loop back to the
relevant modules and interactions in the original HDL program.
The RTL, on the other hand, has fewer gate dependencies to analyze while still
representing the same dataflow graph. Going up one level of abstraction, the behavioral
level describes the same system algorithmically, making it even easier to take advantage
of high-level constructs for determining dependency. Thus, our goal is to raise the
level of abstraction for detecting loops up to the HDL module level in order to give
the designer maximum information and context, to avoid loops more easily, and to
detect loops sooner in the design process. An apt analogy is the difficulty in trying to
determine the cause of an assembly-level link time error versus one presented at the
source level; we aim to do the latter for HDLs.
There do exist HDL-level tools to check certain kinds of properties, for example Sys-
temVerilog Assertions (SVA) [112], Property Specification Language (PSL)/Sugar [113,
114], and Open Verification Library (OVL) [115]. These frameworks facilitate the spec-
ification of temporal relationships between wires, which are checked via simulation or
model checking rather than statically at design time. These tools can express properties
about the relative order in which things occur but not the reasons why they occur.
Since our analysis is concerned with the exact causes of events (i.e., combinational
dependencies between wires), we believe from our experience using these tools that
they are not suitable for our purpose.121
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
There is additionally a long history of using higher-level abstractions to describe
hardware formally [116, 117] and of using richer type systems [118] and functional
programming techniques [119, 120, 121, 122]. DSLs like Murφ [123] and Dahlia [124]
target specific use cases like protocol descriptions or improved accelerator design, while
high-level synthesis (HLS) techniques [125, 126] translate subsets of C/C++ to RTL.
Other HDLs [127] like PyMTL [128], Cλash [129], Pi-Ware [130], HardCaml [131],
BlueSpec [132], and Kami [20] also use modern programming language techniques
to overcome some of the issues that arise when writing in traditional HDLs [133, 134];
like many of them, we focus on improving the register-transfer level design process by
creating better and more expressive abstractions.
4.2.1 BaseJump STL
The closest work to our own is BaseJump STL [135, 136]. Their work discusses the
requirements for creating a library of hardware modules (analogous, in their words, to
the C++ standard template library) and introduces some informal terminology to help
describe module interfaces and promote properties such as well-connectedness. They
draw upon the principles of latency-insensitive hardware design [137, 138, 139, 140]
but aim for a less restrictive model.
BaseJump STL informally defines the notions of helpful and demanding module
interface endpoints (such as the ready-valid endpoints from the previous FIFO example).
The distinction is based onwhether an endpoint is able to offer up datawithout “waiting”
for input. For the ready-valid protocol, a helpful producer offers validout upfront while a
demanding producer waits for readyin before computing and emitting validout. Similarly,
a helpful consumer offers readyout upfront while a demanding consumer waits for
validin before computing and emitting readyout. BaseJump STL creates a taxonomy of
122
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
interface connections based on the various combinations of helpful and demanding
endpoints. They note that the only unsafe combination is a demanding-demanding
connection, which would directly lead to a combinational loop.
The problem with BaseJump STL’s approach is that it considers module endpoint
connections in isolation: the notion of dependence inherent in the demanding and
helpful classifications only considers wires that directly participate in the connection.
However, this isn’t sufficient to guarantee detection of combinational loops, as we
have shown with our previous example of a problematic circuit in Figure 4.3. In that
example, the forwarding FIFO’s producer endpoint is considered helpful because
validout is offered without needing to wait on readyin. The normal FIFO’s consumer
endpoint is considered helpful because readyout doesn’t wait on validin. According
to BaseJump STL’s model, these modules have a helpful-helpful connection and are
therefore safe. But aswe have demonstrated, the design is actually faulty due to the third
module in the circuit and how it interacts with the connection between the forwarding
and normal FIFOs.
We discovered the issues with BaseJump STL’s notions of helpful and demanding endpoints
when we attempted to formalize them and prove that they were adequate to detect combinational
loops at the HDL module level of abstraction. Our experience led us to conclude that in
order to guarantee well-connectedness, we need to: (1) be able to reason about module
endpoints based on wire dependencies between the input and output wires within a
module; and (2) using only the resulting endpoint annotations, reason about an entire
circuit at the module level to resolve possible loops introduced by interactions between
multiple modules.
123
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
4.3 Wire Sorts and Well-Connectedness
In this section we define our notion of wire sorts, formalize the property of well-
connectedness using these sorts, and prove a set of properties that can be used to demon-
strate that a circuit composed of independently designed modules is well-connected.
Finally, we show exactly how our definitions contrast to BaseJump STL’s notions of
helpful and demanding endpoints and how our approach avoids the problems that
BaseJump STL encounters.
4.3.1 Defining Basic Domains
We formally define a set of basic domains that collectively comprise a circuit com-
posed of independent modules, so that we can precisely define wire sorts and well-
connectedness and prove that a well-connected circuit has no combinational loops. Our
formalisms and techniques apply to synchronous digital designs, and we assume for
simplicity that there is a single clock driving all stateful elements (both are properties
of the most commonly found designs).
A wire is denoted by wσ where σ ∈ {const, reg, in, out,
basic}. A constant wire wconst produces a 0 or 1, an input wire win serves as input
into a module, and an output wire wout serves as output from a module. Registers are
stateful elements that are latched each cycle according to the same shared clock; the wreg
wires represent the outputs of these registers. Basic wires wbasic are used to connect or
combine these wires together via nets. A net is a tuple (−→wσ,wσ, op) representing a gate,
with multiple wires −→wσ coming into the gate, a single wire wσ coming out of the gate,
and a bitwise logical operation op denoting the type of gate such that wσ = op(−→wσ).
A module M is a tuple (−−→win,−−−→wout,−−→net) composed of sets of input wires, output wires,
and nets representing a directed acyclic graph (DAG); in this DAG, the nets are nodes,
124
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
and the outputs of the nets are the forward edges in the graph. The input and output
wires form the module’s external interface. Given a module M = (−−→win,−−−→wout,−−→net) we will
use the shorthand M.inputs, M.outputs, and M.nets to mean −−→win, −−−→wout, and−−→net, respectively.
A circuit C is a tuple (−→
M,−−−−−−−−−→(wout,win)) composed of a set of modules C.modules and
the connections C.conns between their inputs and outputs. Given M1,M2 ∈ C.modules
and two wires wout ∈ M1.outputs and win ∈ M2.inputs, we use wout →C win to mean
that wout is directly connected to win, i.e., (wout,win) ∈ C.conns. We define the function
module(win,C) = M iff win ∈ M.inputs ∧ M ∈ C.modules.Without loss of expressiveness,
we assume that one module’s outputs are always connected directly to another mod-
ule’s inputs.2 Note that a circuit C and its set of modules C.modules can essentially
define a larger module composed of submodules. A circuit composed of many of these
“supermodules” connected together in turn makes an even larger module, ad infinitum.
Thus the intra- and intermodular analyses we discuss in the following sections are fully
generalizable to the notion of submodules common in popular HDLs.
4.3.2 Defining Combinational Reachability
We define two different levels of combinational reachability: one intra-modular that
can be computed for each module independently and one inter-modular that involves
the entire circuit.
Given a module M containing a wire wσ, we define the combinationally reachable set
reachable(M,wσ) as the set of wires reachable from wσ in M.nets without going through
any wire wreg; in other words, the transitively reachable wires that don’t go through
any registers (state).
We can now define two terms that will be important for determining combinational2If there is any extra-modular logic between modules, one can wrap that logic into its own module to
trivially meet this condition.
125
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Figure 4.4: Example for computing the output-port-set and input-port-set of a moduleM. The output-port-set of input w4in is {w2out} and ∅ for the other inputs. The input-port-set of w2out is {w4in} and ∅ for w1out.
(a) A from-sync wire w1outconnected to a to-sync wirew2in.
(b) A from-sync wire w1outconnected to a to-port wirew2in.
(c) A from-port wire w1outconnected to a to-sync wirew2in.
Figure 4.5: Connections between to-sync or from-sync wires cannot result in combina-tional loops.
126
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
reachability at the module level without needing the internal details of the relevant
modules: output-port-set and input-port-set. The output-port-set is relevant for mod-
ule inputs: given module M and input win, the output-port-set output-ports(M,win) is
the set of module output wires that are combinationally reachable from that input wire.
In other words, output-ports(M,win) = reachable(M,win) ∩ M.outputs. Similarly, the
input-port-set is relevant for module outputs: for an output wire wout of module M,
the input-port-set input-ports(M,wout) is the set of module input wires that combi-
nationally reach that output wire. In other words, input-ports(M,wout) = {win | win ∈
M.inputs,wout ∈ output-ports(M,win)}. These sets need only be computed once per
module definition (regardless of how many instantiations are used in a circuit).
To illustrate these definitions consider the module diagram in Figure 4.4. In this
module, whichwe’ll call M, the output-port-set of inputw4in is output-ports(M,w4in) =
{w2out}, while the output-port-set of each of the inputs w1in,w2in,w3in is ∅. The input-
port-set of output w1out is ∅, while the input-port-set of w2out is input-ports(M,w2out) =
{w4in}.
Given a circuit composed of multiple modules along with the output-port-set and
input-port-set for each input and output wire of each module, we can compute inter-
modular combinational loops without needing to inspect the internals of any module.
The transitive forward reachability of any output wire amounts to a fixpoint compu-
tation involving the output-port-sets of the modules in the circuit; while tracing a
path from between wires, if a module input wire is encountered, skip over its mod-
ule’s internal logic by continuing with the output wires in its output-port-set. We use
w1{Cw2 to denote that wire w1 transitively affects wire w2 in circuit C and call{C the
TransitivelyAffects relation. This computation is shown in Algorithm 1.
127
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Algorithm 1: TransitivelyAffectsData: wout,win,CResult: True if output wire wout of one module affects input wire win of another
module in circuit C, otherwise False.1 begin2 a← ∅;3 c← {w1in | (wout,w1in) ∈ C.conns};4 while c , ∅ do5 w1in ← pop(c);6 if w1in = win then7 return True;8 if w1in ∈ a then9 continue;
10 M ← module(w1in,C);11 forall w1out ∈ output-ports(M,w1in) do12 c← c ∪ {w2in | (w1out,w2in) ∈ C.conns};13 a← a ∪ {w1in};14 return False;
4.3.3 Wire Sorts
We can now formally define the novel set of sorts for module input and output wires,
a key contribution of this paper. An input wire win is to-sync if output-ports(M,win) =
∅ and is to-port otherwise. An outputwirewout is from-sync if input-ports(M,wout) = ∅
and is from-port otherwise. The to-sync, to-port, from-sync, or from-port designation
of awire is its sort, and this set of sorts is sufficient to label all module ports. In Figure 4.4,
the sort of input wires w1in,w2in,w3in is to-sync while the sort of w4in is to-port. Of the
outputs, the sort of w1out is from-syncwhile the sort of w2out is from-port.
Note that an input wire of sort to-sync cannot be involved in a combinational loop,
nor can an output wire of sort from-sync. By definition, these wires terminate or
originate in some stateful or constant-valued element, and therefore module interface
wires of these sorts can be freely connected to other modules safely without regard to
128
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
the connected module’s interface wire sorts or the rest of the circuit. This leads us to
our first property.
Property 1. Two connected wires wout and win cannot be involved in a combinational loop if
wout is from-sync or win is to-sync.
Summary. Given a module M1 such that wout ∈ M1.outputs, if wout is from-sync, then
input-ports(M1,wout) = ∅, meaning it does not combinationally depend on anymodule
input. Similarly, given a module M2 such that win ∈ M2.inputs, if win is to-sync, then
output-ports(M2,win) = ∅, meaning it does not combinationally affect any module
output. □
In Figure 4.5a, from-sync wire w1out is connected to to-sync wire w2in, while in
Figure 4.5b, it is connected to to-port wire w2in. We can see that it doesn’t matter what
sort of input w1out connects to, since there is at least one stateful element shielding
w1out from being fed into itself combinationally: the stateful elements of M1 in both
figures and additionally the stateful elements of M2 in Figure 4.5a. In both cases, it
doesn’t matter what modules M3 . . .Mn may do or any other output M1 may have that
could possibly feed into them. Similarly, in Figure 4.5c, because from-port wire w1out is
connected to to-sync wire w2in, we can know even without analyzing the entire circuit
that this particular connection is safe.
4.3.4 Defining Well-Connectedness
There are cases, like our previous example of a forwarding FIFO queue in Section 4.2,
where it doesn’t make sense to require that module interface wires be only to-sync or
from-sync. Relaxing this requirement means we cannot rely solely on Property 1 for
establishing safety between wires, and so we must more precisely define our notion of
inter-wire safety as follows:129
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
(a) A from-port wire w1out con-nected to a to-port wire w2in,which through interactions withother modules can be shownto not result in a combinationalloop.
(b) A from-port wire w1outconnected to a to-port wirew2in resulting in a combina-tional loop.
(c) A from-port wire w1outconnected to a to-port wirew2in, which cannot be deter-mined to be well-connecteduntil the entire circuit hasbeen completed.
Figure 4.6: Connections between from-port or to-port wires might result in combina-tional loops.
Definition 1 (Wire Well-Connectedness). Given a circuit C and two modules M1,M2 ∈
C.modules (where M1 may be M2), an output wire wout ∈ M1.outputs, and an input wire win ∈
M2.inputs such thatwout →C win,wout iswell-connected towin iff ∀w1 ∈ input-ports(M1,wout),
∀w2 ∈ output-ports(M2,win), w2 ̸{Cw1.
It is straightforward to show that it satisfies our desired safety property:
Property 2. The connection between two wires wout and win that are well-connected to one
another does not introduce a combinational loop.
Summary. By definition, all of the input wires w1 in M1 that combinationally affect wout
are present in its input-port-set. Likewise, by definition, all of the output wires w2 in M2
that are combinationally affected by win are in its output-port-set. If it is impossible to
transitively trace any output wire w2 through the nets it combinationally affects to any
input wire w1 that wout is awaiting, then no combinational loop has been introduced by
wout → win. □
We illustrate this property in Figure 4.7 below.
Any wires of sort to-port or from-port are potential problems, so we cannot in gen-
eral determine safety without inspecting the entire circuit. For example, Figure 4.6a and130
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Figure 4.7: Illustration of the Wire Well-Connectedness definition. Given a circuit C,well-connectedness for a connection (wout,win) ∈ C.conns occurs when there does notexist an output port w2 in win’s output-port-set that is transitively connected ({C) toany wire w1 in wout’s input-port-set.
Figure 4.6b both show two connected modules with a from-port output wire connected
to a to-port input wire, but in the former case it does not result in a combinational
loop while in the latter it does. Note, however, that we still do not need to inspect the
internals of any modules as long as we know the sorts of their interface wires.
We can distinguish between the examples in Figure 4.6a and Figure 4.6b by defining
a safe class of connections to from-port sorts, called safely from-port:
Definition 2 (Safely From-Port Wires). A from-port output wire wout connected to a
to-port input wire win is called safely from-port with respect to win if wout and win are
well-connected according to Definition 1.
A safely from-port output wire combinationally depends on some module input
wires (and hence its value is not valid until those inputs have propagated to the output
wire later in the clock cycle) but still guarantees the absence of combinational loops
with respect to certain connected wires. In Figure 4.6a, the dependent output wire
w1out is safely from-port with respect to w2in, and hence the overall circuit is well-
connected since w1out is not connected to anything else. In contrast, in Figure 4.6b the
from-port output wire w1out is not safely from-port and hence the overall circuit is not
well-connected.131
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Determining whether a wire is safely from-port or not requires the complete circuit
in order to compute the TransitivelyAffects relation. Figure 4.6c demonstrates this fact.
We define a circuit composed of a set of modules such that all module interface wires
are connected to be a complete circuit. A well-connected circuit is a complete circuit
that has no combinational loops. This definition brings us to our final property:
Property 3 (Circuit Well-Connectedness). A complete circuit is well-connected if and only
if all from-port output wires in the circuit are safely from-port with respect to the to-port
input wires to which they are connected.
Summary. The forward implication is that in a complete, well-connected circuit C, all
from-port output wires are safely from-port. By definition, a well-connected circuit
does not contain any combinational loops. If there exists some module M1’s from-port
output wire wout that is not safely from-port, then by the definition of safely from-port
(Definition 2) either:
1. Wire wout is not connected to any other wire. But this contradicts the fact that the
circuit must be complete.
2. Wire wout is connected to wire win of some module M2 and there exist wires
w1 ∈ input-ports(M1,wout), w2 ∈ output-ports(M2,win) such that w2{Cw1. By
the definition of{C this means that there is a combinational loop in the circuit.
But this contradicts that the circuit is well-connected.
Therefore by contradiction the forward implication holds. The reverse implication is
that if all from-port output wires are safely from-port, then the complete circuit is well-
connected. Since the circuit is complete, every input and output wire is connected to
some output or input wire, respectively. For a given connection, if either the output wire
is from-sync or the input wire is to-sync then they cannot be part of a combinational132
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
loop. So the only case that we need to worry about is if the output wire wout is from-port
and the input wire win is to-port. Assuming that wout is safely from-port, this means
that by Definition 2 it must be true that wout and win are well-connected according
to Definition 1. This property directly implies that these wires cannot be part of a
combinational loop. Therefore the forward direction holds. □
4.3.5 Putting It All Together
Given the definitions and properties stated above, we can divide checking a circuit
for well-connectedness into three stages:
• Stage 1. At the time each module is designed, automatically compute the sort of
each input and output wire. Annotate each wire with its sort and, for a from-port
or to-port wire, its input-port-set and output-port-set, respectively.
• Stage 2. When modules are connected during circuit design, any connections
involving a from-sync or to-syncwire can be marked as safe immediately.
• Stage 3. Either periodically during circuit construction (useful when using interac-
tive HDLs with a tight feedback loop) or only once when the circuit is completed:
for each from-port output wire connected to a to-port input wire, check whether
the output wire is safely from-port with respect to the input wire.
This process neatly encapsulates the necessary information about the module’s
internal details into its interface and allows for checking well-connectedness in the final
circuit while treating each module as a opaque.
133
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
4.3.6 Comparison to BaseJump STL
We can relate the informal notions given by BaseJump STL (described in Section 4.2)
to our more precise definitions given here and thereby pin down exactly where the
BaseJump STL notions become problematic. BaseJump STL says that an endpoint
is demanding if it needs the other endpoint’s input signal (validin for the consumer
endpoint, readyin for the producer endpoint) before computing its own output signal
(readyout for the consumer endpoint, validout for the producer endpoint) and is helpful
otherwise.
Using our definitions, we can formulate these notions precisely. We are given a
module M with producer endpoint (readyin, validout, dataout) and consumer endpoint
(readyout, validin, datain). The producer endpoint ishelpful iff readyin < input-ports(M, validout),
otherwise it is demanding. This says nothing about the presence or absence of M’s
other inputs in input-ports(M, validout), meaning validout could be from-port and thus
potentially cause a loop due to other module connections. The consumer endpoint is
helpful iff validin < input-ports(M, readyout), otherwise it is demanding; again, this
does not preclude readyout from being from-port.
According to the BaseJump STL work, the only potentially problematic connection
is between two demanding endpoints; all other types of connections are safe while
demanding-demanding connections should be forbidden. However, according to
our analysis above this is not a sufficient condition for correctness. It is possible (as
demonstrated in Section 4.2) for a helpful-helpful connection to create a combinational
loop; this is because the helpful and demanding endpoint classifications focus only
on direct connections between two modules and do not consider the possibility of
combinational loops via other modules not directly involved in the connection.
134
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
4.3.7 Extension to Synchronous Memory Reads
The basic set of domains described in Section 4.3.1 omitsmention ofmemories. Mem-
ories are a special case in digital logic; their semantics partially depend upon whether
they are synchronous or asynchronous. Synchronous memories are often preferable in
order to be able to synthesize a design into efficient hardware, but using them imposes
additional conditions on the design. For example, one class of synchronous memories
requires that the read operations are able to start at the beginning of the clock cycle. What
this often means is that the designer must make sure that the read address port, raddrin,
is fed directly from a register.
Take as an example the module-memory interconnection in Figure 4.8a. At first
glance, this condition requires that any external module’s output wire wout connected
to raddrin be from-sync. However, this still doesn’t meet the required conditions for
synchronous memories; our definition of from-sync allows combinational logic to
exist between the source register from which the from-sync data originates and its
destination. In order for the data on wout to be available immediately at the beginning of
the clock cycle, it must not go through any combinational logic at all (since all gates have
propagation delay), and so we find that we must create a from-sync subsort, which
we’ll call from-sync-direct.
A from-sync-direct output wire wout is simply one where reachable(M,wout) = ∅.
By our definition of reachable in Section 4.3.2, this means that wout is connected only
directly to registers, with no intermediate combinational logic. In Figure 4.4, wire w1out
could thus be labelled from-sync-direct and qualify as being able to be connected to a
synchronous memory’s input wires. Its data is available at the start of the clock cycle
because its signal doesn’t need to propagate through any attached combinational logic.
A from-syncwire that isn’t from-sync-direct is said to be from-sync-indirect.
135
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
(a) The read address line raddrin of certainsynchronous memories must be connected tofrom-sync-direct wires like wout.
(b) Other synchronous memories requiredataout be connected to to-sync-direct inputwires like win feeding directly to a register.
Figure 4.8: Wire sorts for synchronous memories.
There are other forms of memories where synchronous requirements are placed on
certain outputs, rather than inputs. In these memories, the designer must ensure that
the dataout wire is fed directly into a register, as shown in Figure 4.8b. This naturally
leads to an input subsort for describing such conditions, which we call to-sync-direct; a
to-sync-direct input wire win is one where reachable(M,win) = ∅. A to-sync wire that
isn’t to-sync-direct is said to be to-sync-indirect.
By providing these additional sorts, designers can communicate the interface re-
quirements of modules using synchronous memories, making libraries of hardware
components more accessible and easier to use. Furthermore, the particular aspect of
timing information provided by these new subsorts can be used as part of clock cycle
determination and place-and-route post-synthesis of standard EDA workflows. Thus,
this sort taxonomy, now at for inputs: to-sync (with its subsorts to-sync-direct and
to-sync-indirect) and to-port; and for outputs: from-sync (with its subsorts from-sync-
direct and from-sync-indirect) and from-port, has a wide range of applications and
can be potentially expanded even further.
136
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
4.4 Implementation ofModularWell-ConnectednessChecks
We augmented the PyRTL HDL [110] to implement lightweight annotations and
design-time checks according to the formal properties that we have described of the
original four wire sorts (to-sync, to-port, from-sync, and from-port). PyRTL does not
natively support a module abstraction, so we first modified the language by adding a
Module class that isolates a modular design and exposes an interface consisting of input
and output wires.3
Our formalism made two simplifying assumptions. First, it assumed that all logic
is contained inside modules. For developer convenience, we eased this restriction to
allow for arbitrary logic to exist between modules. We tweaked the TransitivelyAffects
relationship ({C) to account for combinational paths through this extra-modular logic.
Second, it treated all wires as one bit in width. At the HDL level, it is much more
convenient to group related one-bit wires, especially input and output ports, into single
n-bit wire vectors. For native PyRTL designs (but not BLIF import), the output-port-set
or input-port-set of each port wire vector becomes the union of the output-port-set or
input-port-set of its constituent wires; thus we are overly conservative because single-bit
dependencies are not tracked, but maintain soundness by continuing to be able to detect
all combinational loops.
The well-connectedness implementation itself consists of (1) a sort inferencer that
automatically computes the sorts of a module’s input and output wires at module
design time; (2) lightweight syntactic annotations that allow a designer to (optionally)
specify what they believe the sorts should actually be; and (3) a whole-circuit checker
that automatically triggers when needed to verify that a circuit composed of multiple
modules is well-connected.3Our PyRTL modifications and the complete implementation of our tool are available at https://
github.com/pllab/PyRTL/tree/pldi-2021.
137
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
The computed sorts are then checked against any existing designer annotations
to ensure that the computed sorts match the designer’s expectations; any unascribed
ports are labeled with their computed sorts. We require sort ascriptions, where the
output-port-set or input-port-set are fully specified by the user, for all the ports of
opaque modules, since there is no internal logic to use for calculating these sorts. Once
a module has its wire sorts, these sorts make it quicker to determine intermodular
connections because they facilitate re-use: every instantiation of the same module in
the larger design reuses the same wire sort information.
An interesting question during circuit design, as modules are being composed, is
when exactly to check well-connectedness. We would like to highlight problems as early
as possible instead of waiting until the entire circuit is complete. However, we also
want to minimize the cost of constantly checking well-connectedness during the design
process. As such, our tool can either check for well-connectedness after all modules have
been connected or instead whenever a newly formed connection between two modules
meets the following condition: the connection’s forward combinational reachability set
includes a to-port input, and its backward combinational reachability set contains a
from-port output. This condition is cheap to track by saving and updating information
about each wire’s reachability as wires are added to the design, and it guarantees that
(1) a check is never done unless a problem could potentially be found; and (2) an actual
problem is found as soon as possible.
4.5 Evaluation
We evaluate our tool in five parts: (1) an application to a number of SystemVerilog
modules provided by the BaseJump STL; (2) an application to several components from
the OpenPiton Design Benchmark; (3) a case study applying the tool to the design of a
138
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
multithreaded RISC-V CPU; (4) a comparison of our tool to standard cycle detection
during synthesis; and (5) a discussion of the scalability and asymptotic complexity of
the tool.
4.5.1 BaseJump STL Modules
We begin with an evaluation of a number of the SystemVerilog modules provided
by the industrial-strength BaseJump STL library. This evaluation serves as a baseline
sanity check allowing us to verify that we can successfully assign the correct sorts to
module interfaces.
We ran our annotation framework successfully on 144 unique modules from the
BaseJump STL as found in the BSG Micro Designs repository [141], a repository con-
taining a large number of BaseJump STL modules parameterized and converted into
Verilog. Each module was instantiated one to four times to test various combinations of
its parameters (e.g. data bit width, address width, queue size, etc.), so that we analyzed
533 modules in total. Since our technique is currently only applicable to synchronous,
single-clock designs, we were unable to analyze 5 modules that relied on asynchronous
or multi-clock constructs.
We converted each top-level module and their submodules into the flattened BLIF
format[142] via Yosys version 0.9 and imported the result into PyRTL. The average
size of each module in BLIF was 1.7 MB, with an average number of primitive gates of
19,981, an average number of inputs and outputs per module of 6, and an average time
for inferring all of the interface sorts for each module of 361 milliseconds. We ran all
of these experiments using PyRTL on a computer with a 1.9 GHz Intel Xeon E5-2420
processor and 32 GB 1333 MT/s DDR3 memory.
A representative subset of these BaseJump STL modules is shown in Table 4.1:
139
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Table4.1:
Wire
sortso
fmod
ulepo
rtsfor
asu
bset
ofBa
seJumpST
L;TS
=to-syn
c,TP
=to-port,FS
=from
-syn
c,FP
=from
-port.Ev
erymod
ulealso
hasaresetinpu
twire
who
sesortisto-syn
c.Th
etim
elistediscu
mulativetim
eto
anno
tate
allthe
wire
sorts.
Mod
ule
Prim
.Gates
Time
(s)
Inpu
tsOutpu
tsWire
Nam
eSo
rtOutpu
tPortS
etWire
Nam
eSo
rtInpu
tPortS
etFirst-In
First-O
utQue
ue148,272
2.669
data_i
TS∅
data_o
FS∅
yumi_i
TS∅
ready_o
FS∅
v_i
TS∅
v_o
FS∅
Paralle
l-In
Seria
l-Out
Shift
Reg.
53,637
0.606
valid_i
TS∅
valid_o
FS∅
data_i
TS∅
data_o
FS∅
yumi_i
TP{ready_o}
ready_o
FP{yumi_i}
Seria
l-In
Paralle
l-Out
SR1,61
7,69
818
.752
yumi_cnt_i
TS∅
ready_o
FS∅
valid_i
TP{valid_o}
valid_o
FP{valid_i}
data_i
TP{data_o}
data_o
FP{data_i}
Cache
DMA
4,440
0.051
data_mem_data_i
TS∅
data_mem_data_o
FS∅
dma_data_i
TS∅
dma_data_o
FS∅
dma_data_v_i
TS∅
dma_data_v_o
FS∅
dma_data_yumi_i
TS∅
dma_data_ready_o
FS∅
dma_pkt_yumi_i
TP{done_o}
dma_pkt_v_o
FP{dma_cmd_i}
dma_way_i
TP{data_mem_w_mask_o}
data_mem_addr_o
FP{dma_addr_i}
dma_addr_i
TP{data_mem_addr_o,
data_mem_v_o
FP{dma_cmd_i}
dma_pkt_o}
data_mem_w_mask_o
FP{dma_way_i}
dma_cmd_i
TP{done_o,dma_pkt_o,
dma_pkt_o
FP{dma_addr_i,
dma_pkt_v_o,
dma_cmd_i}
data_mem_v_o}
done_o
FP{dma_cmd_i,
dma_pkt_yumi_i}
data_mem_w_o
FS∅
dma_evict_o
FS∅
snoop_word_o
FS∅
140
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Figure 4.9: Tracing a path through the parallel-in serial-out (PISO) shift register showsthe dependency between yumi_i and ready_oout, making connections involving eitherof their respective endpoints potentially unsafe.
a first-in first-out queue, a parallel-in serial-out shift register, a serial-in parallel-out
shift register, and cache DMA. Information on the sizes, wire sort annotations, and
annotation time for all 533 modules is found in the supplementary material.
We highlight in particular the parallel-in serial-out (PISO) shift register as an inter-
esting case (see Figure 4.9). Three of its four inputs are to-sync, while yumi_i is to-port
(specifically, its output-port-set contains the output ready_o). We can see the details
by looking at the logic for output ready_o:
ready_oout F (statereg = stateRcv)∨
((statereg = stateTsmt)∧
(shi f tCtrreg = nSlots − 1) ∧ yumi_iin)
According to the BaseJump STL paper, the consumer endpoint of this module
141
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
(to which ready_oout belongs) is helpful because ready_oout does not combinationally
depend on valid_iin (an input wire in the consumer endpoint). Thus, according to
them, this module can safely be connected to any other module. However, our own
analysis more precisely shows that while it is true that ready_oout doesn’t depend on
other consumer endpoint wires, it does require other module input (in particular
yumi_iin, which is part of the producer endpoint). This fact means that the PISOmodule
connections may or may not be safe depending on the sorts of the interfaces to which it
is directly or transitively connected.
Notably, after personal correspondence in which we reported the issue, the authors
of the BaseJump STL PISO module updated it so that, according to our terms, yumi_iin
is now to-sync and ready_oout is now from-sync.4 This shows that designers care about
the precise behavior of these interfaces and that an analysis that annotates wire sorts
and verifies their interconnections is a useful thing to have.
4.5.2 OpenPiton Modules
We also used our analysis on a completely separate body of work: the OpenPiton
Design Benchmark (OPDB), based on the OpenPiton manycore research platform [143,
144]. OPDB is interesting because it provides modules of a variety of scales and with
different configuration options pre-generated per module. We were also interested in
these OpenPiton designs due to anecdotes from the developers of OpenPiton related
to issues they experienced with compositionality. In one instance, developing a like-
for-like replacement for an existing component led to combinational loops that went
undetected until final integration and synthesis due to minor mismatches in interfaces
and test configurations. In another, a hardware generator produced combinational4See https : / / github . com / bespoke-silicon-group / basejump _ stl / commit /
67830f05ffce1333c7b790600530da0681af74fe
142
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Table 4.2: Size (in primitive gates), wire sort inference time (in seconds), and numberof IO ports of 17 OPDB modules.
Module Prim. Gates Time (s) Portsdynamic_node 29,918 0.759 35fpu 168,525 1.456 16ifu_esl 15,602 1.362 40ifu_esl_counter 310 0.001 5ifu_esl_fsm 2,299 0.040 34ifu_esl_htsm 524 0.012 30ifu_esl_lfsr 213 0.001 6ifu_esl_rtsm 170 0.005 24ifu_esl_shiftreg 208 0.001 4ifu_esl_stsm 267 0.016 26l2 1,088,384 15.128 16l15 1,518,073 30.176 71pico 36,479 0.245 24sparc_ffu 104,966 0.723 77sparc_mul 20,702 0.260 7space_exu 320,397 10.203 132sparc_tlu 650,364 8.753 214
loops for only particular values of a parameter designed to change the size of a module,
and those loops would require the composition of as many as seven modules to come
into existence.
To process the OPDB designs, we followed the same Yosys Verilog-to-BLIF synthesis
step as with the BaseJump STL designs, excluding some with asynchronous or multi-
clock constructs. Our selected OPDB designs include a floating-point unit, network-
on-chip router, and two caches, among others. Table 4.2 shows the OPDB designs we
selected, their sizes in number of primitive gates, the time taken to infer the wire sorts
of the design, and the number of input/output ports. Of the 17 designs we processed,
the average number of gates was 232,788, while the smallest (ifu_esl_rtsm) had just
170 gates and the largest (l15) had more than 1.5 million gates. The designs had an
average of 44 ports with the fewest ports (ifu_esl_stsm) being just 4, while the design
143
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Table 4.3: A comparison of cycle detection during synthesis (Yosys) versus our toolusing wire sorts on large OPDB designs. Each unique module type only needs to beanalyzed once; additional (non-unique) instantiations reuse the calculated sorts. Thenumber of primitive gate differs from Table 4.2 because these are unflattened, and thusunoptimized, designs.
Module Prim. gates(hier. BLIF)
Cycle det. time (s) Speedup Sort infer.time (s)
SubmodulesYosys Ours Total Unique
fpu 189,452 46.42 3.11 14.92x 0.845 3530 118sparc_ffu 105,688 11.30 1.00 11.30x 0.397 208 51sparc_exu 331,452 22.81 8.65 2.63x 0.989 737 92sparc_tlu 761,538 108.54 5.82 18.64x 0.813 777 128l2 1,176,219 361.04 10.64 33.93x 13.80 157 45l15 1,549,475 643.45 20.81 30.92x 7.86 68 26
with the most (sparc_tlu) had 214. The larger scale of these designs also skews to a
longer average wire sort inference time, at 4.067 seconds, with a minimum of 0.001s
and a maximum of 30.176s. We describe the asymptotic complexity of this operation in
Section 4.5.5.
4.5.3 RISC-V CPU
For a more holistic case study, we implemented a multithreaded single-cycle RISC-
V [145, 146] CPU (RV32I base integer instruction set) in PyRTL. The CPU consists of 11
modules in total; the total number of primitive gates for the entire design, configured
for five threads and five pipeline stages, is 229,011 gates. Our tool spent an average of
13.5 milliseconds on each module inferring its interface sorts; it took on average 162.7
milliseconds to determine all of the sorts, with a lower bound of 148.9 milliseconds
and an upper bound of 194.2 milliseconds, at a rate of 298 nanoseconds per primitive
gate. Once all the modules were connected, it was able to correctly check all the inter-
module connections in an average of 67.1 milliseconds, with a lower bound of 62.5
milliseconds and an upper bound of 77.3 milliseconds. Information on the sizes, wire
144
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
sort annotations, and annotation times for the RISC-V submodules can be found in the
supplementary material.
4.5.4 Comparison to Loop Detection During Synthesis
In our final analysis, we compared the efficiency of doing cycle detection at the
HDL level via wire sorts versus at the netlist level during synthesis. Finding broken
designs in the wild is difficult because most designers don’t publish broken designs. So
instead, we altered the OPDB Verilog designs slightly by introducing multi-module
loops, importing the largest of them in their hierarchical BLIF format into PyRTL where
our intermodular analysis is done. We then timed how long (1) Yosys takes to find the
cycle during synthesis, (2) our tool takes to determine all interface sorts, and (3) our
tool takes to check for intermodular loops given these sorts. We found that Yosys took
longer to synthesize and find loops than our tool. It was also not straightforward to get
Yosys to tell us these loops exist: depending on the options given, it would optimize
them out or convert them to something else entirely without warning. Our results are
found Table 4.3.
In actual use, we expect the user to write their designs in a modular fashion in a
high-level HDL that can be analysed directly to begin with and to provide wire sort
ascriptions if wanted. This experiment favored synthesis over our technique because
it relied on importing a BLIF file, which has a few downsides. The Verilog-to-BLIF
process converts N-bit ports into N 1-bit ports, meaning the number of ports increased
by a factor equal to the average port bitwidth. The conversion also creates a module
instance for each unique set of parameters used; since BLIF doesn’t offer information
that a module instantiation differs from another only by some parameter, those count
as additional unique modules whose sorts must be calculated.
145
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Despite this, annotating all modules with their I/O sorts was relatively quick, and
detecting loops via intermodular connections using these sorts was 2.6–33.9x faster
than trying to find them during synthesis at the pure netlist level. We expect that by
analyzing the design in its original form (e.g. Verilog or PyRTL), where the wires stay
bundled together and parameterized module instances can be abstracted over, this
speedup would increase significantly. This is exemplified by our RISC-V case study
mentioned in Section 4.5.3, which was written entirely in PyRTL.
4.5.5 Complexity and Scalability
We describe the asymptotic complexity of the two analysis phases in order to demon-
strate their scalability.
Module Wire Sort Inference
Sort inference takes place once per module definition. For a given module M =
(−−→win,−−−→wout,−−→net), we must compute the transitive closure of combinationally reachable
output wires for each win ∈ −−→win. Thus the total complexity of computing the sorts for
all input wire sorts is O(| −−→win | · | edges |), where edges =⋃
net∈M.nets{−→wσ | (−→wσ,wσ, op) =
net} ∪ M.outputs. Since the to-port/from-port relationship is symmetric, the wire sorts
for outputs can be computed using the previously computed input wire sorts without
traversing the module’s internal wires again.
Circuit Well-Connectedness
The phase to check circuit well-connectedness uses the wire sorts computed by
the module wire sort inference, and it operates only on the module interfaces without
caring how large or complex any individual module might be. It only needs to be run
146
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
Table 4.4: The number of annotations per sort. TS = To-Sync, TP = To-Port, FS =From-Sync, FP = From-Port.
Source Modules Inputs OutputsTS TP FS FP
BaseJump STL 144 233 211 178 197OpenPiton DB 17 347 113 245 56
RISC-V 11 14 33 3 33Total 172 594 357 426 286
once, after the circuit is complete. The algorithm iterates over each pair of inter-module
input-output connections checking them against the TransitivelyAffects relationship
({C).
Since each input port is connected to only one incoming output wire from another
module, the number of connections is equal to the total number of input ports across
the circuit. Given a circuit C and arbitrary wires wout, win in the circuit, the worst-case
scenario is when the path from wout to win traces through every inter-module connection
before finally reaching the combinational loop. Thus, the TransitivelyAffects computation
has a worst-case complexity of O(| C.conns |). Since we do this check for each connection
pair in C.conns, the total worst-case complexity is O(| C.conns |2).
Distribution of Wire Sorts
We found that sort annotations that our tool assigned to the module ports were
widely distributed, as shown in Table 4.4. Across all 172 modules, to-sync inputs make
up 62.5% of module inputs, compared to 37.5% for to-port inputs. from-sync outputs
make up 59.8% of module outputs, compared to 40.2% for from-port outputs.
The foremost goal of this work was to reduce the number of “late surprises” in
the design process. In these designs, 38.7% of the ports raise the possibility of a “late
surprise” loops because they are to-port or from-port. For the remaining 61.3%, our
147
Wire Sorts: A Language Abstraction for Safe Hardware Composition Chapter 4
technique has the additional advantage of making the checking process faster, by
eliminating individual wires, or in the case of modules with entirely to-sync/from-sync
IO, entire modules that need to be included in the cycle detection analysis.
4.6 Conclusion
We have presented an approach to creating hardware modules in isolation while
tracking enough information to make checking their well-connectedness in an entire
design feasible and user-friendly. BaseJump STL’s informal approach of commenting
ready-valid endpoints as helpful or demanding is a step in the right direction at
classifying modules with information to help in connecting them at circuit design
time in a plug-and-play fashion, but as we show it falls short in being able to prevent
combinational loops.
Our solution is to provide wire-level information via a taxonomy of sorts: to-sync,
to-port, from-sync, and from-port, allowing for modules to be written in isolation
effectively and still safely connected without knowing their internals. We implemented
our approach in a hardware description language and analyzed real-world designs
(BaseJump STL and the OpenPiton Design Benchmark) as well as a multithreaded
RISC-V CPU implementation, showing that our approach is feasible, effective, and
efficient.
148
Chapter 5
PyLSE: A Pulse-Transfer Level Language
for Superconductor Electronics
5.1 Introduction
Superconductor electronics (SCE) are a promising emerging technology for the
post-Moore era due to their low power dissipation, energy-efficient interconnects, and
sub-attojoule ultra-high-speed switching [147]. However, the physical properties that
make SCE so promising also make them difficult to design for. In particular, SCE use a
pulse-based, rather than a voltage level-based, information encoding. This, alongwith the
stateful nature of superconducting cells [148] and the lack of a uniformly agreed-upon ef-
ficient translation from design to implementation, makes it necessary to develop unique
logic gates and design rules [149, 150, 151]. As a result, the superconducting realm
lacks tools and workflows for the rapid prototyping and testing of microarchitectures
and SCE-based applications (see Figure 5.1).
The primary question we seek to answer in this paper is: what is a suitable abstraction
for precisely defining the functional and timing behavior of SCE designs? Our solution is
149
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.1: Unlike the semiconducting realm, superconducting lacks standard celllibraries and standardized digital design rules, making it difficult to rapidly build andtest microarchitectures and larger applications.
to completely depart from existing hardware description languages (HDLs), taking a
bottom-up approach to build a new Python [152] embedded domain-specific language
(DSL), called PyLSE (Python Language for Superconductor Electronics). We argue
that PyLSE is well tailored to the unique needs of SCE, making it easier to create and
compose cells into correct, scalable systems.
Inspired by the theory of automata [153], we propose a custom finite state machine
(FSM) abstraction, which we call a PyLSE Machine, to precisely describe the func-
tional and timing behavior of SCE cells. This FSM abstraction eliminates the need for
complex and error-prone conditional assignments, commonly found in state-of-the-art
approaches [154], and forms the core our PyLSE language. Through this abstraction,
we also develop a new link between SCE and the theory of Timed Automata [155],
which enables the integration of PyLSE with modern formal verification tools like the
UPPAAL model checker [156]. Overall, the main contributions of this paper are:
• We create the PyLSE Machine, a language abstraction for the formalization of the
150
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
functional and timing semantics of pulse-based circuits (Section 5.3).
• We create PyLSE, a lightweight transition system-based Python DSL for the rapid
prototyping of pulse processing systems, modeled as networks of PyLSEMachines
(Section 5.4).
• We automate the translation of PyLSE Machines to Timed Automata (Section 5.4).
• We build a multi-level framework for the simulation and analysis of PyLSE Ma-
chine systems, which also allows for the integration of abstract behavioral software
models, fostering agile development (Section 5.4).
• We evaluate PyLSE’s capabilities through a series of comparisonswith state-of-the-
art approaches, dynamic checks of SCE designs with stochastic timing behaviors,
and formal verification using UPPAAL (Section 5.5).
5.2 Defining Computation on Pulses
5.2.1 Functional Behavior
Complementary metal-oxide-semiconductor (CMOS) is the dominant technology
of today’s computing landscape. The majority of the tools used to design and fabricate
modern hardware assume an underlying digital logic basically corresponding to the
functionality of basic gatesand n- and p-type field-effect transistors[157]. Because the
CMOS fabrication process is so well-understood, engineers encapsulate the functional,
timing, and physical layout characteristics of these gates into standard cell libraries [158].
In turn, digital designers and computer architects use abstractions of these cells to
create consistent digital design rules that underpin the semantics of hardware descrip-
tion languages [106, 159] and which are used to create microarchitectures and larger151
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
(a) CMOS: Steady voltage levels encode in-formation; thus, wires are considered statefuland the gates can be stateless.
(b) SFQ: Transient pulses encode information;thus, wires are considered stateless and thegates should be stateful.
Figure 5.2: Comparing information in CMOS and SFQ.
applications (see the left side of Figure 5.1).
Superconductor electronics, on the other hand, exploit the unique properties of su-
perconductivity [160, 161] to perform computation through the carefully orchestrated
consumption and emission of tiny pulses of magnetic flux. In this realm, Josephson
junctions (JJs), rather than transistors, serve as the basic switching device. While the
quantum nature of such flux exchange is central to the device operation, the computa-
tion performed is strictly classical. Information is moved from logic element to logic
element in the same “feed-forward” way as traditional digital logic. However, the use of
picosecond-scale pulses of single flux quanta (SFQ), rather than sustained voltage levels,
to carry information between logic elements has myriad downstream effects. In CMOS,
information is “held” in the wires connecting gates and registers together; that is, the
voltage level corresponding to a logical 0 or 1 persists as long as the input remains
unchanged (see Figure 5.2a). In SFQ [162], information is encoded via short-lived
pulses and does not persist on the wires; instead, cells1 must be designed to “remember”
that a particular input has arrived (see Figure 5.2b).
The SCE community has traditionally relied on low-level analog models for the
design and analysis of basic SCE cells. Figure 5.3a is one such example, showing the1We use “cell” and “gate” interchangeably throughout.
152
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
(a) Schematic (b) Mealy machine
Figure 5.3: Schematic and Mealy machine description of the Synchronous And Element.Labels a arrived and b arrived in the schematic are locations of superconductingloops holding state and roughly correspond to states in the Mealy machine.
low-level circuit schematic of the Synchronous And Element2. The mechanism these
SCE cells use to “remember” input arrival by storing flux quantum is the inferferometer,
also known as superconducting quantum interference devices (SQUIDs) [163]; labels a
arrived and b arrived indicate the location of these SQUIDS. At a high level, this cell
functions as follows. The arrival of a pulse on A causes flux to be stored in the SQUID
labelled a arrived; similarly, a pulse on B stores flux in SQUID b arrived). When a
pulse arrives on CLK, the flux in either SQUID is read out to another loop involving J7;
if both a arrived and b arrived have been set, the combined flux causes J7 to switch
producing an output on output Q. The flux in the SQUIDs has been expended, causing
the cell to return back to its initial state and receive pulses anew.
With the burgeoning interest of digital designers in SCE, the area is leaving the2While the details of its construction using inductances and bias currents is beyond the scope of this
paper, we include it show the physical manifestation of state being held in each cell, which informs ourdiscussion of automata.
153
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.4: Waveform for the Synchronous And Element showing the timing constraintsthat must be met for correct operation. Pulses arriving during the hold time 1 or setuptime 2 are erroneous. Assuming their absence, a pulse is produced some propagationdelay 3 after a clock pulse.
confines of domain experts, increasing the need for new abstractions more suitable
for the scalable design and analysis of SCE systems. One abstraction commonly used
to explain the stateful behavior of SCE cells is the Mealy machine [164]. which are
deterministic finite state transducers mapping state-input symbol tuples to new states
and output symbols [165]. This model has been used extensively to model SCE cells
[166, 167, 150, 151, 168, 162] like the Synchronous And Element in Figure 5.3, allowing
digital designers to hide the low-level circuit details of Figure 5.3a in favor of the simpler,
easier-to-use functional model of Figure 5.3b.
5.2.2 Timing Behavior
The schematic shown in Figure 5.3a comes with particular timing behavior as well
as constraints on said behavior, which we visualize using the waveform in Figure 5.4.
One such behavior is propagation delay, defined as the time it takes after a particular
154
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
pulse for an output to appear (see moment 3 ). Two such constraints are setup time
and hold time [169, 170], defined as the intervals before and after the clock, respectively,
where no pulses are expected to arrive. The combination of these latter two times help
define an “input window” for legal inputs; pulses arriving outside this window (such
as at moments 1 or 2 ) run the risk of getting lost or dropped, or causing the cell to
enter a metastable state.
Because Mealy machines lack an explicit notion of time, they fall short when con-
straints on the relative arrival times of inputs must be part of the functional description.
These and other timing restrictions need to be carefully thought through and should
be captured as early in the design process as possible. The time it takes for pulses to
propagate through a system is a first class concern for not only analog circuit designers,
but digital designers and architects as well. A good language abstraction must therefore
be centered around a notion of time and provide mechanisms for (1) easily defining
timing constraints and (2) verifying the absence of violations in the system.
5.3 A Language Abstraction for Superconductor Electron-
ics
5.3.1 Overview of the PyLSE Machine
There are four key pieces of information that must be captured in a new SCE ab-
straction:
1. The time it takes to transition between states
2. The priority order that a given input should be handled if more than one arrives
simultaneously
155
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
3. How long it takes for an output to appear once fired
4. Constraints on when it is legal to receive inputs
We believe that the Mealy machine should be the base of this new abstraction,
but with the timing deficiencies covered in Section 5.2 resolved by augmenting the
machine’s edges. These augmented edges are composed of three parts: the Trigger
(in turn composed of an input, priority, and transition time), the Firing Outputs
(associating each output with its firing delay), and Past Constraints (for specifying
conditions on a cell’s past history of inputs); the details of each are found in Figure
5.5. We require that the machine be fully-specified (i.e. that for all states, there are edges
labelled for all inputs) and call a machine composed of states and these augmented
edges a PyLSE Machine.
Figure 5.5: Anatomy of a PyLSE Machine transition. The arrival of an input pulse onwire A Triggers the transition from the source to dest state. This transition has priority Nover other simultaneously-triggered transitions originating from source and takes τtrantime to complete; during this period, receiving any inputs is illegal. A pulse for eachoutput Q in the Firing Outputs set appears on their associated output wire some τfiretime units later. Finally, according to the Past Constraints, if it’s been less than τdistsince the last time an input Bwas received during a previous transition, it is an error.ANτtran
is shorthand for ANτtran/∅/∅.
To show how the proposed extension transforms a plain Mealy machine, we will
focus on the Synchronous And Element. Figure 5.6 shows the PyLSE Machine for the156
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.6: PyLSE Machine for the Synchronous And Element. Using transition time,we can model the hold time τhold and using the past constraints, we can model the τsetuptime. Firing delay directly models the propagation delay τprop
Synchronous And Element, expanded from the original Mealy machine representa-
tion in Figure 5.3b. We’ll dissect the edge (highlighted in gold) moving from state
a and b arrived to idle (CLK0τhold/{Qτprop}/{*τsetup}) and show how we can use partic-
ularly subtle and useful features of Figure 5.5 to represent the setup and hold time
constraints and propagation delay of this particular cell.
Transition Time The Trigger portion shows that this transition occurs when CLK is
received and that it takes τhold time units to complete. We make it illegal to receive
other inputs during a transitionary period, so that an edge’s transition time can be used
directly to model the hold time constraint of Figure 5.4, by setting τtran B τhold.
Priorities: The example edge has the highest priority (with a priority of 0 shown in the
Trigger), meaning that if the machine received A, B, and CLK simultaneously, it would
handle the transition associated with CLK first, transition to idle, and then make an
157
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
arbitrary choice between A and B, since both have the same labeled priority of 1 leaving
idle. While it is practically impossible to arrange for SFQ pulses to purposefully arrive
“simultaneously,” it is not uncommon to consider idealized models of gates which have
a delay of either zero, or some small integer. Priorities let the designer identify and
explicitly handle cases of simultaneous arrival in a deterministic manner when desired.
Multi-Output: We require a set of outputs, rather than a single one, in order to
accurately model SFQ cells with more than one output. In the Firing Outputs portion
of the example edge being discussed, the singleton set {Qτprop} indicates that a single
output should fire and take τprop time units to appear; thus, we can model the cell’s
propagation delay by an edge’s firing delay, setting τfire B τprop.
Constraints on the Past: The Past Constraints portion says that when a CLK pulse
arrives, it is an error to have received a pulse on any input (indicated by the * in this
example) within the last τsetup time units. We note that the hold time constraint (repre-
sented as transition time) could also have instead been placed in the past constraints
section of the edges leaving idle, i.e. A10.0 leaving idle could have instead been written
as A10.0/∅/{CLKτhold} and the transition times involving τhold replaced with 0.0; however,
we feel that using a transition time, which imposes a future-looking constraint on what
inputs can’t be received, sometimes more easily reflects the mental model of the digital
designer, and so offer both forms. Thus, we can model the setup time constraint shown
in Figure 5.4 by setting τdist B τsetup on this and all other edges whose triggering input
is CLK.
158
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
5.3.2 Formalization of the PyLSE Machine
Wewill nowprecisely define PyLSEMachines, their semantics, and how they interact
in larger designs:
Definition 3 (PyLSE Machine). A finite state machine with timed, prioritized transitions, an
output set, and past constraints, whichwe call a PyLSEMachine, is a tuple M = ⟨Q, qinit,Σ,Λ, δ, µ, θ⟩
where
q ∈ Q is a set of states
qinit ∈ Q is the initial state
σ ∈ Σ is a set of input symbols
λ ∈ Λ is a set of output symbols
δ : Q × Σ→ Q × N × R is the transition function
µ : Q × Σ→ P(Λ × R) is the output function
θ : Q × Σ→ P(Σ × R) is the past constraints function
We write M.Σ to extract Σ, and likewise M.Λ for Λ. We call a system modeled using a PyLSE
Machine a pulse-based system.
The first three domains, Q, Σ, andΛ, are similar to a typicalMealymachine definition.
The transition function δ maps a state and input symbol to the next state it should
transition to, a natural number corresponding to the priority of that transition, and a
real number corresponding to the physical time it takes to complete. The output function,
µ, maps tuples of states and inputs to sets of tuples consisting of output symbols and
the time it takes for them to appear (i.e. a firing delay). The past constraints function θ
maps the current state and input to a set of input—real number tuples, each indicating159
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Transition Relation −→tran⊆ K × Σ × R × (K ∪ {qerr})
⟨qnext, _, τtran⟩ = δ(qcurr, σ) τarr ≥ τdone ∀⟨σ′, τdist⟩ ∈ θ(qcurr, σ), τarr ≥ Θ[σ′] + τdist
κ⟨qcurr,τdone,Θ⟩⟨σ,τarr⟩−−−−−→tran κ⟨qnext,τtran+τarr,Θ[σ 7→τarr]⟩
(Normal-κ)
τarr < τdone
κ⟨qcurr,τdone,_⟩⟨σ,τarr⟩−−−−−→tran qerr
(Error-κ Tran)∃⟨σ′, τdist⟩ ∈ θ(qcurr, σ), τarr < Θ[σ′] + τdist
κ⟨qcurr,τdone,Θ⟩⟨σ,τarr⟩−−−−−→tran qerr
(Error-κ Cons)
Dispatch Relation →disp⊆ K × (P(Σ) × R) × K × (P(Σ) × R) × P(Λ × R)
σ ∈ argminσ′∈
⇀σ
(π2(δ(qcurr, σ′)) outs = µ(qcurr, σ)
κ⟨qcurr,_,_⟩⟨σ,τarr⟩−−−−−→tran κnext
⇀σrest =
⇀σ/σ⟨
κ⟨qcurr,_,_⟩, ⟨⇀σ, τarr⟩
⟩→disp
⟨κnext, ⟨
⇀σrest, τarr⟩
⟩∣∣∣outs(Disp)
Trace Relationy
trace ⊆ K × (P(Σ) × R)∗ × (P(Λ × R))∗
⟨κ, ⟨⇀σ, τarr⟩
⟩→disp
⟨κ′, xs
⟩∣∣∣outs⟨κ′, xs
⟩ytrace
⟨κ′′, outs′
⟩outs′′ = outs + outs′⟨
κ, ⟨⇀σ, τarr⟩
⟩ytrace
⟨κ′′, outs′′
⟩ (Trace-Cont) ⟨κ, ⟨∅, _⟩
⟩ytrace
⟨κ,∅⟩ (Trace-Done)
Network Relation →net ⊆ P(K) × (Σ × R)∗ × P(K)(Σ × R)∗
⟨⟨⇀σ, τarr⟩M , ps′
⟩= getSimPulses(ps) κM ∈
⇀κ
⟨κM , ⟨⇀σ, τarr⟩M⟩
ytrace⟨κ
′M , outs⟩
⇀κ′
=⇀κ [κ′M/κM] ps′′ = ps′ + outs
⟨⇀κ , ps⟩ →net ⟨
⇀κ′
, ps′′⟩(Network-Cont)
∀⟨σ, τarr⟩ ∈ ps. σ ∈ C.Λ⟨_, ps⟩ →net ⟨_, ps⟩
(Network-Done)
Figure 5.7: Semantics of the Transition, Dispatch, and Trace relation of the PyLSEMachine ⟨Q, q0,Σ,Λ, δ, µ⟩ as well as the Network relation for larger composite designs.πi(⟨..., xi, ...⟩) = xi is standard tuple projection. Θ[σ 7→ τ] produces an updated mappingwhere σ nowmaps to τ. We use S [y/x] to denote y replacing x in S . The helper functiongetSimPulses extracts the pulse heap ps into the earliest set of simultaneous pulsesdestined for the same PyLSE Machine and the rest for later use. If both x and y areheaps of pulses, we use x + y to denote merging them into a single ordered heap.
160
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
a precondition needed in order for the given transition to be allowed to proceed.
The transition semantics of our PyLSEMachine is found in Figure 5.7. To help define
the semantics, we define the configuration κ ∈ K = Q × R × Θ, parameterized over a
current state q ∈ Q, a real-valued time τdone, and a mapping Θ : Σ→ R that maps each
input to the last time it was seen; this is written as κ⟨q,τdone,Θ⟩, with the τdone being used to
represent the end of the unstable period during which time the machine is transitioning.
The initial configuration is κMinit = κ⟨qinit,0,{σ 7→−∞|σ∈M.Σ}⟩.
Transition Relation Given the current configuration κ⟨qcurr,τdone,Θ⟩, the Transition Rela-
tion can be interpreted as follows. If the machine receives an input σ at time τarr, and it
has been long enough to have finished the process of entering state qcurr (i.e. τarr ≥ τdone),
it proceeds to a new configuration κ⟨qnext,τ′done,Θ′⟩ by remembering (1) the next state qnext,
(2) the time at which the new transition should be completed τ′done = τtran + τarr, and (3)
the time it saw this current input, via Θ′ = Θ[σ 7→ τarr] (seeNormal-κ). Otherwise, if it
is not yet ready to receive inputs because τarr < τdone (see Error-κ Tran) or because any
input σ′ was received less than θ(q, σ′) + τdist ago (see Error-κ Cons), it proceeds to
the special qerr state, which is the target state of any transition whose timing conditions
can’t be satisfied.
Dispatch Relation The Dispatch Relation is used to retrieve the highest priority
transition that leaves qcurr for all the inputs σ in the set of simultaneous inputs ⇀σ
arriving at τarr (choosing one non-deterministically if multiple candidate transitions
have the same priority); this allows the machine to continue processing inputs.
Trace Relation The Trace Relation is used to determine the outputs that result from
running the Dispatch Relation over the entirety of the inputs. This sequence of outputs
produced by a PyLSE Machine is defined as follows:161
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Definition 4 (Output Trace for a PyLSE Machine). Given a set of simultaneous inputs
⟨⇀σ, τ⟩, an output trace t is the sequence of sets of tagged outputs produced by the Transition
Relation over these input sets, one by one via the Dispatch Relation. Given PyLSE Machine M
and an initial configuration κinit = κ⟨qinit,0,{σ 7→−∞|σ∈M.Σ}⟩, κinity
trace⟨κ, t⟩. The PyLSE Machine M
ends in the configuration κ.
5.3.3 Formalizing a Network of PyLSE Machines
While each individual PyLSE Machine models a particular type of SCE cell, a
network of communicating PyLSE Machines models a larger design:
Definition 5 (Network Domain of PyLSE Machines). A network of PyLSE Machines,
which we call a circuit, is a tuple C = ⟨⇀
M,⇀w,Σ,Λ⟩ composed of a set of PyLSE Machines
⇀
M
(accessed as C.machines), a set of connective wires ⇀w (accessed as C.wires) and a set of circuit
inputs C.Σ and outputs C.Λ. A wire is a tuple w = ⟨α, β⟩ such that α ∈ M′.Λ⋃
C.Σ and
β ∈ M′′.Σ⋃
C.Λ for some M′,M′′ ∈⇀
M.
Network Relation The Network Relation of Figure 5.7 shows the semantics of how a
sequence of externally derived time-tagged pulses ts propagate through the network.
We define an initial circuit configuration κCinit, composed of (1) all individual PyLSE
Machine initial configurations ⇀κ and (2) a list of input pulses ps tagged with the
wires where they are headed, i.e. κCinit = ⟨⇀κ , ps⟩, where ⇀κ = {κM
init
∣∣∣M ∈ C.machines} and
ps = {⟨σ′, τarr⟩∣∣∣⟨σ, τarr⟩ ∈ ts ∧ ⟨σ,σ′⟩ ∈ C.wires}. The network proceeds until there is no
more work to do (meaning all pending pulses in ps are directed toward the circuit
output), i.e. ⟨κCinit, ps⟩↠net ⟨κC ′, ps′⟩. Non-determinism occurs when there are multiple
simultaneous pending pulses on the heap ps going to different PyLSE Machines; the
helper function getSimPulses chooses one to update before proceeding with the next.
162
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
5.3.4 Example: PyLSE Machine Definition of the Synchronous And
Element
We conclude our discussion of PyLSE Machines by mathematically defining the
Synchronous And Element PyLSE Machine as shown in Figure 5.6. Formally,
MAnd = ⟨QAnd, q0And ,ΣAnd,ΛAnd, δAnd, µAnd, θAnd⟩
where
QAnd = {qidle, qa arrived, qb arrived, qa and b arrived}
q0And = qidle
ΣAnd = {σA, σB, σCLK}
ΛAnd = {λQ}
µ(q, σ) =
{⟨λQ, τprop⟩} if q = qa and b arrived ∧ σ = σCLK
∅ otherwise
θ(q, σ) =
{⟨σA, τsetup⟩, ⟨σB, τsetup⟩, ⟨σCLK , τsetup⟩} if q ∈ {qidle, qa arrived, qb arrived, qa and b arrived}
∧ σ = σCLK
∅ otherwise
δ σA σB σCLK
qidle ⟨qa arrived, 1, 0.0⟩ ⟨qb arrived, 1, 0.0⟩ ⟨qidle, 0, τhold⟩
qa arrived ⟨qa arrived, 1, 0.0⟩ ⟨qa and b arrived, 1, 0.0⟩ ⟨qidle, 0, τhold⟩
qb arrived ⟨qa and b arrived, 1, 0.0⟩ ⟨qb arrived, 1, 0.0⟩ ⟨qidle, 0, τhold⟩
qa and b arrived ⟨qa and b arrived, 1, 0.0⟩ ⟨qa and b arrived, 1, 0.0⟩ ⟨qidle, 0, τhold⟩
163
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
5.4 PyLSE Language Design
We use the above PyLSE Machine formalisms to develop a practical embedded
DSL that eases the description and analysis of SCE designs at multiple levels. By being
embedded in Python, we lower the barrier of entry for new users and gain the enormous
productivity benefits of using Python’s libraries. The language will be open-sourced
upon publication of this work.
5.4.1 Design Levels
Cell Definition Level: Given that there is still no dominant logic scheme for SCE
designs, the ability to easily define new cells is crucial for the advancement of the field.
We enable this by providing a Transitional Python abstract class; each SCE cell is
modeled as an implementing class by defining the set of input and output names as well
as a list of transitions. Each transition in this list is represented as a Python dictionary,
storing key-value pairs exactly corresponding to the information found in Figure 5.5.
PyLSE comes with a library containing all the basic SCE cells and provides templates
for the creation of custom ones.
Let’s take the example Synchronous And Element, as show in the state diagram
in Figure 5.6 and the semantics given in end of Subsection 5.3.4. The PyLSE code for
this cell is found in Figure 5.8. It implements an abstract class SFQ, which is a child
of the Transitional class mentioned previously; its purpose is to require additional
properties specific to SFQ cell design from its implementing classes. For our purposes,
it only requires that the jjs (the number of Josephson junctions) and firing_delay
properties exist on the class.
The priorities of transitions can be given explicitly, via the priority key in each
transition dictionary, or implied by the order in which they are listed. For example,
164
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
1 class AND(SFQ):2 _setup_time = 2.83 _hold_time = 3.04
5 name = 'AND'6 inputs = ['a', 'b', 'clk']7 outputs = ['q']8 transitions = [9 {'id': '0', 'source': 'idle', 'trigger': 'clk',
'dest': 'idle',↪→
10 'transition_time': _hold_time, 'past_constraints': {'*':_setup_time}},↪→
11 {'id': '1', 'source': 'idle', 'trigger': 'a','dest': 'a_arrived'},↪→
12 {'id': '2', 'source': 'idle', 'trigger': 'b','dest': 'b_arrived'},↪→
13 {'id': '3', 'source': 'a_arrived', 'trigger': 'b','dest': 'a_and_b_arrived'},↪→
14 {'id': '4', 'source': 'a_arrived', 'trigger': 'a','dest': 'a_arrived'},↪→
15 {'id': '5', 'source': 'a_arrived', 'trigger': 'clk','dest': 'idle',↪→
16 'transition_time': _hold_time, 'past_constraints': {'*':_setup_time}},↪→
17 {'id': '6', 'source': 'b_arrived', 'trigger': 'a','dest': 'a_and_b_arrived'},↪→
18 {'id': '7', 'source': 'b_arrived', 'trigger': 'clk','dest': 'idle',↪→
19 'transition_time': _hold_time, 'past_constraints': {'*':_setup_time}},↪→
20 {'id': '8', 'source': 'b_arrived', 'trigger': 'b','dest': 'b_arrived'},↪→
21 {'id': '9', 'source': 'a_and_b_arrived', 'trigger': 'clk','dest': 'idle',↪→
22 'transition_time': _hold_time, 'past_constraints': {'*':_setup_time},↪→
23 'firing': 'q'},24 {'id': '10', 'source': 'a_and_b_arrived', 'trigger': ['a', 'b'],
'dest': 'a_and_b_arrived'},↪→
25 ]26 jjs = 1127 firing_delay = 9.2
Figure 5.8: Synchronous And Element PyLSE code.
165
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
1 mem = defaultdict(lambda: 0)2 raddr = waddr = wenable = data = 03
4 @pylse.hole(delay=5.0, inputs=['ra3', 'ra2', 'ra1', 'ra0', 'wa3', 'wa2', 'wa1','wa0', 'd1', 'd0', 'we', 'clk'], outputs=['q1', 'q0'])↪→
5 def memory(ra3, ra2, ra1, ra0,6 wa3, wa2, wa1, wa0,7 d1, d0, we, clk, time):8 nonlocal raddr, waddr, wenable, data9 raddr |= ra3*8 + ra2*4 + ra1*2 + ra0
10 waddr |= wa3*8 + wa2*4 + wa1*2 + wa011 data |= d1*2 + d012 wenable |= we13 if clk:14 if wenable:15 mem[waddr] = data16 value = mem[raddr]17 raddr = waddr = wenable = data = 018 else:19 value = 020 return ((value >> 1) & 1), value & 1
Figure 5.9: An example Functional (“hole”) element modeling a memory with 16addresses, each storing 2 bits.
in the C Element, the transition leaving idle on clk is given before the transition
leaving idle on a; thus, the former’s trigger has priority over the latter’s. This priority
order is isolated to transitions with the same source state; for example, the first and
fourth transitions have different source states (idle and a_arrived, respectively), and
thus their relative order in this list of transitions is irrelevant. Finally, all transitions
have a default transition_time of 0, and all firing delays for SFQ cells use the class’s
firing_delay for its value (for even greater flexibility, Transitional allows subclasses
to set the firing delay individually per transition and output).
Hole Description Level: To facilitate the rapid prototyping and exploration of more
complicated designs without the need to describe every single block via interacting tran-
166
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.10: Graphical results of simulating the memory Functional class.
167
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
sition systems, PyLSE provides the Hole Description Level. At this level, pure Python
code is wrapped behind a specialized interface (by implementing a Functional abstract
class), allowing non-transition-based abstract “holes” to communicate via pulses with
the rest of the system. The Functional class takes as initialization parameters (1) a
Callable function mapping time-tagged input to output pulses, (2) the list of input
and output names, and (3) the firing delay for each output. The user can also simply
wrap a Python function (with the appropriate signature) with the hole decorator. Note
that these holes do not abide by the formal semantics of Section 5.3.
An example functional element is found in Figure 5.9, which shows how to create a
memory bywrapping a Pythondictionary behind a functionwith a pulse-communicating
interface. This function, memory(), takes in twelve boolean-valued arguments, and a
thirteenth argument, time, which is implicitly passed as the last argument to all func-
tional elements time. The read and write addresses, ra* and wa*, respectively, are
split into four 1-bit inputs, and a pair of nonlocal variables raddr and waddr are used
to accumulate which address bits have been seen since the last clock pulse. When a
clock pulse arrives, the memory is updated if write is enabled, the newly read value
is produced as tuple of 1-bit values, and the accumulators are reset, ready for the
next period. The arguments are internally connected to PyLSE Wires in the network,
and the framework automatically converts the presence of a pulse on one or more of
them at a particular instant as a call to memory(), passing a value of 1 for each of the
corresponding arguments plus the current time in time.
Figure 5.10 shows the result of simulating the memory hole against a variety of
inputs.
Full-CircuitDesign Level: Nodes of Transitional and Functional class instances are
interconnected with Wires and added to the circuit workspace at the Full-Circuit Design
168
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
(a) Block diagram.
from pylse import s, c, c_inv, jtldef min_max(a, b):a1, a2 = s(a)b1, b2 = s(b)low = c_inv(a1, b1)high = c(a2, b2)high = jtl(high, firing_delay=2.0)return low, high
(b) PyLSE code.
Figure 5.11: Min-max pair. Inputs a and b are duplicated by the splitters. a0 and b0enter the Inverted C Element, which propagates an output pulse on low some delayafter the first of its inputs arrive. a1 and b1 are fed into the C Element, whose output isdelayed via a Josephson transmission line (for path balancing) before being emittedas the high output. The 2.0 JTL delay has been calculated based on the difference indelays between the paths to low and to high, assuming a splitter delay of 11, C Elementdelay of 12, and Inverted C Element delay of 14. Thus, given low, high = comp(a, b),the earlier input pulse propagates to low after 11 + 14 = 25 ps and the latter to high(likewise after 11 + 12 + 2 = 25 ps).
level. The code in Figure 5.11b provides an example of a Min-Max pair implemented
with two splitters, a C Element, an Inverted C Element, and a JTL (all basic SFQ
cells [171]) following recently introduced temporal conventions [150]. A Min-Max
pair circuit is an attractive application for pulse-based computation because it sorts its
input according to arrival times by using race logic primitives like First Arrival and Last
Arrival (implemented by the Inverted C Element and C Element, respectively).
Calling the function min_max(a,b) causes its constituent cells and connective wires
to be instantiated via the calls to the encapsulating functions s, c, c_inv, and jtl. These
functions take in wire objects and return one or more output wire objects as result; by
updating the circuit workspace automatically, this encapsulation enables basic cells
to resemble Python operators and improve language usability. These functions can
also take in optional arguments, making it easy to override properties like firing delay,
transition time for arbitrary transitions, and the number of Josephson junctions used in a
169
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
particular element instance (the latter of which is an area metric based on the number of
switching elements used by the design). At this level, full application implementations
can be realized through the technique of elaboration-through-execution [119, 172, 173],
although here, unlike traditional HDLs, the underlying primitives used by higher level
generators are inherently stateful and pulse-based.
5.4.2 Syntactic and Semantic Checks
PyLSE provides several syntactic and semantic checks to alert the user if a design
is ill-formed. At the Cell Definition level, PyLSE ensures the list of a cell’s transitions
constitute a well-formed transition system. These include simple checks such as the use
of recognized field names, references to valid input triggers and output signal names,
and the inclusion of an idle starting state. More advanced checks include the complete
specification of transitions for every possible trigger, and that at least one transition has
been defined that fires an output.
At the Circuit Design level, we currently check that all circuit outputs have a “fanout”
of one. In SCE, the outputs of arbitrary cells cannot immediately be shared by multiple
inputs; instead, a splitter cell must be used, which is specifically designed to forward an
incoming pulse to two different outgoing wires. The example in Figure 5.11b includes
two splitter cells (lines 3 and 4) to allow a and b to be used in two different places;
PyLSE reports an error on instantiation if, for example, input a is used in both lines 5
and 6.
5.4.3 Simulation
PyLSE’s built-in simulator can be used to validate designs against a given set of
input signals. It follows the principles of other discrete-event simulation frameworks
170
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
1 from pylse import inp_at, inp, and_s, Simulation2 a = inp_at(125, 175, 225, 275, name='A')3 b = inp_at(75, 185, 225, 265, name='B')4 clk = inp(start=50, period=50, n=6, name='CLK')5 out = and_s(a, b, clk, name='Q')6 sim = Simulation()7 events = sim.simulate()8 assert events['Q'] == [209.2, 259.2, 309.2]9 sim.plot()
(a) Exhaustively simulating a Synchronous And Element.
(b) Simulation result in graphical form.
Figure 5.12: Simulation of the Synchronous And Element in PyLSE. Pulses occur onwires A at 125.0, 175.0, 225.0, and 275.0; B at 75.0, 185.0, 225.0, and 265.0; and CLK every50.0 ps, six times starting at 50.0. The cell’s default firing delay is 9.2, so we check forexpected pulses on Q 9.2 ps after a clock period in which both A and Bwere received.
171
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Table 5.1: Functions used in the code in Figure 5.12a. The first four return a namedwire, while simulate() returns a mapping from each named wire to the ordered list ofpulses that appeared on it. The last two are methods on the Simulation class.
Function Descriptioninp_at(*times, name=None) Produce pulses at each time in *times.
inp(start=0, period=0, n=1, name=None)Produce pulses starting at start, occurringnmore times every period picoseconds.
split(wire, n=2, names=None, **overrides)Split a wire nways, creating n-1 splitterelements in a binary tree.
inspect(wire, name)Give a wire a name for observation duringsimulation.
simulate(self, until, variability=False)Simulate the circuit until a certain time or allpulses are processed.
plot(self) Produce a graph plotting the pulses against time.
b = inp_at(99, 185, 225, 265, name='B')...events = sim.simulate()
pylse.pylse_exceptions.PylseError: Error while sendinginput(s) 'clk' to the node with output wire '_0':Prior input violation on FSM 'AND'. A constraint ontransition '7', triggered at time 100.0, given via the'past_constraints' field says it is an error to triggerthis transition if input 'b' was seen as recently as2.8 time units ago. It was last seen at 99.0, which is1.7999999999999998 time units to soon.
Figure 5.13: Changing the first time at which a pulse is produced on B in the simulationof Figure 5.12a rightfully results in a past constraint error due to the setup time.
[174] and, according to the semantics provided in Figure 5.7, maintains a priority heap
of pending pulses tagged with their destination cells. When their turn comes, these
pulses get popped off the heap and propagate through the PyLSE circuit under test. Any
newly generated pulses get pushed onto the heap, and the process continues iteratively
until the heap is empty or a user-defined target time is reached (needed when there are
loops in the system).
Figure 5.12a shows how a single instance of the Synchronous And Element gets
instantiated and simulated, using many of the functions described in Table 5.1. In lines172
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
2 and 3, we create two inputs named A and B, producing four pulses on each. Line 4
creates a periodic clock signal, while lines 6 and 7 create and start a simulation object.
Line 8 verifies the correctness of pulses appearing on output Q; in this example, the
first appears at 209.2 ps, exactly firing_delay after the input pulse on CLK that ended
the first clock period in which both A and B appeared. Line 9 produces the graph in
Figure 5.12b. Finally, Figure 5.13 shows the PyLSE simulator catching a past constraints
violation (the setup time constraint); the first pulse produced on B arrives too soon
before the next pulse that arrives on CLK.
5.4.4 Correspondence with Timed Automata
Timed Automata (TA) are a related formalismwith a rich theoretical foundation, used
extensively to model real-time systems with timing constraints. A Timed Automaton
[155] is a finite state machine whose state transitions are guarded by conditions on a
set of resettable clocks, defined formally as follows:
Definition 6 (Timed Automata). A Timed Automaton A = ⟨L, l0,Σ,C, E, I⟩ is a tuple where
(l ∈)L is a set of locations, linit ∈ L is the initial location, (α ∈)Σ is the set of actions, (c ∈)C is a
set of clocks, I : L→ Φ(C) are clock invariants at each location, and
(e ∈)E ⊆ L × Σ × Φ(C) × P(C) × L
is the set of transitions. e = ⟨l, α, φ, λ, l′⟩ ∈ E is a transition from location l to l′ on action α, φ
is the guard specifying conditions that must be true on the clocks, and λ is the set of clocks to be
reset after the transition.
Semantics The operational semantics are a transition relation on snapshots of the
timed automaton in time; we call these snapshots configurations and define them as173
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
both idleCLK0
τhold/{Qτprop }/{*τsetup }
(a) Original PyLSE Machine tran-sition, moving from both to idleon CLK only if τhold time haspassed, producing output Q afterτprop time. It is an error if any in-puts (i.e. *) arrived in the lastτsetup time units prior to startingthis transition. The priority 0 isignored here in isolation.
both q0
{ch ≤ τhold}
idle
errAs errBs errCLKs
CLK?;{∧α∈{A,B,CLK} cα ≥ τsetup};{ch, cCLK} σ;∅;∅
CLK?;0 ≥ cA < τsetup;∅ CLK?;0 ≥ cB < τsetup;∅ CLK?;0 ≥ cCLK < τsetup;∅
(b) (Intermediary step) Expanding the original transitioninto two intermediate TA transitions that handle receiv-ing a message on channel CLK (corresponding to originalsymbol CLK) (left edge), checking for the transition timeto have passed (right edge), and erroring out (to errAs,errBs, or errCLKs) if the τsetup is violated.
both q0 q1
{ch ≤ τhold}
idle
errAh errBh errCLKh
... f !;∅;∅ σ;{ch = τhold};{ch}
A?;{0 ≥ ch < τhold};∅ B?;{0 ≥ ch < τhold};∅
CLK?; 0 ≥ ch < τhold;∅
... ......
(c) (Final step, part 1) Further expanding the transition toinclude transitions for firing output f and erroring out (toerrAh, errBh, or errCLKh) if unexpected inputs are receivedduring a transitionary period.
f0 f1
{cp ≤ τprop}f ?;∅;{cp}
Q!;{cp = τprop};∅
(d) (Final step, part 2) AuxiliaryTA created for modeling firing de-lay. The TA in Figure 5.14c sendsa message on channel f , whichis received here. After τprop timeunits, output finally appears onoutput channel Q. This firing TA,including a fresh clock cp, is du-plicated by a soaking factor s =⌈τprop/τhold⌉ to allow the networkto fire again if needed during thetransition.
Figure 5.14: Expanding a PyLSE Machine transition into its corresponding TA transi-tions, using an edge from the Synchronous And Element (for brevity, we’ve replacedthe state named a and b arrived with both). We assume clocks ch, cs, and cp andchannels A, B, CLK, f and Q. Shaded states (or ... edges) indicate old states (edges)repeated from the previous figure.
174
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
a location/clock valuation tuples, such that the invariant at that location is satisfied:
Conf(A) = {⟨l, ν⟩| l ∈ L, ν : C → Time, ν |= I(l)}. A configuration ⟨l, ν⟩ moves to a
configuration ⟨l′, ν′⟩ by one of two transitions: an action transition or a delay transition.
In an action transition, ⟨l, ν⟩moves to ⟨l′, ν′⟩ by either sending or receiving a message
on a channel; formally, ⟨l, ν⟩ λ = a?!−−−−→ ⟨l′, ν′⟩ ⇐⇒ ∃⟨l, α, φ, λ, l′⟩ ∈ E s.t. ν |= φ ∧ ν′ |= ν[λ :=
0]∧ ν′ |= I(l′). In a delay transition, ⟨l, ν⟩moves to ⟨l′, ν′⟩ by way of time passing; formally,
⟨l, ν⟩λ = t−−−→ ⟨l′, ν + t⟩ ⇐⇒ ∀t′ ∈ [0, t] : ν + t′ |= I(l). The silent transition is represented by
σ. Finally, the initial configuration is defined as the set:
Confinit = {⟨linit, ν0⟩} ∩ Conf(A) with ν0(x) = 0, ∀c ∈ C
Using these definitions gives us:
Definition 7 (Timed Automata Semantics). The semantics S of a Timed Automaton are
defined by
S (A) = ⟨Conf(A),Time ∪ Σ?!, {λ | λ ∈ Time ∪ Σ?!},Confinit⟩
We now present two examples to better explain the different transition types:
Action Transition Example From a TA’s initial configuration, assuming there exists
another playing the role of an “environment” that sends a message on the a channel,
the TA is able to make an action transition:
⟨0, ν(x) 7→ 0⟩λ = a?−−−−→ ⟨1, ν(x) 7→ x⟩
The above transition simply indicates that the TA can receive a message on channel
a and move from location 0 to 1 instantaneously. In the first configuration, our clock
valuation sets the local clock x to 0 and at the second configuration, we set the x to its
175
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
current value, indicating no time has passed.
Delay Transition Example By moving our imaginary token forward a few steps, the
TA can additionally make a delay transition. Assuming the following configurations:
⟨3, ν(x) 7→ t⟩λ = 2.3−−−−−→ ⟨4, ν(x) 7→ t + 2.3⟩
This transition indicates that from location 3 the TA can’t move to location 4 without
moving time forward by 2.3 units (e.g. picoseconds). This concomitantly increases our
local clock, which incidentally also increases the global clock.
Conversion to TA To directly obtain the benefits of TA, we can convert a PyLSE
Machine to a network of Timed Automata running in parallel. Figure 5.14 graphically
shows this conversion process for a single edge of the Synchronous And Element. This
is the same edge highlighted in Figure 5.6 and dissected in Section 5.3.1, but we’ve
replaced the state named a and b arrived with both for space constraints. At a high
level, this process works by expanding the edges from the original PyLSE Machine
into TA transition sequences. We first create TA clocks for each PyLSE Machine input
(cA, cB, and cCLK), in addition to a clock measuring the time elapsed on transitions (ch);
these clocks are available to all edges of this TA. Given edge CLK0τhold/{Qτprop}/{*τsetup}
emerging from state both, translation proceeds incrementally. The input symbol CLK of
the PyLSE Machine becomes a TA channel CLK on which messages are only received
by this automaton. The time it takes to complete the transition, τhold, becomes part of
the inequality in both location q0’s invariant and in the guard involving clock ch as part
of final edge to idle. In addition, clocks cA, cB, and cCLK are compared against the past
constraint value, τsetup, in the first edge’s guard, going to an error state if violated and
otherwise permitting the transition to q0 to be taken. Figure 5.14b is the result of this
176
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
first conversion, producing an intermediate TA.
To handle detecting inputs while in the transitional period, Figure 5.14c inserts three
additional states, erra, errb and errclk, one per possible input message that can be
received. Figure 5.14c also adds intermediate state q1, with the dual purpose of sending
a firing message f to an auxiliary TA created in Figure 5.14d, and for setting up the
clock that is used for checking that the transitional timing period has been satisfied
before going to state idle. The auxiliary TA in Figure 5.14d is created entirely alongside
the previous TA; when it receives a message f to fire, it waits the designated firing
delay time τprop before sending a message on channel Q. Here, producing output Q in
the original PyLSE Machine corresponds to sending a message on the channel Q created
solely for sending, allowing an output action and transition to occur in parallel. Finally,
UPPAAL admits a notion of priority by allowing channels to be declared with the
priority keyword along with an associated ordering or priorities. Given two channels
with two different priorities, only the higher priority action will be enabled.
There is a significant increase in complexity as onemoves down fromPyLSEMachine
to TA; the example in Figure 5.14 shows that at least 12 TA locations and 11 edges had
to be created for a single PyLSE Machine transition. The entire resultant TA network for
a single Synchronous And Element PyLSE Machine has 102 locations and 110 edges.
PyLSE properly encapsulates this complexity, allowing this much larger TA network
to instead be represented by the four states and twelve edges of the original PyLSE
Machine, as shown in Figure 5.6. The user can succinctly write the transition system in
our DSL, and it gets automatically converted to TA.
5.5 Evaluation
The goal of our evaluation is to prove the following claims:
177
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Claim 1. PyLSE can be used to accurately model the functional and timing behavior of basic
SCE cells and larger designs.
Claim 2. PyLSE offers significant productivity gains over state-of-art HDLs for designing and
simulating basic SCE cells and larger designs.
Claim 3. PyLSE can be used in conjunction with a state-of-the-art model checker to formally
verify properties of basic SCE cells and larger designs.
To evaluate these claims, we implemented 16 basic cells (constituting the PyLSE
standard library) plus six larger designs as listed in Table 5.3. While these designs
appear relatively small, we note that each basic cell takes 15-100+ lines to represent
in Verilog (the most commonly used approach). In [175], for example, the OR cell
autogenerated from their analog model takes 58 lines of Verilog. As far as we know,
there are no open source cell libraries available containing all the basic cells we list in
Table 5.3, making direct comparison difficult.
5.5.1 SPICE Simulation Comparison
Circuit designers perform simulationswith low-level languages like SPICE [176] and
WRSpice [177] to create analog gate models using fundamental electrical components.
However, this process can be time-consuming and requires significant domain expertise.
SPICE and PyLSE occupy different levels of abstraction, with the one complementing
the other; the information on delays produced by highly accurate SPICE simulation
can inform models of the gates via their respective PyLSE Machines. It is through
this abstraction that PyLSE can improve productivity by making it easier to scale
and simulate larger designs before physically implementing them. However, SPICE
simulations also highlight other parasitic effects that can change delays when two or
178
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.15: A eight-input bitonic sorter, composed of twenty-four comparators (seeFigure 5.11a). It takes eight individual input wires (i0 through i7), which are producedin arrival time order, after some network delay, on o0 through o7.
more gates are connected together – thus a key to developing a successful simulator
at a different level of abstraction is to verify that the two match in spite of design size.
Small discrepancies in delays are expected to be found as a result of both loading effects
and additional buffering stages used to improve signal fidelity.
For a more detailed discussion, we will focus on an 8-input bitonic sorter and the
cells that compose it. A bitonic sorter [178] is a parallel sorting network made up of
many Min-Max pair blocks (Figure 5.11), connected like in Figure 5.15. Figure 5.16
shows the PyLSE code for creating an arbitrary N input bitonic sorter (where N is a
power of two). By setting it N = 8, we get the bitonic sorter presented in Figure 5.15.
For line count comparison against SPICE code later, we use a version where N has been
hard-coded to 8, meaning we manually connect the Min-Max pairs together, as shown
in the PyLSE code in Figure 5.17; both Figures 5.16 and 5.17 are equivalent.
To validate the accuracy of PyLSE, we compare simulation results of running the
four designs shown in Table 5.2 in both SPICE via Cadence [179] and PyLSE.
Figures 5.18, 5.19, 5.20, and 5.21 compare simulating three designs in PyLSE and
SPICE. These SPICE simulations operate on the level of the analog design (for example,
179
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
1 def split(*args):2 mid = len(args) // 23 return args[:mid], args[mid:]4
5 def cleaner(*args):6 upper, lower = split(*args)7 res = [min_max(*t) for t in zip(upper, lower)]8 new_upper = tuple(t[0] for t in res)9 new_lower = tuple(t[1] for t in res)
10 return new_upper, new_lower11
12 def crossover(*args):13 upper, lower = split(*args)14 res = [min_max(*t) for t in zip(upper, lower[::-1])]15 new_upper = tuple(t[0] for t in res)16 new_lower = tuple(t[1] for t in res[::-1])17 return new_upper, new_lower18
19 def merge_network(*args):20 if len(args) == 1: return args21 upper, lower = cleaner(*args)22 return merge_network(*upper) + merge_network(*lower)23
24 def block(*args):25 upper, lower = crossover(*args)26 if len(upper + lower) == 2: return upper + lower27 return merge_network(*upper) + merge_network(*lower)28
29 def bitonic_helper(*args):30 if len(args) == 1: return args31 else:32 upper, lower = split(*args)33 new_upper = bitonic_helper(*upper)34 new_lower = bitonic_helper(*lower)35 return block(*new_upper + new_lower)36
37 def bitonic_sort(*args):38 if len(args) == 0:39 raise pylse.PylseError("bitonic_sort requires at least one argument to sort")40 if len(args) & (len(args) - 1) != 0:41 raise pylse.PylseError("number of arguments to bitonic_sort must be a power of 2")42 return bitonic_helper(*args)
Figure 5.16: An N-input bitonic sorter implementation written in PyLSE. The onlySFQ-specific cells are created in the calls within cleaner() and crossover() to comp(),whose definition is listed in Figure 5.11b. Given N unordered input wires, the sorterreturns N ordered output wires. The “value” of a wire is determined by the time atwhich the pulse arrives, such that the comparison operation between two wires a and bis true if a arrives earlier than b. The associated block diagram is found in Figure 5.15.
180
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
1 def bitonic_sort_8(a1, a2, a3, a4, a5, a6, a7, a8):2 c1l, c1h = min_max(a1, a2)3 c2l, c2h = min_max(a3, a4)4 c3l, c3h = min_max(c1h, c2l)5 c4l, c4h = min_max(c1l, c2h)6 c5l, c5h = min_max(c4l, c3l)7 c6l, c6h = min_max(c3h, c4h)8
9 c7l, c7h = min_max(a5, a6)10 c8l, c8h = min_max(a7, a8)11 c9l, c9h = min_max(c7h, c8l)12 c10l, c10h = min_max(c7l, c8h)13 c11l, c11h = min_max(c10l, c9l)14 c12l, c12h = min_max(c9h, c10h)15
16 c13l, c13h = min_max(c5l, c12h)17 c14l, c14h = min_max(c5h, c12l)18 c15l, c15h = min_max(c6l, c11h)19 c16l, c16h = min_max(c6h, c11l)20 c17l, c17h = min_max(c14l, c16l)21 c18l, c18h = min_max(c16h, c14h)22 c19l, c19h = min_max(c13l, c15l)23 c20l, c20h = min_max(c15h, c13h)24 c21l, c21h = min_max(c19l, c17l)25 c22l, c22h = min_max(c19h, c17h)26 c23l, c23h = min_max(c18l, c20l)27 c24l, c24h = min_max(c18h, c20h)28 return c21l, c21h, c22l, c22h, c23l, c23h, c24l, c24h
Figure 5.17: An 8-input bitonic sorter implementation written in PyLSE. The SFQ-specific cells are created in the calls to comp(), whose definition is listed in Figure 5.11b.Simulation results are in Figure 5.21a. This function is an alternative way of writing thebitonic sorter produced by passing eight arguments to the function in Figure 5.16.
181
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
(a) PyLSE simulation (C Element).
(b) SPICE simulation (C Element).
Figure 5.18: SPICE vs. PyLSE simulation results for the C Element.
182
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
(a) PyLSE simulation (Inverted C Element).
(b) SPICE simulation (Inverted C Element).
Figure 5.19: SPICE vs. PyLSE simulation results for the Inverted C Element.
183
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Table 5.2: Simulation times of PyLSE vs. SPICE-level models. For the C and InvCelements, size refers to the number of transitions in the DSL (≈ the number of lines),and for the rest, the number of non-whitespace lines within the function. The numberof SPICE lines reported is the size of the unflattened netlist.
Name SPICE (Cadence) PyLSELines Time (s) Size Time (s)
C 81 2.840 6 0.000298InvC 87 2.987 6 0.000336
Min-Max Pair 140 4.608 5 0.000617Bitonic Sort 8 250 52.565 24 0.003857
(a) PyLSE simulation (min-max).
(b) SPICE simulation (min-max).
Figure 5.20: SPICE vs. PyLSE simulation results for the min-max pair.
184
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
(a) PyLSE simulation (8-input bitonic sort).
(b) SPICE simulation (8-input bitonic sort).
Figure 5.21: SPICE vs. PyLSE simulation results for the eight-input bitonic sorter.
185
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.3a) while the PyLSE simulations operate on the level of the PyLSE Machine
(Figure 5.3b). Figures 5.18a (PyLSE) and 5.18b (SPICE) simulate the C Element; given
identical inputs and C cell propagation delay (12 ps), the output times of both sim-
ulations match exactly: with inputs A arriving at 80.0, 230.0, 380.0, and 530.0 ps and
B arriving at 50.0, 220.0, 390.0, and 560.0, respectively, and output Q arriving at 92.0,
242.0, 402.0, 572.0. Figures 5.20a (PyLSE) and 5.20b (SPICE) simulate the min-max
pair, tested with three pairs of (A, B) inputs: (115, 64), (184, 215), and (304, 315) ps.
The SPICE model is balanced, with a propagation delay along all paths of 22 ps; how-
ever, the PyLSE model’s propagation delay is 25 ps. As a result, for the SPICE model
LOW = min(A, B)+ 22ps and HIGH = max(A, B)+ 22ps, such that the first LOW pulse appears
at 64 + 22 = 86 ps and first HIGH pulse appears at 115 + 22 = 137 ps. The PyLSE model’s
output pulses appear 3 ps later, e.g. for this first pair, at 89 ps and 140 ps.
The discrepancy comes because the given PyLSE design was created as a pure
composition of the individual cells. When combined together in SPICE, however, the
entire system exhibits a smaller total propagation delay than what would be assumed
from the sum of its parts, due to the parasitic effects mentioned previously. Note that
the delay of each individual cell in PyLSE can be tuned, or variability added (see Section
5.5.2), to match the SPICE behavior more closely if desired.
Figure 5.21a (PyLSE) and 5.21b (SPICE) show the waveforms for the 8-input bitonic
sorter given inputs 80, 80, 130, 130, 180, 230, 280, 330 ps. The composability issue
creeps up here as well: the SPICE model’s entire propagation delay is between 100 and
110 ps, while a purely compositional delay would equal the min-max SPICE model’s
delay (22 ps) multiplied by the depth of the network (6 according to Figure 5.15), i.e.
22 ∗ 6 = 132 ps. Despite this discrepancy, we can see that the PyLSE design functions
correctly according to its own total delay, i.e. 25 ∗ 6 = 150: here the pulse arriving on
input IN4 (the earliest input pulse) is produced 150 ps later on OUT0, and that generally,186
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
the output pulses appear in rank order, behaving correctly. Table 5.2 compares sizes
and simulation times of these designs; the PyLSE versions are an average of 16.6×
smaller than their SPICE counterparts and take several orders of magnitude less time to
simulate (average 9879× less). These example simulations demonstrate an important
tradeoff: the extremely high accuracy of the analog design level (SPICE) versus the
scalability and rapid prototyping of PyLSE.
5.5.2 Simulation and Dynamic Correctness Checks
Harnessing the rich features of Python, we can quickly validate our designs for basic
correctness using the events dictionary returned from a simulation run; in this section
we give some examples. We note that these and similar tests have been performed on
all 22 designs from Table 5.3.
2x2 Join The 2x2 Join element is a dual-rail logic primitive that takes in two pairs of
inputs, AT and AF , and BT and BF , and produces one of four outputs Q00, Q01, Q10, and
Q11 depending on the last pair of A∗, B∗ inputs seen. This element has been implemented
in PyLSE and takes 12 transitions to fully specify. A precondition for this cell to function
properly is that AT and AF never arrive consecutively without an interleaving BT or BF
(and vice versa). This can be written succinctly as follows:
inputs = sorted(((w, p) for w, evs in events.items()
for p in evs if w in ('A_T', 'A_F', 'B_T', 'B_F')),
key=lambda x: x[1])
zipped = list(zip(inputs[0::2], inputs[1::2]))
assert all(x[0] != y[0] for x, y in zipped)
The cell behaves correctly if, given this precondition, Q00 only pulses if AF and BF
arrived, in any order (similarly for Q01 being produced only when AF and BT arrived,187
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
etc.); this condition is written in a similar fashion.
This can be written as follows:
outputs = sorted(((w, p) for w, evs in events.items()
for p in evs if w in ('Q00', 'Q01', 'Q10', 'Q11')),
key=lambda x: x[1])
def to_output(*names):
x, y = sorted(names)
return {
('A_F', 'B_F'): 'Q00', ('A_F', 'B_T'): 'Q01',
('A_T', 'B_F'): 'Q10', ('A_T', 'B_T'): 'Q11',
}[(x, y)]
assert [w[0] for w in outputs] ==\
[to_output(x[0], y[0]) for x, y in zipped]
Race Tree A race tree [180] is a decision tree that uses race logic to produce a label
based on a set of internal decision branches. We implemented a race tree in PyLSE by
composing 18 basic SFQ cells together in a total of 20 lines of code. A fundamental
correctness property of these trees is that one and only one output label pulses for a
given set of input pulses. The following assertion encodes this condition using the
events dictionary of before:
assert sum(len(l) for o, l in events.items()
if o in ('a', 'b', 'c', 'd')) == 1
8-input Bitonic Sorter A bitonic sorter is correct if, given a single pulse on each input
at an arbitrary time (spaced far enough apart to satisfy transition time constraints), the
outputs appear in rank order. This property can be expressed as follows, assuming the
first output that should appear is named O0, followed by O1, etc. until the last output
ON for some power-of-two N:188
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
out_events = {e for e in events.items()
if e[0].startswith('o')}
ordered_names = sorted(out_events.keys())
ranked = [es for _, es in sorted(out_events.items(),
key=lambda x: ordered_names.index(x[0]))]
assert all(len(es) == 1 for es in ranked)
assert all(x[0] <= y[0] for x, y
in zip(ranked, ranked[1:]))
Evaluating Robustness Given Timing Variability In the physical world, the propaga-
tion delays of the basic cells vary somewhat from their nominal values due to variability;
we saw this in the bitonic sort example of Section 5.5.1, where the SPICE delay varied
between 100 and 110 ps. Such variance can lead to pulses arriving at their destination
cells too early or late, causing the design to fail unexpectedly. At a PyLSE Machine
level, these failures are detected by violations of transition and past constraint times
during simulation or by erroneous outputs seen after simulation, and might signify
that the network needs to be redesigned to make it less sensitive to variability. PyLSE
makes it easy to add variability to existing designs and evaluate their robustness in the
presence of these variations; simply pass the flag variability=True to simulate().
Every individual propagation delay that occurs during the simulation will then have
a small amount of delay, by default taken from a Gaussian distribution, added to or
subtracted from it. The variability argument can be used to specify the cell types or
the individual cell instances where the default variability should be added, or it can be
set to a user-defined function for even greater fine-tuning.
The following demonstrates how to simulate the 8-input bitonic sorter with added
variability. The variation from the original delay of each cell has been capped to +/-
20%.
189
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
5.5.3 Model Checking in UPPAAL
Model checking [181] is a formal verification technique used to check that a particu-
lar property, typically written in a temporal logic, holds for certain states on a given
model of the system. Before it can be used, however, a model of the system must be
created. Timed Automata is one such model, and as we have shown in Section 5.4,
PyLSE can automatically transform PyLSE Machines into a network of communicating
Timed Automata; in this way, designs written in PyLSE are the models themselves, and
immediately amenable to formal verification.
We have chosen to integrate with UPPAAL, a state-of-the-art framework for model-
ing real-time systems based on TA [156]. The conversion process is straightforward:
the PyLSE circuit is traversed, with every transition of every element being converted
according to the steps in Figure 5.14 into a network UPPAAL-flavored TA. The result
is saved to an XML file, which can then be simulated in UPPAAL or verified against
certain properties on the command line via the verifyta program their distribution
provides.
A Note on Converting Input Sequences Input sequences in PyLSE are lists of times-
tamps with an associated name; during simulation, a pulse is sent along the wire with
the given name at the given time to its destination element. This pattern can be readily
converted to a Timed Automaton. For each timestamped input pulse, we create a state
and an invariant on the state indicating that time may not pass longer than the time
indicated by the timestamp. For transitions, we place a guard indicating time may
not pass less than that indicated by the timestamp, and then we issue an output pulse
on the prescribed channel. Time may be indicated either in an absolute fashion or by190
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Table5.3:
Basicc
ells(fi
rst1
6rows)
andlarger
design
s(lastsixrows)
implem
entedin
PyLS
E.Ea
chha
sbeenva
lidated
viaPy
LSEsimulationforfun
ctiona
lcorrectne
ssan
dtim
ingconstra
intv
iolatio
nde
tection,
andau
tomatically
conv
erted
into
TAthat
have
been
simulated
andve
rified
inUPP
AAL.
TheP
yLSE
columns
disp
laycoun
tsfors
ize,cells
,states,an
dtran
sitio
ns;for
basicc
ells,the
searenu
mbe
rsfora
nindividu
alcell,
while
forthe
larger
design
s,itistheaccu
mulation
ofeveryinstan
tiatedcellin
thene
twork.
Thesize
correspo
ndstothenu
mbe
roftrans
ition
sintheDSL
(rou
ghly
equa
lto
thenu
mbe
roflines)forb
asiccells
,and
thenu
mbe
roflines
forthe
larger
design
s.Th
efirstfour
UPP
AALcolumns
arethenu
mbe
rofT
A,locations
,trans
ition
s,an
dch
anne
lsin
thecell’sge
neratedTA
netw
ork,
while
thelatte
rtwo
columns
containthetim
eto
verif
ytheQue
ries1
and2listedin
Section5.5.3an
dthenu
mbe
roftotal
states
explored
(onlyon
enu
mbe
rislist
edin
each
columniftheresu
ltsforQ
uerie
s1an
d2werethesame).Ittoo
kless
than
1second
tosimulateallo
fthe
sede
sign
sinPy
LSE.
Nam
ePy
LSE
UPP
AAL
Com
paris
onSize
Cells
States
Tran
.TA
Locs.
Tran
.Cha
n.Time(s)
States
TA/C
ells
Locs./States
Tran
.(U)/Tran
.(P)
C6
13
62
3942
3<1
382
137
InvC
61
36
445
483
<1
694
158
M2
11
22
1718
3<1
372
179
S1
11
13
1313
3<1
563
1313
JTL
11
11
29
92
<1
172
99
And
111
412
510
211
04
<1
695
25.5
9.17
Or
41
26
249
534
<1
485
24.5
8.83
Nan
d12
14
122
9510
34
<1
422
23.75
8.58
Nor
61
26
249
534
<1
362
24.5
8.83
Xor
91
39
375
814
<1
453
259
Xnor
121
412
294
102
4<1
452
23.5
8.5
Inv
41
24
330
323
<1
143
158
DRO
41
24
227
293
<1
112
13.5
7.25
DRO
SR6
12
62
4953
4<1
232
24.5
8.83
DRO
C4
12
43
3133
4<1
143
15.5
8.25
2x2Join
121
520
520
622
18
<1
585
41.2
11.05
Min-M
ax5
59
1524
149
155
14<1
2471
4.8
16.56
10.33
Race
Tree
1618
3256
5044
046
454
127/84
2625
592.78
13.75
8.29
Add
er(Syn
c)13
1933
7157
627
665
6266
9/51
518
5815
33
199.37
Add
er(xSF
Q)
3183
121
183
193
1449
1511
211
∞N/A
2.33
11.98
8.26
BitonicS
ort4
630
5490
144
894
930
84∞
N/A
4.8
16.56
10.33
BitonicS
ort8
2412
021
636
057
635
7637
2033
6∞
N/A
4.8
16.56
10.33
191
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
duration since the last pulse, since clocks in UPPAAL are resettable. Typically, with
well-designed circuits, this will permit only one possible enabled transition, which
matches the determinism found in PyLSE.
Query 1: Correctness To verify that our translation process works, we automatically
converted all 16 basic cells and six larger designs into UPPAAL, as shown in Table 5.3,
where we note the resulting size of the TA network. Once in UPPAAL, we checked that
their internal simulator agrees with ours from an input/output perspective. We also
automatically generate a correctness formula in UPPAAL-flavored timed computation
tree logic (TCTL) [182, 183] for each, based on a given PyLSE simulation’s events, to
formally verify that the given design generates the expected output. For example, here
is a PyLSE-generated TCTL formula for the correctness of min-max pair, given pulses
on A at 115, 215, and 315, on B at 64, 184, and 304, and a network delay of 25 ps:
A[] (((firingauto3.fta_end imply ((global == 890) ||
(global == 2090) || (global == 3290))) &&
(firingauto4.fta_end imply ((global == 890) ||
(global == 2090) || (global == 3290))) &&
(firingauto5.fta_end imply ((global == 890) ||
(global == 2090) || (global == 3290)))) &&
((firingauto12.fta_end imply ((global == 1400) ||
(global == 2400) || (global == 3400)))))
At the top of this formula, A is a path quantifier that expresses “for all subsequent
time points”, while [] is a branch quantifier meaning “for all possible branches.” The
firingauto* correspond to firing TA instances, and fta_end is the location in that
instance that immediately follows sending a fire message to a particular network output
sink. As many firing TAmay be associated with each network output (see Figure 5.14d),
there are multiple states to check for each time. This says that it is only possible to
produce a pulse at the given output at the given time. These times have been upscaled
192
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
to integers to meet the requirements UPPAAL places on numbers involved in clock
constraints; thus global == 2090 correspond to a simulated time of 209.0 ps in PyLSE.
As another example, here is PyLSE-generated TCTL formula for correctness for the
8-input bitonic sorter, given inputs occurring at times 230, 130, 180, 280, 80, 80, 130, and
330 ps on inputs I0 through I7, respectively, and a network delay of 150 ps:
A[] (((firingauto185.fta_end imply ((global == 2300))) &&
(firingauto186.fta_end imply ((global == 2300))) &&
(firingauto187.fta_end imply ((global == 2300)))) &&
((firingauto333.fta_end imply ((global == 2300)))) &&
((firingauto107.fta_end imply ((global == 2800))) &&
...9 more lines...
(firingauto263.fta_end imply ((global == 4300)))) &&
((firingauto321.fta_end imply ((global == 4800)))))
In Table 5.3, we also show the time it took to verify this property (customized to each
cell). For the basic cells and the min-max pair, verification consistently took less than 1
second. The race tree, with 440 locations, took 127 seconds and explored 262559 states,
while the synchronous full adder, with nearly 43% more locations, took 669 seconds
(5.26×) and visited 7.077×more states. Model checking becomes infeasible due to the
state explosion as we reach the bitonic sorters and xSFQ [151] full adder, which failed
to finish in a day. Table 5.3 also shows how much larger the network of TA is compared
to the original PyLSE Machines. On average, each cell (i.e. PyLSE Machine) requires
3.02 UPPAAL TA, each PyLSE Machine state requires 18.99 UPPAAL locations, and
each PyLSE Machine transition requires 9.05 UPPAAL transitions.
Query 2: Unreachable Error States Our translation process inserts error states that
are entered when transition time or past constraint violations occur (for example, errAh
and errAs, respectively, from Figure 5.14). Since these states have no outgoing edges,
they cannot respond to additional input nor allow time to pass and so are terminal.
Entering such a state would deadlock the TA, and verifying that no deadlock occurs (i.e.193
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
A[] not deadlock) would normally be sufficient to show that the inputs to a design
meet timing constraints. Unfortunately, this form of deadlock detection is not useful for
our purposes, since “good” deadlock also occurs when the sequence of user-defined
inputs has been exhausted and no more cells can progress. Instead, we automatically
generate an UPPAAL verification query that checks that it is impossible to reach any
error state in the network:
A[] not (c0.C_err_a_1 || c0.C_err_a_11 || c0.C_err_a_16 ||
...18 more lines...
c_inv0.C_INV_err_b_8 || c_inv0.C_INV_err_b_9 ||
s0.S_err_a_1 || s0.S_err_a_2 || jtl0.JTL_err_a_1 ||
jtl0.JTL_err_a_2 || s1.S_err_a_1 || s1.S_err_a_2)
UPPAAL explores the same number of states as for Query 1 in under one second
for all basic cells, with the larger designs similarly encountering exponential blowup
difficulties. If the above property is not satisfied, UPPAAL will return a trace showing
the path that led to the particular error state.
Likewise, here is the query we automatically generate for the 8-input bitonic sorter:
A[] not
(jtl0.JTL_err_a_1 || jtl0.JTL_err_a_2 ||
jtl1.JTL_err_a_1 || jtl1.JTL_err_a_2 ||
c0.C_err_a_1 || c0.C_err_a_11 ||
c0.C_err_a_16 || c0.C_err_a_21 ||
c0.C_err_a_26 || c0.C_err_a_6 ||
...
s47.S_err_a_1 || s47.S_err_a_2)
As of this writing, additional properties must be explicitly written out in UPPAAL’s
DSL for expressing TCTL formulas. As far as we know, we are the first to use timed
automata-based model checking to check the correctness of SFQ circuits.
194
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
5.6 Related Work
Existing HDLs Existing HDLs like Verilog [106] model the timing constraints of
SCE by coupling asynchronously-updated registers to latch incoming signals with
complicated series of conditionals to track whether these constraints are satisfied [184,
166, 185, 186]. Designs using this approach have many downsides:
• They tend to be extremely verbose, spanning tens to hundreds of lines per cell
module; for example, in [187], 90 lines of codes were needed to model a destruc-
tive readout (DRO) cell, while the PyLSE Machine equivalent takes four lines.
Similarly, a model of the OR cell in [154] takes 18 lines of Verilog.
• A number of ambiguous internal signals must be generated for synchroniza-
tion purposes; for example, for the implementation of said DRO cell, five edge-
triggered always blocks and three artificial synchronization signals were required.
• There are no clear boundaries between functional and timing specification, leading
to obfuscated code and an enlarged surface for programming bugs.
• They rely on the peculiar semantics of Verilog or the chosen simulator, instead of
being based on a suitable formal foundation.
Recent approaches [188, 189] are more modular and compact, but the resemblance
of their proposed coding scheme to multithreaded socket programming raises the
barrier to entry and again makes them more prone to bugs. Finally, [190, 191, 192, 175]
automatically extract state machine models and timing characteristics of SFQ cells from
SPICE files, but in the end, still use them to generate Verilog HDL code that must be
integrated with the rest of the user-coded design.
195
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Verification There have been many attempts at formally checking the correctness
of SCE designs at the HDL level. Recent work [170] uses a delay-based time frame
model, which assumes that pulses arrive periodically according to a unique clock period.
This assumption allows them to discretize the behavior of these pulse-based systems
into a verifiable synchronous model. PyLSE instead makes no requirements about
clock periodicity and is able to model systems that include asynchronous cells or no
clock. VeriSFQ [193] is semi-formal verification framework that uses UVM [194] to
check that their designs are properly path-balanced, have correct fanout, and that all
synchronous SFQ gates have a clock signal. PyLSE, on the other hand, is an entirely
new DSL for SCE design, statically preventing the creation of designs with these basic
issues, and so a formal framework for checking them is unneeded. qMC [154] develops
a framework that uses SMT-based model checkers to check the correct functionality of
post-synthesis netlists via SystemVerilog assertions. However, their gate models do
not include information on hold or setup times or propagation delay, such that they
assume that pulses are delayed by one cycle. PyLSE instead represents and model
checks against these timing constraints via a Timed Automata-based model checker
like UPPAAL.
5.7 Conclusion
We present PyLSE, a language for the design and simulation of pulse-based sys-
tems like superconductor electronics (SCE). PyLSE simplifies the process of precisely
defining the functional and timing behavior of SCE cells using a new transition-system
based abstraction we call the PyLSE Machine. It facilitates a multi-level approach to
design, making it easy to compose basic cells alongside abstract design models and
create large, scalable SCE systems. As an alternative to Verilog, we argue that it is
196
PyLSE: A Pulse-Transfer Level Language for Superconductor Electronics Chapter 5
Figure 5.22: PyLSE’s place in the superconductor electronics design flow. PyLSE isused to model, simulate, and verify SFQ cells at the Behavioral Level, using timinginformation from the Josephson Junction Level, similar to approaches using Verilog andSystem Verilog.
an effective behavioral-level modeling framework that can fit within SCE design flow
(see Figure 5.22). We evaluate PyLSE by simulating and dynamically checking the
correctness of 22 different designs, comparing these simulations against analog SPICE
models, and verifying their timing constraints using the UPPAAL model checker. Com-
pared to SPICE, PyLSE designs take 16.6× fewer lines code and take several orders
of magnitude less time to simulate, all while maintaining the needed level of timing
accuracy. Compared with specification directly as Timed Automata, PyLSE requires
18.9× fewer states and 9.0× fewer transitions. We believe, with the end of traditional
transistor scaling, pulse-based logic systems will only continue to grow in importance.
PyLSE, with its expressive timing, composable abstractions, and connection to well
understood theory, has the potential to provide a new foundation for that growth for
years to come.
197
Chapter 6
Conclusions and Future Work
The correctness of the programs we write ultimately depends on the correctness of the
machines on which they run. In this thesis, I have shown that programming language
principles can guide the design of more precise, verifiably correct, and secure ISAs and
HDLs, resulting in hardware with better correctness guarantees. In summary:
1. Provable correctness and ease-of-reasoning deserve a place alongside performance
and efficiency as first-class goals in the computer architecture design space.
2. Programming language principles that have traditionally benefited the software
world can be applied toward and improve hardware design and IP integration.
3. Emerging technologies like SCE should be modeled and designed using new
HDLs based on mathematical formalism like automata theory.
This work can be extended in the future in many ways. Practically, Zarf might find
more widespread use if an operating systemwas developed for it; the process of making
one may reveal fundamental aspects that need to change to make it POSIX-compliant,
for example. In addition, I want to formally verify that the Zarf microarchitecture
implements the ISA and apply additional programming language techniques like linear198
Conclusions and Future Work Chapter 6
types towards reducing Zarf’s dependence on garbage collection. At the HDL level, I
want to integrate Wire Sorts into other HDLs, like Verilog or Chisel, and experiment
with applying more type-theoretic approaches to improving the composability of IP.
My experience using and augmenting PyRTL has given memany ideas toward new and
improved HDL design, in addition to a list of improvements I’d like to make in PyRTL
itself. Finally, I want to further develop the PyLSE ecosystem so that it can integrate
more fully with existing EDA workflows and become a transformative and impactful
language in the SCE landscape.
199
Appendix A
Zarf and Bouncer
A.1 Small-Step Semantics
To create an accurate Coq interpreter that behaves similar to our ISA’s implementa-
tion, we use an abstract machine small-step semantics description of Zarf’s functional
ISA, as described in Chapter 2 In the following sections we define its abstract syntax,
domains, and transition rules.
n ∈ Z PC ∈ N
P ∈ ProgramF−→it −→pw
it ∈ ITableF FunTable n n PC | ConTable npw ∈ ProgramWordF let src n n | case src n | result src n |
patlit n n | patcons n n | arg src nsrc ∈ SourceF LiteralSrc | ArgSrc | LocalSrc | FieldSrc | ITableSrc
Figure A.1: Abstract syntax for the small-step semantics of Zarf’s functional ISA.
200
Zarf and Bouncer Chapter A
i, ι ∈ N ⊕ ∈ PrimOp θ, α, a ∈ Address = N
c ∈ Constructor = Con N−−−−−−−→Wrapper
t ∈ Thunk = Thunk Nv ∈ Value = Constructor + Thunk + Z
ϵ, arg ∈ Wrapper = Address + Zκ ∈ Kont = caseK Address Environment N PC +
argsK−−−−−−−→Wrapper +
updateK Address +rightK N Address O(Wrapper) +leftK N Address Wrapper
ρ ∈ Environment = N→ Addressσ ∈ Store = Address→ Value
Λ ∈ ITableMap = N→ (ITable + PrimOp)Π ∈ ProgramMap = N→ ProgramWord
Figure A.2: Semantic domains for the small-step semantics of Zarf’s functional ISA.
A.1.1 Small-Step Abstract Syntax
FigureA.1 defines the abstract syntax for the small-step semantics of Zarf. Aprogram
is composed of a list of information tables (itables), which are either functions or
constructors, and a list of 32-bit program words, which contain the instructions that
define the bodies of the function itables. Like the big-step abstract syntax presented
in Figure 2.2, there are similar constructs like the let, case, and result program words.
patlit and patcons are program words corresponding to branches, and one of more
arg words are used when the preceding instruction needs an argument (i.e. a function
call).
201
Zarf and Bouncer Chapter A
A.1.2 Small-Step Semantics Domains
Figure A.2 defines the domains over which the small-step semantics operate. All
functions and constructors (ITable) are uniquely numbered and put in an ITableMap. We
define various continuations (Kont) needed for tracking the current execution context
as well as the notion of an environment and store for use in the transition rules defined
below.
A.1.3 Small-Step State Transition Rules
Execution begins by setting the initial program state to initState. The itable map
(Λ) and program words map (Π) are also part of the program state but, since they do
not change during the course of execution, are elided from the state transition rules as
defined below for simplicity. Notably, the first itable in the program’s −→it list is always
assumed to be the program’s entry point (the main function) and is assigned itable
number 256 in the itable map (Λ); builtin functions have itable numbers 0 through
255, and user-defined functions are assigned unique numbers greater than 256. Upon
termination, the program’s final value will be found in the heap at address 0x0.
The following table formally defines the transition function F . Each rule describes
a translation relation from one state.
A.1.4 Small-Step Semantics Helper Functions
The following describe several helper functions used in the state transition functions
above. We use the notation −−→arg[i] to represent getting the ith element of the −−→arg list.
The notation J⊕, nK and J⊕, n1, n2K represents performing a unary or binary primitive
operation ⊕ on one or two arguments, respectively. The function builtins is a map
that associates ITable numbers (less than 256) with primitive operations.
202
Zarf and Bouncer Chapter A
TableA.1:S
mall-s
tepstatetran
sitio
nru
leso
fZarf’s
func
tiona
lISA
.
Rul
ePrem
ises
PCne
wO
(θne
w)ϵ n
ewρ
new
ι new
σne
wα
new
κ
LetPrim
θ,•,Π
(PC
)=
letL
itera
lSrc
nPC+
1θ
ϵρ
[ι7→
n]ι+
1σ
ακ⃗
StaticApp
θ,•,Π
(PC
)=
letI
Tabl
eSrc
in,
PC+
n+
1θ
ϵρ
[ι7→α
]ι+
1σ
[α7→
Thun
ki−−→ ar
g]α+
1κ⃗
−−→ arg=getArgs
(PC+
1,n)
LetA
ssign
θ,•,Π
(PC
)=
let
src
i0,s
rc,
Lite
ralS
rc,
PC+
1θ
ϵρ
[ι7→
arg]
ι+
1σ
ακ⃗
src,
ITab
leS
rc,a
rg=getSrc
(src,i
),Th
unkA
ppθ,•,Π
(PC
)=
let
src
in,s
rc,
Lite
ralS
rc,
PC+
n+
1θ
ϵρ
[ι7→α
]ι+
1σ
[α7→
α+
1κ⃗
src,
ITab
leS
rc,n>
0,a=getSrc
(src,i
),Th
unk
i′′(−−→ arg′++−−→ ar
g)]
Thun
ki′−−→ ar
g′=σ
(a),−−→ ar
g=getArgs
(PC+
1,n)
Case
θ,•,Π
(PC
)=
case
src
i,getSrc
(src,i
)=ϵ′
PC•
ϵ′{}
0σ
αca
seKθρι
(PC+
1)::κ⃗
Resu
ltθ,•,Π
(PC
)=
resu
ltsr
ci,getSrc
(src,i
)=ϵ′
PC•
ϵ′{}
0σ
ακ⃗
PatLit1
θ,•,Π
(PC
)=
patli
tin,
i=η(ϵ
)PC+
1θ
ϵρ
ισ
ακ⃗
PatLit2
θ,•,Π
(PC
)=
patli
tin,
i,η(ϵ
)PC+
nθ
ϵρ
ισ
ακ⃗
PatC
ons1
θ,•,Π
(PC
)=
patc
ons
in,C
oni−−→ ar
g=η(ϵ
)PC+
1θ
ϵρ
ισ
ακ⃗
PatC
ons2
θ,•,Π
(PC
)=
patc
ons
in,C
oni−−→ ar
g,η(ϵ
)PC+
nθ
ϵρ
ισ
ακ⃗
Thun
kUnd
1θ=•,T
hunk
i−−→ ar
g=η(ϵ
),PC
θα
ρι
σ[α
α+
1κ⃗′
|−−→ arg|<arity
(i),
args
K−−→ ar
g′::κ⃗′=κ⃗
7→Th
unk
i(−−→ ar
g++−−→ ar
g′)]
Thun
kUnd
2θ=•,T
hunk
i−−→ ar
g=η(ϵ
),PC
θα
ρι
σ[a7→
Thun
ki−−→ ar
g]α
κ⃗′
|−−→ arg|<arity
(i),
upda
teK
a::κ⃗′=κ⃗
Thun
kOve
rθ=•,a=ϵ,
Thun
ki−−→ ar
g=σ
(a),
PC′
aϵ
{}0
σα
args
Kdrop
(n1,−−→ ar
g)|−−→ arg|>arity
(i),
FunT
able
n 1n 2
PC′=Λ
(i)
::up
date
Ka
::κ⃗
Thun
kPrim
1θ=•,a=ϵ,
Thun
ki(
arg 1
::[]
)=σ
(a),
PCθ
arg 1
ρι
σα
righ
tKia•
::κ⃗
|−−→ arg|=arity
(i)=
1Th
unkP
rim2
θ=•,a=ϵ,
Thun
ki(
arg 1
::ar
g 2::
[])=σ
(a),
PCθ
arg 1
ρι
σα
left
Kia
arg 2
::κ⃗
|−−→ arg|=arity
(i)=
2Th
unkC
ons
θ=•,T
hunk
i−−→ ar
g=η(ϵ
)PC
θα
ρι
σ[α7→
Con
i−−→ ar
g]α+
1κ⃗
|−−→ arg|=arity
(i),
Con
Tabl
en=Λ
(i)
Thun
kExa
ctθ=•,a=ϵ,
Thun
ki−−→ ar
g=σ
(a)
PC′
aϵ
{}0
σα
upda
teK
a::κ⃗
|−−→ arg|=arity
(i),
FunT
able
PC′=Λ
(i)
EndC
aseK
θ=•,η
(ϵ)=
n∨η(ϵ
)=
Con
,PC′
aϵ
ρ′
ι′σ
ακ⃗′
case
Kaρ′ι′
PC′::κ⃗′=κ⃗
LeftK
θ=•,η
(ϵ)=
n∨η(ϵ
)=
Con
,PC
θar
gρ
ισ
αri
ghtK
iaϵ
::κ⃗′
left
Kia
arg
::κ⃗′=κ⃗
Righ
tK1
θ=•,η
(ϵ)=
n,ri
ghtK
ia•
::κ⃗′=κ⃗,
PCθ
aρ
ισ
[a7→
J⊕,n
K]α
κ⃗′
⊕=Λ
(i),arity
(⊕)=
1Righ
tK2
θ=•,η
(ϵ)=
n 2,r
ight
Kia
arg
::κ⃗′=κ⃗,
PCθ
aρ
ισ
[a7→
J⊕,n
1,n 2
K]α
κ⃗′
⊕=Λ
(i),arity
(⊕)=
2,η(a
rg)=
n 1Upd
ate1
θ=•,ϵ=
n,up
date
Ka
::κ⃗′=κ⃗
PCθ
ϵρ
ισ
[a7→
n]α
κ⃗′
Upd
ate2
θ=•,ϵ=
a′,u
pdat
eKa
::κ⃗′=κ⃗
PCθ
ϵρ
ισ
[a7→σ
(a′)]
ακ⃗′
203
Zarf and Bouncer Chapter A
getArgs
getArgs returns a list of the n argument words found starting at address PC.
getArgs ∈ PC × N→−−−−−−−→Wrapper
getArgs(PC, n) =
[] if n = 0
arg ::getArgs(PC, n + 1) otherwise
where
arg src n = Π(PC)
arg = getSrc(src, n)
getSrc
getSrc interprets the Source encoding to get a literal value stored in i, a local variable
stored in the current environment, a thunk argument, or a constructor field.
getSrc ∈ Source × N→−−−−−−−→Wrapper
getSrc(src, i) =
i if src = LiteralSrc
−−→arg[i] if src = ArgSrc,Thunk −−→arg = θ
ρ(i) if src = LocalSrc
−−→arg[i] if src = FieldSrc,Con −−→arg = ϵ
arity
arity returns the arity of primitive operations, constructors, and user-defined
functions, based on a given itable number.
204
Zarf and Bouncer Chapter A
arity ∈ N→ N
arity(i) =
i1 if i < 256
i2 otherwise
where
i1 =
0 if builtins(i) ∈ {error}
1 if builtins(i) ∈ {¬,getint}
2 if builtins(i) ∈ {+,−,×,÷,=,≤,∧,∨, ∧̄, ∨̄, ,̂≪,≫,≫, <,putint}
i2 =
n if ConTable n = Λ(i)
n if FunTable n = Λ(i)
η
η is a convenience function for getting the value associated with the current evalua-
tion object. If it is a number, it simply returns the number; otherwise it looks up the
address in the store.
η ∈ Wrapper → Value
η(ϵ) =
n if n = ϵ
σ(a) if a = ϵ
initState
initState creates the initial state used for beginning execution. The store is initial-
ized with a mapping from address 0x0 to the thunk representing the main function,
and the next usable store address is set to 0x1.
205
Zarf and Bouncer Chapter A
c ∈ Constructor = Con Name−−−−−−−→Wrapper t ∈ Thunk = Thunk Operator
−−−−−−−→Wrapper
v ∈ Value = Z + Constructor + Thunka ∈ Address = N w ∈ Wrapper = Address + Z
ρ ∈ Environment = Variable→ Wrapper σ ∈ Store = Address→ Value
Figure A.3: Semantic domains for the big-step semantics
initState ∈ ⟨N,O(Address),Wrapper,Environment,N, Store,Address,−−−→Kont⟩
initState = ⟨0, •, 0x0, {}, 0, {0x0 7→ Thunk 256 []}, 0x1, []⟩
A.2 Big-Step Lazy Semantics for Typed Zarf
The following sections define the big-step lazy semantics for the type-extended Zarf
functional ISA found in Chapter 3.
A.2.1 Big-Step Dynamic Semantics Domains
Figure A.3 shows the semantics domains for the big-step lazy semantics of Zarf.
There are three possible values: an integer, a constructor (i.e. a data type instance),
and a thunk. The primary way in which these domains differ from the eager domains
presented in Figure 2.5 is via the use of the thunk, stores a function and its arguments,
unapplied, until needed by a case statement.
A.2.2 Big-Step Dynamic Semantics Rules
Figure A.4 defines the big-step lazy semantics of the Zarf ISA. The primary differ-
ence between these semantics and those found in Figure 2.5 is that in let instructions,
206
Zarf and Bouncer Chapter A
ρ[x 7→ n], σ ⊢ e ⇓ (v, σ′)ρ, σ ⊢ let x = n in e ⇓ (v, σ′)
(let-int-1)
varLookup(ρ, σ, x2) = n ρ[x1 7→ n], σ ⊢ e ⇓ (v, σ′)ρ, σ ⊢ let x1 = x2 [] in e ⇓ (v, σ′)
(let-int-2)
a is a fresh address ρ′ = ρ[x 7→ a] (w⃗, σ′) = getArgs(−−→arg, ρ′, σ)σ′′ = σ′[a 7→ Thunk op w⃗] ρ′, σ′′ ⊢ e ⇓ (v, σ′′′)
ρ, σ ⊢ let x = op −−→arg in e ⇓ (v, σ′′′)(let-con-fun)
ρ(x2) = a σ(a) = Con ρ[x1 7→ a], σ ⊢ e ⇓ (v, σ′)ρ, σ ⊢ let x1 = x2 [] in e ⇓ (v, σ′)
(let-var-1)
a is a fresh address varLookup(σ, ρ, x2) = Thunk op w⃗1 ρ′ = ρ[x1 7→ a](w⃗2, σ
′) = getArgs(−−→arg, ρ′, σ) σ′′ = σ′[a 7→ Thunk op w⃗1 ++ w⃗2] ρ′, σ′′ ⊢ e ⇓ (v, σ′′′)
ρ, σ ⊢ let x1 = x2−−→arg in e ⇓ (v, σ′′′)
(let-var-2)
a = ρ(x) valueOf(σ(a), σ) = (Con cn w⃗, σ′) σ′′ = σ′[a 7→ Con cn w⃗](cn x⃗ ⇒ e1) ∈
−→br ρ′ = ρ[⃗x 7→ w⃗] ρ′, σ′′ ⊢ e1 ⇓ (v, σ′′′)
ρ, σ ⊢ case x of−→br else e2 ⇓ (v, σ′′′)
(case-con)
a = ρ(x) valueOf(σ(a), σ) = (Con cn w⃗, σ′) σ′′ = σ′[a 7→ Con cn w⃗](cn x⃗ ⇒ e1) <
−→br ρ, σ′′ ⊢ e2 ⇓ (v, σ′′′)
ρ, σ ⊢ case x of−→br else e2 ⇓ (v, σ′′′)
(case-con-else)
((n = ρ(x)) ∨ (a = ρ(x) valueOf(σ(a), σ) = (n, σ′) σ′′ = σ′[a 7→ n]))(n ⇒ e1) <
−→br ρ, σ′′ ⊢ e2 ⇓ (v, σ′′′)
ρ, σ ⊢ case x of−→br else e2 ⇓ (v, σ′′′)
(case-lit)
((n = ρ(x)) ∨ (a = ρ(x) valueOf(σ(a), σ) = (n, σ′) σ′′ = σ′[a 7→ n]))(n ⇒ e1) ∈
−→br ρ, σ′′ ⊢ e1 ⇓ (v, σ′′′)
ρ, σ ⊢ case x of−→br else e2 ⇓ (v, σ′′′)
(case-lit-else)
v = varLookup(ρ, σ, x)ρ, σ ⊢ result x ⇓ (v, σ)
(result-1)ρ, σ ⊢ result n ⇓ (n, σ)
(result-2)
Figure A.4: Big-step lazy semantics of Zarf’s functional ISA.
207
Zarf and Bouncer Chapter A
the function application is postponed by storing the function and argument identifiers
in a thunk, which is evaluated as needed via the valueOf helper during case instruc-
tions. Note that when a case expression does not include an else expression (i.e. rules
case-con and case-lit), the list of branches must contain a match (which is checked
statically). Execution of these forms of case expressions proceeds like case-con-else
and case-lit-else.
A.2.3 Big-Step Dynamic Semantics Helper Functions
The following section defines the helper functions used in Figure A.4. We omit
formalization of those helpers with straightforward definitions.
getArgs
getArgs folds over the list of arguments to return a list of the addresses associated
with those arguments; each literal argument is assigned an address that maps to the
semantic value of that literal in the newly-returned store.
interpret
interpret(e, ρ, σ) is shorthand for ρ, σ ⊢ e, which evaluates to a (v, σ′) tuple per the
big-step operational rules listed in Figure A.4.
paramsOf
paramsOf consults the list of function declarations to extract the parameter names
of a function given its name.
208
Zarf and Bouncer Chapter A
bodyOf
bodyOf consults the list of function declarations to extract the body of a function
given its name.
arity
arity consults the list of constructor and function declarations and builtin operators
associated with a given name and returns the number of fields or parameters it accepts,
respectively.
eval
eval applies a thunk’s primitive operator (e.g. + or ∗) to the thunk’s numeric
arguments, returning a number and a new store (such that this new store remembers
the values of any postponed thunks that had to be evaluated during the primitive
application) as a result.
varLookup
varLookup looks up a variable in the environment, returning its entry if it maps to
an integer. If it maps to an address, it looks up the address in the store.
varLookup ∈ Environment × Store × Variable→ Value
varLookup(ρ, σ, x) =
n if ρ(x) = n
σ(a) if ρ(x) = a
valueOf
valueOf reduces a fully- or over-saturated thunk to weak-head normal form; for any
other value, it just returns that value.209
valueOf ∈ Value × Store→ Value × Store
valueOf(v, σ) =
(n, σ) if v = n
(Con cn w⃗, σ) if v = Con cn w⃗
(Thunk op w⃗, σ) if v = Thunk op w⃗, |w⃗| < arity(op)
overAppHelper(op, w⃗, σ) if v = Thunk op w⃗, |w⃗| ≥ arity(op)
overAppHelper
overAppHelper reduces a thunk to a value by either evaluating a primitive operation
to an integer, creating a saturated constructor, or evaluating the thunk’s function body
to a value.
overAppHelper ∈ Operator ×−−−−−−−→Wrapper × Store→ Value × Store
overAppHelper(op, w⃗, σ) =
eval(⊕, w⃗, σ) if op = ⊕, arity(⊕) = |w⃗|
(Con cn w⃗, σ) if op = cn, arity(cn) = |w⃗|
v′ if op = fn
where
(v, σ′) = interpret(bodyOf(fn), paramsOf(fn) zip w⃗, σ)
v′ =
valueOf(Thunk op w⃗′ ++ (w⃗ drop arity(fn))) if v = Thunk op w⃗′
(v, σ′) otherwise
210
Bibliography
[1] J. McMahan, M. Christensen, L. Nichols, J. Roesch, S.-Y. Guo, B. Hardekopf, andT. Sherwood, An architecture supporting formal and compositional binary analysis, inProceedings of the Twenty-Second International Conference on Architectural Support forProgramming Languages and Operating Systems, ASPLOS ’17, (New York, NY,USA), p. 177–191, Association for Computing Machinery, 2017.
[2] J. McMahan, M. Christensen, L. Nichols, J. Roesch, S.-Y. Guo, B. Hardekopf, andT. Sherwood, An architecture for analysis, IEEE Micro 38 (2018), no. 3 107–115.
[3] M. Christensen, J. McMahan, L. Nichols, J. Roesch, T. Sherwood, andB. Hardekopf, Safe functional systems through integrity types and verified assembly,Theoretical Computer Science 851 (2021) 39–61.
[4] J. E. McMahan, The ZARF Architecture for Recursive Functions. PhD thesis, UCSanta Barbara, Santa Barbara, CA, June, 2019.
[5] J. McMahan, M. Christensen, K. Dewey, B. Hardekopf, and T. Sherwood,Bouncer: Static program analysis in hardware, in Proceedings of the 46th InternationalSymposium on Computer Architecture, ISCA ’19, (New York, NY, USA), p. 711–722,Association for Computing Machinery, 2019.
[6] M. Christensen, T. Sherwood, J. Balkind, and B. Hardekopf,Wire sorts: Alanguage abstraction for safe hardware composition, in Proceedings of the 42nd ACMSIGPLAN International Conference on Programming Language Design andImplementation, PLDI 2021, (New York, NY, USA), p. 175–189, Association forComputing Machinery, 2021.
[7] R. Mangharam, H. Abbas, M. Behl, K. Jang, M. Pajic, and Z. Jiang, Threechallenges in cyber-physical systems, in 2016 8th International Conference onCommunication Systems and Networks (COMSNETS), pp. 1–8, Jan, 2016.
[8] S. Shuja, S. K. Srinivasan, S. Jabeen, and D. Nawarathna, A formal verificationmethodology for ddd mode pacemaker control programs, Journal of Electrical andComputer Engineering (2015).
211
[9] J. M. Rushby, Proof of separability: A verification technique for a class of a securitykernels, in Proceedings of the 5th Colloquium on International Symposium onProgramming, (London, UK, UK), pp. 352–367, Springer-Verlag, 1982.
[10] N. Heintze and J. G. Riecke, The slam calculus: Programming with secrecy andintegrity, in Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium onPrinciples of Programming Languages, POPL ’98, (New York, NY, USA),pp. 365–377, ACM, 1998.
[11] M. Abadi, A. Banerjee, N. Heintze, and J. G. Riecke, A core calculus of dependency,in Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles ofProgramming Languages, POPL ’99, (New York, NY, USA), pp. 147–160, ACM,1999.
[12] D. Volpano, C. Irvine, and G. Smith, A sound type system for secure flow analysis, J.Comput. Secur. 4 (Jan., 1996) 167–187.
[13] D. E. Denning and P. J. Denning, Certification of programs for secure informationflow, Commun. ACM 20 (July, 1977) 504–513.
[14] J. A. Goguen and J. Meseguer, Security policies and security models, in Security andPrivacy, 1982 IEEE Symposium on, pp. 11–11, April, 1982.
[15] F. Pottier and V. Simonet, Information flow inference for ml, ACM Trans. Program.Lang. Syst. 25 (Jan., 2003) 117–158.
[16] A. Sabelfeld and A. C. Myers, Language-based information-flow security, IEEE J.Sel.A. Commun. 21 (Sept., 2006) 5–19.
[17] D. Terei, S. Marlow, S. Peyton Jones, and D. Mazières, Safe haskell, in Proceedingsof the 2012 Haskell Symposium, Haskell ’12, (New York, NY, USA), pp. 137–148,ACM, 2012.
[18] D. Yu, N. A. Hamid, and Z. Shao, Building certified libraries for pcc: dynamic storageallocation, in Proceedings of the 12th European conference on Programming,pp. 363–379, Springer-Verlag, 2003.
[19] A. Chlipala,Mostly-automated verification of low-level programs in computationalseparation logic, in Proceedings of the 32Nd ACM SIGPLAN Conference onProgramming Language Design and Implementation, PLDI ’11, (New York, NY,USA), pp. 234–245, ACM, 2011.
[20] J. Choi, M. Vijayaraghavan, B. Sherman, A. Chlipala, and Arvind, Kami: Aplatform for high-level parametric hardware specification and its modular verification,Proc. ACM Program. Lang. 1 (Aug., 2017).
212
[21] R. S. Boyer and Y. Yu, Automated correctness proofs of machine code programs for acommercial microprocessor, in Proceedings of the 11th International Conference onAutomated Deduction: Automated Deduction, CADE-11, (London, UK, UK),pp. 416–430, Springer-Verlag, 1992.
[22] N. G. Michael and A. W. Appel, Machine instruction syntax and semantics in higherorder logic, in Proceedings of the 17th International Conference on AutomatedDeduction, CADE-17, (London, UK, UK), pp. 7–24, Springer-Verlag, 2000.
[23] A. Kennedy, N. Benton, J. B. Jensen, and P.-E. Dagand, Coq: The world’s best macroassembler?, in Proceedings of the 15th Symposium on Principles and Practice ofDeclarative Programming, PPDP ’13, (New York, NY, USA), pp. 13–24, ACM, 2013.
[24] A. Fox and M. O. Myreen, A trustworthy monadic formalization of the armv7instruction set architecture, in Proceedings of the First International Conference onInteractive Theorem Proving, ITP’10, (Berlin, Heidelberg), pp. 243–258,Springer-Verlag, 2010.
[25] J. S. Moore, A mechanically verified language implementation, Journal of AutomatedReasoning 5 (1989), no. 4 461–492.
[26] W. A. Hunt Jr,Microprocessor design verification, Journal of Automated Reasoning 5(1989), no. 4 429–460.
[27] Journal of automated reasoning, 2003.
[28] G. C. Necula, Proof-carrying code. design and implementation. Springer, 2002.
[29] A. W. Appel, Foundational proof-carrying code, in Proceedings of the 16th AnnualIEEE Symposium on Logic in Computer Science, LICS ’01, (Washington, DC, USA),pp. 247–, IEEE Computer Society, 2001.
[30] J. Yang and C. Hawblitzel, Safe to the last instruction: Automated verification of atype-safe operating system, in Proceedings of the 31st ACM SIGPLAN Conference onProgramming Language Design and Implementation, PLDI ’10, (New York, NY,USA), pp. 99–110, ACM, 2010.
[31] T. Maeda and A. Yonezawa, Typed assembly language for implementing os kernels insmp/multi-core environments with interrupts, in Proceedings of the 5th InternationalConference on Systems Software Verification, SSV’10, (Berkeley, CA, USA), pp. 1–1,USENIX Association, 2010.
[32] M. Barnett, B.-Y. E. Chang, R. DeLine, B. Jacobs, and K. R. M. Leino, Boogie: Amodular reusable verifier for object-oriented programs, in Proceedings of the 4thInternational Conference on Formal Methods for Components and Objects, FMCO’05,(Berlin, Heidelberg), pp. 364–387, Springer-Verlag, 2006.
213
[33] H. Xi and R. Harper, A dependently typed assembly language, in Proceedings of theSixth ACM SIGPLAN International Conference on Functional Programming, ICFP ’01,(New York, NY, USA), pp. 169–180, ACM, 2001.
[34] A. Chlipala, A verified compiler for an impure functional language, in Proceedings ofthe 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of ProgrammingLanguages, POPL ’10, (New York, NY, USA), pp. 93–106, ACM, 2010.
[35] G. C. Necula, Translation validation for an optimizing compiler, in Proceedings of theACM SIGPLAN 2000 Conference on Programming Language Design andImplementation, PLDI ’00, (New York, NY, USA), pp. 83–94, ACM, 2000.
[36] P. Curzon and P. Curzon, A verified compiler for a structured assembly language, inIn proceedings of the 1991 international workshop on the HOL theorem Proving Systemand its applications. IEEE Computer, pp. 253–262, Society Press, 1991.
[37] M. Strecker, Formal verification of a java compiler in isabelle, in AutomatedDeduction—CADE-18, pp. 63–77. Springer, 2002.
[38] X. Leroy, A formally verified compiler back-end, Journal of Automated Reasoning 43(2009), no. 4 363–446.
[39] A. W. Appel, Verified software toolchain, in Proceedings of the 20th EuropeanConference on Programming Languages and Systems: Part of the Joint EuropeanConferences on Theory and Practice of Software, ESOP’11/ETAPS’11, (Berlin,Heidelberg), pp. 1–17, Springer-Verlag, 2011.
[40] G. Neis, C.-K. Hur, J.-O. Kaiser, C. McLaughlin, D. Dreyer, and V. Vafeiadis,Pilsner: A compositionally verified compiler for a higher-order imperative language, inProceedings of the 20th ACM SIGPLAN International Conference on FunctionalProgramming, ICFP 2015, (New York, NY, USA), pp. 166–178, ACM, 2015.
[41] T. Ramananandro, Z. Shao, S.-C. Weng, J. Koenig, and Y. Fu, A compositionalsemantics for verified separate compilation and linking, in Proceedings of the 2015Conference on Certified Programs and Proofs, CPP ’15, (New York, NY, USA),pp. 3–14, ACM, 2015.
[42] Z. Jiang, M. Pajic, and R. Mangharam, Cyber-physical modeling of implantablecardiac medical devices, Proceedings of the IEEE 100 (Jan, 2012) 122–137.
[43] A. O. Gomes and M. V. M. Oliveira, Formal specification of a cardiac pacing system,in FM 2009: Formal Methods (A. Cavalcanti and D. R. Dams, eds.), (Berlin,Heidelberg), pp. 692–707, Springer Berlin Heidelberg, 2009.
[44] T. Chen, M. Diciolla, M. Kwiatkowska, and A. Mereacre, Quantitative verificationof implantable cardiac pacemakers, in Real-Time Systems Symposium (RTSS), 2012IEEE 33rd, pp. 263–272, IEEE, 2012.
214
[45] L. Cordeiro, B. Fischer, H. Chen, and J. Marques-Silva, Semiformal verification ofembedded software in medical devices considering stringent hardware constraints, in2009 International Conference on Embedded Software and Systems, pp. 396–403, May,2009.
[46] P. J. Landin, The Mechanical Evaluation of Expressions, The Computer Journal 6 (Jan.,1964) 308–320.
[47] B. Graham, Secd: Design issues, tech. rep., University of Calgary, 1989.
[48] T. J. Clarke, P. J. Gladstone, C. D. MacLean, and A. C. Norman, Skim - the s, k, ireduction machine, in Proceedings of the 1980 ACM Conference on LISP andFunctional Programming, LFP ’80, (New York, NY, USA), pp. 128–135, ACM, 1980.
[49] L. P. Deutsch, A lisp machine with very compact programs, in Proceedings of the 3rdinternational joint conference on Artificial intelligence, pp. 697–703, MorganKaufmann Publishers Inc., 1973.
[50] P. M. Kogge, "The Architecture of Symbolic Computers". McGraw-Hill, Inc., NewYork, New York, 1991.
[51] T. F. Knight, Implementation of a list processing machine. PhD thesis, MassachusettsInstitute of Technology, 1979.
[52] J. M. McCune, B. J. Parno, A. Perrig, M. K. Reiter, and H. Isozaki, Flicker: Anexecution infrastructure for tcb minimization, SIGOPS Oper. Syst. Rev. 42 (Apr.,2008) 315–328.
[53] E. Keller, J. Szefer, J. Rexford, and R. B. Lee, Nohype: Virtualized cloudinfrastructure without the virtualization, SIGARCH Comput. Archit. News 38 (June,2010) 350–361.
[54] D. Halperin, T. S. Heydt-Benjamin, B. Ransford, S. S. Clark, B. Defend,W. Morgan, K. Fu, T. Kohno, and W. H. Maisel, Pacemakers and implantable cardiacdefibrillators: Software radio attacks and zero-power defenses, in 2008 IEEE Symposiumon Security and Privacy (sp 2008), pp. 129–142, IEEE, 2008.
[55] S. Gollakota, H. Hassanieh, B. Ransford, D. Katabi, and K. Fu, They can hear yourheartbeats: non-invasive security for implantable medical devices, in Proc. ACM Conf.SIGCOMM, pp. 2–13, 2011.
[56] T. Denning, K. Fu, and T. Kohno, Absence makes the heart grow fonder: Newdirections for implantable medical device security, in Proceedings of the 3rd Conferenceon Hot Topics in Security, HOTSEC’08, (Berkeley, CA, USA), pp. 5:1–5:7, USENIXAssociation, 2008.
215
[57] B. Hardekopf and C. Lin, Flow-sensitive pointer analysis for millions of lines of code,in Proceedings of the 9th Annual IEEE/ACM International Symposium on CodeGeneration and Optimization, CGO ’11, (Washington, DC, USA), pp. 289–298,IEEE Computer Society, 2011.
[58] E. Moggi, Notions of computation and monads, Information and computation 93(1991), no. 1 55–92.
[59] P. Hudak, J. Hughes, S. Peyton Jones, and P. Wadler, A history of haskell: being lazywith class, in Proceedings of the third ACM SIGPLAN conference on History ofprogramming languages, pp. 12–1, ACM, 2007.
[60] R. Hindley, The principal type-scheme of an object in combinatory logic, Transactions ofthe American Mathematical Society 146 (1969) 29–60.
[61] R. Milner, A theory of type polymorphism in programming, Journal of Computer andSystem Sciences 17 (1978) 348–375.
[62] M. E. Conway, Design of a separable transition-diagram compiler, Commun. ACM 6(July, 1963) 396–408.
[63] A. L. D. Moura and R. Ierusalimschy, Revisiting coroutines, ACM Trans. Program.Lang. Syst. 31 (Feb., 2009) 6:1–6:31.
[64] S. J. Connolly, M. Gent, R. S. Roberts, P. Dorian, D. Roy, R. S. Sheldon, L. B.Mitchell, M. S. Green, G. J. Klein, and B. O’Brien, Canadian implantable defibrillatorstudy (cids), Circulation 101 (2000), no. 11 1297–1302,[http://circ.ahajournals.org/content/101/11/1297.full.pdf].
[65] T. A. versus Implantable Defibrillators (AVID) Investigators, A comparison ofantiarrhythmic-drug therapy with implantable defibrillators in patients resuscitatedfrom near-fatal ventricular arrhythmias, New England Journal of Medicine 337 (1997),no. 22 1576–1584, [http://dx.doi.org/10.1056/NEJM199711273372202]. PMID:9411221.
[66] J. Siebels, K.-H. Kuck, and C. Investigators, Implantable cardioverter defibrillatorcompared with antiarrhythmic drug treatment in cardiac arrest survivors (the cardiacarrest study hamburg), American Heart Journal 127 (April, 1994) 1139–1144.
[67] “Living with your implantable cardioverter defibrillator (ICD).”http://www.heart.org/HEARTORG/Conditions/Arrhythmia/PreventionTreatmentofArrhythmia/Living-With-Your-Implantable-Cardioverter-Defibrillator-ICD_UCM_448462_Article.jsp, 09, 2016.
216
[68] “How many people have ICDs?.” http://asktheicd.com/tile/106/english-implantable-cardioverter-defibrillator-icd/how-many-people-have-icds/, Accessed October 24, 2019.
[69] J. Pan and W. J. Tompkins, A real-time qrs detection algorithm, IEEE Transactions onBiomedical Engineering BME-32 (March, 1985) 230–236.
[70] R. A. Álvarez, A. J. M. Penín, and X. A. V. Sobrino, A comparison of three qrsdetection algorithms over a public database, Procedia Technology 9 (2013) 1159 – 1165.CENTERIS 2013 - Conference on ENTERprise Information Systems / ProjMAN2013 - International Conference on Project MANagement/ HCIST 2013 -International Conference on Health and Social Care Information Systems andTechnologies.
[71] “Open source ECG analysis software.”http://www.eplimited.com/confirmation.htm, Accessed October 24, 2019.
[72] M. S. Wathen, P. J. DeGroot, M. O. Sweeney, A. J. Stark, M. F. Otterness, W. O.Adkisson, R. C. Canby, K. Khalighi, C. Machado, D. S. Rubenstein, and K. J.Volosin, Prospective randomized multicenter trial of empirical antitachycardia pacingversus shocks for spontaneous rapid ventricular tachycardia in patients with implantablecardioverter-defibrillators, Circulation 110 (2004), no. 17 2591–2596,[http://circ.ahajournals.org/content/110/17/2591.full.pdf].
[73] “The Coq proof assistant.” https://coq.inria.fr, Accessed October 24, 2019.
[74] V. Kashyap, B. Wiedermann, and B. Hardekopf, Timing- and termination-sensitivesecure information flow: Exploring a new approach, in 2011 IEEE Symposium onSecurity and Privacy, pp. 413–428, May, 2011.
[75] V. Simonet, Fine-grained information flow analysis for a λ calculus with sum types, inProceedings of the 15th IEEE Workshop on Computer Security Foundations, CSFW ’02,(Washington, DC, USA), pp. 223–, IEEE Computer Society, 2002.
[76] D. Ricketts, G. Malecha, M. M. Alvarez, V. Gowda, and S. Lerner, Towardsverification of hybrid systems in a foundational proof assistant, in 2015 ACM/IEEEInternational Conference on Formal Methods and Models for Codesign(MEMOCODE), pp. 248–257, 2015.
[77] Z. Cutlip, “Dlink dir-815 upnp command injection.” http://shadow-file.blogspot.com/2013/02/dlink-dir-815-upnp-command-injection.html, February, 2013.[Online; accessed 01-November-2017].
[78] A. Cui, M. Costello, and S. J. Stolfo, When firmware modification attack: A case studyof embedded exploitation, in NDSS Symposium ’13, 2013.
217
[79] G. Hernandez, O. Arias, D. Buentello, and Y. Jin, Smart nest thermostat: A smartspy in your home, in Black Hat Briefings, 2014.
[80] G. Morrisett, D. Walker, K. Crary, and N. Glew, From System F to Typed AssemblyLanguage, ACM Trans. Program. Lang. Syst. 21 (May, 1999) 527–568.
[81] K. Crary, N. Glew, D. Grossman, R. Samuels, F. Smith, D. Walker, S. Weirich, andS. Zdancewic, TALx86: A realistic typed assembly language, in 1999 ACM SIGPLANWorkshop on Compiler Support for System Software Atlanta, GA, USA, pp. 25–35,1999.
[82] E. Buchanan, R. Roemer, and S. Savage, “Return-oriented programming:Exploits without code injection.” https://hovav.net/ucsd/talks/blackhat08.html.
[83] K. Dewey, J. Roesch, and B. Hardekopf, Fuzzing the rust typechecker using clp, inProceedings of the 2015 30th IEEE/ACM International Conference on AutomatedSoftware Engineering (ASE), ASE ’15, (Washington, DC, USA), pp. 482–493, IEEEComputer Society, 2015.
[84] L. De Moura and N. Bjørner, Z3: An efficient smt solver, in Proceedings of the Theoryand Practice of Software, 14th International Conference on Tools and Algorithms for theConstruction and Analysis of Systems, TACAS’08/ETAPS’08, (Berlin, Heidelberg),pp. 337–340, Springer-Verlag, 2008.
[85] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B.Brown, Mibench: A free, commercially representative embedded benchmark suite, inProceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE InternationalWorkshop, WWC ’01, (Washington, DC, USA), pp. 3–14, IEEE Computer Society,2001.
[86] H. Xi and R. Harper, A Dependently Typed Assembly Language, in Proceedings of theSixth ACM SIGPLAN International Conference on Functional Programming, ICFP ’01,(New York, NY, USA), pp. 169–180, ACM, 2001.
[87] G. Morrisett, K. Crary, N. Glew, and D. Walker, Stack-based typed assemblylanguage, Journal of Functional Programming 12 (Jan., 2002).
[88] K. Crary, Toward a Foundational Typed Assembly Language, in Proceedings of the 30thACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,POPL ’03, (New York, NY, USA), pp. 198–212, ACM, 2003.
[89] A. Azevedo de Amorim, N. Collins, A. DeHon, D. Demange, C. Hriţcu,D. Pichardie, B. C. Pierce, R. Pollack, and A. Tolmach, A Verified Information-flowArchitecture, in Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium onPrinciples of Programming Languages, POPL ’14, (New York, NY, USA),pp. 165–178, ACM, 2014.
218
[90] G. Balakrishnan, R. Gruian, T. Reps, and T. Teitelbaum, CodeSurfer/x86—APlatform for Analyzing x86 Executables, in Compiler Construction, Lecture Notes inComputer Science, pp. 250–254, Springer, Berlin, Heidelberg, Apr., 2005.
[91] J. Lee, T. Avgerinos, and D. Brumley, Tie: Principled reverse engineering of types inbinary programs, in In Proceedings of the Network and Distributed System SecuritySymposium, 2011.
[92] M. Noonan, A. Loginov, and D. Cok, Polymorphic Type Inference for Machine Code,arXiv:1603.05495 [cs] (Mar., 2016). arXiv: 1603.05495.
[93] J. Caballero and Z. Lin, Type Inference on Executables, ACM Comput. Surv. 48 (May,2016) 65:1–65:35.
[94] Z. Chen, Java card technology for smart cards: architecture and programmer’s guide.Addison-Wesley Professional, 2000.
[95] W. Coekaerts, “The java typesystem is broken.”http://wouter.coekaerts.be/2018/java-type-system-broken.
[96] N. Amin and R. Tate, Java and scala’s type systems are unsound: The existential crisisof null pointers, in Proceedings of the 2016 ACM SIGPLAN International Conferenceon Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA2016, (New York, NY, USA), pp. 838–848, ACM, 2016.
[97] R. Grigore, Java generics are turing complete, CoRR abs/1605.05274 (2016)[arXiv:1605.0527].
[98] K. Zhai, R. Townsend, L. Lairmore, M. A. Kim, and S. A. Edwards, Hardwaresynthesis from a recursive functional language, in Proceedings of the 10th InternationalConference on Hardware/Software Codesign and System Synthesis, CODES ’15,(Piscataway, NJ, USA), pp. 83–93, IEEE Press, 2015.
[99] R. Townsend, M. A. Kim, and S. A. Edwards, From functional programs to pipelineddataflow circuits, in Proceedings of the 26th International Conference on CompilerConstruction, CC 2017, (New York, NY, USA), pp. 76–86, ACM, 2017.
[100] S. Nagarakatte, J. Zhao, M. M. Martin, and S. Zdancewic, Softbound: Highlycompatible and complete spatial memory safety for c, in Proceedings of the 30th ACMSIGPLAN Conference on Programming Language Design and Implementation, PLDI’09, (New York, NY, USA), pp. 245–258, ACM, 2009.
[101] J. Devietti, C. Blundell, M. M. K. Martin, and S. Zdancewic, Hardbound:Architectural support for spatial safety of the c programming language, in Proceedingsof the 13th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS XIII, (New York, NY, USA),pp. 103–114, ACM, 2008.
219
[102] S. Nagarakatte, M. M. K. Martin, and S. Zdancewic,Watchdog: Hardware for safeand secure manual memory management and full memory safety, in Proceedings of the39th Annual International Symposium on Computer Architecture, ISCA ’12,(Washington, DC, USA), pp. 189–200, IEEE Computer Society, 2012.
[103] R. Zhang, N. Stanley, C. Griggs, A. Chi, and C. Sturton, Identifying security criticalproperties for the dynamic verification of a processor, in Proceedings of the 22ndInternational Conference on Architectural Support for Programming Languages andOperating Systems (ASPLOS), ACM, 2017.
[104] H. Cherupalli, H. Duwe, W. Ye, R. Kumar, and J. Sartori, Software-based gate-levelinformation flow security for iot systems, in Proceedings of the 50th AnnualIEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (NewYork, NY, USA), pp. 328–340, ACM, 2017.
[105] R. S. Chakraborty and S. Bhunia, Harpoon: An obfuscation-based soc designmethodology for hardware protection, IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems 28 (2009), no. 10 1493–1502.
[106] Ieee standard for verilog hardware description language, IEEE Std 1364-2005 (Revisionof IEEE Std 1364-2001) (2006) 1–590.
[107] W. Snyder, P. Wasson, and D. Galbi, “Verilator-convert Verilog code toC++/SystemC.” http://www.veripool.org/wiki/verilator, 2020.
[108] C. Wolf, “Yosys open synthesis suite.” http://www.clifford.at/yosys/.
[109] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, J. Wawrzynek,and K. Asanović, Chisel: Constructing hardware in a scala embedded language, inProceedings of the 49th Annual Design Automation Conference, DAC ’12, (New York,NY, USA), p. 1216–1225, Association for Computing Machinery, 2012.
[110] J. Clow, G. Tzimpragos, D. Dangwal, S. Guo, J. McMahan, and T. Sherwood, Apythonic approach for rapid hardware prototyping and instrumentation, in 2017 27thInternational Conference on Field Programmable Logic and Applications (FPL),(Ghent, Belgium), pp. 1–7, IEEE, 2017.
[111] D. Dangwal, G. Tzimpragos, and T. Sherwood, Agile hardware development andinstrumentation with PyRTL, IEEE Micro 40 (2020), no. 4 76–84.
[112] IEEE standard for SystemVerilog–unified hardware design, specification, and verificationlanguage, IEEE Std 1800-2017 (Revision of IEEE Std 1800-2012) (2018) 1–1315.
[113] IEEE standard for property specification language (PSL), IEEE Std 1850-2010(Revision of IEEE Std 1850-2005) (2010) 1–182.
220
[114] D. Geist, The PSL/Sugar specification language a language for all seasons, in CorrectHardware Design and Verification Methods (D. Geist and E. Tronci, eds.), (Berlin,Heidelberg), pp. 3–3, Springer Berlin Heidelberg, 2003.
[115] O. W. Group, Accellera standard OVL v2 library reference manual, tech. rep.,Accellera Systems Initiative, 2014.
[116] M. Sheeran,MuFP, a language for VLSI design, in Proceedings of the 1984 ACMSymposium on LISP and Functional Programming, LFP ’84, (New York, NY, USA),p. 104–112, Association for Computing Machinery, 1984.
[117] K. Claessen, Embedded Languages for Describing and Verifying Hardware. PhDthesis, Chalmers University of Technology and Göteborg University, 2001.
[118] A. Gill, T. Bull, A. Farmer, G. Kimmell, and E. Komp, Types and type families forhardware simulation and synthesis, in Trends in Functional Programming (R. Page,Z. Horváth, and V. Zsók, eds.), (Berlin, Heidelberg), pp. 118–133, SpringerBerlin Heidelberg, 2011.
[119] P. Bjesse, K. Claessen, M. Sheeran, and S. Singh, Lava: Hardware design in haskell,in Proceedings of the Third ACM SIGPLAN International Conference on FunctionalProgramming, ICFP ’98, (New York, NY, USA), p. 174–184, Association forComputing Machinery, 1998.
[120] A. Gill, T. Bull, G. Kimmell, E. Perrins, E. Komp, and B. Werling, IntroducingKansas Lava, in Implementation and Application of Functional Languages (M. T.Morazán and S.-B. Scholz, eds.), (Berlin, Heidelberg), pp. 18–35, Springer BerlinHeidelberg, 2010.
[121] A. Mycroft and R. Sharp, Higher-level techniques for hardware description andsynthesis, International Journal on Software Tools for Technology Transfer 4 (2003),no. 3 271–297.
[122] J. O’Donnell, Overview of Hydra: a concurrent language for synchronous digitalcircuit design, International Journal of Information 9 (March, 2006) 249–264.
[123] D. L. Dill, A. J. Drexler, A. J. Hu, and C. H. Yang, Protocol verification as a hardwaredesign aid, in Proceedings 1992 IEEE International Conference on Computer Design:VLSI in Computers & Processors, (Cambridge, MA, USA), pp. 522–525, IEEE, 1992.
[124] R. Nigam, S. Atapattu, S. Thomas, Z. Li, T. Bauer, Y. Ye, A. Koti, A. Sampson, andZ. Zhang, Predictable accelerator design with time-sensitive affine types, in Proceedingsof the 41st ACM SIGPLAN Conference on Programming Language Design andImplementation, PLDI 2020, (New York, NY, USA), p. 393–407, Association forComputing Machinery, 2020.
221
[125] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang, High-levelsynthesis for FPGAs: From prototyping to deployment, IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems 30 (2011), no. 4 473–491.
[126] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson,S. Brown, and T. Czajkowski, LegUp: High-level synthesis for FPGA-basedprocessor/accelerator systems, in Proceedings of the 19th ACM/SIGDA InternationalSymposium on Field Programmable Gate Arrays, FPGA ’11, (New York, NY, USA),p. 33–36, Association for Computing Machinery, 2011.
[127] L. Truong and P. Hanrahan, A golden age of hardware description languages:Applying programming language techniques to improve design productivity, in 3rdSummit on Advances in Programming Languages, SNAPL 2019, May 16-17, 2019,Providence, RI, USA (B. S. Lerner, R. Bodík, and S. Krishnamurthi, eds.), vol. 136of LIPIcs, (Providence, RI, USA), pp. 7:1–7:21, Schloss Dagstuhl -Leibniz-Zentrum für Informatik, 2019.
[128] D. Lockhart, G. Zibrat, and C. Batten, PyMTL: A unified framework for verticallyintegrated computer architecture research, in 2014 47th Annual IEEE/ACMInternational Symposium on Microarchitecture, (Cambridge, United Kingdom),pp. 280–292, IEEE, 2014.
[129] C. Baaij, M. Kooijman, J. Kuper, A. Boeijink, and M. Gerards, Cλash: Structuraldescriptions of synchronous hardware using haskell, in 2010 13th Euromicro Conferenceon Digital System Design: Architectures, Methods and Tools, (Lille, France),pp. 714–721, IEEE, 2010.
[130] J. P. P. Flor, W. Swierstra, and Y. Sijsling, Pi-Ware: Hardware Description andVerification in Agda, in 21st International Conference on Types for Proofs and Programs(TYPES 2015) (T. Uustalu, ed.), vol. 69 of Leibniz International Proceedings inInformatics (LIPIcs), (Dagstuhl, Germany), pp. 9:1–9:27, SchlossDagstuhl–Leibniz-Zentrum fuer Informatik, 2018.
[131] J. Street, “Hardcaml: Register transfer level hardware design in OCaml.”https://github.com/janestreet/hardcaml.
[132] T. Bourgeat, C. Pit-Claudel, A. Chlipala, and Arvind, The essence of bluespec: Acore language for rule-based hardware design, in Proceedings of the 41st ACMSIGPLAN Conference on Programming Language Design and Implementation, PLDI2020, (New York, NY, USA), p. 243–257, Association for Computing Machinery,2020.
[133] S. Sutherland and D. Mills, Standard gotchas subtleties in the verilog andsystemverilog standards that every engineer should know, 2006.
222
[134] S. Sutherland, D. Mills, and C. Spear, Gotcha again: More subtleties in the Verilogand SystemVerilog standards that every engineer should know, 2007.
[135] M. B. Taylor, BaseJump STL: SystemVerilog needs a standard template library forhardware design, in Proceedings of the 55th Annual Design Automation Conference,DAC ’18, (New York, NY, USA), Association for Computing Machinery, 2018.
[136] S. Xie and M. B. Taylor, The basejump manycore accelerator network, 2018.
[137] L. P. Carloni, From latency-insensitive design to communication-based system-leveldesign, Proceedings of the IEEE 103 (2015), no. 11 2133–2151.
[138] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli, Theory oflatency-insensitive design, Trans. Comp.-Aided Des. Integ. Cir. Sys. 20 (Nov., 2006)1059–1076.
[139] L. P. Carloni, K. L. McMillan, and A. L. Sangiovanni-Vincentelli, LatencyInsensitive Protocols, in Computer Aided Verification (N. Halbwachs and D. Peled,eds.), Lecture Notes in Computer Science, (Berlin, Heidelberg), pp. 123–133,Springer, 1999.
[140] B. Cao, K. A. Ross, M. A. Kim, and S. A. Edwards, Implementing latency-insensitivedataflow blocks, in 2015 ACM/IEEE International Conference on Formal Methods andModels for Codesign (MEMOCODE), (Austin, TX, USA), pp. 179–187, IEEE, 2015.
[141] L. Tang and S. Davidson, “BSG Micro Designs.”https://github.com/bsg-idea/bsg_micro_designs, 2019.
[142] U. Berkeley, Berkeley logic interchange format (BLIF), Oct Tools Distribution 2 (1992)197–247.
[143] J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad,A. Fuchs, S. Payne, X. Liang, M. Matl, and D. Wentzlaff, OpenPiton: An opensource manycore research framework, in Proceedings of the Twenty-First InternationalConference on Architectural Support for Programming Languages and OperatingSystems, ASPLOS ’16, (New York, NY, USA), p. 217–232, Association forComputing Machinery, 2016.
[144] P. University, “OpenPiton Design Benchmark.”https://github.com/PrincetonUniversity/OPDB, 2020.
[145] A. Waterman, Y. Lee, D. Patterson, K. Asanović, and C. Division, The RISC-VInstruction Set Manual Volume I: User-Level ISA, 2016.
223
[146] J. Lowe-Power and C. Nitta, The Davis In-Order (dino) CPU: A teaching-focusedRISC-V CPU design, in Proceedings of the Workshop on Computer ArchitectureEducation, WCAE’19, (New York, NY, USA), Association for ComputingMachinery, 2019.
[147] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, Energy-efficient superconductingcomputing—power budgets and requirements, IEEE Transactions on AppliedSuperconductivity 23 (2013), no. 3 1701610–1701610.
[148] I. I. Soloviev, N. V. Klenov, S. V. Bakurskiy, M. Y. Kupriyanov, A. L. Gudkov, andA. S. Sidorenko, Beyond moore’s technologies: operation principles of a superconductoralternative, Beilstein journal of nanotechnology 8 (2017), no. 1 2689–2710.
[149] R. Cai, A. Ren, O. Chen, N. Liu, C. Ding, X. Qian, J. Han, W. Luo, N. Yoshikawa,and Y. Wang, A stochastic-computing based deep learning framework using adiabaticquantum-flux-parametron superconducting technology, in Proceedings of the 46thInternational Symposium on Computer Architecture, ISCA ’19, (New York, NY,USA), p. 567–578, Association for Computing Machinery, 2019.
[150] G. Tzimpragos, D. Vasudevan, N. Tsiskaridze, G. Michelogiannakis,A. Madhavan, J. Volk, J. Shalf, and T. Sherwood, A computational temporal logic forsuperconducting accelerators, in Proceedings of the Twenty-Fifth InternationalConference on Architectural Support for Programming Languages and OperatingSystems, ASPLOS ’20, (New York, NY, USA), p. 435–448, Association forComputing Machinery, 2020.
[151] G. Tzimpragos, J. Volk, A. Wynn, J. E. Smith, and T. Sherwood, Superconductingcomputing with alternating logic elements, in 2021 ACM/IEEE 48th AnnualInternational Symposium on Computer Architecture (ISCA), (Valencia, Spain),pp. 651–664, IEEE, 2021.
[152] T. E. Oliphant, Python for scientific computing, Computing in Science Engineering 9(2007), no. 3 10–20.
[153] J. E. Hopcroft, R. Motwani, and J. D. Ullman, Introduction to automata theory,languages, and computation, Acm Sigact News 32 (2001), no. 1 60–65.
[154] M. Munir, A. Gopikanna, A. Fayyazi, M. Pedram, and S. Nazarian, Qmc: A formalmodel checking verification framework for superconducting logic, in Proceedings of the2021 on Great Lakes Symposium on VLSI, GLSVLSI ’21, (New York, NY, USA),p. 259–264, Association for Computing Machinery, 2021.
[155] R. Alur and D. L. Dill, A theory of timed automata, Theoretical Computer Science 126(Apr., 1994) 183–235.
224
[156] J. Bengtsson, K. Larsen, F. Larsson, P. Pettersson, and W. Yi, Uppaal — a tool suitefor automatic verification of real-time systems, in Hybrid Systems III (R. Alur, T. A.Henzinger, and E. D. Sontag, eds.), (Berlin, Heidelberg), pp. 232–243, SpringerBerlin Heidelberg, 1996.
[157] R. J. Baker, CMOS: circuit design, layout, and simulation. John Wiley & Sons, 2019.
[158] L.-T. Wang, Y.-W. Chang, and K.-T. T. Cheng, Electronic Design Automation.Elsevier, 2009.
[159] Ieee standard for vhdl language reference manual, IEEE Std 1076-2019 (2019) 1–673.
[160] L. N. Cooper, Bound Electron Pairs in a Degenerate Fermi Gas, Physical Review 104(Nov., 1956) 1189–1190.
[161] B. Josephson, Possible new effects in superconductive tunnelling, Physics Letters 1(1962), no. 7 251–253.
[162] K. Likharev and V. Semenov, Rsfq logic/memory family: a new josephson-junctiontechnology for sub-terahertz-clock-frequency digital systems, IEEE Transactions onApplied Superconductivity 1 (1991), no. 1 3–28.
[163] J. Clarke, Principles and applications of squids, Proceedings of the IEEE 77 (1989),no. 8 1208–1223.
[164] G. H. Mealy, A method for synthesizing sequential circuits, The Bell System TechnicalJournal 34 (Sept., 1955) 1045–1079.
[165] M. Sipser, Introduction to the theory of computation, ACM Sigact News 27 (1996),no. 1 27–29.
[166] Q. Xu, C. L. Ayala, N. Takeuchi, Y. Yamanashi, and N. Yoshikawa, Hdl-basedmodeling approach for digital simulation of adiabatic quantum flux parametron logic,IEEE Transactions on Applied Superconductivity 26 (2016), no. 8 1–5.
[167] G. Tzimpragos, J. Volk, D. Vasudevan, N. Tsiskaridze, G. Michelogiannakis,A. Madhavan, J. Shalf, and T. Sherwood, Temporal computing with superconductors,IEEE Micro 41 (2021), no. 3 71–79.
[168] K. Gaj, E. G. Friedman, and M. J. Feldman, Timing of Multi-Gigahertz Rapid SingleFlux Quantum Digital Circuits, in High Performance Clock Distribution Networks(E. G. Friedman, ed.), pp. 135–164. Springer US, Boston, MA, 1997.
[169] A. Krasniewski, Logic simulation of rsfq circuits, IEEE Transactions on AppliedSuperconductivity 3 (1993), no. 1 33–38.
225
[170] T. Kawaguchi, K. Takagi, and N. Takagi, A verification method forsingle-flux-quantum circuits using delay-based time frame model, IEICE Trans.Fundam. Electron. Commun. Comput. Sci. 98-A (2015) 2556–2564.
[171] R. S. Bakolo, Design and implementation of a rsfq superconductive digital electronicscell library, Master’s thesis, University of Stellenbosch, 2011.
[172] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avižienis, J. Wawrzynek,and K. Asanović, Chisel: Constructing hardware in a scala embedded language, inProceedings of the 49th Annual Design Automation Conference, DAC ’12, (New York,NY, USA), p. 1216–1225, Association for Computing Machinery, 2012.
[173] J. Clow, G. Tzimpragos, D. Dangwal, S. Guo, J. McMahan, and T. Sherwood, Apythonic approach for rapid hardware prototyping and instrumentation, in 2017 27thInternational Conference on Field Programmable Logic and Applications (FPL),(Ghent, Belgium), pp. 1–7, IEEE, 2017.
[174] N. Matloff, Introduction to discrete-event simulation and the simpy language, Davis,CA. Dept of Computer Science. University of California at Davis. Retrieved on August2 (2008), no. 2009 1–33.
[175] L. Schindler, The Development and Characterisation of a Parameterised RSFQ CellLibrary for Layout Synthesis. PhD thesis, Stellenbosch University, 2021.
[176] L. W. Nagel, SPICE2: A Computer Program to Simulate Semiconductor Circuits. PhDthesis, EECS Department, University of California, Berkeley, May, 1975.
[177] I. Whiteley Research, “Wrspice.” http://wrcad.com. Accessed: 2021-10-22.
[178] K. E. Batcher, Sorting networks and their applications, in Proceedings of the April30–May 2, 1968, Spring Joint Computer Conference, AFIPS ’68 (Spring), (New York,NY, USA), p. 307–314, Association for Computing Machinery, 1968.
[179] A. F. Kirichenko, I. V. Vernik, M. Y. Kamkar, J. Walter, M. Miller, L. R. Albu, andO. A. Mukhanov, Ersfq 8-bit parallel arithmetic logic unit, IEEE Transactions onApplied Superconductivity 29 (Aug, 2019) 1–7.
[180] G. Tzimpragos, A. Madhavan, D. Vasudevan, D. Strukov, and T. Sherwood,Boosted race trees for low energy classification, in Proceedings of the Twenty-FourthInternational Conference on Architectural Support for Programming Languages andOperating Systems, ASPLOS ’19, (New York, NY, USA), p. 215–228, Associationfor Computing Machinery, 2019.
[181] E. M. Clarke and E. A. Emerson, Design and synthesis of synchronization skeletonsusing branching time temporal logic, in Logics of Programs (D. Kozen, ed.), (Berlin,Heidelberg), pp. 52–71, Springer Berlin Heidelberg, 1982.
226
[182] T. Henzinger, X. Nicollin, J. Sifakis, and S. Yovine, Symbolic model checking forreal-time systems, in [1992] Proceedings of the Seventh Annual IEEE Symposium onLogic in Computer Science, (Santa Cruz, CA, USA), pp. 394–406, IEEE, 1992.
[183] G. Behrmann, A. David, and K. G. Larsen, A tutorial on uppaal, in Formal Methodsfor the Design of Real-Time Systems: International School on Formal Methods for theDesign of Computer, Communication, and Software Systems, Bertinora, Italy, September13-18, 2004, Revised Lectures (M. Bernardo and F. Corradini, eds.), pp. 200–236.Springer Berlin Heidelberg, Berlin, Heidelberg, 2004.
[184] V. Adler, Chin-Hong Cheah, K. Gaj, D. K. Brock, and E. G. Friedman, Acadence-based design environment for single flux quantum circuits, IEEE Transactionson Applied Superconductivity 7 (1997), no. 2 3294–3297.
[185] F. Matsuzaki, N. Yoshikawa, M. Tanaka, A. Fujimaki, and Y. Takai, Abehavioral-level hdl description of sfq logic circuits for quantitative performance analysisof large-scale sfq digital systems, Physica C: Superconductivity 392-396 (2003) 1495 –1500. Proceedings of the 15th International Symposium on Superconductivity(ISS 2002): Advances in Superconductivity XV. Part II.
[186] N. Katam, S. N. Shahsavani, T.-R. Lin, G. Pasandi, A. Shafaei, and M. Pedram,“Sport lab sfq logic circuit benchmark suite.”https://ceng.usc.edu/techreports/2017/Pedram%20CENG-2017-1.pdf, 2017.Accessed: 2021-10-22.
[187] K. Gaj, C.-H. Cheah, E. Friedman, and M. Feldman, Functional modeling of rsfqcircuits using verilog hdl, IEEE Transactions on Applied Superconductivity 7 (1997),no. 2 3151–3154.
[188] R. N. Tadros, A. Fayyazi, M. Pedram, and P. A. Beerel, Systemverilog modeling ofsfq and aqfp circuits, IEEE Transactions on Applied Superconductivity 30 (2020), no. 21–13.
[189] R. N. Tadros, A. Fayyazi, M. Pedram, and P. A. Beerel, Systemverilog modeling ofsfq and aqfp circuits, IEEE Transactions on Applied Superconductivity 30 (2020), no. 21–13.
[190] L. C. Müller and C. J. Fourie, Automated state machine and timing characteristicextraction for rsfq circuits, IEEE Transactions on Applied Superconductivity 24 (2014),no. 1 3–12.
[191] L. C. Müller and C. J. Fourie, Automated state machine and timing characteristicextraction for rsfq circuits, IEEE Transactions on Applied Superconductivity 24 (2014),no. 1 3–12.
227
[192] C. J. Fourie, Extraction of dc-biased sfq circuit verilog models, IEEE Transactions onApplied Superconductivity 28 (2018), no. 6 1–11.
[193] A. D. Wong, K. Su, H. Sun, A. Fayyazi, M. Pedram, and S. Nazarian, Verisfq: Asemi-formal verification framework and benchmark for single flux quantum technology,in 20th International Symposium on Quality Electronic Design (ISQED), (SantaClara, CA, USA), pp. 224–230, IEEE, 2019.
[194] Ieee standard for universal verification methodology language reference manual, IEEEStd 1800.2-2020 (Revision of IEEE Std 1800.2-2017) (2020) 1–458.
228