8/2/2019 COMS 6998 Final Paper
1/19
Compiling Python to C: An Introduction to RPython
[my name redacted] for Alfred Aho
Advanced Topics in Programming
Languages and CompilersColumbia University
8/2/2019 COMS 6998 Final Paper
2/19
Part 1 - A Brief Introduction to Python
Python is an bytecode-interpreted language, initially designed by Guidovan Rossum in 1991 and presented to the alt.sources discussion board as
a language meant to interface with the Amoeba operating system. Since
then, it has grown to be one of the most popular languages in common
use today, featuring extensive library support and a massive developer
community. (Authors, 2011)
Python programs can either both compiled and executed as bytecode, or
interpreted in a console shell. For the purposes of this paper, we will only
consider the portion of the toolchain that deals with compilation and
bytecode interpretation.
The Python Language
A Demonstration Program
Here is a complete Python program consisting of an implementation of
Euclids algorithm, followed by a short driver.
def euclid(a, b):
if a < b:return euclid(b, a)
while b != 0:
t = b
b = a % b
a = t
return a
if __name__ == __main__:
print euclid(549, 129)
While language grammar is not the topic of this paper, a few points standout because they demonstrate the philosophy of simplicity that runs
throughout the reference CPython implementation.
Firstly, there are no visible line delimiters. In the place of semicolons, we
find newline characters dividing the program. Secondly, there are no
braces to mark blocks. Instead, a standard indentation scheme is used, in
8/2/2019 COMS 6998 Final Paper
3/19
this case four spaces. Finally, the parentheses typically seen around loop
and conditional checks in C and friends are absent.
Together, these points make for a language that is meant to be simple to
write and simple to comprehend.
Notable Language Semantics
In Python, almost everything happens at runtime. All decisions about
name binding and object creation occur when the program is run, and as
a result allow for some very interesting semantics.
No Declarations
Unlike C, C++, Java, and a great many other languages, we are not
required to declare our variables. Instead, we simply assign a name to a
value, and use it as we see fit. If the name is accessed after it has been
created, its value is simply updated.
In fact, all variable assignments occur in this way. When a variable is not
present in the current namespace, the interpreter creates a new name
and assigns to it whatever value is being assigned. A name can also be
deleted using the del keyword, removing it from the current namespace.
In a sense, the state of a Python program can be thought of as the set of
name to object mappings, in which case a program is efectively a means
of ensuring that the desired name is mapped to appropriate object.
No Types
Just as we never declared any variables, we also never specify the type of
any variables. The euclid function takes two parameter, but we do not
specify their types at compile time. Instead, the types are implied by the
operations we perform on the variables.
For instance, the euclid function clearly lends itself to numerical types,
since it features operations that are typically associated with numbers.
We see comparison, comparison with zero, and the modulo operation.
The precise type of the numerical arguments is not known. We could pass
in ints, longs, or even integer-valued floats, and the operations would
succeed.
8/2/2019 COMS 6998 Final Paper
4/19
This type-freedom goes further, however. If we were to pass in instances
of a class that was jury-rigged to implement all of those operations, the
algorithm would run. In fact, any variable that supports these operations
will happily pass through this algorithm.
The crucial point is that the entire Python type system relies on runtime
discovery of what operations are available. If the above function were
passed a list, it would happily accept the variable, push the function
object onto the runtime stack, and begin executing it. Only when it
attempts to perform the modulo operation would it discover that the
object passed is not valid and raise an exception. (Incidentally, while one
might expect the comparisons to be the first operations to fail, Python
actually implements comparisons between lists.)
This runtime discovery of object capabilities will become very importantin a few sections, when we discuss Object Spaces.
Everything is an Object
Upon close inspection of the program, a reader might object to the
statement of no declarations. After all, what is the def euclid(a, b):
doing if not declaring a function?
While the def syntax does borrow from languages that declare their
functions, in Python it is actually a statement whose side efect is thecreation of a function object containing the relevant code, and the
binding of the name euclid to that object in the containing namespace.
This is a crucial distinction, because it means that function objects can be
passed around as arguments, assigned to variables, and even deleted.
For instance, the following code can be appended to the above example
to cause all subsequent calls to euclid to be executed recursively rather
than iteratively:
def euclid_rec(a, b):if a < b:
return euclid(b, a)
if b == 0:
return a
else:
return euclid_rec(b, a % b)
8/2/2019 COMS 6998 Final Paper
5/19
euclid = euclid_rec
# performs recursive implementation
print euclid(100, 20)
In a similar way, class declarations are in fact statements that result in
the creation and naming of a class object:
class hello:
def __init__(self):
self.message = Hello World
def print_message(self):
print self.message
These semantics of classes-and-functions-as-objects makes for some
interesting and powerful capabilities. Functions can return custom-made
classes for specific purposes. For instance, operating system-specific
initialization typically uses this mechanism to construct file and system-
handling interfaces composed of the methods supported by the running
system. For instance, a file handling module might attempt to load a
Windows-specific filesystem interface, fail, and load a Linux-specific one
instead.
Everything is a Namespace
In addition to being passed around, objects support the same namespace
operations as the global namespace. They support name binding,
rebinding, and deletion. For instance, in the above class declarationdemonstration, the__init__method is attaching themessage member
to that instances namespace. In turn, each of the def statements is
nothing more than the familiar creation of a function object with a
binding to the enclosing namespace.
As such, object members can be modified at will. One common design
pattern that utilizes this capability involves replacing an objects handlers
with hooks for that function. For instance, logging of critical function
calls can be implemented as follows:
def log_function(f):
log(About to call +str(f))
f()
handler = event_handler()
handler.f = log_function(f)
8/2/2019 COMS 6998 Final Paper
6/19
The handler object will now log all calls to the f function. To disable this
behavior, the f function can be replaced with its original value.
Built-ins
There is a notable exception to this namespace convention, however.Built-in objects do not support the same namespace operations as other
classes. These are objects that are implemented at the interpreter level,
and as a result have limited functionality by design.
Conclusion
The dynamism of Pythons object system makes it very flexible. Since
objects are, at their heart, collections of name to object mappings, they
can be manipulated at runtime to perform any functionality desired. This
allows for some very interesting usage patterns, but does pose someproblems with translation, as we will soon see.
Part 2 - Python Interpreter Semantics
Having seen the flexibility of Pythons objects, we might wonder how
these constructs are implemented. We have seen that there are some
limitations to the object model, namely the rigidity of the built in objectsand methods, and that the ban on modifying operations on these objects
is an implementation detail of the interpreter and virtual machine.
The CPython Interpreter
As we make our way toward a discussion of the features and techniques
used to implement RPython, let us first discuss notable implementation
details of the CPython interpreter. The features and design patterns we
find here will again reveal themselves when we arrive at our destination.
Interpreter and VM Basics
After a Python program has been compiled to bytecodes by the compiler,
the output is passed to the virtual machine. Control flow is handled by
the interpreter, which is in essence a large switch statement that grabs a
bytecode from the program and chooses the appropriate handler. These
handlers manipulate the state of the VM itself.
8/2/2019 COMS 6998 Final Paper
7/19
Separation of Control Flow and Object Semantics
At the language level, there is a sort of implicit distinction between the
control flow of the program and the operations of the objects themselves.
Let us take a closer look at the implementation of Euclids algorithmintroduced in the previous section:
def euclid(a, b):
if a < b:
return euclid(b, a)
while b != 0:
t = b
b = a % b
a = t
return a
Let us forget what we intuitively know about numerical operations, and
remember that any object can be made to support these operations
(regardless of whether the operation makes sense or not). When we look
closely at this function, we find that we can treat the objects themselves
as black boxes whose operations are unknown to us. For each operation,
suppose we give the objects the benefit of the doubt and assume that
they will respond favorably to our attempts to perform operations.
Once we make this assumption, we are left with nothing more than naked
control flow. Let us strip away the syntax surrounding binary operationsand reveal what the proper semantics of the language see:
def euclid(a, b):
if a.operation1(b):
return euclid.__call__(b, a)
# suppose __zero is an object with value zero
while b.operation3(__zero):
t = b
b = a.operation4(b)
a = t
return a
This semantic understanding perfectly captures the runtime binding
semantics of the language. Each of blandly named operation methods
belongs to an object, whose type we do not know until the last possible
moment. The interpreter is responsible for maintaining control flow,
8/2/2019 COMS 6998 Final Paper
8/19
implementing assignment by manipulating the namespace, and
instructing objects to perform operations.
The objects themselves, on the other hand, are responsible for either
performing those operations, or issuing an exception in the event that
they cannot. It is worth noting that the exceptions raised by the objectswhen they do not support an operation are no diferent from exceptions
thrown from user code. As such, they can be caught and handled by the
interpreter.
Example with Disassembly
To demonstrate the ignorance with which the interpreter approaches
operations, here is a disassembly of the original euclid function as
provided by the CPython implementation. The particular disassembly
varies between Python versions and implementations, but the theme ofthe division between flow control and object semantics remains.
In the disassembly on the next page, the leftmost numbers correspond to
the line in the original python code that generated the opcodes, and the
opcodes are listed in uppercase. Names are compiled down to numeric
handles, which appear to the right of the opcodes. The parenthesized
annotations indicate which name in the original program the numeric
identifiers correspond to.
The originating lines are included before each block for clarity, althougha raw disassembly would not contain this information. Additionally, the
>> indicates an instruction to which control may jump. These can be
considered the delimiters of basic blocks.
Also note that, as can be expected of a stack-based machine,
instructions are triples, as evidenced by the fact that the instruction
address increments by 3 bytes instead of 4, as it commonly seen in
register-based machines.
if a < b:2 0 LOAD_FAST 0 (a)
3 LOAD_FAST 1 (b)
6 COMPARE_OP 0 (
8/2/2019 COMS 6998 Final Paper
9/19
18 LOAD_FAST 0 (a)
21 CALL_FUNCTION 2
24 RETURN_VALUE
while b != 0:
4 >> 25 SETUP_LOOP 38 (to 66)
>> 28 LOAD_FAST 1 (b)31 LOAD_CONST 1 (0)
34 COMPARE_OP 3 (!=)
37 POP_JUMP_IF_FALSE 65
t = b
5 40 LOAD_FAST 1 (b)
43 STORE_FAST 2 (t)
b = a % b
6 46 LOAD_FAST 0 (a)
49 LOAD_FAST 1 (b)
52 BINARY_MODULO
53 STORE_FAST 1 (b)a = t
7 56 LOAD_FAST 2 (t)
59 STORE_FAST 0 (a)
62 JUMP_ABSOLUTE 28
>> 65 POP_BLOCK
return a
8 >> 66 LOAD_FAST 0 (a)
69 RETURN_VALUE
This disassembly clearly illustrates the ignorance of the interpreter to the
objects internals. The parameters and all objects created in the functionare included in the functions stack frame as a list of references indexed
by an integer ofset. The interpreter does not know their types, and
neither does it care. The only scenario that could perturb it is an
uncaught exception, which is handled by quitting with an error.
For instance, note that the LOAD_FAST opcode takes as a parameter an
ofset into the object array of the function frame. Once those objects are
pushed onto the stack, the COMPARE_OP opcode is responsible forinstructing the objects on the stack to be compared to one another, again
without any awareness of type.
This division yields the notion of the Object Space. An object space can
be though of as an object-level implementation of the application
interface that is presented to the interpreter. This split between
8/2/2019 COMS 6998 Final Paper
10/19
interpreter space and object space is a crucial one, because various
object spaces can be used for various purposes, allowing the interpreter
to drive either an actual execution, or a more abstract implementation, as
we will see when we reach RPythons translation framework.
Garbage Collection
Python is a garbage collected language, meaning that the runtime must
devote resources to maintaining an awareness of the liveness of its
objects. Up until version 2.0 of the CPython interpreter, this task was
handled by using reference counting. However, reference counting
sufered from a critical weakness in the form of an inability to detect
reference cycles. For instance, consider the following code:
lst = [] # lst count is 1
lst.append(lst) # lst count is 2del lst # lst count is 1 - no deletion
In this snippet, lst is a list containing a reference to itself. When the del
operator is called, the reference count is decremented to one, so it is not
garbage collected. However, there is now no name that points to the
object, either directly or indirectly, which means the object is now
garbage.
On the face of it, it would be natural to simply implement a traditional
garbage collector, such as mark and sweep. These approaches work byfinding the root of the object reference graph, traversing the graph and
marking the found objects as alive, and garbage collecting the rest.
However, CPython supports extension modules written in C, which means
that determining the root of the object graph is not possible for those
extensions, since there is no C interface for reporting objects created by
C-language extensions. As a result, the CPython garbage collection
scheme is a combination of reference counting and a period cycle
detector. (Schemenauer, 2000)
Part 3 - Enter RPython
8/2/2019 COMS 6998 Final Paper
11/19
Now that we have discussed the features of the CPython implementation,
we us turn our focus toward the RPython compiler.
RPython is a strict subset of the Python language. The goal of the project
is to develop a dialect of Python that can support whole-program staticanalysis. This efort was launched in order to develop a toolkit for the
construction of virtual machines for dynamic languages, such as Python.
With this toolkit, a developer could specify his program in a high level
language, namely RPython, and have it compiled down to some lower-
level language for fast execution.
While RPython currently supports backends for the Java Virtual Machine
and Common Language Runtime, the C backend is the most stable and
well-developed. Because of this and the universality and approachability
of C, we will restrict ourselves to discussion of the C backend for thepurposes of this paper.
Before we describe the proper translation toolchain, let us first consider
the obstacles to translating Python to C. (Rigo, Hudson, & Pedroni)
Dynamic Features Make Static Analysis Dicult
In general, it is impossible to prove almost anything about a Python
program. As we have seen, any construct that seems to resemble a
feature of a language designed to support static analysis in factgenerates a dynamically changing object. For instance, functions are
actually objects that contain code, and any object can be made to behave
as a function by adding a__call__method to its namespace. In order to
compile to C, we would have to be able to prove a classs members,
which can change at any time.
There is more trouble with classes, however. C is a strongly typed
language, meaning it requires the type of every expression in the
program must be known at compile time. Meanwhile, Python is a
language where the only thing that is known at compile time is controlflow.
What Python calls classes are actually namespaces which contain
references to objects. Any of these references can be mutated at runtime,
possibly as a result of some exponential computation. This, however
could be remedied by use of type inference. The real trouble lies not so
8/2/2019 COMS 6998 Final Paper
12/19
much in the exact contents of every class, but rather in the number of
potential classes.
The Number of Types is Unbounded
Since classes can be created at runtime, consider the following function:
def make_types(num):
ret = []
for i in range(0, num):
name = str(i)ret.append(type(class_ + name,
(object,),
dict(a_ + name=i))
return ret
This function creates some number of distinct classes using the type
function, and returns them in an array. Any method calling this function
would have at its disposal any number of types from which it could
choose.
If we are performing static analysis with the intention of translating to C,then we need to generate struct declarations for every structure our code
will use. However, this is clearly impossible in the case of code involving
this function.
Python functions enjoy the same sort of dynamic creation as classes. In C,
the prototype of a function defines the type of function pointers that may
point to the function. Furthermore, the addition of syntax for calling
objects via the__call__method introduces new complexity.
On the face of it, a function pointer could be kept in the object. However,since any function can be assigned to the__call__method, the
prototype of the function pointer must be either inferred, which poses a
significant challenge, or null, which introduces the risk of a translated
program getting issued a SEGFAULT on improper function invocation
rather than exiting gracefully.
8/2/2019 COMS 6998 Final Paper
13/19
8/2/2019 COMS 6998 Final Paper
14/19
However, if we restrict our code to only generate a provably-bounded
number of classes, we can identify the classes that would be created by
using data flow analysis. For instance, the PyPy interpreter must create
wrappers for various objects in the interpreter. Instead of manually
specifying a wrapper class for every object, the code instead defines a
function that generates a class for that object. The crucial point is thatthe number of these objects is bounded, and therefore the class-
generating function is also entered a bounded number of times, ensuring
that the translator does not spin through the function forever.
In addition to not creating new classes, existing classes cannot have their
contents modified after startup. In other words, the classs namespace is
considered to be constant after a certain point. Otherwise, each
successive alteration would have to be captured and modeled by a new
type in C, assuming the results of the alterations are decidable in the first
place.
A Special Note on Functions
In Python, any variable can point to any function. If a function call with an
improper number of arguments is attempted, the runtime would detect it
and issue an error. This is not the case in C, however. Only functionpointers of the appropriate type can point to a function, and an error of
mismatched arguments must be discovered at compile time, not at
runtime.
To solve this, RPython places a restriction on the use of function objects.
Functions are first class objects, but variables that hold them must only
hold functions that are deemed similar enough by the type inferencer.
The documentation does not currently define similar enough, although
it does promise that the toolchain will emit explicit errors and not
obscure crashes.
Globals
Global definitions correspond roughly to Cs static declarations. As such,
they their type and value must be known at compile time. A simple
restriction is placed on globals, namely that they are considered constant,
and cannot be modified after they are defined.
8/2/2019 COMS 6998 Final Paper
15/19
The Translation Path
Now that we have described the limitations placed on the language, we
can delve into the process of the translation itself. The RPython compiler
is written in Python, and is interpreted by a standard interpreter. For thepurposes of this paper, we assume that this interpreter is CPython.
For reference, here is a graphic representing the RPython translation path
(Krekel & Bolz, 2005):
Parsing
The RPython compiler is interesting in that it does not perform parsing
on the python text file. Instead, on receiving an input file, it uses the
interpreters compilation functionality to interpret the input file.
Eventually, the interpreter reaches the ready point, which consists of a
call into the entry point of the RPython compiler.
That call inspects the objects produced by the interpretation of the
initialization portion of the input code. This consists of code objects in
the form of functions, which can be disassembled using the built-in dismodule, and live Python objects, whose members can be retrieved by
Pythons native introspection features.
In essence, standard Python interpreter is used as a preprocessor for the
RPython language. The RPython compilers input is not the program
itself, but rather the partially-executed state of the program, as
generated by the CPython interpreter. This state takes the form of class
8/2/2019 COMS 6998 Final Paper
16/19
and function objects in memory, as well as any global variables whose
values are to be compiled to static declarations.
The Flow Object Space
At this point, the programs state constitutes an intermediaterepresentation amenable to abstract interpretation. The RPython compiler
is part of the PyPy Python interpreter project, and shares some code with
the proper PyPy interpreter. In particular, it borrows the interpreter to
handle interpretation of the newly-minted python bytecodes.
However, while PyPy uses a concrete object space to implement the full
spectrum of interaction between the interpreter and Python objects,
RPython uses a much simpler space called the Flow Object Space. This is
an object space that contains placeholder objects instead of fully featured
objects, and yet still receives the relevant requests for operations fromthe interpreter.
The aim of the flow object space is to generate flow graphs for the
program by way of interaction with the abstract interpreter. The abstract
interpreter goes over the entire program bytecode by bytecode, and
sends of requests for operations to the flow object space. The flow
object space, rather an servicing these requests, records the operations,
gradually building up a flow graph of the operations of the program from
the live code objects.
One might expect that branching is a problem with this scheme. After all,
if the interpreter sees a branch, it will only choose one direction in which
to go. The RPython documentation claims that the interpreter is tricked
into interpreting two sides of a branch at once, without going into detail.
With such a paucity of description, let us take the documentation at its
word.
Type Inference
Once the control flow graph of the entire program is available, the typeinferencer, called the annotator, can pass over the entre program and
infer the type of each variable. While the details of this type inferencer are
far beyond the scope of this paper, it suces to say that the inferencer
works by forward propagation, starting with the types input arguments of
the entry point function as a base case.
8/2/2019 COMS 6998 Final Paper
17/19
The annotator begins with specific types for each variable, and gradually
works up to the most general. The annotation lattices are shown in [big].
Variables can have change their type, but at no point may the types
diverge. In other words, after every branch is merged at a joinpoint, the
types of each variable must be the same. For instance, the following code
is forbidden: (Rigo, Hudson, & Pedroni)
if a == 1:
b = 10
else:
b = a string
# b has conflicting types here
Specialization
From this point forward, the program undergoes direct compilation downto the target environment. From the flow graph, a low level flow graph is
generated, conforming to either a low-level, C-like type system for the C
backend called lltypesystem, or an object-oriented type system for the
JVM and CLI backends, called ootypesystem. Given this lower level flow
graph, the appropriate code generator can generate target code.
Conclusion
8/2/2019 COMS 6998 Final Paper
18/19
The RPython language was developed to serve as a framework for the
specification of dynamic language virtual machines. The project itself is
called PyPy, as is the flagship Python interpreter.
The PyPy interpreter is written entirely in RPython, and be either
interpreted by standard CPython, or translated to C for release. As of thiswriting, the PyPy interpreter is considered to be the fastest Python
implementation available today, boasting speed increases over the
reference CPython implementation in excess of ten times for some tests.
(Authors, PyPy Speed, 2011) In addition, it is very compliant, including
almost all features of the CPython implementation.
The use of RPython as an implementation language and framework allows
the PyPy project to be written in a high-level language with concise
features, but be compiled to a low-level language for fast execution. The
high level specification has allowed for a very flexible architecture. Forinstance, while CPythons garbage collection scheme consists of manually
written reference counts, PyPys scheme can be chosen at compile time as
a flag.
In addition to Python, the VM specification framework is flexible enough
to allow specification of other languages. For instance, JS-PyPy is a
Javascript interpreter written in RPython and compiled using the RPython
compiler and VM toolkit. (Santagada)
Compilation of Python to C is a common question among Pythonbeginners, and while such a translation is made impossible by the
semantics of the general Python language, RPython shows that simple
restrictions can be placed on the language to make the translation
possible.
8/2/2019 COMS 6998 Final Paper
19/19
Bibliography
Authors, P. (2011, December 10). PyPy Homepage. Retrieved December10, 2011, from General Python FAQ: http://docs.python.org/faq/
general#why-was-python-created-in-the-first-place
Authors, P. (2011, December 10). PyPy Speed. Retrieved December 10,
2011, from PyPy Speed: http://speed.pypy.org/
Krekel, H., & Bolz, C. F. (2005, December 28). PyPy - The new Python
implementation on the block. Retrieved December 8, 2011, from PyPy
Homepage: http://codespeak.net/pypy/extradoc/talk/22c3/hpk-
tech.html
Rigo, A., Hudson, M., & Pedroni, S. Compiling Dynamic Language
Implementations . European Commission within the Sixth Framework
Programme .
Santagada, L. (n.d.). PyPy Homepage. Retrieved December 8, 2011, from
JS-PyPy: PyPy's Javascript interpreter: http://codespeak.net/svn/pypy/
lang/javascript/trunk/js/javascript-interpreter.txt
Schemenauer, N. (2000, December 6). Arctrix. Retrieved December 8,
2011, from Garbage Collection for Python: http://arctrix.com/nas/
python/gc/
http://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://speed.pypy.org/http://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://codespeak.net/pypy/extradoc/talk/22c3/hpk-tech.htmlhttp://speed.pypy.org/http://speed.pypy.org/http://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://docs.python.org/faq/general#why-was-python-created-in-the-first-placehttp://docs.python.org/faq/general#why-was-python-created-in-the-first-place