MEAP Edition Manning Early Access Program
Deep Learning with PyTorch Version 6
Copyright 2019 Manning Publications
For more information on this and other Manning titles go to https://www.manning.com/
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
welcome Welcome to Deep Learning with PyTorch!
Eli and Luca here. We’re ecstatic to have you with us. No, really — it’s a big deal for us, both terrifying and exhilarating. So, thanks!
Our best wish for this book is that it’ll help you develop your own intuition and stimulate your curiosity. PyTorch is an amazing library; it will give you new powers if you give it a few hours of your time.
We’re having a lot of fun writing this book, but it’d be pretty lame if we are the only ones having fun. We’re looking forward to being able to hear directly from you about what you like about the book, and what still needs work. We’re adamant that the manuscript be as clear and of as much practical utility as possible, so please reach out. We want to know how you feel about the book, both good and bad. The good will give us fuel for the journey, while the bad keeps us out of the weeds.
One note, at the time of this writing the released version of PyTorch is 0.3.1, while the pending 0.4 release has a fairly major change to the relationship between Tensors and Variables in PyTorch. We decided to target the currently released version, rather than rely on functionality you could only get by compiling the master branch. Be aware that if you’re trying to run the examples against PyTorch 0.4 you might run into some issues. We’ll get those cleared up as soon as we can after the release is final.
In the meantime, enjoy the book, say "hi!" on the forum, and we’ll chat again soon!
— Eli and Luca
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
brief contents PART 1: CORE PYTORCH
1 PyTorch from 1 Mile Away 2 A Whirlwind Tour of PyTorch 3 It Starts with a Tensor 4 The World as Tensors 5 The Mechanics of Learning 6 Telling Birds from Airplanes - Learning from Images
PART 2: LEARNING FROM IMAGES IN THE REAL-WORLD: EARLY DETECTION OF LUNG CANCER
7 How hard can curing cancer be? 8 Classifying Suspected Tumors 9 Monitoring Metrics: Precision, Recall, and Pretty Pictures 10 Determining What to Classify With Segmentation And Clustering 11 Data Improvements, Augmentation 12 PyTorch in Production 13 Revisiting CycleGAN’s zebras
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
P1Part 1: Core PyTorch
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
1
1This chapter covers
We are living through exciting times. The landscape of what computers can do is
changing by the week. Tasks that only a few years ago were thought to require higher
cognition are getting solved by machines at super-human levels of performance. For
example:
describing a photographic image with a sentence in current Englishplaying complex strategy gamesdiagnosing a tumor from a radiological scan
are all now approachable by a computer. Even more impressively, the ability to solve
such tasks is acquired by computers through examples, rather than encoded by a human
as a set of hand-crafted rules.
It would be disingenuous to assert that machines are learning to "think" in any human
sense of the word. Rather, we’ve discovered a general class of algorithms that are able to
approximate complicated, non-linear processes very effectively. In a way, we’re learning
PyTorch from 1 Mile Away
the key concepts behind deep learning
an overview of the tooling landscape
where PyTorch comes from
why one should choose PyTorch
the anatomy of PyTorch
1.1 Introduction
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
2
that intelligence as we subjectively perceive it, a notion that we often conflate with
self-awareness, is not required to successfully carry out the aforementioned tasks and
many others.
That general class of algorithms we’re talking about falls under the definition of deep
, which deals with training mathematical entities named onlearning deep neural networks
the basis of examples. Knowing how to choose and train a neural network gives access to
the ability of solving very complicated problems in an ever-increasing number of
domains, from computer vision to natural language processing, provided that training
data are available.
It’s not easy to overstate how large an impact deep learning has had and will continue to
have on the world. Starting from computer vision and natural language processing all the
way to scene understanding and autonomous agents, established benchmarks obtained by
hand-crafted systems or humans themselves have been repeatedly blown away by
algorithms that carried very little prior knowledge about the task at hand except from
what was provided through examples.
In the medical image analysis literature, for instance, elaborate pipelines of ad-hoc
algorithms used to identify an organ from a CT scan have been increasingly supplanted
by the use of convolutional neural networks [REF Section]. Figure [REF] shows the
number of citations including the keyword in PubMed, a popular index ofdeep learning
medical peer reviewed articles, over the years, as a testimony of this trend. In the 2017
edition of MICCAI, the largest international medical image analysis conference, over
50% of the papers submitted used deep learning one way or another.
Figure 1.1 Number of publications mentioning deep learning on PubMed, the leadingindex of peer-reviewed biomedical journal articles, year over year.
Finding the right sequence of hard-coded steps to go from a CT scan to the outlines of a
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
3
liver, and do so robustly in the face of variability in the anatomy and in the imaging
process requires anatomical knowledge, algorithmic expertise and the right mix of
experience and trial and error. With the advent of deep learning, this task has turned into
feeding data representative of such variability into an appropriately designed system that
learns how to obtain the desired outlines. If we exclude the effort required to create the
outlines of the liver used as examples for learning, the practitioner is required to know a
lot less about medical imaging and processing pipelines compared to the past. This has
initially caused some frustration outside the deep learning community in this and other
fields, but the compelling results obtained through the use of deep learning has
determined a wave of adoption and experimentation during the first years of the current
decade.
Figure 1.2 CT scan of a liver (top left) and its corresponding outline (top right). Thebottom row offers an example of such variability: cross-sections of the liver in CT scansacquired from different subjects.
Surprisingly, the mathematical concepts behind deep learning are not prohibitively
advanced and the programming tools to train a deep neural network are very accessible.
This book focuses on one of such tools, PyTorch, with the aim of covering enough
ground to allow the reader to solve practical problems with deep learning or explore new
models as they pop up on ArXiV.
PyTorch is an amazing library for a deep learning practitioner. It doesn’t get in the way,
it minimizes cognitive overhead, ultimately allowing one to focus on what matters the
most - building and training deep learning models - without renouncing to performance
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
4
and scalability.
We believe that lowering the barrier to tools like PyTorch represents more than just a
way to facilitate the acquisition of new technical skills. It is a step towards equipping a
new generation of computer scientists, data scientists or students from a wide range of
disciplines, with a working knowledge on the tools that will run the world behind the
scenes during the next decades. It is imperative that such knowledge is widespread and
accessible, owing to the great power that lies in these methodologies. This is what
motivates us towards making this book as useful as possible.
We’ll start off by taking a look at the ongoing revolution brought in by deep learning
over the last few years. This will help us understand how PyTorch fits in the big picture
and why it’s a great idea to pick it up at this point in time.
Until the advent of deep learning in the late 2000’s, machine learning practitioners
typically faced the following task: given a dataset of samples and desired outcomes,
define transformations on the data until the resulting numbers, or features, allow a
downstream algorithm, like a classifier, to produce correct outcomes on new data. This
process, called , very much in use today, is aimed at taking thefeature engineering
original data and coming up with of the same data that can then be fed torepresentations
an algorithm to solve a problem. For instance, in order to tell 1’s from 0’s in images of
handwritten digits, one would come up with a set of filters to estimate the direction of
edges over the image, and then train a classifier to predict the correct digit given a
distribution of directions at multiple locations.
Deep learning, on the other hand, deals with finding such representations automatically,
from raw data, in order to successfully perform a task. In the 1’s vs 0’s example, filters
would be estimated during training, by iteratively looking at pairs of examples and target
labels. This is not to say that feature engineering has no place with deep learning, in one
form or another: we often need to inject some form of prior knowledge in a learning
system. However, the ability of a neural network to ingest data and extract useful
representations on the basis of examples is what makes deep learning so powerful.
Successful practitioners are not required to hard-code transformations of the data to
extract those representations, but rather know how to make a mathematical entity
discover it from training data. As with many disruptive technologies, this fact has led to a
change in perspective [Figure 1].
1.1.1 The Deep Learning Revolution
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
5
Figure 1.3 The change in perspective brought in by deep learning. Left: the practitioner isbusy defining engineering features and feeding them to a learning algorithm; the resultson a task will be as good as the features he or she engineers. Right: with deep learning,the raw data is fed to an algorithm that extracts hierarchical features automatically, basedon optimizing the performance of the algorithm on the task; the results will be as good asthe ability of the practitioner to drive the algorithm towards its goal.
An important factor that determined the success of deep learning is its inherent
underlying simplicity. At the basis of deep learning are neural networks, mathematical
entities capable of representing complicated functions, i.e. transformations from inputs to
outputs, through a composition of simpler functions. The shape of those simpler
functions and the way such composition is realized makes it possible for a neural network
to learn how to approximate functions whose inputs and outputs are from each other,far
like an image and an English sentence that describes it, from a large number of
input/output pairs.
The term is obviously suggestive of a link to the way our brain works. Asneural network
a matter of facts, although the initial models were inspired by neuroscience [REF
Perceptron], modern artificial neural networks bear very little resemblance to the
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
6
mechanisms at the basis of neural networks in the brain. It seems however likely that
both artificial and physiological neural networks have selected similar mathematical
strategies for approximating complicated functions that happen to work very effectively.
The unit at the basis of such decomposition of complicated functions into a simpler units
is the . At its core, it is nothing but a linear transformation of the input (e.g.neuron
multiplication of the input by a number, the , and addition of a constant, the )weight bias
followed by the application of a fixed non-linear function (referred to as the "activation
function"). Mathematically we can write this out as
In general, and can be simple scalars, or vector-valued (meaning holding many scalarx y
values), in which case is a matrix and is a vector. In this case, the expression above isw b
referred to as a of neurons, as represented in Figure 1.4.layer
NOTE For more detail on matrix and vector multiplication, please see Appendix 1,XREF Linear Algebra.
Figure 1.4 An artificial neuron: a linear transformation enclosed in a non-linear function.
Both parts of the neuron, the linear transformation and the subsequent non linear
function, called the , are crucial. Without the activation function,activation function
successive linear transformations can be simplified to a single linear
transformation—nothing that’s capable of the complex behavior we desire. Likewise,
y = f(w * x + b)
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
7
without the linear transformations, the activation function (typically) has no parameters,
and so the network is unable to learn or change. The ability of an ensemble of neurons to
act as , that is, to be able to approximate a very wide range ofuniversal approximators
functions, depends on the combination of the linear and non-linear behavior inherent to
each neuron.
Activation functions generally help accomplish this by mapping wide range of inputs to a
narrow range of outputs. The function accomplishes this by having arbitrarily largetanh
positive inputs asymptotically map to , while large negative inputs map to .1.0 -1.0
This allows the neuron to be sensitive to a narrow range of input values, while treating
anything outside of that range as equivalent.
Inputs above are all approximately equal to , which is considered "saturated"3.0 1.0
(as would be inputs below ).-3.0
The activation function is typically fixed, as it doesn’t depend on parameters that need to
be optimized. The range of input values for which a neuron’s response is not saturated
then depends on how the preceding linear transformation shifts and scales the input.
Therefore, neural networks can learn by changing the individual and values of thew b
linear transformation for all their neurons. By measuring the error performed by the
network on a specific task, like a classification or a regression, one can define the
learning process as the process of changing and throughout the network so that thew b
error decreases.
A multi-layer neural network is made up by a composition of the above functions, that is
where the output of a layer of neurons is used as an input for the following layer. Again,
training consists in finding good values for and w_0, w_1, … w_n b_0, b_1, …
so that the resulting network correctly carries out a task, such as predicting likelyb_n
temperatures given geographic coordinates and time of the year.
>>> math.tanh(1)0.7615941559557649>>> math.tanh(2)0.9640275800758169>>> math.tanh(3)0.9950547536867305
x_1 = f(w_0 * x + b_0)x_2 = f(w_1 * x_1 + b_1)...y = f(w_n * x_n + b_n)
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
8
Figure 1.5 Our mental model of an artificial neural network: composition of neurons,where the output of a neuron is used as input argument for other neurons. By estimatingthe values of w_n and b_n of all neurons, the network learns how to approximatecomplicated functions. The bottom row shows a neural network organized in layers,whereby neurons have multiple inputs and feed their outputs to layers downstream. Byincreasing the width of layers, we increase the capacity of the network, i.e. its ability tohold larger intermediate representations of the input data.
By we mean obtaining a correct output on unseen datacarrying out a task successfully
produced by the same data-generating process. In this sense, the network is not allowed
to learn the data , as it will typically perform badly on new data that is similar butby heart
not identical to the data used during training - effect known as . Instead, aoverfitting
successfully trained network, through the value of its weights and biases, will capture the
inherent structure of the data in the form of meaningful numerical representations that
work correctly for previously unseen data.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
9
Figure 1.6 Our mental model of the learning process: given input data and thecorresponding desired outputs (ground truth), as well as initial values for the weights, thenetwork is fed input data (forward pass) and the errors are evaluated by comparing theresulting outputs to the ground truth. In order to optimize the weights, the change in theerror following a unit change in weights (the gradient of the error with respect to theparameters) is computed using the chain rule for the derivative of a composite function(backward pass). The value of the weights is then updated in the direction that leads to adecrease in the error. The procedure is repeated until the error, evaluated on unseen data,falls below an acceptable level.
The description so far is necessarily a simplification, but it captures the basic
mechanisms through which neural networks operate. In later chapters we will go in depth
into typical neural network architectures that have been devised to carry out specific
tasks, like recognizing objects in images.
We encourage the reader to build up an intuitive understanding of deep learning before
proceeding. To that effect, Grokking Deep Learning [REF] is a great resource for
developing a strong mental model and intuition on the mechanism underlying deep neural
networks. For a thorough introduction and reference, we direct the reader to Deep
Learning by Goodfellow et al [REF].
Deep learning allows us to carry out a very wide range of complicated tasks, like
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
10
machine translation, playing strategy games or identifying objects in cluttered scenes,
solely on the basis of experience. In order to do so in practice we need tools that are
flexible, so they can be adapted to our specific problem, and efficient, to allow training to
occur over large amounts of data and the network to perform correctly in the presence of
uncertainty the inputs. In the following paragraph we will take a look at the evolution of
tools for deep learning and understand where PyTorch stands in such landscape.
We just learned that deep learning is a revolutionary technology, allowing computers to
be programmed through examples. The revolution came about thanks to a few factors.
First off, methodologies: we discovered how to train deep networks effectively. Second,
technological advances: we discovered that we could use Graphical Processing Units
(GPUs, specific processors dedicated to rendering 2D and 3D graphics) to massively
speed up the computations involved in training a network. Third, the availability of large
amounts of data, such as very large collections of natural images collected from the web.
A last, key ingredient to the adoption of deep learning across the board has been the
availability of tools that made experimenting with neural network architectures accessible
to users with only a basic training in mathematics and programming (PyTorch is one such
tools, and a really good tool at that). Writing programs that work on GPUs requires a
certain amount of expertise. Deep learning frameworks, however, typically hide that
away from users, allowing practitioners to build complicated models out of simple
building blocks and utilizing high-level mental models.
Compared to the previous waves of artificial intelligence, the complexity of building an
"intelligent" system using deep learning is largely delegated to the training process. The
amount of code that typically needs to be written to solve a problem that would have
previously required hundreds of thousands lines of highly tailored code is now two orders
of magnitude lower.
There’s more. In the previous section we have taken a glimpse of the inherent simplicity
of neural networks: simple functions inside other simple functions. With a high-level
language, such as Python, equipped with an expressive numerical library, such as
NumPy, one can code a basic neural network engine in a few hundreds lines of code -
although writing a full-fledged deep learning framework requires considerably more
effort.
The take away is that code has become less of an asset, and companies and research
institutions have started to see open-source tools as a way to compete, establish
1.1.2 The Tools Behind the Revolution
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
11
dominance and attract talent, rather than something to jealously keep behind closed
doors. It is not uncommon today to read about a new paper freshly posted on ArXiV and
see it implemented on GitHub for the main deep learning frameworks the next day. The
low barriers for the implementation of models and subsequent experimentation, as well
as for the adoption of such models in production, has been a key ingredient of the deep
learning revolution.
It would be overkill to list all frameworks that have gained popularity in recent history. If
you started getting into deep learning during the first half of this decade, Theano, Torch
and Caffe would have been the natural choices, along with higher-level frameworks
using a lower-level framework as a backend, such as Keras.
A really quick peek into these frameworks will give us a chance to appreciate what
makes PyTorch unique in this landscape.
Caffe was first released in 2013 as a C++ library with GPU support. In Caffe, networks
are typically built by declaratively configuring sequences of layers. Each layer knows
how to compute outputs given inputs (forward pass) and how to compute the change in
the outputs given a unit change in inputs and parameters (backward pass). The latter pass
is used during training to determine how parameters in a network need to change in order
to minimize the errors. We’ll look at this mechanism in detail in Chapter 5.
While using Caffe mainly entails creating configuration files, it requires expertise in C++
for it to be extended, which led a few authors to maintain their separate fork of Caffe for
their experiments. Torch7, on the other hand, was targeted at researchers, allowing them
to create new modules and architectures from a high-level language and with high
performance. Torch7 is written in Lua, a high-level scripting language, and C, with
support for GPUs. Thanks to its simple, modular design it allowed heavy
experimentation. Like layers in Caffe, modules in Torch are defined by explicitly coding
their forward and backward passes. Although a clean, lightweight language with a very
fast interpreter, Lua has represented a limiting factor in the adoption of Torch7, both for
its limited popularity and the absence of a data science environment beyond Torch itself.
MODULAR FRAMEWORKS
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
12
Both Caffe and Torch greedily execute computations: forward and backward functions
are evaluated numerically as soon as their results are needed. Theano, on the other hand,
is a symbolic computation engine, written in Python and C++. With Theano, the user
only has to specify the forward pass, as one would write a mathematical expression using
symbols on paper. Rather than being evaluated immediately, these expressions are
compiled into a symbolic computation graph. By leveraging symbolic differentiation,
Theano can compute derivatives of functions built out of its building blocks, thus
generating backward passes automatically. In addition, it can optimize the evaluation of
expressions, e.g. simplifying terms appearing both at the numerator and denominator of
an expression, or handling numerically unstable situations. The drawbacks are a codebase
that is harder to develop and models that are harder to debug.
Figure 1.7 Static graph for a simple computation corresponding to a single neuron. Themathematical expression (first row) is compiled into a symbolic graph where each noderepresents individual operations (second row), using placeholders for inputs and outputs.The graph is then evaluated numerically (third row). The gradient of the output withrespect to the weights is constructed symbolically by traversing the graph backwardsand multiplying the gradients at individual nodes (fourth row). The correspondingmathematical expression is shown in the fifth row.
STATIC GRAPH FRAMEWORKS
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
13
At the end of 2015, the Google Brain Team released TensorFlow, a library that took a
similar approach to Theano, that is compiling symbolic graphs and evaluating backward
passes by leveraging on symbolic differentiation. At the same time, TensorFlow allowed
training to happen in parallel on clusters of machines, making it the go to framework of
large workloads and arguably the leading deep learning framework in general. At the
same time Keras gained a TensorFlow backend, so users writing their models in Keras
enjoyed the possibility to leverage Theano or TensorFlow interchangeably.
Building a network symbolically in the form of a static graph requires expressions as well
as constructs (like conditions, iterations) to be built on top of the hosting language.
Frameworks using symbolic graphs become in some form a language on top of the host
language, which requires cognitive overhead and typically makes it harder to extend. As
a reaction to this overhead, year 2016 saw the rise of frameworks, likedynamic graph
Chainer, DyNet and, ultimately, PyTorch.
We’ve seen that TensorFlow and Theano build the graph statically, that is, the
computation graph is first compiled as prescribed by the symbolic code, it is then
executed by an engine which replaces symbols with numbers. In dynamic graph
frameworks, on the other hand, computations are still evaluated symbolically and
gradients computed automatically, but the graph is defined and evaluated greedily, as
prescribed by the host language. The catch phrase here is "Define by run": the
computation graph is grows dynamically as individual computations are executed.
PyTorch was one of the first dynamic graph frameworks to gain considerable popularity.
Its rapid growth in the months following its first release quickly allowed PyTorch to
become the leading dynamic graph framework and one of the two go-to frameworks,
together with TensorFlow, for the AI research community.
Dynamic graphs can change during successive forward passes, for instance different
nodes can be instantiated based on the outputs of the preceding nodes, without a need for
the decision-making logic to be represented in the graph itself. Conditionals and loops,
for instance, are not encoded in the computation graph: they are evaluated in the host
language and the resulting codepath is then computed by the engine. This strategy strikes
good a balance between the need for automatic differentiation and code optimization, and
the need for a tool that is easy to program and debug (stack traces correspond to stack
traces of the host language) and integrates nicely with the rest of the ecosystem - it feels
and behaves just like a library.
DYNAMIC GRAPH FRAMEWORKS
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
14
Figure 1.8 Dynamic graph for a simple computation corresponding a single neuron. Theupper half of the figure shows the computation broken down in individual statements,which are greedily evaluated as they are defined. The program has no notion of theinterconnection between computations. The lower half of the figure shows theconstruction of a dynamic computation graph for the same expression: the expression isstill broken down in individual statements that are greedily evaluated, while the graph isbuilt incrementally. Automatic differentiation is achieved by traversing the graphbackwards, the same was as for static computation graphs. Dynamic graphs can changeduring successive forward passes, for instance different nodes can be instantiatedaccording to conditions on the outputs of the preceding nodes, without a need for suchconditions to be represented in the graph itself, as it is needed for static graphs.
There’s a large number of frameworks that have not been mentioned, although they
would greatly deserve it. To make up for it, we collected a few, together with a reference,
in the following Figure.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
15
Figure 1.9 A (incomplete) map of deep learning frameworks as of 2017.
With so much offering, why should a practitioner choose PyTorch? It has been released
fairly recently, so it is presumably less mature, it has less tooling than other frameworks,
one can find fewer pre-implemented models around, although that’s changing quickly.
Here’s a figure showing the adoption of deep learning tools over time.
Figure 1.10 Trends in the adoption of deep learning frameworks over time, measured asnumber of mentions, number of contributors, number of stars and number of forks.
We can see that the adoption of PyTorch has been steep, far beyond what other dynamic
computation graph frameworks, like Chainer and DyNet, have experienced. Arguably,
one of the reasons why PyTorch immediately gained popularity lies in its heritage.
The thing is that, despite its young age, PyTorch stands on very solid shoulders. Actually,
taking a moment to trace its history will help us appreciate the mindset that has motivated
several talented researchers into building Torch first and PyTorch today.
1.2 Where PyTorch Comes From
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
16
A good place to dig into where PyTorch comes from is its file (LICENSE
). Here we can see the history of maintainers of PyTorch andgithub.com/pytorch/pytorch
its parent project Torch over the years.
Figure 1.11 Timeline of Torch projects, with the year of release and the programminglanguages used in the release.
One can follow the history of Torch from Ronan Collobert’s web page (
), who is now a Research Scientist at FAIR and who wrote the veryronan.collobert.com
first Torch when he was a PhD student at the "Istituto Dalle Molle di Intelligenza
Artificiale Percettiva" (Dalle Molle Institute for Perceptive Artificial Intelligence, now
IDIAP Research Institute) in Martigny, Switzerland.
The first ever Torch project was released in 2001 under the name of . It was aSVMTorch
C++ library focused on Support Vector Machines for classification and regression
problems. It can still be found at , for thosebengio.abracadoudou.com/SVMTorch.html
Copyright (c) 2016- Facebook, Inc (Adam Paszke)Copyright (c) 2014- Facebook, Inc (Soumith Chintala)Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)Copyright (c) 2011-2013 NYU (Clement Farabet)Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)Copyright (c) 2006 Idiap Research Institute (Samy Bengio)Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
17
who like software archeology. It didn’t deal with neural networks and doesn’t share any
code with PyTorch, but at least it gave a name to its successors.
An early Torch that can still be found online is Torch3 ( ). It was releasedtorch.ch/torch3
in 2001 under a liberal BSD license, and it focused on training neural networks, including
convolutional neural networks. It was written in C+, a language that the author didn't
particularly like: "I hate C+. Too much complicated", Ronan wrote in the manual. The
author liked the terseness and efficiency of C, but resorted to C++ classes, like Matrix
and Vector, to provide some degree of modularity and composition to the system.
In the early 80’s, as C++ was being conceived as an object-oriented language on top of
C, another language made its appearance. It drew object-orientation concepts from
Smalltalk, and unlike C++ it was implemented as a strict superset of C. The language was
Objective-C, which would then be adopted by Steve Jobs for his NeXt workstations and
later for MacOSX. The language has an extremely dynamic object system on top of a
plain C core. It is easy to write efficient, stateless C code glued together by
loosely-coupled, high-level objects.
When he mentioned , Ronan probably felt that C++ could lead developers tocomplication
trade efficiency for abstractions and lead to a rigid design. For a software that had to
focus on computing things as fast as possible, this trade-off had its downsides. He was an
Objective-C admirer, so he went for it, and created Torch4 ( ).github.com/andresy/torch4
Unfortunately, in 2004, the number of developers into Objective-C was small, much less
so in scientific computing, so Torch4 did not gain a lot of traction, as developers were not
willing to learn a new, different-looking language to make the switch. Torch 4 offered
more or less what Torch3 offered in terms of features, but delivered in a simpler design.
Looking at the Torch4 code, one can find the traces of what would become the kernels at
the basis of PyTorch.
In Torch5 ( ), released in 2008, the role of Objective-C wastorch5.sourceforge.net
replaced by Lua, a high-level language with a very lightweight interpreter. Lua features
straightforward interoperability with C and speed. The basic idea was similar to Torch4
and has characterized Torch to date: provide a high-level layer of lightweight objects
(like matrices) manipulated by stateless, optimized C functions. While Torch4 defined
matrices and vectors as its basic data structures, Torch5 introduced , theTensors
generalization of matrices to multiple dimensions, which are still at the basis of PyTorch
today.
Torch7 ( ), released in 2011, was a direct evolution of Torch5. Several low-leveltorch.ch
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
18
kernels were re-implemented to take advantage of parallelism, by leveraging OpenMP,
for multi-threaded programming on the CPU, and CUDA, for execution of massively
parallel computations on the GPU. In addition, the Lua interpreter was changed to
LuaJIT, a highly optimized interpreter featuring a just-in-time compiler that made Lua
one of the fastest, if not the fastest, high-level scripting language. As a result of such a
lean and extremely careful design, Torch was regarded as one of the fastest deep learning
framework on the market [REF]. Research institutions and R&D companies or divisions,
such as DeepMind and FAIR, started to adopt Torch as their tool of choice for research
and experimentation on large-scale problems. Torch was not a library for beginners, but
if you knew what you were doing it would offer a lot of flexibility and performance to a
practitioner.
While Lua provided Torch with a simple design and high efficiency, the world of data
science was increasingly adopting Python as their favorite language. This took place
thanks to the maturity of NumPy and SciPy as well as the advent of dedicated packages
like Scikit-learn and Pandas, in addition to Theano, TensorFlow and Keras. This growing
ecosystem made Python popular in data science despite the fact that the standard Python
interpreter (CPython) is not particularly fast compared to JIT-enabled interpreters (like
LuaJIT for Lua or V8 for JavaScript).
As a further limiting factor, Torch does not provide automatic differentiation out of the
box. Writing a new layer or a new objective function required to manually hard-code
derivatives for all expressions, which can be tedious or unwieldy for the practitioner that
is experimenting with new architectures. In contrast, the symbolic computation engines at
the core of Theano and TensorFlow are capable of automatically computing derivatives
given a forward expression. In 2015 the Twitter Cortex released an implementation of an
automatic differentiation engine for Torch that alleviated this limitation up to a certain
point, but the appeal of other libraries written in Python and featuring symbolic
computation capabilities posed a strong competition to Torch, especially after the launch
of TensorFlow.
PyTorch finally spun off from Torch in 2016 to address the limitations we just
mentioned. The idea to create a "Python Torch" by wrapping the same low-level C and
Cuda libraries as in Torch and adding autograd functionality to it had been in Soumith
Chintala’s mind for a while. Soumith had joined FAIR a couple of years earlier and, on
top of being a researcher there, working on generative adversarial networks, he was the
maintainer of Lua Torch. The Python Torch idea was sitting on his stack of things to try
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
19
out when Adam Paszke, a young and talented intern from Warsaw, Poland, arrived at
FAIR in the summer of 2016. In mere months PyTorch started to take form at a very
sustained speed and by the end of the year it was ready for its first public appearance.
PyTorch retained the same basic libraries for tensors and neural network operations, it
was very Pythonic on the surface, integrating seamlessly with NumPy and the rest of the
ecosystem. On top of that, a C++ engine provided automatic differentiation enabled by a
dynamic computation graph engine, initially written in Python and then migrated in C++
for performance. While the complexity of the system has increased compared to Torch7,
it still retained the same clean design and extensibility. After the first release and the
wave of adoption that followed, the core team has been extended and multiple work
streams have been established, always with an eye to speed and lean design.
We just recounted some of the history behind PyTorch, which hopefully led us to
appreciate the main design goals that characterized the project since its inception.
However, why should one choose PyTorch today, given the growing amount of very
capable tools we learned about only a couple of sections back?
A design driver for PyTorch is expressivity, that is allowing a a developer to implement
complicated models without extra complexities imposed by the framework. When a new
paper comes out and a practitioner sets out to implement it, the most desirable thing for a
tool is for it to stay out of the way. The less overhead there is in the process, the quickest
and most successful will be the implementation and the experimentation that will
eventually follow. PyTorch arguably offers one of the most seamless translations of ideas
into Python code available in the deep learning landscape, and it does so without
sacrificing performance. While featuring an expressive and user-friendly high-level layer,
PyTorch is not a high-level wrapper on top of a lower-level library, so it does not require
the beginner to learn another tool, like Theano or TensorFlow, when models become
complicated. Even in the case new low-level kernels need to be introduced, say
convolutions on hexagonal lattices, PyTorch offers a low-overhead pathway to achieve
that goal.
Directly linked to the previous point is the ability to debug PyTorch code. Debugging is
currently one of the main pain points of frameworks relying on static computation
graphs. In these frameworks, execution happens after the model has been defined in its
entirety and the code has been compiled by the symbolic graph engine. This creates some
disconnect between a bug in the code and its effect on the execution of the entire graph.
1.3 Why PyTorch
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
20
In PyTorch execution is greedy: statements are executed at the time they are invoked in
Python. After the execution of a statement, the data it generated is immediately available
for inspection. This makes debugging more direct.
In other words, its greedy execution model makes PyTorch behave just like another
Python library, just like NumPy, only with with GPU acceleration, neural network
kernels and automatic differentiation. This applies to debugging as well as integrating
PyTorch with other libraries - like writing a neural network operation using SciPy, for
instance.
From an ecosystem perspective, PyTorch embraces Python, the emergent programming
language for data science. PyTorch compensates the impact of the Python interpreter on
performance through an advanced execution engine, but it does so in a way that is fully
transparent to the user, both during development and during debugging. PyTorch also
features a seamless interoperation with NumPy. On the CPU, NumPy arrays and Torch
tensors can even share the same underlying memory and be converted back and forth at
no cost.
An important aspect is the ability of a deep learning model to be deployed in production
on a number of architectures, such from GPU clusters to low footprint devices, even to
mobile devices. PyTorch can be deployed on clusters thanks to its distributed computing
capabilities, but it is not designed to be deployed on a phone. However, computation
graphs can be exported to a neural network interoperability representation, namely the
Open Neural Network Exchange (ONNX, ). This allows a modelgithub.com/onnx/onnx
defined and trained with PyTorch to be deployed to ONNX-compliant frameworks
optimized for inference, like Caffe2 ( ), which runs on iOS and Android as wellcaffe2.ai
as a host of other architectures, provided that the model satisfies a few basic
requirements.
These and many other advantages that we will discover throughout the book make
PyTorch one of the most interesting deep learning frameworks available, and possibly
one of the leading tools for deep learning in the near future.
Before we finally set out for our journey with PyTorch, we will spend the last section of
this chapter mapping out its structure, in terms of components and how they interoperate.
This mental map will help us understand what happens and where it is happening when
we run our first lines of PyTorch.
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
21
We have already hinted to a few components in PyTorch. Let’s now take some time to
formalize a high-level map of the main architectural components. In fact this is going to
be the only time we’ll look at what’s under the hood. This book will mostly deal with the
top-most, user-facing layer - the Python module. We won’t go into too much detail about
it now - we will have a chance to enrich this description along the way.
Figure 1.12 Anatomy of PyTorch, showing a high-level Python API (top), the C++autograd/JIT engine (mid), and the C/CUDA low-level libraries (bottom). Each level isexposed to the upper levels through automatic wrapping. The result is a loosely-coupledsystem, with stateless low-level building blocks, a high performance engine and anexpressive high-level API.
We just mentioned that at the top-most level PyTorch is a Python library. It exposes a
very convenient API for dealing with tensors and performing operations over them, as
well as building neural networks and training them via optimizers. In Torch tradition, the
Python layer is actually pretty thin: it is designed to prescribe computations, but not to
compose them or execute them. This is delegated to lower layers for performance
reasons.
1.4 The Anatomy of PyTorch
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
22
Right under the Python layer we find an execution engine written in C++. The engine
includes , which manages the dynamic computation graph and providesautograd
automatic differentiation, and a (just-in-time) compiler that traces computation stepsjit
as they are performed and optimizes them for performance for repeated executions. We’ll
talk about this feature later in the book. For now, it is worth mentioning that many of the
features that make PyTorch unique, such as very fast automatic differentiation, come
from this layer.
At the lowest layers, we find all the core libraries doing the actual computing. A series of
plain C libraries provide very efficient data structures, the tensors (a.k.a.
multi-dimensional arrays), for CPU and GPU (TH and THC, respectively), as well as
stateless functions that implement neural network operations and kernels (THNN and
THCUNN) or wrap optimized libraries such as NVIDIA’s cuDNN. Other libraries deal
with distributed (multi-machine) and sparse (multi-dimensional arrays where most of the
entries are zero) tensor implementations. A lot of the code in this layer comes from
Torch7 and Torch5 before it.
A library named ATen automatically wraps the low-level C functions in a convenient
C++ API. ATen provides its tensor classes to the engine and it is automatically wrapped
and exposed to Python. Similarly, the neural network function libraries are automatically
wrapped towards the engine and Python API. Such automatic wrapping of low-level code
contributes to keeping the code loosely coupled, decreasing the overall complexity of the
system and encouraging further development.
Despite such layered structure, the Python API is all a practitioner needs to use PyTorch
proficiently. Still, awareness on the anatomy of the whole system will help us to
understand API design and error messages to a greater extent.
In this chapter we introduced where the world stands with deep learning and what tool
one can use to be part of the revolution. We have taken a peek on what PyTorch has to
offer and why it is worth investing time and energy in it. Just prior to that, we have
looked at its origins, with the intent of explaining the underlying motivations and design
decisions behind Torch first and PyTorch now. Last, we have described what PyTorch
looks like from a bird’s-eye view.
As with any good story, wouldn’t it be great to take a peek at the amazing things PyTorch
will enable us to do once we’ve completed our journey? Hold tight, the next chapter is
1.5 Wrapping up
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
23
aimed at exactly that.
Deep learning is about automatically learning representations from examples using deepneural networksNeural networks consist in a composition of simple operationsNeural networks learn through weight updates by back-propagation of errorsLibraries like PyTorch allow to build and train neural networks efficiently, movingcomputations to the GPU and automatically computing derivatives for back-propagatingerrorsPyTorch focuses of minimizing cognitive overhead, while focusing on flexibility andspeed
1.6 Summary
©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.
https://forums.manning.com/forums/deep-learning-with-pytorch
24