Download - Deep Learning with PyTorch MEAP V06 - Chapter 1€¦ · We’ll start off by taking a look at the ongoing revolution brought in by deep learning over the last few years. This will

MEAP Edition Manning Early Access Program

Deep Learning with PyTorch Version 6

Copyright 2019 Manning Publications

For more information on this and other Manning titles go to https://www.manning.com/

©Manning Publications Co. We welcome reader comments about anything in the manuscript - other than typos and other simple mistakes. These will be cleaned up during production of the book by copyeditors and proofreaders.

https://forums.manning.com/forums/deep-learning-with-pytorch

http://www.manning.com


https://www.manning.com/

welcome Welcome to Deep Learning with PyTorch!

Eli and Luca here. We’re ecstatic to have you with us. No, really — it’s a big deal for us, both terrifying and exhilarating. So, thanks!

Our best wish for this book is that it’ll help you develop your own intuition and stimulate your curiosity. PyTorch is an amazing library; it will give you new powers if you give it a few hours of your time.

We’re having a lot of fun writing this book, but it’d be pretty lame if we are the only ones having fun. We’re looking forward to being able to hear directly from you about what you like about the book, and what still needs work. We’re adamant that the manuscript be as clear and of as much practical utility as possible, so please reach out. We want to know how you feel about the book, both good and bad. The good will give us fuel for the journey, while the bad keeps us out of the weeds.

One note, at the time of this writing the released version of PyTorch is 0.3.1, while the pending 0.4 release has a fairly major change to the relationship between Tensors and Variables in PyTorch. We decided to target the currently released version, rather than rely on functionality you could only get by compiling the master branch. Be aware that if you’re trying to run the examples against PyTorch 0.4 you might run into some issues. We’ll get those cleared up as soon as we can after the release is final.

In the meantime, enjoy the book, say "hi!" on the forum, and we’ll chat again soon!

— Eli and Luca





brief contents PART 1: CORE PYTORCH

1 PyTorch from 1 Mile Away 2 A Whirlwind Tour of PyTorch 3 It Starts with a Tensor 4 The World as Tensors 5 The Mechanics of Learning 6 Telling Birds from Airplanes - Learning from Images

PART 2: LEARNING FROM IMAGES IN THE REAL-WORLD: EARLY DETECTION OF LUNG CANCER

7 How hard can curing cancer be? 8 Classifying Suspected Tumors 9 Monitoring Metrics: Precision, Recall, and Pretty Pictures 10 Determining What to Classify With Segmentation And Clustering 11 Data Improvements, Augmentation 12 PyTorch in Production 13 Revisiting CycleGAN’s zebras




P1Part 1: Core PyTorch



1


1This chapter covers

We are living through exciting times. The landscape of what computers can do is

changing by the week. Tasks that only a few years ago were thought to require higher

cognition are getting solved by machines at super-human levels of performance. For

example:

describing a photographic image with a sentence in current Englishplaying complex strategy gamesdiagnosing a tumor from a radiological scan

are all now approachable by a computer. Even more impressively, the ability to solve

such tasks is acquired by computers through examples, rather than encoded by a human

as a set of hand-crafted rules.

It would be disingenuous to assert that machines are learning to "think" in any human

sense of the word. Rather, we’ve discovered a general class of algorithms that are able to

approximate complicated, non-linear processes very effectively. In a way, we’re learning

PyTorch from 1 Mile Away

the key concepts behind deep learning

an overview of the tooling landscape

where PyTorch comes from

why one should choose PyTorch

the anatomy of PyTorch

1.1 Introduction



2


that intelligence as we subjectively perceive it, a notion that we often conflate with

self-awareness, is not required to successfully carry out the aforementioned tasks and

many others.

That general class of algorithms we’re talking about falls under the definition of deep

, which deals with training mathematical entities named onlearning deep neural networks

the basis of examples. Knowing how to choose and train a neural network gives access to

the ability of solving very complicated problems in an ever-increasing number of

domains, from computer vision to natural language processing, provided that training

data are available.

It’s not easy to overstate how large an impact deep learning has had and will continue to

have on the world. Starting from computer vision and natural language processing all the

way to scene understanding and autonomous agents, established benchmarks obtained by

hand-crafted systems or humans themselves have been repeatedly blown away by

algorithms that carried very little prior knowledge about the task at hand except from

what was provided through examples.

In the medical image analysis literature, for instance, elaborate pipelines of ad-hoc

algorithms used to identify an organ from a CT scan have been increasingly supplanted

by the use of convolutional neural networks [REF Section]. Figure [REF] shows the

number of citations including the keyword in PubMed, a popular index ofdeep learning

medical peer reviewed articles, over the years, as a testimony of this trend. In the 2017

edition of MICCAI, the largest international medical image analysis conference, over

50% of the papers submitted used deep learning one way or another.

Figure 1.1 Number of publications mentioning deep learning on PubMed, the leadingindex of peer-reviewed biomedical journal articles, year over year.

Finding the right sequence of hard-coded steps to go from a CT scan to the outlines of a



3


liver, and do so robustly in the face of variability in the anatomy and in the imaging

process requires anatomical knowledge, algorithmic expertise and the right mix of

experience and trial and error. With the advent of deep learning, this task has turned into

feeding data representative of such variability into an appropriately designed system that

learns how to obtain the desired outlines. If we exclude the effort required to create the

outlines of the liver used as examples for learning, the practitioner is required to know a

lot less about medical imaging and processing pipelines compared to the past. This has

initially caused some frustration outside the deep learning community in this and other

fields, but the compelling results obtained through the use of deep learning has

determined a wave of adoption and experimentation during the first years of the current

decade.

Figure 1.2 CT scan of a liver (top left) and its corresponding outline (top right). Thebottom row offers an example of such variability: cross-sections of the liver in CT scansacquired from different subjects.

Surprisingly, the mathematical concepts behind deep learning are not prohibitively

advanced and the programming tools to train a deep neural network are very accessible.

This book focuses on one of such tools, PyTorch, with the aim of covering enough

ground to allow the reader to solve practical problems with deep learning or explore new

models as they pop up on ArXiV.

PyTorch is an amazing library for a deep learning practitioner. It doesn’t get in the way,

it minimizes cognitive overhead, ultimately allowing one to focus on what matters the

most - building and training deep learning models - without renouncing to performance



4


and scalability.

We believe that lowering the barrier to tools like PyTorch represents more than just a

way to facilitate the acquisition of new technical skills. It is a step towards equipping a

new generation of computer scientists, data scientists or students from a wide range of

disciplines, with a working knowledge on the tools that will run the world behind the

scenes during the next decades. It is imperative that such knowledge is widespread and

accessible, owing to the great power that lies in these methodologies. This is what

motivates us towards making this book as useful as possible.

We’ll start off by taking a look at the ongoing revolution brought in by deep learning

over the last few years. This will help us understand how PyTorch fits in the big picture

and why it’s a great idea to pick it up at this point in time.

Until the advent of deep learning in the late 2000’s, machine learning practitioners

typically faced the following task: given a dataset of samples and desired outcomes,

define transformations on the data until the resulting numbers, or features, allow a

downstream algorithm, like a classifier, to produce correct outcomes on new data. This

process, called , very much in use today, is aimed at taking thefeature engineering

original data and coming up with of the same data that can then be fed torepresentations

an algorithm to solve a problem. For instance, in order to tell 1’s from 0’s in images of

handwritten digits, one would come up with a set of filters to estimate the direction of

edges over the image, and then train a classifier to predict the correct digit given a

distribution of directions at multiple locations.

Deep learning, on the other hand, deals with finding such representations automatically,

from raw data, in order to successfully perform a task. In the 1’s vs 0’s example, filters

would be estimated during training, by iteratively looking at pairs of examples and target

labels. This is not to say that feature engineering has no place with deep learning, in one

form or another: we often need to inject some form of prior knowledge in a learning

system. However, the ability of a neural network to ingest data and extract useful

representations on the basis of examples is what makes deep learning so powerful.

Successful practitioners are not required to hard-code transformations of the data to

extract those representations, but rather know how to make a mathematical entity

discover it from training data. As with many disruptive technologies, this fact has led to a

change in perspective [Figure 1].

1.1.1 The Deep Learning Revolution



5


Figure 1.3 The change in perspective brought in by deep learning. Left: the practitioner isbusy defining engineering features and feeding them to a learning algorithm; the resultson a task will be as good as the features he or she engineers. Right: with deep learning,the raw data is fed to an algorithm that extracts hierarchical features automatically, basedon optimizing the performance of the algorithm on the task; the results will be as good asthe ability of the practitioner to drive the algorithm towards its goal.

An important factor that determined the success of deep learning is its inherent

underlying simplicity. At the basis of deep learning are neural networks, mathematical

entities capable of representing complicated functions, i.e. transformations from inputs to

outputs, through a composition of simpler functions. The shape of those simpler

functions and the way such composition is realized makes it possible for a neural network

to learn how to approximate functions whose inputs and outputs are from each other,far

like an image and an English sentence that describes it, from a large number of

input/output pairs.

The term is obviously suggestive of a link to the way our brain works. Asneural network

a matter of facts, although the initial models were inspired by neuroscience [REF

Perceptron], modern artificial neural networks bear very little resemblance to the



6


mechanisms at the basis of neural networks in the brain. It seems however likely that

both artificial and physiological neural networks have selected similar mathematical

strategies for approximating complicated functions that happen to work very effectively.

The unit at the basis of such decomposition of complicated functions into a simpler units

is the . At its core, it is nothing but a linear transformation of the input (e.g.neuron

multiplication of the input by a number, the , and addition of a constant, the )weight bias

followed by the application of a fixed non-linear function (referred to as the "activation

function"). Mathematically we can write this out as

In general, and can be simple scalars, or vector-valued (meaning holding many scalarx y

values), in which case is a matrix and is a vector. In this case, the expression above isw b

referred to as a of neurons, as represented in Figure 1.4.layer

NOTE For more detail on matrix and vector multiplication, please see Appendix 1,XREF Linear Algebra.

Figure 1.4 An artificial neuron: a linear transformation enclosed in a non-linear function.

Both parts of the neuron, the linear transformation and the subsequent non linear

function, called the , are crucial. Without the activation function,activation function

successive linear transformations can be simplified to a single linear

transformation—nothing that’s capable of the complex behavior we desire. Likewise,

y = f(w * x + b)



7


without the linear transformations, the activation function (typically) has no parameters,

and so the network is unable to learn or change. The ability of an ensemble of neurons to

act as , that is, to be able to approximate a very wide range ofuniversal approximators

functions, depends on the combination of the linear and non-linear behavior inherent to

each neuron.

Activation functions generally help accomplish this by mapping wide range of inputs to a

narrow range of outputs. The function accomplishes this by having arbitrarily largetanh

positive inputs asymptotically map to , while large negative inputs map to .1.0 -1.0

This allows the neuron to be sensitive to a narrow range of input values, while treating

anything outside of that range as equivalent.

Inputs above are all approximately equal to , which is considered "saturated"3.0 1.0

(as would be inputs below ).-3.0

The activation function is typically fixed, as it doesn’t depend on parameters that need to

be optimized. The range of input values for which a neuron’s response is not saturated

then depends on how the preceding linear transformation shifts and scales the input.

Therefore, neural networks can learn by changing the individual and values of thew b

linear transformation for all their neurons. By measuring the error performed by the

network on a specific task, like a classification or a regression, one can define the

learning process as the process of changing and throughout the network so that thew b

error decreases.

A multi-layer neural network is made up by a composition of the above functions, that is

where the output of a layer of neurons is used as an input for the following layer. Again,

training consists in finding good values for and w_0, w_1, … w_n b_0, b_1, …

so that the resulting network correctly carries out a task, such as predicting likelyb_n

temperatures given geographic coordinates and time of the year.

>>> math.tanh(1)0.7615941559557649>>> math.tanh(2)0.9640275800758169>>> math.tanh(3)0.9950547536867305

x_1 = f(w_0 * x + b_0)x_2 = f(w_1 * x_1 + b_1)...y = f(w_n * x_n + b_n)



8


Figure 1.5 Our mental model of an artificial neural network: composition of neurons,where the output of a neuron is used as input argument for other neurons. By estimatingthe values of w_n and b_n of all neurons, the network learns how to approximatecomplicated functions. The bottom row shows a neural network organized in layers,whereby neurons have multiple inputs and feed their outputs to layers downstream. Byincreasing the width of layers, we increase the capacity of the network, i.e. its ability tohold larger intermediate representations of the input data.

By we mean obtaining a correct output on unseen datacarrying out a task successfully

produced by the same data-generating process. In this sense, the network is not allowed

to learn the data , as it will typically perform badly on new data that is similar butby heart

not identical to the data used during training - effect known as . Instead, aoverfitting

successfully trained network, through the value of its weights and biases, will capture the

inherent structure of the data in the form of meaningful numerical representations that

work correctly for previously unseen data.



9


Figure 1.6 Our mental model of the learning process: given input data and thecorresponding desired outputs (ground truth), as well as initial values for the weights, thenetwork is fed input data (forward pass) and the errors are evaluated by comparing theresulting outputs to the ground truth. In order to optimize the weights, the change in theerror following a unit change in weights (the gradient of the error with respect to theparameters) is computed using the chain rule for the derivative of a composite function(backward pass). The value of the weights is then updated in the direction that leads to adecrease in the error. The procedure is repeated until the error, evaluated on unseen data,falls below an acceptable level.

The description so far is necessarily a simplification, but it captures the basic

mechanisms through which neural networks operate. In later chapters we will go in depth

into typical neural network architectures that have been devised to carry out specific

tasks, like recognizing objects in images.

We encourage the reader to build up an intuitive understanding of deep learning before

proceeding. To that effect, Grokking Deep Learning [REF] is a great resource for

developing a strong mental model and intuition on the mechanism underlying deep neural

networks. For a thorough introduction and reference, we direct the reader to Deep

Learning by Goodfellow et al [REF].

Deep learning allows us to carry out a very wide range of complicated tasks, like



10


machine translation, playing strategy games or identifying objects in cluttered scenes,

solely on the basis of experience. In order to do so in practice we need tools that are

flexible, so they can be adapted to our specific problem, and efficient, to allow training to

occur over large amounts of data and the network to perform correctly in the presence of

uncertainty the inputs. In the following paragraph we will take a look at the evolution of

tools for deep learning and understand where PyTorch stands in such landscape.

We just learned that deep learning is a revolutionary technology, allowing computers to

be programmed through examples. The revolution came about thanks to a few factors.

First off, methodologies: we discovered how to train deep networks effectively. Second,

technological advances: we discovered that we could use Graphical Processing Units

(GPUs, specific processors dedicated to rendering 2D and 3D graphics) to massively

speed up the computations involved in training a network. Third, the availability of large

amounts of data, such as very large collections of natural images collected from the web.

A last, key ingredient to the adoption of deep learning across the board has been the

availability of tools that made experimenting with neural network architectures accessible

to users with only a basic training in mathematics and programming (PyTorch is one such

tools, and a really good tool at that). Writing programs that work on GPUs requires a

certain amount of expertise. Deep learning frameworks, however, typically hide that

away from users, allowing practitioners to build complicated models out of simple

building blocks and utilizing high-level mental models.

Compared to the previous waves of artificial intelligence, the complexity of building an

"intelligent" system using deep learning is largely delegated to the training process. The

amount of code that typically needs to be written to solve a problem that would have

previously required hundreds of thousands lines of highly tailored code is now two orders

of magnitude lower.

There’s more. In the previous section we have taken a glimpse of the inherent simplicity

of neural networks: simple functions inside other simple functions. With a high-level

language, such as Python, equipped with an expressive numerical library, such as

NumPy, one can code a basic neural network engine in a few hundreds lines of code -

although writing a full-fledged deep learning framework requires considerably more

effort.

The take away is that code has become less of an asset, and companies and research

institutions have started to see open-source tools as a way to compete, establish

1.1.2 The Tools Behind the Revolution



11


dominance and attract talent, rather than something to jealously keep behind closed

doors. It is not uncommon today to read about a new paper freshly posted on ArXiV and

see it implemented on GitHub for the main deep learning frameworks the next day. The

low barriers for the implementation of models and subsequent experimentation, as well

as for the adoption of such models in production, has been a key ingredient of the deep

learning revolution.

It would be overkill to list all frameworks that have gained popularity in recent history. If

you started getting into deep learning during the first half of this decade, Theano, Torch

and Caffe would have been the natural choices, along with higher-level frameworks

using a lower-level framework as a backend, such as Keras.

A really quick peek into these frameworks will give us a chance to appreciate what

makes PyTorch unique in this landscape.

Caffe was first released in 2013 as a C++ library with GPU support. In Caffe, networks

are typically built by declaratively configuring sequences of layers. Each layer knows

how to compute outputs given inputs (forward pass) and how to compute the change in

the outputs given a unit change in inputs and parameters (backward pass). The latter pass

is used during training to determine how parameters in a network need to change in order

to minimize the errors. We’ll look at this mechanism in detail in Chapter 5.

While using Caffe mainly entails creating configuration files, it requires expertise in C++

for it to be extended, which led a few authors to maintain their separate fork of Caffe for

their experiments. Torch7, on the other hand, was targeted at researchers, allowing them

to create new modules and architectures from a high-level language and with high

performance. Torch7 is written in Lua, a high-level scripting language, and C, with

support for GPUs. Thanks to its simple, modular design it allowed heavy

experimentation. Like layers in Caffe, modules in Torch are defined by explicitly coding

their forward and backward passes. Although a clean, lightweight language with a very

fast interpreter, Lua has represented a limiting factor in the adoption of Torch7, both for

its limited popularity and the absence of a data science environment beyond Torch itself.

MODULAR FRAMEWORKS



12


Both Caffe and Torch greedily execute computations: forward and backward functions

are evaluated numerically as soon as their results are needed. Theano, on the other hand,

is a symbolic computation engine, written in Python and C++. With Theano, the user

only has to specify the forward pass, as one would write a mathematical expression using

symbols on paper. Rather than being evaluated immediately, these expressions are

compiled into a symbolic computation graph. By leveraging symbolic differentiation,

Theano can compute derivatives of functions built out of its building blocks, thus

generating backward passes automatically. In addition, it can optimize the evaluation of

expressions, e.g. simplifying terms appearing both at the numerator and denominator of

an expression, or handling numerically unstable situations. The drawbacks are a codebase

that is harder to develop and models that are harder to debug.

Figure 1.7 Static graph for a simple computation corresponding to a single neuron. Themathematical expression (first row) is compiled into a symbolic graph where each noderepresents individual operations (second row), using placeholders for inputs and outputs.The graph is then evaluated numerically (third row). The gradient of the output withrespect to the weights is constructed symbolically by traversing the graph backwardsand multiplying the gradients at individual nodes (fourth row). The correspondingmathematical expression is shown in the fifth row.

STATIC GRAPH FRAMEWORKS



13


At the end of 2015, the Google Brain Team released TensorFlow, a library that took a

similar approach to Theano, that is compiling symbolic graphs and evaluating backward

passes by leveraging on symbolic differentiation. At the same time, TensorFlow allowed

training to happen in parallel on clusters of machines, making it the go to framework of

large workloads and arguably the leading deep learning framework in general. At the

same time Keras gained a TensorFlow backend, so users writing their models in Keras

enjoyed the possibility to leverage Theano or TensorFlow interchangeably.

Building a network symbolically in the form of a static graph requires expressions as well

as constructs (like conditions, iterations) to be built on top of the hosting language.

Frameworks using symbolic graphs become in some form a language on top of the host

language, which requires cognitive overhead and typically makes it harder to extend. As

a reaction to this overhead, year 2016 saw the rise of frameworks, likedynamic graph

Chainer, DyNet and, ultimately, PyTorch.

We’ve seen that TensorFlow and Theano build the graph statically, that is, the

computation graph is first compiled as prescribed by the symbolic code, it is then

executed by an engine which replaces symbols with numbers. In dynamic graph

frameworks, on the other hand, computations are still evaluated symbolically and

gradients computed automatically, but the graph is defined and evaluated greedily, as

prescribed by the host language. The catch phrase here is "Define by run": the

computation graph is grows dynamically as individual computations are executed.

PyTorch was one of the first dynamic graph frameworks to gain considerable popularity.

Its rapid growth in the months following its first release quickly allowed PyTorch to

become the leading dynamic graph framework and one of the two go-to frameworks,

together with TensorFlow, for the AI research community.

Dynamic graphs can change during successive forward passes, for instance different

nodes can be instantiated based on the outputs of the preceding nodes, without a need for

the decision-making logic to be represented in the graph itself. Conditionals and loops,

for instance, are not encoded in the computation graph: they are evaluated in the host

language and the resulting codepath is then computed by the engine. This strategy strikes

good a balance between the need for automatic differentiation and code optimization, and

the need for a tool that is easy to program and debug (stack traces correspond to stack

traces of the host language) and integrates nicely with the rest of the ecosystem - it feels

and behaves just like a library.

DYNAMIC GRAPH FRAMEWORKS



14


Figure 1.8 Dynamic graph for a simple computation corresponding a single neuron. Theupper half of the figure shows the computation broken down in individual statements,which are greedily evaluated as they are defined. The program has no notion of theinterconnection between computations. The lower half of the figure shows theconstruction of a dynamic computation graph for the same expression: the expression isstill broken down in individual statements that are greedily evaluated, while the graph isbuilt incrementally. Automatic differentiation is achieved by traversing the graphbackwards, the same was as for static computation graphs. Dynamic graphs can changeduring successive forward passes, for instance different nodes can be instantiatedaccording to conditions on the outputs of the preceding nodes, without a need for suchconditions to be represented in the graph itself, as it is needed for static graphs.

There’s a large number of frameworks that have not been mentioned, although they

would greatly deserve it. To make up for it, we collected a few, together with a reference,

in the following Figure.



15


Figure 1.9 A (incomplete) map of deep learning frameworks as of 2017.

With so much offering, why should a practitioner choose PyTorch? It has been released

fairly recently, so it is presumably less mature, it has less tooling than other frameworks,

one can find fewer pre-implemented models around, although that’s changing quickly.

Here’s a figure showing the adoption of deep learning tools over time.

Figure 1.10 Trends in the adoption of deep learning frameworks over time, measured asnumber of mentions, number of contributors, number of stars and number of forks.

We can see that the adoption of PyTorch has been steep, far beyond what other dynamic

computation graph frameworks, like Chainer and DyNet, have experienced. Arguably,

one of the reasons why PyTorch immediately gained popularity lies in its heritage.

The thing is that, despite its young age, PyTorch stands on very solid shoulders. Actually,

taking a moment to trace its history will help us appreciate the mindset that has motivated

several talented researchers into building Torch first and PyTorch today.

1.2 Where PyTorch Comes From



16


A good place to dig into where PyTorch comes from is its file (LICENSE

). Here we can see the history of maintainers of PyTorch andgithub.com/pytorch/pytorch

its parent project Torch over the years.

Figure 1.11 Timeline of Torch projects, with the year of release and the programminglanguages used in the release.

One can follow the history of Torch from Ronan Collobert’s web page (

), who is now a Research Scientist at FAIR and who wrote the veryronan.collobert.com

first Torch when he was a PhD student at the "Istituto Dalle Molle di Intelligenza

Artificiale Percettiva" (Dalle Molle Institute for Perceptive Artificial Intelligence, now

IDIAP Research Institute) in Martigny, Switzerland.

The first ever Torch project was released in 2001 under the name of . It was aSVMTorch

C++ library focused on Support Vector Machines for classification and regression

problems. It can still be found at , for thosebengio.abracadoudou.com/SVMTorch.html

Copyright (c) 2016- Facebook, Inc (Adam Paszke)Copyright (c) 2014- Facebook, Inc (Soumith Chintala)Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)Copyright (c) 2011-2013 NYU (Clement Farabet)Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)Copyright (c) 2006 Idiap Research Institute (Samy Bengio)Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)



17

https://github.com/pytorch/pytorch

http://ronan.collobert.com

http://bengio.abracadoudou.com/SVMTorch.html


who like software archeology. It didn’t deal with neural networks and doesn’t share any

code with PyTorch, but at least it gave a name to its successors.

An early Torch that can still be found online is Torch3 ( ). It was releasedtorch.ch/torch3

in 2001 under a liberal BSD license, and it focused on training neural networks, including

convolutional neural networks. It was written in C+, a language that the author didn't

particularly like: "I hate C+. Too much complicated", Ronan wrote in the manual. The

author liked the terseness and efficiency of C, but resorted to C++ classes, like Matrix

and Vector, to provide some degree of modularity and composition to the system.

In the early 80’s, as C++ was being conceived as an object-oriented language on top of

C, another language made its appearance. It drew object-orientation concepts from

Smalltalk, and unlike C++ it was implemented as a strict superset of C. The language was

Objective-C, which would then be adopted by Steve Jobs for his NeXt workstations and

later for MacOSX. The language has an extremely dynamic object system on top of a

plain C core. It is easy to write efficient, stateless C code glued together by

loosely-coupled, high-level objects.

When he mentioned , Ronan probably felt that C++ could lead developers tocomplication

trade efficiency for abstractions and lead to a rigid design. For a software that had to

focus on computing things as fast as possible, this trade-off had its downsides. He was an

Objective-C admirer, so he went for it, and created Torch4 ( ).github.com/andresy/torch4

Unfortunately, in 2004, the number of developers into Objective-C was small, much less

so in scientific computing, so Torch4 did not gain a lot of traction, as developers were not

willing to learn a new, different-looking language to make the switch. Torch 4 offered

more or less what Torch3 offered in terms of features, but delivered in a simpler design.

Looking at the Torch4 code, one can find the traces of what would become the kernels at

the basis of PyTorch.

In Torch5 ( ), released in 2008, the role of Objective-C wastorch5.sourceforge.net

replaced by Lua, a high-level language with a very lightweight interpreter. Lua features

straightforward interoperability with C and speed. The basic idea was similar to Torch4

and has characterized Torch to date: provide a high-level layer of lightweight objects

(like matrices) manipulated by stateless, optimized C functions. While Torch4 defined

matrices and vectors as its basic data structures, Torch5 introduced , theTensors

generalization of matrices to multiple dimensions, which are still at the basis of PyTorch

today.

Torch7 ( ), released in 2011, was a direct evolution of Torch5. Several low-leveltorch.ch



18

http://torch.ch/torch3

https://github.com/andresy/torch4

http://torch5.sourceforge.net

http://torch.ch


kernels were re-implemented to take advantage of parallelism, by leveraging OpenMP,

for multi-threaded programming on the CPU, and CUDA, for execution of massively

parallel computations on the GPU. In addition, the Lua interpreter was changed to

LuaJIT, a highly optimized interpreter featuring a just-in-time compiler that made Lua

one of the fastest, if not the fastest, high-level scripting language. As a result of such a

lean and extremely careful design, Torch was regarded as one of the fastest deep learning

framework on the market [REF]. Research institutions and R&D companies or divisions,

such as DeepMind and FAIR, started to adopt Torch as their tool of choice for research

and experimentation on large-scale problems. Torch was not a library for beginners, but

if you knew what you were doing it would offer a lot of flexibility and performance to a

practitioner.

While Lua provided Torch with a simple design and high efficiency, the world of data

science was increasingly adopting Python as their favorite language. This took place

thanks to the maturity of NumPy and SciPy as well as the advent of dedicated packages

like Scikit-learn and Pandas, in addition to Theano, TensorFlow and Keras. This growing

ecosystem made Python popular in data science despite the fact that the standard Python

interpreter (CPython) is not particularly fast compared to JIT-enabled interpreters (like

LuaJIT for Lua or V8 for JavaScript).

As a further limiting factor, Torch does not provide automatic differentiation out of the

box. Writing a new layer or a new objective function required to manually hard-code

derivatives for all expressions, which can be tedious or unwieldy for the practitioner that

is experimenting with new architectures. In contrast, the symbolic computation engines at

the core of Theano and TensorFlow are capable of automatically computing derivatives

given a forward expression. In 2015 the Twitter Cortex released an implementation of an

automatic differentiation engine for Torch that alleviated this limitation up to a certain

point, but the appeal of other libraries written in Python and featuring symbolic

computation capabilities posed a strong competition to Torch, especially after the launch

of TensorFlow.

PyTorch finally spun off from Torch in 2016 to address the limitations we just

mentioned. The idea to create a "Python Torch" by wrapping the same low-level C and

Cuda libraries as in Torch and adding autograd functionality to it had been in Soumith

Chintala’s mind for a while. Soumith had joined FAIR a couple of years earlier and, on

top of being a researcher there, working on generative adversarial networks, he was the

maintainer of Lua Torch. The Python Torch idea was sitting on his stack of things to try



19


out when Adam Paszke, a young and talented intern from Warsaw, Poland, arrived at

FAIR in the summer of 2016. In mere months PyTorch started to take form at a very

sustained speed and by the end of the year it was ready for its first public appearance.

PyTorch retained the same basic libraries for tensors and neural network operations, it

was very Pythonic on the surface, integrating seamlessly with NumPy and the rest of the

ecosystem. On top of that, a C++ engine provided automatic differentiation enabled by a

dynamic computation graph engine, initially written in Python and then migrated in C++

for performance. While the complexity of the system has increased compared to Torch7,

it still retained the same clean design and extensibility. After the first release and the

wave of adoption that followed, the core team has been extended and multiple work

streams have been established, always with an eye to speed and lean design.

We just recounted some of the history behind PyTorch, which hopefully led us to

appreciate the main design goals that characterized the project since its inception.

However, why should one choose PyTorch today, given the growing amount of very

capable tools we learned about only a couple of sections back?

A design driver for PyTorch is expressivity, that is allowing a a developer to implement

complicated models without extra complexities imposed by the framework. When a new

paper comes out and a practitioner sets out to implement it, the most desirable thing for a

tool is for it to stay out of the way. The less overhead there is in the process, the quickest

and most successful will be the implementation and the experimentation that will

eventually follow. PyTorch arguably offers one of the most seamless translations of ideas

into Python code available in the deep learning landscape, and it does so without

sacrificing performance. While featuring an expressive and user-friendly high-level layer,

PyTorch is not a high-level wrapper on top of a lower-level library, so it does not require

the beginner to learn another tool, like Theano or TensorFlow, when models become

complicated. Even in the case new low-level kernels need to be introduced, say

convolutions on hexagonal lattices, PyTorch offers a low-overhead pathway to achieve

that goal.

Directly linked to the previous point is the ability to debug PyTorch code. Debugging is

currently one of the main pain points of frameworks relying on static computation

graphs. In these frameworks, execution happens after the model has been defined in its

entirety and the code has been compiled by the symbolic graph engine. This creates some

disconnect between a bug in the code and its effect on the execution of the entire graph.

1.3 Why PyTorch



20


In PyTorch execution is greedy: statements are executed at the time they are invoked in

Python. After the execution of a statement, the data it generated is immediately available

for inspection. This makes debugging more direct.

In other words, its greedy execution model makes PyTorch behave just like another

Python library, just like NumPy, only with with GPU acceleration, neural network

kernels and automatic differentiation. This applies to debugging as well as integrating

PyTorch with other libraries - like writing a neural network operation using SciPy, for

instance.

From an ecosystem perspective, PyTorch embraces Python, the emergent programming

language for data science. PyTorch compensates the impact of the Python interpreter on

performance through an advanced execution engine, but it does so in a way that is fully

transparent to the user, both during development and during debugging. PyTorch also

features a seamless interoperation with NumPy. On the CPU, NumPy arrays and Torch

tensors can even share the same underlying memory and be converted back and forth at

no cost.

An important aspect is the ability of a deep learning model to be deployed in production

on a number of architectures, such from GPU clusters to low footprint devices, even to

mobile devices. PyTorch can be deployed on clusters thanks to its distributed computing

capabilities, but it is not designed to be deployed on a phone. However, computation

graphs can be exported to a neural network interoperability representation, namely the

Open Neural Network Exchange (ONNX, ). This allows a modelgithub.com/onnx/onnx

defined and trained with PyTorch to be deployed to ONNX-compliant frameworks

optimized for inference, like Caffe2 ( ), which runs on iOS and Android as wellcaffe2.ai

as a host of other architectures, provided that the model satisfies a few basic

requirements.

These and many other advantages that we will discover throughout the book make

PyTorch one of the most interesting deep learning frameworks available, and possibly

one of the leading tools for deep learning in the near future.

Before we finally set out for our journey with PyTorch, we will spend the last section of

this chapter mapping out its structure, in terms of components and how they interoperate.

This mental map will help us understand what happens and where it is happening when

we run our first lines of PyTorch.



21

https://github.com/onnx/onnx

https://caffe2.ai


We have already hinted to a few components in PyTorch. Let’s now take some time to

formalize a high-level map of the main architectural components. In fact this is going to

be the only time we’ll look at what’s under the hood. This book will mostly deal with the

top-most, user-facing layer - the Python module. We won’t go into too much detail about

it now - we will have a chance to enrich this description along the way.

Figure 1.12 Anatomy of PyTorch, showing a high-level Python API (top), the C++autograd/JIT engine (mid), and the C/CUDA low-level libraries (bottom). Each level isexposed to the upper levels through automatic wrapping. The result is a loosely-coupledsystem, with stateless low-level building blocks, a high performance engine and anexpressive high-level API.

We just mentioned that at the top-most level PyTorch is a Python library. It exposes a

very convenient API for dealing with tensors and performing operations over them, as

well as building neural networks and training them via optimizers. In Torch tradition, the

Python layer is actually pretty thin: it is designed to prescribe computations, but not to

compose them or execute them. This is delegated to lower layers for performance

reasons.

1.4 The Anatomy of PyTorch



22


Right under the Python layer we find an execution engine written in C++. The engine

includes , which manages the dynamic computation graph and providesautograd

automatic differentiation, and a (just-in-time) compiler that traces computation stepsjit

as they are performed and optimizes them for performance for repeated executions. We’ll

talk about this feature later in the book. For now, it is worth mentioning that many of the

features that make PyTorch unique, such as very fast automatic differentiation, come

from this layer.

At the lowest layers, we find all the core libraries doing the actual computing. A series of

plain C libraries provide very efficient data structures, the tensors (a.k.a.

multi-dimensional arrays), for CPU and GPU (TH and THC, respectively), as well as

stateless functions that implement neural network operations and kernels (THNN and

THCUNN) or wrap optimized libraries such as NVIDIA’s cuDNN. Other libraries deal

with distributed (multi-machine) and sparse (multi-dimensional arrays where most of the

entries are zero) tensor implementations. A lot of the code in this layer comes from

Torch7 and Torch5 before it.

A library named ATen automatically wraps the low-level C functions in a convenient

C++ API. ATen provides its tensor classes to the engine and it is automatically wrapped

and exposed to Python. Similarly, the neural network function libraries are automatically

wrapped towards the engine and Python API. Such automatic wrapping of low-level code

contributes to keeping the code loosely coupled, decreasing the overall complexity of the

system and encouraging further development.

Despite such layered structure, the Python API is all a practitioner needs to use PyTorch

proficiently. Still, awareness on the anatomy of the whole system will help us to

understand API design and error messages to a greater extent.

In this chapter we introduced where the world stands with deep learning and what tool

one can use to be part of the revolution. We have taken a peek on what PyTorch has to

offer and why it is worth investing time and energy in it. Just prior to that, we have

looked at its origins, with the intent of explaining the underlying motivations and design

decisions behind Torch first and PyTorch now. Last, we have described what PyTorch

looks like from a bird’s-eye view.

As with any good story, wouldn’t it be great to take a peek at the amazing things PyTorch

will enable us to do once we’ve completed our journey? Hold tight, the next chapter is

1.5 Wrapping up



23


aimed at exactly that.

Deep learning is about automatically learning representations from examples using deepneural networksNeural networks consist in a composition of simple operationsNeural networks learn through weight updates by back-propagation of errorsLibraries like PyTorch allow to build and train neural networks efficiently, movingcomputations to the GPU and automatically computing derivatives for back-propagatingerrorsPyTorch focuses of minimizing cognitive overhead, while focusing on flexibility andspeed

1.6 Summary



24