+ All Categories
Home > Documents > arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance...

arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance...

Date post: 04-Jun-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
15
Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1, * Justin Hogaboam, 1 Fabio Baruffa, 2 and Nicolas P. D. Sawaya 1, 1 Intel Labs 2 Intel Deutschland GmbH, Feldkirchen, Germany (Dated: May 7, 2020) Classical simulation of quantum computers will continue to play an essential role in the progress of quantum information science, both for numerical studies of quantum algorithms and for modeling noise and errors. Here we introduce the latest release of Intel Quantum Simulator (IQS), formerly known as qHiPSTER. The high-performance computing (HPC) capability of the software allows users to leverage the available hardware resources provided by supercomputers, as well as available public cloud computing infrastructure. To take advantage of the latter platform, together with the distributed simulation of each separate quantum state, IQS allows to subdivide the computational resources to simulate a pool of related circuits in parallel. We highlight the technical implementation of the distributed algorithm and details about the new pool functionality. We also include some basic benchmarks (up to 42 qubits) and performance results obtained using HPC infrastructure. Finally, we use IQS to emulate a scenario in which many quantum devices are running in parallel to imple- ment the quantum approximate optimization algorithm, using particle swarm optimization as the classical subroutine. The results demonstrate that the hyperparameters of this classical optimiza- tion algorithm depends on the total number of quantum circuit simulations one has the bandwidth to perform. Intel Quantum Simulator has been released open-source with permissive licensing and is designed to simulate a large number of qubits, to emulate multiple quantum devices running in parallel, and/or to study the effects of decoherence and other hardware errors on calculation results. I. INTRODUCTION In the past decade there has been steady progress toward building a viable quantum computer that can be used to solve problems that classical computers cannot. Because quantum hardware is still in its infancy, the simulation of quantum algorithms on classical computers will continue to be an important and useful endeavor. This is because many technologically and scientifically relevant questions are too difficult or impractical to be answered analytically. Though many useful conclusions can be demonstrated analytically, such as the proven speedup of Shor’s algorithm compared to classical factoring algorithms [1] or Grover’s algorithm on unstructure database search [2], most algorithmic research does benefit from numerical experiments. The first area where numerical simulation is useful is to evaluate the performance of parameters and hyper- parameters used in quantum algorithms. For instance, the simulation of physics and chemistry problems involved many choices regarding how to encode the problem to a set of qubits [3], for which it is usually not obvious which approach will be most efficient without performing numerics. Further, most variational algorithms — whether the variational quantum eigensolver for finding Hamiltonian eigenvalues [4] or variants of the quantum approximate optimization algorithm for solving combinatorial problems [5] — involve a classical heuristic optimization routine that, for all intents and purposes, must be analyzed numerically. The other primary reason for numerical simulation of quantum algorithms is to study the effects of errors. Despite enormous progress in reducing the effects of environmental noise and in perfecting the fidelities of gate operations, it appears certain that all near-term quantum devices will exhibit errors that cannot be corrected without sophisticated error correction schemes such as the surface code [6]. This highlights the need for numerical simulations of quantum algorithms running on error-prone hardware [7, 8]. Such simulations not only help draw conclusions about the robustness of particular algorithmic choices, but can also guide hardware design and gate compilation [9, 10], since different choices may lead to qualitatively different errors. Quantum circuits are hard to simulate classically since the computational cost scales exponentially with the number of qubits. Notably, though there are classes of algorithms that scale more favorably for some set of quantum circuits — such as tensor network [11–15] or path integral methods — these methods still scale exponentially in the * [email protected] [email protected] arXiv:2001.10554v2 [quant-ph] 5 May 2020
Transcript
Page 1: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

Intel Quantum Simulator:A cloud-ready high-performance simulator of quantum circuits

Gian Giacomo Guerreschi,1, ∗ Justin Hogaboam,1 Fabio Baruffa,2 and Nicolas P. D. Sawaya1, †

1Intel Labs2Intel Deutschland GmbH, Feldkirchen, Germany

(Dated: May 7, 2020)

Classical simulation of quantum computers will continue to play an essential role in the progressof quantum information science, both for numerical studies of quantum algorithms and for modelingnoise and errors. Here we introduce the latest release of Intel Quantum Simulator (IQS), formerlyknown as qHiPSTER. The high-performance computing (HPC) capability of the software allowsusers to leverage the available hardware resources provided by supercomputers, as well as availablepublic cloud computing infrastructure. To take advantage of the latter platform, together with thedistributed simulation of each separate quantum state, IQS allows to subdivide the computationalresources to simulate a pool of related circuits in parallel. We highlight the technical implementationof the distributed algorithm and details about the new pool functionality. We also include some basicbenchmarks (up to 42 qubits) and performance results obtained using HPC infrastructure. Finally,we use IQS to emulate a scenario in which many quantum devices are running in parallel to imple-ment the quantum approximate optimization algorithm, using particle swarm optimization as theclassical subroutine. The results demonstrate that the hyperparameters of this classical optimiza-tion algorithm depends on the total number of quantum circuit simulations one has the bandwidthto perform. Intel Quantum Simulator has been released open-source with permissive licensing andis designed to simulate a large number of qubits, to emulate multiple quantum devices running inparallel, and/or to study the effects of decoherence and other hardware errors on calculation results.

I. INTRODUCTION

In the past decade there has been steady progress toward building a viable quantum computer that can be usedto solve problems that classical computers cannot. Because quantum hardware is still in its infancy, the simulationof quantum algorithms on classical computers will continue to be an important and useful endeavor. This is becausemany technologically and scientifically relevant questions are too difficult or impractical to be answered analytically.Though many useful conclusions can be demonstrated analytically, such as the proven speedup of Shor’s algorithmcompared to classical factoring algorithms [1] or Grover’s algorithm on unstructure database search [2], mostalgorithmic research does benefit from numerical experiments.

The first area where numerical simulation is useful is to evaluate the performance of parameters and hyper-parameters used in quantum algorithms. For instance, the simulation of physics and chemistry problems involvedmany choices regarding how to encode the problem to a set of qubits [3], for which it is usually not obvious whichapproach will be most efficient without performing numerics. Further, most variational algorithms — whether thevariational quantum eigensolver for finding Hamiltonian eigenvalues [4] or variants of the quantum approximateoptimization algorithm for solving combinatorial problems [5] — involve a classical heuristic optimization routinethat, for all intents and purposes, must be analyzed numerically.

The other primary reason for numerical simulation of quantum algorithms is to study the effects of errors. Despiteenormous progress in reducing the effects of environmental noise and in perfecting the fidelities of gate operations, itappears certain that all near-term quantum devices will exhibit errors that cannot be corrected without sophisticatederror correction schemes such as the surface code [6]. This highlights the need for numerical simulations of quantumalgorithms running on error-prone hardware [7, 8]. Such simulations not only help draw conclusions about therobustness of particular algorithmic choices, but can also guide hardware design and gate compilation [9, 10], sincedifferent choices may lead to qualitatively different errors.

Quantum circuits are hard to simulate classically since the computational cost scales exponentially with thenumber of qubits. Notably, though there are classes of algorithms that scale more favorably for some set of quantumcircuits — such as tensor network [11–15] or path integral methods — these methods still scale exponentially in the

[email protected][email protected]

arX

iv:2

001.

1055

4v2

[qu

ant-

ph]

5 M

ay 2

020

Page 2: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

2

general case. Several high-performance quantum circuit simulators have been reported, including full state vectorcodes built for CPUs [16–23] and/or graphics processing units [19, 24–27], and those that use a mix of algorithmtypes [28, 29].

In this manuscript we present a new version of Intel Quantum Simulator (IQS) and use it to emulate a hybridquantum-classical algorithm. Also known as qHiPSTER or the Quantum High-Performance Software Testing En-vironment, IQS is a massively parallel simulator of quantum algorithms expressed in the form of quantum circuits.Its original version was coded for High-Performance Computing environments, with the goal of allowing large scalesimulations. In 2016 it was used to simulate the full state of 40 [18] and later 42 qubits [30].

The second version of Intel Quantum Simulator has just been released, open source, at the address https://github.com/iqusoft/intel-qs. While preserving the HPC core of the original implementation, it includesseveral new features that extend its application to cloud computing environments. In particular, Intel QuantumSimulator can divide the allocated computational resources into groups, each dedicated to the simulation of a distinctquantum circuit. The communication between these group of processes is minimal, and each separate group usesthe distributed implementation from the original release to store and manipulate its quantum state. We expectthat at least two use cases will profit massively from this extension. First, when multiple circuits or variants of thesame circuit have to be run in parallel (think for example of variational algorithms in conjunction with classicaloptimizers like the genetic algorithm or particle swarm optimizers). And second, when stochastic methods are usedto include noise and decoherence in the simulation. One can easily envision situations as those just mentioned inwhich a pool of hundreds or thousands of states is required to accelerate simulation. Cloud computing platforms,with (tens of) thousands of nodes, are an ideal choice to run these workloads.

In addition to the new features just discussed, the new release includes robust unit testing to verify the properinstallation and functioning of IQS and, for developers, to test the compatibility of novel features with the releasedcode. Finally, we focused on lowering the user’s learning barrier with an automatic installation process and extendedtutorials, and improving the simplicity of use by providing Python integration and a Docker container option. Ourgoal in releasing this version of the software is that IQS may be used as a standalone program or as a backendto other quantum computing frameworks like Xanadu’s Pennylane [31], IBM’s Qiskit [32], Rigetti’s Forest [33],Google’s Cirq [34], Microsoft’s Azure Quantum [35], ProjectQ [36], Zapata’s Orquestra [37], Amazon’s Braket [38],and others.

In this article, we begin by describing the basic usage of IQS and its software structure. In Section III, we presentbenchmarks for large scale simulations of up to 42 qubits. Section IV describe two situations that take advantage ofsimulating a pool of circuits: the emulation of a variational protocol that uses many quantum processors in paralleland the simulation of circuits exposed to noise. Finally we draw some conclusions and provide an outlook.

II. SOFTWARE DESCRIPTION

Intel Quantum Simulator, both in its initial version and latest release, takes advantage of the full resources ofan HPC system, due to the shared and distributed memory implementation. The first situation is when severalprocessors, or a processor with multiple computing cores, have access to the same memory and the operationsneed to be performed without a specific sequential order. This opportunity for parallelism is best exploited withOpenMP. The second opportunity for parallelism arises when a relatively small amount of memory requires a lot ofcomputation or, as in the case of storing quantum states, a large amount of memory cannot fit in a single machineor node. In this case, one needs to explicitly consider the communication pattern between the different processesand adopting the Message Passing Interface is a necessity.

In the new release of IQS we have set the MPI environment for allowing multiple quantum circuits to be simulatedin parallel. We divide the computing processes into groups, each dedicated to storing a single quantum state andupdate it according to the action of a specific circuit. Each new state can still profit from the shared and thedistributed implementation of the code (MPI+OpenMP), which has been previously implemented in the originalversion of the simulator. The use cases are illustrated in Fig. 1 where the graphics clarifies that a single state can bestored using all nodes, a subset of them, or even part of a single computing node. In this section we first introducethe basic methods that allow IQS to initialize, evolve and extract information from quantum states, then we discussthe distributed implementation of a single state and finally explain the parallel simulation of multiple circuits.

Page 3: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

3

FIG. 1: Depending on the nature of the research question being considered, there are several ways to run IQS. Ifone needs to simulate a single instance of a circuit for the highest possible qubit count, then one would distributethe quantum state across all available computing nodes or sockets (left panel). If one needs to consider a pool ofstates for either simulating many circuits in parallel or describe noise effects via stochastic simulations (or possibly

both situations at the same time), then one may simulate multiple distributed (center panel) or local (rightpanel) quantum states in parallel.

Scope and basic use

IQS is designed to simulate the dynamics of multi-qubit states in quantum circuits. The state is assumed to bepure and its unitary dynamics a consequence of the application of one- and two-qubit operations. IQS providesmethods to extract information from the state at any intermediate point or at the end of quantum circuits. Thereare three fundamental parts in the simulation of quantum circuits: state initialization, evolution, and measurement.We illustrate the basic methods of IQS with short snippets of code which are part of a longer C++ programcontained in the release among the examples. The same syntax applies to the Python interface of IQS1.

The central object of IQS is called QubitRegister and can be thought as the quantum state of the qubits composing

the system of interest. In its declaration, we need to specify the number of qubits in order to allocate a sufficientamount of memory to describe their state. Then we can initialize the state to any computational basis state,uniquely identified by its index. Other methods allow for initializing the state randomly or, for example, as thebalanced superposition of all computational states.

1 i n t num qubits = 4 ;2 QubitRegister<std : : complex<double>> p s i ( num qubits ) ;3 std : : s i z e t index = 0 ;4 p s i . I n i t i a l i z e ( ” base ” , index ) ;

The dynamics is generated by applying one- and two-qubit gates. IQS gives the option of defining custom gates,but also provides a large choice of the most common gates like single-qubit rotations or the Hadamard gate. Thetwo qubit gates are in the form of controlled one-qubit gates, meaning that the desired operation is applied to thetarget qubit conditionally on the control qubit being in |1〉. The special case of the conditional Pauli X, also calledCNOT gate, clarifies why there is no need of arbitrary two- or multi-qubit gates: any unitary evolution can beapproximated to arbitrary precision by a sequence of one-qubit and CNOT gates. IQS is suitable for implementingmulti-qubit operations2, but the definition of custom multi-qubit gates requires a very good understanding of itsinternal implementation (see next subsections).

5 // One qubit gate s : Paul i X on qubit 1 and Hadamard on qubit 06 p s i . ApplyPauliX (1) ;7 p s i . ApplyHadamard (0 ) ;8 // Two−qubit gate o f c o n d i t i o n a l form : apply Paul i X on qubit 0 cond i t i oned on qubit 19 i n t c o n t r o l q u b i t = 1 ;

10 i n t t a r g e t q u b i t = 0 ;11 p s i . ApplyCPauliX ( con t ro l qu b i t , t a r g e t q u b i t ) ;

1 The Python interface of IQS is currently limited to single-process execution (i.e. no MPI).2 For example, the latest IQS release include methods specialized to the emulation of circuits for the quantum approximate optimizationalgorithm. These circuits are reduced to one-qubit gates and a single global operation per step.

Page 4: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

4

Finally the qubits can be measured in the computational basis one at a time or in larger groups. While in actualrealization of the quantum experiments a measurement returns only one of the possible outcomes according to astate-dependent probability distribution, with simulators one can compute the full statistics of outcomes withoutthe need of re-running the experiment. For example, IQS provides methods to compute the probability of finding acertain qubit in state |1〉 or evaluate the expectation value of multi-qubit observables like products of Pauli matrices(not shown below).

15 // Compute the p r o b a b i l i t y o f observ ing qubit 1 in s t a t e |1>16 i n t measured qubit = 1 ;17 double p r o b a b i l i t y = p s i . GetProbab i l i ty ( measured qubit ) ;18 // The expec ta t i on value o f <Z1> can be computed from the above p r o b a b i l i t y19 double expec ta t i on = −1 ∗ p r o b a b i l i t y + 1 ∗ (1− p r o b a b i l i t y ) ;

Distributed implementation

Here we describe how the quantum state is defined and stored inside the IQS objects. The current algorithm isbased of the original implementation of Intel Quantum Simulator [18], so here we summarize it in order to providea self-contained description of our simulator.

To fully describe an arbitrary state of n qubits, one needs to store 2n complex numbers corresponding to theprobability amplitudes with respect to the computational basis. For n as low as 30, simply storing all the amplitudesfills up 23+1+30Byte ' 17 GB of memory (23 = 8 bytes per double-precision number and a factor 2 since theprobability amplitudes are complex). To enable the fast simulation of circuits involving more than 30 qubits, oneneeds to divide the state between multiple processes, each with its own dedicated memory. IQS assumes that P = 2p

processes are used, each storing 2n−p amplitudes and satisfying p < n. If P is not a power of two, IQS considers aneffective number of nodes equal to 2p with p = blog2(P )c.

As we explain in the next paragraphs, all operations involving only the first m = n − p qubits do not requirecommunication between processes, while MPI communication is needed when performing operations on the lastp = n−m qubits. Therefore we refer to the qubits with index 0 ≤ q < m as “local” and those with index m ≤ q < nas “global”. However it is important to realize that even the partial state of a local qubit can be fully known onlyby accessing all 2n amplitudes distributed among all 2p processes.

It is informative to analyze how one-qubit gates are implemented in IQS. Any quantum state can be writtenas a vector with complex entries {αi}i=0,1,...,2n−1, so it is convenient to express the index i in binary notation as

in−1 . . . i2i1i0 with iq ∈ {0, 1} and such that i =∑n−1

q=0 iq2q. In this way it is straightforward to obtain both the

process number p(i) and index of the local memory `(i) corresponding to the i-th amplitude αi:

p(i) =

n−1∑q=m

iq 2q−m , `(i) =

m−1∑q=0

iq 2q (1)

Consider the one-qubit gate acting on qubit q and defined by the 2× 2 unitary matrix U :

U =

(U00 U01

U10 U11

). (2)

Its action on the quantum state can be written as

α′?···?0q?···? = U00 α?···?0q?···? + U01 α?···?1q?···?

α′?···?1q?···? = U10 α?···?0q?···? + U11 α?···?1q?···? (3)

where ? refer to any bit value. The expression above means that the entries are updated in pairs independently ofthe values of the other amplitudes. From eq. Eq. (1) it is clear that when q < m the connected pairs of entries arestored in the memory of the same process and can be updated without inter-process communication, as illustratedin Fig. 2.

The situation differs when q ≥ m. In this case the two entries belong to the memory of two distinct processes,specifically to those with index p(i) and p(i+2q) = p(i)+2q−m respectively (here i is such that iq = 0). Inter-processcommunication is therefore required and we adopt the same scheme as in the original IQS implementation [18, 39].It is briefly summarized in Fig. 3 and its caption.

Page 5: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

5

FIG. 2: Left: Quantum state of three qubits stored as a vector of 23 = 8 complex amplitudes αi2i1i0 :|ψ〉 = [α000, α001, α010, . . . , α111]T. The state is distributed over 2 processes, each storing half of the amplitudes.Right: Illustration of the computation scheme to simulate one-qubit gates. Observe the qualitative change

depending on the qubit involved in the operation: At a critical qubit number, communication between memoryspaces (i.e. processes or MPI ranks) is required. In this 3-qubit case, communication between memory spaces is

required for q = 2 but not for q = 0, 1.

FIG. 3: Communication scheme for implementing a one-qubit gate on qubit q, with q > m. Time flows from leftto right. Step 1: Each MPI task sends half of its local memory to its communication partner, identified by an

index difference of 2q−m. Step 2: The computation is equally split between the two processes and involves onlylocal memory. Step 3: At the end the updated information is sent back and the state is updated. This follows

the original IQS implementation [18] that was first described in [39].

In addition to one-qubit gates, IQS implements distributed two-qubit gates of the controlled form, meaning thatthe one-qubit gate U is applied to target qubit t conditionally on the control qubit c being in state |1〉. Thecommunication pattern depends on the control and target qubits being local or global and if t > c or not.

Pool of multiple states

The most consequential change in the IQS implementation compared to its original release is the ability ofdividing the processes into groups using the MPI function MPI Comm create group . Each group can be used to

store a quantum state, possibly in a distributed way if the group itself is composed by more than one process. Now,when a QubitRegister object is created, it actually initializes a state in each group of processes: we call “pool” the

collection of such states. In addition, when a method of the form ApplyGate is called, the gate is actually applied

Page 6: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

6

to each and every state of the pool.

We clarify this concept with a concrete example. Consider that we have 80 processes and want to work with 10quantum states. One option is to group all processors together and declare 10 QubitRegister objects: here, each

state is distributed over 64 = 26 processes (since 80 is not a power of 2) and the circuit’s gates must be specifiedfor each of the 10 states separately. The second option is to divide the processes in 10 groups of 8 processes eachand create a single QubitRegister object: here, each state is distributed over 80/10 = 8 = 23 processes and each

gate is by default applied to every state.

There are two important observations: the first one is that defining a non-trivial pool of states may take advantageof the available processes in a more effective way. The second is that each state of the pool is naturally subjectedto the same quantum circuit. The latter characteristic, if strictly enforced, would make the simulations redundant:we would simulate over and over the same identical evolution. However it is possible to differentiate the appliedcircuit for each of the state in the pool and the relevant commands are discussed in Appendix D. Moreover, thereare cases in which simulating closely related circuits is required and what seemed a limitation actually becomes abeneficial feature.

Here we discuss two of these situations and present the corresponding results in Section IV. Code snippets arediscussed in Appendix D.

• In Variational Quantum Algorithms (VQA) a quantum circuit composed of parametric gates is optimized toprepare states with desired properties, often related to having large overlap with the ground state of certainobservables. During the optimization, the same circuit is simulated over and over with the only difference beingthe value of its parameters (think of them as the angle of one-qubit rotations). Within the pool functionality,it is easy to assign different parameter values to the circuit simulated by the distinct states in the pool. Thisapproach greatly speedup the overall simulations of VQA protocols based on several classes of optimizers, likegenetic algorithm, swarm particle optimization, or gradient-based methods.• IQS is a simulator of unitary dynamics in which each state is pure. Nonetheless it is possible to use IQS to

simulate the effect of noise and decoherence during the circuit by means of introducing stochastic perturbationsto the ideal circuit and averaging over the ensemble of “perturbed” circuits [18]. Formally, this approachis based on the unraveling of master equations into stochastic Schrodinger equations in the circuit-modelformalism [40] and corresponds to the introduction of additional “noise gates” in the form of one-qubitrotations with stochastic rotation angles. IQS provides specialized methods to apply these noise gates thatautomatically varies their rotation angles over the pool’s states.

III. SCALING EXPERIMENTS

In Ref. [18], Smelyanskiy and coauthors analyzed the simulator performance on the distributed system Stampedeprovided by the Texas Advanced Computing Center (TACC). They demonstrated the weak and the strong scalabilityof the code up to 40 qubits using 1024 compute nodes. They also performed single and multi-node performancemeasurements of one- and two-qubit gates, and of a complete quantum circuit, namely the Quantum FourierTransform. Since the core implementation of IQS did not change for the latest release, we consider those resultsstill valid.

In the current release of IQS, we provided sample codes to run the simulation of one-qubit operations by varyingthe total number of qubits in the state or the index of the qubit involved in the gate. This allows the users tobenchmark IQS execution times on HPC systems with the scope of analyzing the strong and the weak scaling ofthe simulator. We have used such sample scripts to run the following experiments launched on the SuperMUC-NG3

HPC system hosted by the Leibniz Supercomputing Center of the Bavarian Academy of Science (LRZ). SuperMUC-NG consists of 6, 480 compute nodes, each equipped with 2 socket Intel R© Xeon R© Scalable Processor 8174 CPU.The total amount of CPU cores is 311, 040 and the total distributed memory is 719 terabytes. Each single node isa two sockets system of 24 cores each with 96GB of shared memory. In the next sections, we present strong andweak scaling results up to 2048 nodes, which corresponds to 98, 304 CPU cores and a total memory of 196TB.

3 https://doku.lrz.de/display/PUBLIC/Hardware+of+SuperMUC-NG

Page 7: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

7

Strong scaling

In the strong scaling analysis, we fixed the problem size (the number of qubits to run) and scaled up thecomputational resources. The total speedup is limited by the fraction of the serial part of the code that is notamenable to parallelization. In Fig. 4 we show the speedup of single-qubit operations for simulations of 32-qubitsystem. We have implemented one-qubit gates defined by a random 2 × 2 matrix. The gate is then applied to allthe qubits involved in the simulation.

FIG. 4: Strong scaling of 32-qubit simulations using from 32 to 512 MPI processes. Each MPI process is runningon one socket of SuperMUC-NG, which means 2 MPI tasks per node. Each socket is fully populated by 24

OpenMP threads. Left panel: Time to execute a random one-qubit gate as a function of the qubit involved.Different colors correspond to different number of MPI processes. When the gate is executed on qubit q = n− p

(with 2p being the number of processes and n = 32), the communication between the MPI tasks is happeningwithin the nodes (intra-node) and not between the nodes (inter-node). For the later qubits the communication ismostly inter-node. This reflects the peak behavior we observe in our measurement, which has been confirmed also

by our MPI ping-pong test benchmark for messages of that size. Right panel: Time to execute a randomone-qubit gate on the first (q = 0) or last (q = 31) qubit. The computational time difference is due to the

additional communication required for the last qubit to be updated.

Weak scaling

In Section II we described the internal representation of quantum states used by IQS and highlighted the factthat simulating an extra qubit implies doubling the allocated memory. This consideration can be included in thenumerical analysis of the so-called weak scaling. The idea is that, while the number of processes increase, thesimulation size also increases and in such a way that the memory amount and computing effort per process stays(ideally) constant.

We launched simulations of systems from 32 to 42 qubits using from 4 to 4096 processes. The largest job used2048 nodes of the SuperMUC-NG system. The expectation of a scale-invariant behavior is confirmed by our studyand presented in Fig. 5.

IV. PARTICLE SWARM / QAOA SIMULATION

Having discussed the functionality and implementation of IQS, we now consider an illustrative application. Asquantum hardware improves, it will be possible to run many small-scale quantum computers at the same time.Though there may not be entanglement across the devices, one may run the devices in a parallel fashion in order tomore quickly solve variational quantum problems. Each quantum device would be calculating an objective functionfor a different set of parameters, with a classical optimization step using the results of all the devices. The behaviorof such an algorithm may be analyzed numerically using IQS.

Page 8: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

8

FIG. 5: Weak scaling study of the time needed to apply one-qubit gates depending on the involved qubit.Different colors correspond to different number of MPI processes and number of qubits. For any additional qubitin the simulation, we increase the computational resources by a factor of 2 since the memory required by the staterepresentation also increases by a factor of 2. For small qubit indices, we observe exactly the same computational

time using larger numbers of qubits and resources. For larger qubit indices, the difference is mostly due to thecommunication overhead. The peak we observe at qubit index 30 is due to the intra-node communication between

the 2 sockets of one node.

Besides studying the behavior of a specific algorithm, a purpose of this section is to demonstrate the IQS ‘pool ofstates’ functionality. Because the classical optimization loop we use has low overhead, the computational speedupup of this distributed IQS simulation—compared to a single-process simulation—will be approximately equal to thenumber of processes used in the simulation.

QAOA with swarm particle optimization

We used IQS to perform this task of simulating many quantum computers running in parallel. The simulationdemonstrates one example of the variety of simulation types that may be performed with the software. Thevariational quantum algorithm we chose is the quantum approximate optimization algorithm (QAOA) [5] for theMax-Cut problem on 3-regular graphs, an extensively studied problem in the quantum algorithms community[5, 8, 41–45]. For the classical optimization procedure we use the particle swarm optimization (PSO) algorithm[46, 47], where we implement each ‘particle’ as one virtual quantum device.

The particular PSO implementation we used was taken from reference [47]. For each particle we first set ran-dom initial positions drawn uniformly from [0, 2π), where these positions are each a set of parameter vectors

{~θ0, ~θ1, · · · , ~θR−1} for R particles. Each unique position ~θk produces a unique output from the objective function

L(~θ) = 〈HMaxCut〉, where HMaxCut is the Max-Cut Hamiltonian. Each particle is given an initial velocity ~vk alsodrawn uniformly from [0, 2π). The particles are propagated for one time step based on their velocities, after thevelocities have been updated with the formula

~vk ← ω~vk + φprp[~θk − ~θk,(best)] + φgrg[~θk − ~θ(global)] (4)

where ω, φp, and φg are arbitrary constants; rp and rg are random numbers drawn each step from uniform distri-

bution [0,1]; ~θk,(best) is the best position that particle k has discovered thus far; and ~θ(global) is the best positiondiscovered by the entire swarm. For this work, we set the parameters to ω = 0.66, φp = 1.6, and φg = 0.62, as thereis numerical evidence that these parameters perform well for some optimization landscapes and hyperparameters[47]. The algorithm is naturally parallelizable, as the only cooperation between quantum devices involves broad-casting each device’s L in order to modify the velocities. The formula shows that after each time step, each newvelocity ~vk is determined by three terms: a damping term (ω), an acceleration term based on the best previousposition of the kth particle (φp), and a second acceleration term based on the best position found so far by theentire swarm (φg).

Page 9: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

9

FIG. 6: Approximation ratio versus the number of total function evaluations, while running PSO forQAOA/Max-Cut on 3-regular graphs of 18 vertices (18 qubits) for varying particle counts. Results are averaged

from 300 different graphs and random initial conditions. The horizontal axis gives the number of functionevaluations, i.e. the number of quantum circuit emulations and 〈HMaxCut〉 evaluations run on IQS. The vertical

axis gives the approximation ration, equal to 〈0|U†circHMaxCutUcirc |0〉 divided by the exact MaxCut solution.The higher the particle count, the larger the number of function evaluations per PSO time step. Note that the

appropriate choice of particle count depends on how many function evaluations one wishes to perform. Standarddeviations (not shown in figure) calculated over the set of graphs are large compared to typical difference between

means, often higher than 0.05.

FIG. 7: Results of varying particle count, while running PSO for QAOA/Max-Cut on 3-regular graphs of 18vertices (18 qubits). To compare directly between particle counts, results are shown for a fixed number of function

evaluations, i.e. a fixed number of quantum circuit emulations run on IQS. Results are averaged from 300different graphs and random initial conditions. Error bars show standard deviation of the distribution (not

uncertainty in the mean). If one is limited in the total number of functional evaluations one has the bandwidth toperform, then the utility of adding additional particles might be non-monotonic. Notably, standard deviations

show substantial overlap between different particle counts.

In the procedure described above, we make the implicit assumption that systematic error does not differ greatlybetween devices. One feature of variational quantum algorithms is that they are robust to any systematic constant

errors inherent to a given device. This is because maxL(~θ) = maxL(~θ + ~ε), i.e. the maximum of the objectivefunction is independent of any systematic error ~ε. It will not in general be true that a collection of quantum deviceswill have nearly equal ~ε, but as hardware improves the difference in error between devices is likely to decrease.

As the number of total function evaluations Neval increases, Fig. 6 shows the improvement in the Max-Cut

Page 10: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

10

approximation ratio, equal to the quotient of 〈0|U†circ HMaxCut Ucirc |0〉 and the exact MaxCut solution Eexact.HMaxCut denotes the quantum observable that counts the number of cuts for a given graph bipartition and we aretherefore interested in finding its largest eigenvalue. Results for several particle counts between 4 and 64 are shown,with the horizontal axis giving the number of function evaluations. Note that the number of function evaluationsper PSO step is equal to the number of PSO particles, meaning that having more particles results in fewer swarmsteps for the same number of total function evaluations.

The results show mean approximation values at each step, averaged over 300 random 3-regular graphs of 18vertices (qubits), each with randomly selected initial conditions for the swarm. A QAOA depth [5] of p = 4 wasused for all circuits, leading to 2p = 8 parameters and hence an 8-dimensional position vector. The standarddeviations (from the 300 random graph instances) are omitted in Fig. 6, though they are appreciable (often higherthan 0.05) and usually larger than the difference between means. Note that we are referring to the standard

deviation of the distribution of L(~θ) = 〈0|U†circ HMaxCut Ucirc |0〉 /Eexact over many graphs instances, not theuncertainty in the mean. For the initial step, the reported values for a given Neval accurately reflect the numberof random positions chosen at that point. For example, for 64 particles, the value plotted at Neval = 8 reflects thebest value for the first 8 randomly chosen positions.

Though the best strategy is to choose a number of particles with the best mean behavior for a given Neval, thelarge overlaps between the standard deviations suggest that a clearly superior particle count choice would appearonly after many problem instances. For lower numbers of function evaluations, lower particle counts performslightly better, because fewer function evaluations are spent on the first step of choosing many random positions.For example, at 102 function evaluations, using fewer particles is always a slightly strategy. This is because thereare fewer function evaluations per time step, allowing for faster convergence in the short-term. The trend is reversedwell before 104 function evaluations, because more space is explored by the higher particle counts. In between theseextremes, one can find a Neval for which an arbitrary particle count in this range performs best.

For two snapshots taken at 512, and 5000 total function evaluations, Fig. 7 shows the mean as well as thestandard deviations of the distribution. Both Fig. 6 and 7 show that, if one is limited by the total number offunction evaluations, the optimal number of particles is not necessarily the largest number. The reason for this is,though more particles allow for exploring more of the parameter space, this comes at the cost of needing to calculatemore function evaluations per swarm step. However, as Neval increases, more particles ought to strictly producebetter approximations to the solution, matching the observed trend. Fig. 7 shows more clearly that the optimalparticle count depends on Neval. For instance, if one may only perform Neval ∼500 evaluations, ∼10-14 particlesare best, but the optimal number of particles increases as the number of allowed function evaluations increases.Stated differently, the fewer total function evaluations are available, the less utility is gained from adding moreparticles. This relationship between total allowed function evaluations and particle count is problem-dependent,but it highlights the usefulness of classical software results when running real quantum algorithms. If one candetermine optimal hyperparamters (such as particle count) for a quantum problem using classical software, it mayinform the choice of hyperparameters when these problems are scaled up on real quantum hardware.

Convergence of noisy simulations

QAOA is one of the leading candidates to achieve quantum advantage on noisy near-term quantum devices. Toevaluate its performance in realistic experiments, it is fundamental to include the effect of noise and decoherenceinto its protocol. There are two distinct effects of noise that combine in QAOA protocols: the first one is that theexpected state Ucirc |0〉 is not achieved in practice but a hopefully related mixed state is obtained. The estimateof L on this state may differ from that of the noiseless case. Second, the imprecise value of L is transmitted tothe classical optimization loop and affects the next choice of the circuit’s parameters. Fig. 8 illustrates the utilityof simulating multiple states in parallel to speed up noisy simulations. The QAOA/MaxCut simulations wereperformed on 3-regular graphs of 16 vertices, with a QAOA depth of p = 4. We used T1 = 500Tg and T2 = T1/2,where Tg is the gate time and T1 and T2 are respectively the time constants related to relaxation and dephasing.In this example, an optimized set of parameters reached a converged value more quickly than a randomly selectedset of parameters.

Page 11: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

11

FIG. 8: Variational algorithms are based on the iterative improvement of quantum circuits based on theestimation of the expectation value of a specific observable. In presence of noise, the quantum state differs from

the result of ideal simulations and so does the expectation value. Here we show the convergence of an ensemble ofstochastic simulations to the result including noise effects. The circuit corresponds to a QAOA instance of

Max-Cut on 3-regular graphs (n = 16 qubits and 4 QAOA steps). We compiled the circuit for a device withall-to-all connectivity via a simple greedy scheduler. The data are taken for the decoherence timescale T1 = 500Tgand T2 = T1/2, where Tg is a typical gate duration. Upper panel: Convergence of 〈0|U†circHMaxCutUcirc |0〉 to

its noisy value by means of incoherent average over an increasing number of states in the ensemble. Each ensemblestate is obtained by simulating the circuit with the addition of the noise gates according to the schedule. Thecircuit’s parameters have been optimized with the PSO method. The different lines show the same averaging

procedure but with different streams of random numbers and suggest that convergence requires hundreds of statesin the ensemble, for the parameters used here. For reference, the red dashed line indicates the expectation value L

for noiseless simulations. Lower panel: As above but with circuit’s parameters initialized randomly from auniform distribution.

V. CONCLUSION AND OUTLOOK

We have demonstrated the functionality of Intel Quantum Simulator (IQS), a high-performance software packagefor simulating quantum algorithms on single work stations, supercomputers, or the cloud. Depending on theplatform of choice and the problem at hand, IQS can take advantage of three operation modes: (1) all resourcesdevoted to simulating the highest possible number of qubits, (2) processes divided into separate groups to simulatea pool of distinct circuits, or (3) using the pool of states as the stochastic ensemble needed to model noise anddecoherence.

In this work we explored all three operation modes: first we launched 42-qubit simulations on the SuperMUC-NGsupercomputer at LRZ and characterized the strong and weak scaling of the one-qubit gate execution. Then, inorder to study the performance of many quantum devices operating in parallel, we used IQS to investigate theperformance of particle swarm optimization (PSO) for the quantum approximate optimization algorithm (QAOA).Analyzing the results allowed us to estimate the optimal number of PSO particles for the class of problem instancesstudied. Finally, we performed a convergence study using hundreds of ensembles of stochastic circuits to describe thenoise effects for systems of dimension 216 ' 65, 000. The relatively small overhead provides a remarkable advantageover methods based on density matrix simulations, which require quadratically more memory. We conclude byemphasizing that the two applications just discussed are particularly suitable to run on cloud platforms, where theycan take full advantage of the tens of thousands of available nodes without being limited by the communication

Page 12: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

12

bandwidth or latency. Whether one simulates one state of many qubits or a pool of many smaller states, IQS issuitable as a standalone program or as a back-end to other quantum simulation software.

ACKNOWLEDGMENTS

The authors thank Aastha Grover who helped in releasing Intel Quantum Simulator open source at https://github.com/iqusoft/intel-qs. The authors acknowledge the Leibniz Supercomputing Center of the BavarianAcademy of Science (LRZ) for providing HPC resources and Luigi Iapichino for useful discussions about the results.

Appendix A: Installation

Intel Quantum Simulator builds as a static library and it can be used via an API, which corresponds to thequantum gate operations. The following packages are required to be present to the system, before installing thelibrary:• CMake tools version 3.15+• MPICH3 for distributed communication• optional: Intel Math Kernel Libary (MKL) for distributed random number generation• optional: PyBind11 (installed via conda, not pip) required by the Python binding of IQS

The code is hosted as open-source project to the public GitHub repository and it can be cloned via:

g i t c l one https : // github . com/ i q u s o f t / i n t e l−qscd i n t e l−qs

The preferred installation for best performance takes advantage of Intel Parallel Studio compilers and is doc-umented in the GitHub page of the project [48]. Here we provide instructions how to use the standard GNUtoolchain. The installation follows the out-of-source building and requires the creation of the directory build. Thisdirectory is used to collect all the files generated by the build process. The appropriate makefile is generated withCMake:

mkdir bu i ldcd bu i ldCXX=g+ cmake −DIqsMPI=OFF −DIqsUTest=ON . .make

By default, MKL is not required when GNU compilers are used. The command above install the single-nodeversion of IQS, while to install the distributed version one must set the option −DIqsMPI=ON instead. In this

case, it is required at least the version 3.1 of MPICH for the build to be successful.The result of the building process is twofold: on one hand the static C++ library of IQS is created as

build/lib/ libintel qs .a , and on the other hand the executables of the unit test and several examples are saved in

the folder build/bin/ .

Appendix B: Python bindings

In the last few years, the scientific community has adopted Python as a central language for numerical tools. Inthe field of quantum computing, several of the most popular frameworks have a Python interface [31–36, 38]. Tofacilitate the integration with those and other tools, we provide Python bindings of the IQS code for the single-nodeimplementation. By default, whenever MPI is disabled, the building process create a Python library containing the

classes and methods of IQS. The library can be found in build/lib/intelqs py .cpython−36m−x86 64−linux−gnu.so

or in an equivalent file. The binding code itself uses the Pybind11 library which needs to be installed via conda(not simply with pip) to include the relevant information in CMake. To disable the Python wrapper, even without

MPI, set the CMake option selection to −DIqsPython=OFF .

Page 13: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

13

Appendix C: Docker file

The number of qubits that it is possible to simulate with Intel Quantum Simulator is constrained by the amountof memory available to hold the quantum state vector.

Cloud computing platforms make it possible for high-performance computing applications to be run on smalltemporary clusters of compute nodes that provide more memory than is possible on a single user’s laptop orworkstation. One simply allocates the compute nodes with the required memory sizes, configures them to talk toeach other in a cluster, and then uses Kubernetes, SWARM, SLURM, BEEs, or some other container or cluster joborchestration package to allocate and use a Docker container or Singularity instance on each node.

In order to facilitate the use of Intel Quantum Simulator in a cloud computing multi-node configuration, weprovide a Docker file that can be built to create an image that can be run on each compute node. This Dockerbuild file downloads all of the required software packages including a host OS necessary to build and execute IQSsoftware platform.

To use the multi-node capability of IQS in the cloud environment, it is necessary to compile the Dockerfile torun as a Singularity image. This is due to restrictions on using SSH from within a native Docker image. Compilingdown to a Singularity instance works around this restriction.

The Docker container also provides a pre-built environment for compiling and building IQS if researchers do nothave access to the correct OS version and software tools required by the platform.

Appendix D: Parallel simulations of a pool of states

The multi-state functionality is better described by providing a practical example. Here we want to simulate a10-qubit system exposed to dissipation and decoherence, characterized by the T1 and T2 time respectively. Theeffect of noise is included by performing multiple simulations of the circuit with the addition of stochastic noisegates [40, 49].

The code below is a simplified version of the tutorial provided in the IQS repository (released under the Apache2 license) and requires MPI. The circuit is trivial: for each qubit a rotation around the X axis is performed, byangles that were randomly chosen. One has control on the gate schedule and we assume that all gates are performedsequentially starting from qubit 0 until qubit 9. This is controlled by increasing the second argument of the noisegates corresponding to the duration of the simulated noise.

1 #inc lude ” qureg . hpp” // IQS header f i l e ( a d d i t i o n a l header f i l e s may need to be inc luded )2

3 i n t main ( i n t argc , char ∗∗ argv )4 {5 // Create the MPI environment , pas s ing the same argument to a l l the p r o c e s s e s .6 q h i p s t e r : : mpi : : Environment env ( argc , argv ) ;7 // One pool s t a t e per p roce s s . For accurate no i s e e f f e c t s , they should be hundreds .8 i n t num poo l s tate s ;9 MPI Comm size (MPI COMM WORLD, &num poo l s tates ) ;

10 // P a r t i t i o n the MPI environment in to groups o f p r o c e s s e s . One group per pool s t a t e .11 env . UpdateStateComm ( num poo l s tates ) ;12 // Number o f qubits , here 10 .13 i n t num qubits = 10 ;14 // Random number generator , provided by IQS .15 q h i p s t e r : : RandomNumberGenerator<double> rng ;16 rng . SetSeedStreamPtrs ( 777777 ) ;17 // Choose the ang l e s o f the c i r c u i t , randomly in [ 0 , p i [ .18 // They have the same value a c r o s s a l l pool s t a t e s .19 std : : vector<double> ang l e s ( num qubits ) ;20 rng . UniformRandomNumbers ( ang l e s . data ( ) , ang l e s . s i z e ( ) , 0 . , M PI , ” pool ” ) ;21 // I n i t i a l i z e the qubit r e g i s t e r s t a t e to |0 0 . . . 0 >22 QubitRegister<std : : complex<double>> p s i ( num qubits ) ;23 p s i . I n i t i a l i z e ( ” base ” ,0 ) ;24 // Assoc i a t e the random number genera to r to the qubit r e g i s t e r .25 // This i s r equ i r ed by the s t o c h a s t i c no i s e gate s .26 p s i . SetRngPtr(&rng ) ;27 // Set T 1 and T 2 t i m e s c a l e .

Page 14: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

14

28 double T 1=30. , T 2=15. ;29 p s i s l o w . SetNoi seTimesca le s ( T 1 , T 2 ) ;30 // −− C i r c u i t s imu la t i on .31 // Each gate i s preceded and fo l l owed by no i s e gate s accord ing to the schedu le .32 f o r ( i n t qubit =0; qubit<num qubits ; ++qubit )33 {34 double durat ion = double (1+ qubit ) ;35 p s i s l o w . ApplyNoiseGate ( qubit , durat ion ) ;36 p s i s l o w . ApplyRotationX ( qubit , ang l e s [ qubit ] ) ;37 durat ion = double ( num qubits−qubit ) ;38 p s i s l o w . ApplyNoiseGate ( qubit , durat ion ) ;39 }40 // Compute the p r o b a b i l i t y o f qubit 0 to be in |1> .41 double p r o b a b i l i t y = p s i . GetProbab i l i ty (0 ) ;42 // Incoherent average a c r o s s the pool to get the no i sy expec ta t i on .43 p r o b a b i l i t y = env . IncoherentSumOverAllStatesOfPool<double> ( p r o b a b i l i t y ) ;44 p r o b a b i l i t y /= double ( num poo l s tate s ) ;45 re turn 0 ;46 }

Notice that in line 15 we declare the (psudo-)random number generator included in the IQS software. It isadvantageous for various scenarios that it can generate three kinds of random numbers:

local: different for each pool rank (not used in the code above)state: common to all ranks of the same state (automatically used by the noise gates)pool: common to all ranks of the pool (used for the rotation angles of the circuit)

[1] Peter W. Shor. Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer.SIAM Review, 41(2):303–332, 1999.

[2] Lov K. Grover. Quantum mechanics helps in searching for a needle in a haystack. Physical Review Letters, 79(2):325–328,1997.

[3] Nicolas P. D. Sawaya, Tim Menke, Thi Ha Kyaw, Sonika Johri, Alan Aspuru-Guzik, and Gian Giacomo Guerreschi.Resource-efficient digital quantum simulation of d-level systems for photonic, vibrational, and spin-s Hamiltonians.arXiv:1909.12847, 2019.

[4] Alberto Peruzzo, Jarrod R. McClean, Peter J. Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, AlanAspuru-Guzik, and Jeremy L. O’Brien. A variational eigenvalue solver on a photonic quantum processor. NatureCommunications, 5(May):4213, jul 2014.

[5] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A quantum approximate optimization algorithm. arXiv:1411.4028,2014.

[6] A. Yu. Kitaev. Fault-tolerant quantum computation by anyons. Annals of Physics, 303:2–30, 2002.[7] R. Sagastizabal, X. Bonet-Monroig, M. Singh, M. Adriaan Rol, C. C. Bultink, Xiang Fu, C. H. Price, V. P. Ostroukh,

N. Muthusubramanian, A. Bruno, M. Beekman, N. Haider, Thomas E. O’Brien, and Leo DiCarlo. Experimental errormitigation via symmetry verification in a variational quantum eigensolver. Physical Review A, 100(1):010302(R), 2019.

[8] Gian Giacomo Guerreschi and Anne Y. Matsuura. QAOA for Max-Cut requires hundreds of qubits for quantum speed-up.Scientific Reports, 9:6903, 2019.

[9] Swamit S Tannu and Moinuddin K Qureshi. Not All Qubits Are Created Equal A Case for Variability-Aware Policiesfor NISQ-Era Quantum Computers. ASPLOS19, pages 987–999, 2019.

[10] Lingling Lao, Daniel M. Manzano, Hans van Someren, Imran Ashraf, and Carmina G. Almudever. Mapping of quantumcircuits onto NISQ superconducting processors. arXiv:1908.04226, 2019.

[11] Igor L. Markov and Yaoyun Shi. Simulating quantum computation by contracting tensor networks. SIAM Journal onComputing, 38(3):963–981, jan 2008.

[12] E. Schuyler Fried, Nicolas P. D. Sawaya, Yudong Cao, Ian D. Kivlichan, Jhonathan Romero, and Alan Aspuru-Guzik.qTorch: The quantum tensor contraction handler. PLOS ONE, 13(12):e0208510, dec 2018.

[13] Alexander J. McCaskey. Tensor Network Quantum Virtual Machine (TNQVM), 2016.[14] Johnnie Gray. quimb: a python library for quantum information and many-body calculations. Journal of Open Source

Software, 3(29):819, 2018.[15] Benjamin Villalonga, Sergio Boixo, Bron Nelson, Christopher Henze, Eleanor Rieffel, Rupak Biswas, and Salvatore

Mandra. A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on realhardware. npj Quantum Information, 5(1):1–16, 2019.

Page 15: arXiv:2001.10554v2 [quant-ph] 5 May 2020Intel Quantum Simulator: A cloud-ready high-performance simulator of quantum circuits Gian Giacomo Guerreschi, 1,Justin Hogaboam,1 Fabio Baru

15

[16] Jumpei Niwa, Keiji Matsumoto, and Hiroshi Imai. General-purpose parallel simulator for quantum computing. PhysicalReview A, 66(6), dec 2002.

[17] K. De Raedt, K. Michielsen, H. De Raedt, B. Trieu, G. Arnold, M. Richter, Th. Lippert, H. Watanabe, and N. Ito.Massively parallel quantum computer simulator. Computer Physics Communications, 176(2):121–136, jan 2007.

[18] Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Aln Aspuru-Guzik. qhipster: The quantum high performance softwaretesting environment, 2016.

[19] Thomas Hner and Damian S. Steiger. 0.5 petabyte simulation of a 45-qubit quantum circuit. In Proceedings of theInternational Conference for High Performance Computing, Networking, Storage and Analysis on - SC 17. ACM Press,2017.

[20] N. Khammassi, I. Ashraf, X. Fu, C.G. Almudever, and K. Bertels. QX: A high-performance quantum computer simulationplatform. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017. IEEE, mar 2017.

[21] Ryan LaRose. Distributed memory techniques for classical simulation of quantum circuits, 2018.[22] Tyson Jones, Anna Brown, Ian Bush, and Simon C. Benjamin. QuEST and high performance simulation of quantum

computers. Scientific Reports, 9(1), jul 2019.[23] Hans De Raedt, Fengping Jin, Dennis Willsch, Madita Willsch, Naoki Yoshioka, Nobuyasu Ito, Shengjun Yuan, and Kris-

tel Michielsen. Massively parallel quantum computer simulator, eleven years later. Computer Physics Communications,237:47–61, apr 2019.

[24] Eladio Gutierrez, Sergio Romero, Marıa A. Trenas, and Emilio L. Zapata. Quantum computer simulation using theCUDA programming model. Computer Physics Communications, 181(2):283–300, feb 2010.

[25] A. Amariutei and S. Caraiman. Parallel quantum computer simulation on the gpu. In 15th International Conference onSystem Theory, Control and Computing, pages 1–6, Oct 2011.

[26] Pei Zhang, Jiabin Yuan, and Xiangwen Lu. Quantum computer simulation on multi-GPU incorporating data locality.In Algorithms and Architectures for Parallel Processing, pages 241–256. Springer International Publishing, 2015.

[27] I. Savran, M. Demirci, and A.H. Ylmaz. Accelerating shors factorization algorithm on gpus. Canadian Journal ofPhysics, 96(7):759–761, 2018.

[28] Edwin Pednault, John A. Gunnels, Giacomo Nannicini, Lior Horesh, Thomas Magerlein, Edgar Solomonik, Erik W.Draeger, Eric T. Holland, and Robert Wisnieff. Breaking the 49-qubit barrier in the simulation of quantum circuits,2017.

[29] Zhao-Yun Chen, Qi Zhou, Cheng Xue, Xia Yang, Guang-Can Guo, and Guo-Ping Guo. 64-qubit quantum circuitsimulation. Science Bulletin, 63(15):964 – 971, 2018.

[30] Sergio Boixo, Sergei V. Isakov, Vadim N. Smelyanskiy, Ryan Babbush, Nan Ding, Zhang Jiang, Michael J. Bremner,John M. Martinis, and Hartmut Neven. Characterizing quantum supremacy in near-term devices. Nature Physics,14(6):595–600, apr 2018.

[31] Pennylane. https://pennylane.ai/. Accessed: 2020-01-10.[32] Qiskit. https://qiskit.org/. Accessed: 2020-01-10.[33] Forest. https://www.rigetti.com/forest-api. Accessed: 2020-01-10.[34] Cirq. https://cirq.readthedocs.io/en/stable/.[35] Azure quantum. https://azure.microsoft.com/en-us/services/quantum/. Accessed: 2020-01-10.[36] Projectq. https://projectq.ch/. Accessed: 2020-01-10.[37] Orquestra. https://www.zapatacomputing.com/orquestra. Accessed: 2020-01-10.[38] Braket. https://aws.amazon.com/braket/. Accessed: 2020-01-10.[39] Doan Binh Trieu. Large-scale simulations of error-prone quantum computation devices. PhD thesis, University of Wup-

pertal, 2009.[40] Angelo Bassi and Dirk Andre Deckert. Noise gates for decoherent quantum circuits. Physical Review A, 77:032323,

2008.[41] E. Farhi, J. Goldstone, and S. Gutmann. A quantum approximate optimization algorithm applied to a bounded occur-

rence constraint problem. arXiv:1412.6062, 2014.[42] E. Farhi and A. W Harrow. Quantum supremacy through the quantum approximate optimization algorithm.

arXiv:1602.07674, 2016.[43] D. Wecker, M. B. Hastings, and M. Troyer. Training a quantum optimizer. Phys. Rev. A, 94(2):022309, 2016.[44] C. Yen-Yu Lin and Y. Zhu. Performance of qaoa on typical instances of constraint satisfaction problems with bounded

degree. arXiv:1601.01744, 2016.[45] Gian Giacomo Guerreschi and M. Smelyanskiy. Practical optimization for hybrid quantum-classical algorithms.

arXiv:1701.01450, 2017.[46] James Kennedy and Russell Eberhart. Particle Swarm Optimization. Proceedings of the IEEE International Conference

on Neural Networks, pages 1942–1945, 1995.[47] Magnus Erik Pedersen. Good parameters for particle swarm optimization. Technical Report HL1001, Hvass Laboratories,

HL1001, 2010.[48] Intel quantum simulator. https://github.com/intel/Intel-QS. Accessed: 2020-01-10.[49] Nicolas P. D. Sawaya, Mikhail Smelyanskiy, Jarrod R. McClean, and Alan Aspuru-Guzik. Error sensitivity to envi-

ronmental noise in quantum circuits for chemical state preparation. Journal of Chemical Theory and Computation,12(7):3097–3108, 2016.


Recommended