+ All Categories
Home > Documents > Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… ·...

Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… ·...

Date post: 30-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
13
This space is reserved for the Procedia header, do not use it Distributed Automatic Differentiation for Ptychography Youssef S. G. Nashed * , Tom Peterka, Junjing Deng, and Chris Jacobsen Argonne National Laboratory Argonne, IL 60439, USA Abstract Synchrotron radiation light source facilities are leading the way to ultrahigh resolution X-ray imaging. High resolution imaging is essential to understanding the fundamental structure and interaction of materials at the smallest length scale possible. Diffraction based methods achieve nanoscale imaging by replacing traditional objective lenses by pixelated area detectors and com- putational image reconstruction. Among these methods, ptychography is quickly becoming the standard for sub-30 nanometer imaging of extended samples, but at the expense of increasingly high data rates and volumes. This paper presents a new distributed algorithm for solving the ptychographic image recon- struction problem based on automatic differentiation. Input datasets are subdivided between multiple graphics processing units (GPUs); each subset of the problem is then solved either entirely independent of other subsets (asynchronously) or through sharing gradient information with other GPUs (synchronously). The algorithm was evaluated on simulated and real data acquired at the Advanced Photon Source, scaling up to 192 GPUs. The synchronous variant of our method outperformed an existing multi-GPU implementation in terms of accuracy while running at a comparable execution time. Keywords: inverse problems, image reconstruction, gradient methods, distributed algorithms, X-ray scattering 1 Introduction In order to understand the behavior of heterogeneous materials at nanometer length scales, one must see their structure. This applies to integrated circuits where the as-manufactured structure may depart from the design, to the interaction of metals with organics in contaminated soil, and to molecular compartmentalization and transport within cells. While super-resolution light microscopy is providing new insights in biology when using fluorescence from one or a few specific molecule types, only electron and X-ray microscopy offer the possibility of nanoscale imaging of a material in its entirety, and it is only with X-rays that one can image specimens much thicker than a micrometer. However, in spite of demonstrations of about 10nm resolution * email: [email protected] 1
Transcript
Page 1: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

This space is reserved for the Procedia header, do not use it

Distributed Automatic Differentiation for Ptychography

Youssef S. G. Nashed∗, Tom Peterka, Junjing Deng, and Chris Jacobsen

Argonne National LaboratoryArgonne, IL 60439, USA

AbstractSynchrotron radiation light source facilities are leading the way to ultrahigh resolution X-rayimaging. High resolution imaging is essential to understanding the fundamental structure andinteraction of materials at the smallest length scale possible. Diffraction based methods achievenanoscale imaging by replacing traditional objective lenses by pixelated area detectors and com-putational image reconstruction. Among these methods, ptychography is quickly becoming thestandard for sub-30 nanometer imaging of extended samples, but at the expense of increasinglyhigh data rates and volumes.

This paper presents a new distributed algorithm for solving the ptychographic image recon-struction problem based on automatic differentiation. Input datasets are subdivided betweenmultiple graphics processing units (GPUs); each subset of the problem is then solved eitherentirely independent of other subsets (asynchronously) or through sharing gradient informationwith other GPUs (synchronously). The algorithm was evaluated on simulated and real dataacquired at the Advanced Photon Source, scaling up to 192 GPUs. The synchronous variantof our method outperformed an existing multi-GPU implementation in terms of accuracy whilerunning at a comparable execution time.

Keywords: inverse problems, image reconstruction, gradient methods, distributed algorithms, X-ray

scattering

1 Introduction

In order to understand the behavior of heterogeneous materials at nanometer length scales, onemust see their structure. This applies to integrated circuits where the as-manufactured structuremay depart from the design, to the interaction of metals with organics in contaminated soil,and to molecular compartmentalization and transport within cells. While super-resolutionlight microscopy is providing new insights in biology when using fluorescence from one or a fewspecific molecule types, only electron and X-ray microscopy offer the possibility of nanoscaleimaging of a material in its entirety, and it is only with X-rays that one can image specimensmuch thicker than a micrometer. However, in spite of demonstrations of about 10 nm resolution

∗email: [email protected]

1

Page 2: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

Sample

Optic or

Pinhole

Detector

Di raction Pattern

Raster Scan

Direction

Beam

Figure 1: Simplified ptychography experiment setup. A Cartesian grid is used for the overlap-ping raster scan positions.

X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront profiles [31],most lens-based X-ray microscopes are limited to 20-30 nm resolution for practical studies. Analternative approach is to collect the far-field X-ray diffraction pattern at large angles and useiterative phase retrieval to obtain higher resolution than X-ray lenses permit [30], but this basicapproach of coherent diffraction imaging or CDI requires samples of very limited extent. X-rayptychography [11, 20, 37], where far field coherent X-ray diffraction patterns are collected asa finite-sized coherent beam is scanned across the specimen with significant illumination spotoverlaps, provides an alternative approach which is compatible with both large imaging fieldsof view and freedom from lens-imposed resolution limits.

While the set of far-field X-ray diffraction patterns recorded in ptychography capture theFourier plane magnitudes of the scattered light, the phase is lost so that these diffraction pat-terns cannot be directly inverted to reveal the samples structure within each illumination spot.Mathematically, the phase inverse problem is ill-posed, meaning its solution is underdeterminedand nonunique [21].

Phase retrieval algorithms [12] are designed to solve the phase problem by iteratively tryingto find phases for the measured magnitudes that satisfy a set of constraints. In ptychography,the constraints are derived from diffraction data redundancy. Data redundancy is achieved byscanning the coherent illumination spot across the sample, and collecting a different diffractionpattern at each partially overlapping scan position (see Figure 1). Ptychography has beensuccessful in imaging frozen-hydrated cells at 30 nm resolution [10], bacteria at 20 nm resolution[45], and integrated circuits at 41 nm resolution [18]. Ptychographic tomography has been usedto image nanoporous glass to 16 nm 3D resolution [19].

Many methods have been proposed for solving the phase problem. The most commonly usedalgorithms follow an alternating projections scheme [39], where a randomly initialized objectwave function guess is iteratively improved by replacing the magnitude of the predicted wavewith the measurement. Variants of this method for ptychography include the Ptychographi-cal Iterative Engine (PIE) [37], later extended and named ePIE [28], and the Difference Mapalgorithm [42]. The alternating projections approach follows naturally from the underlyingphysics, starting with a model that is sequentially updated using the forward problem formu-lation and heuristics about the experimental setup. There is, however, no guarantee that these

2

Page 3: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

algorithms converge to the optimal solution. Moreover, it is inherently difficult to measure theircomputational complexity [35].

Alternatively, phase retrieval can be formulated as a nonlinear optimization problem [17].The forward problem is mathematically described using a cost function, taking into accountdifferent noise models and regularization; the inverse problem is then solved by finding theglobal minimum of this cost function. This requires calculating the gradient of the cost func-tion to direct the search algorithm. Typically, a gradient expression is explicitly derived bysymbolically differentiating the cost function. This has the advantage of using ‘out-of-the-box’optimization methods, like Gauss-Newton [49], conjugate gradient [17], or quasi-Newton [26],and being more computationally efficient than finite-difference methods, when appropriate ap-proximations are used [35]. The major drawback of this method is the manual derivation of thegradient expression. Besides requiring assumptions and approximations in the forward modelcost function, for which there exist symbolic derivatives, it also keeps the forward model tightlycoupled with the gradient calculation. With any updates to the forward model, for exampledue to new experimental capabilities, the gradient expression needs to be derived again.

Automatic differentiation (AD) [36], also known as Algorithmic Differentiation, offers thesimple expressibility of alternating projections methods, along with the power of gradient di-rected optimization methods. AD calculates partial derivatives of a function with respect toeach of its input parameters by using the chain rule. The chain rule states that a derivativeof a complex function can be automatically computed by combining derivatives of elementaryoperations, like arithmetic and trigonometric functions, that make up this function. The ap-plication of AD to the phase problem was highlighted by Jurling and Fienup who derived thecomplex-valued elementary operations specific to phase retrieval algorithms [24]. They applied‘manual automatic differentiation’ by computing the gradient by hand, but using the principlesof AD to convert forward model code to a series of gradient calculations. A related approachwas recently presented to retrieve the 3D structure of a thick specimen [22], but in opticalmicroscopy, employing a different forward model, error metric, and optimization strategy thanwhat is presented here.

This paper investigates the use of AD for ptychographic phase retrieval. We focus on methodaccuracy when the gradient calculation is shared among distributed many-core computing re-sources and execution time on these resources. To summarize, the main contributions of thispaper are:

• a fully automatic gradient calculation from source code for ptychography as will be dis-cussed in Section 2.2;

• an algorithm for distributing a ptychographic dataset and gradient synchronization amongmultiple GPUs;

• a comparison of the new algorithm, in terms of convergence and performance, with anexisting method.

2 Background

This section defines the notation and terminology used throughout the paper. It overviews theptychographic phase retrieval forward model, and introduces AD basics and tools utilized incalculating gradients for this model.

3

Page 4: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

2.1 Ptychography

Ptychography solves the phase problem using interfering diffraction patterns from overlappingscan areas in the object space. The idea itself is decades old [20], but its adoption accelerated inthe last decade [8,15,23,38,42,47]. This is mainly due to increased robustness of the inversionprocess, compared to previous CDI techniques [2, 48], and the ability to image objects muchlarger than the focused beam at a resolution that is, theoretically, only limited by the beamwavelength.

Figure 1 shows a simplified diagram of a ptychography experiment. A focused beam is usedto raster scan an object in a predefined arrangement of spatially overlapping beam locations,generating a set of J ‘far-field’ diffraction patterns at the detector plane. The forward modelfor ptychography can be written as

ψj(r) = P (r − rj) ◦O(r). (1)

The complex-valued object wave function O(r) interacts with the beam wave function (termedProbe hereafter) P (r) at position rj . This interaction is approximated by the complex element-wise Hadamard product operator ◦ of the two wave functions, generating an exit wave functionψj(r) that is propagated to the detector plane, and then measured as a real-valued diffractionpattern Ij(q) defined as

Ij(q) = |F [ψj(r)]|2, (2)

where F [.] denotes the Fourier transform from real space r to reciprocal space q.Alternating projections reconstruction algorithms start with an initial guess of O(r), and

sometimes P (r), then proceed with calculating the exit wave function as in Equation (1). Anew estimate Ψj(q) is then computed by replacing the modulus of the Fourier transform of thej-th exit wave with the square root of the measured diffraction pattern intensity, such that

Ψj(q) =√Ij(q)

F [ψj(r)]

|F [ψj(r)]|. (3)

A new exit wave ψ′j(r) can then be computed by means of an inverse Fourier transform, as in

ψ′j(r) = F−1[Ψj(q)]. (4)

This can be done for all J relative shifts in r at once [42], or sequentially and in randomorder [28]. The residual {ψ′j(r) − ψj(r)} is then used to refine the object guess O(r) in aniterative fashion.

As was stated in Section 1, gradient methods formulate an error metric based on the giveninformation and forward model specification. The simplest form of this error metric for pty-chography can be written as

E =1

J

J∑j=1

{|F [ψj(r)]|2 − Ij(q)}2, (5)

essentially the Mean Squared Error (MSE) estimator measuring how well a forward model agreeswith measured data. More sophisticated error metrics can be defined, including regularizationand weighting functions to account for noise, numerical instability, bad detector pixels, orbeam stops [13,17]. An analytic gradient expression is manually derived from Equation (5) byexpanding and differentiating the error metric function. Guided by the gradient information,finding the minimum of this error function often yields a good estimate of the object wavefunction.

4

Page 5: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

2.2 Automatic Differentiation

AD solves many problems with symbolic and numerical differentiation, such as computationalinefficiency and errors introduced by domain discretization. It can automatically provide deriva-tives, high-order derivatives, and partial derivatives with respect to many input parameter func-tions defined in computer source code. The majority of modern programming languages havetools for AD [3, 5, 16, 46], which work either by source code transformation (SCT) or opera-tor overloading (OO). SCT tools run before the language compiler, generating source code forderivative calculation from existing functions source code. OO tools provide the datatypes andelementary operators to compute partial derivatives, apply the chain rule, and define the func-tion to be differentiated. Both approaches typically use a computational graph [43], a directedacyclic graph in which the vertices are operators or independent variables, and the edge weightsequal the partial derivative of a node with respect to the edge source node. The computationalgraph is traversed to compute the root node derivative by aggregating the partial derivativesalong all paths to leaf nodes, applying the chain rule for every edge weight [6].

The past few years have seen a growing interest in AD from both academia and industry,fueled by a need for generic, user-friendly deep learning toolkits. The backpropagation learningalgorithm used to train deep neural networks is a special case of reverse mode AD [27,40]. Con-sequently, several libraries available for deep learning also include high performance routinesfor constructing computational graphs and traversing them, from top to bottom, computingpartial derivatives along the way to tune ‘neurons’ in a multilayer neural network [1, 4]. Ten-sorFlow [1] is a Python based deep learning API provided by Google with OO AD and GPUimplementation. In this paper, we use TensorFlow to automatically calculate gradients for theptychography forward model and error metric previously defined.

3 AD for Ptychography

Ptychographical reconstruction algorithms typically try to estimate the Fourier phases lost dur-ing measurement of the diffraction patterns (Fourier magnitudes). Once the two components,phase and magnitude, of the Fourier function are known, the object wave function is obtainedby the inverse Fourier transform in Equation (4). In our case, however, the error metric compar-ison defined in Equation (5) is performed directly on the Fourier magnitude values. Instead ofsolving for the Fourier phases first (in order to obtain the object function), the object functionis updated iteratively using gradient information given by AD. AD computes the partial deriva-tive of E with respect to each pixel value in O. In practice, the complex-valued object functionis decomposed into its real and imaginary components, Or and Oi respectively, and two partialderivatives are computed for each component separately, ∂E

∂Or and ∂E∂Oi . Other independent

variables, such as the probe function P , can also be added to this framework. Currently, ouralgorithm requires a known probe function and retrieves the object function.

Once the partial derivatives are computed, minimizing E can be achieved by any gradient-based optimization method. In this paper we employ the Adam algorithm for stochastic opti-mization [25], with the following hyperparameter values: 0.8 for the learning rate, 0.9 as theexponential decay rate of the first moment estimates, and 0.99 for the exponential decay rateof the second moment estimates.

5

Page 6: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

3.1 Distributed Algorithm

Input datasets for ptychography consist of diffraction pattern images measured from spatiallyoverlapping scan positions. Therefore, there exists a decomposition of the input dataset thatmaintains mutual spatial information among diffraction patterns, and in which the inverseproblem is locally and independently solvable. In the case of Cartesian grid scans, a regular 2Ddecomposition suffices [18, 29, 33]. Solving the ptychographic phase retrieval problem for eachsub-dataset will result in a part of the reconstructed object image. Those parts can be mergedto form the final reconstruction since the original sub-datasets overlap at the decompositionborders.

Similar to our previous multi-GPU implementation of the PIE and ePIE reconstruction al-gorithms [33], we implemented two variants of the new AD-based methods. An asynchronousversion, AD async, runs the computations independently on separate GPUs without any com-munication except at the end of the reconstruction, when all partial results are stitched backtogether using a parallel reduction-with-merge algorithm [34]. A synchronous version, AD sync,communicates local gradient information globally to all running GPUs at every iteration, em-ploying a Radix-k communication algorithm [34], effectively sharing a synchronized object func-tion among all GPUs.

AD async delivers the best scaling performance in a multi-GPU distributed environment,due to limited communication between computing resources. Post-reconstruction stitching,however, is not as simple as a parallel gather operation. Each GPU’s partial reconstructionexhibits spatial offsets and phase shifts relative to all other GPU reconstructions. 2D registra-tion and phase modulation methods are required to merge all partial reconstructions into onecoherent object wave function. We use a phase correlation algorithm for 2D rigid registrationand a gain compensation technique, such as is used for photographic image stitching, for findinga common phase offset for the final reconstruction [33].

rj

P

O

ψj+d1

ψj+d2

ψj+dN

|FFT|2

|FFT|2

|FFT|2

Ij+d1

Ij+d2

Ij+dN

Dj

-

-

-

E1

E2

EN

+

g1

g2

gN

G+

... ... ... ...

Figure 2: Block diagram of the AD sync algorithm running on N GPUs. Input data are ingray, independent and intermediate variables are in blue, and operators are in orange.

Figure 2 shows a data flow chart of the AD sync algorithm. The object function O israndomly initialized and shared among all GPUs. The data subdivision scheme is common toboth synchronous and asynchronous versions of the algorithm. Starting with scan area extentsas the domain to be decomposed, regular 2D offset sets [d1, d2, ...dN ] are computed based on

6

Page 7: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

the number N of GPUs used. Those offsets are used to define subsets in the diffraction patterndataset Dj and scan positions list rj . Each GPU n calculates its current diffraction patternestimate Ij+dn for a specific 2D region of O indexed by a different rj subset. A local MSEEn is calculated from the estimate and the measured data subset Dj+dn, for which the ADtool derives a local gradient gn. In AD async, gn is used to update GPU’s local copies of O,while in AD sync, local gradients are aggregated forming a global gradient G, which is thenused to update the shared object function. This process is repeated for a predefined number ofiterations, after which the AD sync algorithm is done. AD async, merges local copies of O intoone final reconstruction using the stitching algorithm introduced above.

3.2 Implementation Details

Our software stack is highlighted in Figure 3. Diffraction pattern data are stored in HDF5format [14]. TensorFlow’s Python API is used to construct the computational graph for our for-ward model, derive gradients for its error metric, and adjust the model parameters accordingly.Each GPU runs a local copy of TensorFlow where distributed memory parallelism is explicitlyhandled using MPI. Domain decomposition is achieved using the DIY parallel programming li-brary [32] that is written on top of MPI to facilitate communication between parallel processes.In DIY terminology, we assign a DIY block, managed by one MPI rank, to each GPU. DIY alsoprovides the parallel reduction algorithm used in AD async and the Radix-k communicationpattern employed at every iteration of AD sync. To bridge between TensorFlow and DIY, awrapper layer, pyDIY, is developed. It is responsible for marshalling Python data structuresto C++, without the need for copying or serialization.

Figure 3: Software stack for distributed AD-based ptychography.

4 Evaluation

Evaluation was performed using two datasets: a synthetic sample simulating the diffractionpatterns from a known image, and on real data from the Bionanoprobe [9] at beamline 21-ID-Dof the Advanced Photon Source at Argonne National Laboratory. Experiments were run on theCooley cluster at the Argonne Leadership Computing Facility (ALCF). Cooley is a visualizationplatform consisting of 126 compute nodes; each node has 12 CPU cores and one NVIDIA TeslaK80 dual-GPU card. Each GPU has 12 GB of memory and a CUDA compute capability of 3.7;the code was built and run with CUDA v7.5.

7

Page 8: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

Figure 4: Simulated dataset performance evaluation plots. The time reported is the totalrunning time of the algorithms in seconds, including I/O time. NRMSE is the Normalized RootMean Squared Error calculated per-pixel between object function estimates and the groundtruth.

4.1 Simulated Data

We compared AD sync, AD async, and our previous implementations of the PIE algorithm,termed PIE sync and PIE async, in terms of accuracy and performance. A synthetic purephase object was raster scanned using a regular 160×160 Cartesian grid, generating a total of25,600 far-field diffraction patterns, each of 128×128 pixels, for a total of 1.56 GB of singleprecision floating point raw data. The diffraction patterns were generated with a single-pixeldetector point-spread function, with no added noise and 90% overlap between adjacent scanpoints in the horizontal and vertical directions. Results were obtained for different numbers ofGPUs after 200 iterations of all algorithms.

Performance and accuracy plots of different GPU configurations [1-128] are reported in Fig-ure 4. Our new algorithm has a higher memory requirement than the previous implementation.Therefore, time was only reported for AD sync and AD async running on 4 or more GPUs,while PIE sync and PIE async were able to fit the entire simulated dataset on one GPU. Thisis because of TensorFlow creating auxiliary arrays for storing the computational graph andcomputing the gradients. The top left plot shows mean runtime of 5 independent runs of allalgorithms. It is clear that scaling is almost linear for the asynchronous algorithms, AD asyncand PIE async, while synchronous ones suffer from communication overhead. The top right

8

Page 9: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

plot shows the algorithms scaling efficiency using a 4 GPU baseline. Again, asynchronous al-gorithms exhibit better scaling efficiency than asynchronous algorithms. The average scalingefficiency for AD sync is 65.7%, AD async is 77.7%, PIE sync is 61.8%, and PIE async is 95.9%.The bottom left plot is of a weak scaling test for the new algorithms, in which the workloadassigned to each GPU was kept constant by doubling the input dataset size with the numberof assigned GPUs.

The bottom right plot of Figure 4 depicts the convergence of all algorithms, measured em-ploying a Normalized Root Mean Squared Error (NRMSE). NRMSE was calculated pixel-wisebetween object estimates and their corresponding 2D regions in the ground truth, for each GPUand per iteration. Individual GPU errors are aggregated for all GPUs of a certain configurationand averaged across 5 independent runs. The plotted quantity is the mean NRMSE of all GPUconfigurations of each algorithm. Both versions of our new algorithm, AD sync and AD async,outperform their PIE counterparts in terms of final reconstruction quality and NRMSE stan-dard deviation between runs. It is also evident from the plot that synchronous versions of thealgorithm have better convergence than the asynchronous versions. This is mainly becauseof the increased statistics, found along scan region decomposition boundaries, that are onlyavailable to communicating GPUs in the synchronous approach.

4.2 Experimental Data

In order to evaluate our algorithm’s applicability and performance, we tested it on real dataacquired at a synchrotron radiation facility. The data was acquired at the Bionanoprobe [9]at the 21-ID-D beamline of the Advanced Photon Source. The sample is a CMOS integratedcircuit (IC) fabricated in a 65 nm technology with eight copper interconnect layers. This ICwas imaged using a 140×300 Cartesian grid of scan points (30% overlap), 10 keV X-rays, and aPilatus 300K detector (Dectris Inc.) with 619×487 pixels placed 2 meters downstream to collectthe diffraction patterns. The central 256×256 pixels of the detector data were selected for thereconstruction, yielding a reconstructed image pixel size of 5.6 nm and 10.25 GB of raw inputdata. Such large datasets usually contain noise resulting from finite photon counts, positionalerrors in the scanning stages, fluctuations in the beam intensity, and distortions caused by airscattering, changes in sample temperature, and bad or missing detector pixels.

Figure 5 shows the 6.64 Terapixel reconstruction of the IC using AD sync running for 200iterations on 32 GPUs, along with plots for estimated resolution and performance scaling upto 192 GPUs. Because phase contrast is much stronger than absorption contrast at the X-rayenergy used, we show the phase contrast image, after phase unwrapping, from the ptychographicreconstruction of the IC complex transmission function.

5 Conclusion and Future Work

In this paper we presented a new parallel multi-GPU algorithm for ptychographic reconstructionbased on automatic differentiation. The algorithm utilizes a production deep learning softwarelibrary, TensorFlow, for adjusting an estimate of the ptychography forward model to fit with themeasured data. In order to distribute the computation among multiple GPUs, our distributedalgorithm splits raw input data into spatially contiguous partitions that are allocated to eachGPU. Sub-problems are then either solved independently on each GPU or by communicatingcurrent solution estimates to other GPUs. The new algorithm was evaluated on synthetic andreal data acquired at the APS. The synchronous version of the new algorithm was found to

9

Page 10: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

22 nm

(a)

0

2.3

2 m

(b) (c)

Distance (nm)G

ray V

alu

eGPUs

WallTim

e (

s)

Figure 5: Ptychographic reconstruction of a CMOS IC fabricated in 65 nm technology. (a) thephase retrieved for the IC with a wafer thickness of 300 µm. In this case, the image shows anoverlay of features at the chip wiring and gate level, along with variations in the overall waferthickness which are presumably due to scratches on the surface of the wafer; (b) Performanceplot for the synchronous and asynchronous algorithms, scaling up to 192 GPUs; (c) Line profileplot along the red line in (a) showing 22 nm resolution.

have superior convergence properties and scaling performance when compared to an existingimplementation.

Thanks to decoupling the forward model from its optimization, one can easily amend theforward model, tune the optimizer hyperparameters, or experiment with different gradient-based optimizers and error metrics. Future work includes adding more input parameters to thecurrent forward model, such as the probe function and experimental instabilities. Expandingthe gradient calculation to these new parameters with the existing error metric, and comparingthe results with various modifications to the error metric, is an interesting study to conduct.Additionally, we consider the work presented here a steppingstone towards a 3D ptychographicreconstruction method, owing to similarities between multilayer neural networks and the multi-slice propagation theory. Multislice techniques are being incorporated into many emerging 3DX-ray ptychographic methods [41,44].

Acknowledgment

We gratefully acknowledge the use of the resources of the Argonne Leadership ComputingFacility and the Advanced Photon Source at Argonne National Laboratory. This work wassupported by Advanced Scientific Computing Research, Office of Science, U.S. Department ofEnergy, under Contract DE-FC02-06ER25777. The Bionanoprobe is funded by NIH/NCRRHigh End Instrumentation (HEI) grant (1S10RR029272-01) as part of the American Recoveryand Reinvestment Act (ARRA). Development of ptychography with the Bionanoprobe hasbeen supported in part by NIH grant R01 GM104530. Use of the Advanced Photon Source,

10

Page 11: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

an Office of Science User Facility operated for the U.S. Department of Energy (DOE) Office ofScience by Argonne National Laboratory, was supported by the U.S. DOE under Contract No.DE-AC02-06CH11357.

References

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean,M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016.

[2] P. Bao, F. Zhang, G. Pedrini, and W. Osten. Phase retrieval using multiple illumination wave-lengths. Optics letters, 33(4):309–311, 2008.

[3] B. M. Bell. Cppad: a package for c++ algorithmic differentiation. Computational Infrastructurefor Operations Research, 57, 2012.

[4] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: A cpu and gpu math compiler in python. In Proc. 9th Python inScience Conf, pages 1–7, 2010.

[5] C. Bischof, A. Carle, G. Corliss, A. Griewank, and P. Hovland. Adifor–generating derivative codesfrom fortran programs. Scientific Programming, 1(1):11–29, 1992.

[6] C. H. Bischof, P. D. Hovland, and B. Norris. On the implementation of automatic differentiationtools. Higher-Order and Symbolic Computation, 21(3):311–331, 2008.

[7] W. Chao, B. Harteneck, J. Liddle, E. Anderson, and D. Attwood. Soft x-ray microscopy at aspatial resolution better than 15 nm. Nature, 435:1210–1213, 2005.

[8] S. Chen, J. Deng, D. Vine, Y. Nashed, Q. Jin, T. Peterka, C. Jacobsen, and S. Vogt. Simultaneousx-ray nano-ptychographic and fluorescence microscopy at the bionanoprobe. In SPIE OpticalEngineering+ Applications, pages 95920I–95920I. International Society for Optics and Photonics,2015.

[9] S. Chen, J. Deng, Y. Yuan, C. Flachenecker, R. Mak, B. Hornberger, Q. Jin, D. Shu, B. Lai,J. Maser, et al. The bionanoprobe: hard x-ray fluorescence nanoprobe with cryogenic capabilities.Journal of synchrotron radiation, 21(1):66–75, 2014.

[10] J. Deng, D. J. Vine, S. Chen, Y. S. Nashed, Q. Jin, N. W. Phillips, T. Peterka, R. Ross, S. Vogt,and C. J. Jacobsen. Simultaneous cryo x-ray ptychographic and fluorescence microscopy of greenalgae. Proceedings of the National Academy of Sciences, 112(8):2314–2319, 2015.

[11] H. M. L. Faulkner and J. M. Rodenburg. Movable Aperture Lensless Transmission Microscopy: ANovel Phase Retrieval Algorithm. Physical Review Letters, 93(2):023903, July 2004.

[12] J. R. Fienup. Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758–2769, 1982.

[13] J. R. Fienup. Invariant error metrics for image reconstructionc. Applied optics, 36(32):8352–8357,1997.

[14] M. Folk, A. Cheng, and K. Yates. HDF5: A file format and I/O library for high performancecomputing applications. In Proceedings of Supercomputing, volume 99, 1999.

[15] K. Giewekemeyer, P. Thibault, S. Kalbfleisch, A. Beerlink, C. M. Kewish, M. Dierolf, F. Pfeiffer,and T. Salditt. Quantitative biological imaging by ptychographic x-ray diffraction microscopy.Proceedings of the National Academy of Sciences, 107(2):529–534, 2010.

[16] A. Griewank, D. Juedes, and J. Utke. Algorithm 755: Adol-c: a package for the automaticdifferentiation of algorithms written in c/c++. ACM Transactions on Mathematical Software(TOMS), 22(2):131–167, 1996.

[17] M. Guizar-Sicairos and J. R. Fienup. Phase retrieval with transverse translation diversity: anonlinear optimization approach. Opt. Express, 16(10):7264–7278, May 2008.

[18] M. Guizar-Sicairos, I. Johnson, A. Diaz, M. Holler, P. Karvinen, H.-C. Stadler, R. Dinapoli,O. Bunk, and A. Menzel. High-throughput ptychography using eiger: scanning x-ray nano-imaging

11

Page 12: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

of extended regions. Opt. Express, 22(12):14859–14870, Jun 2014.

[19] M. Holler, A. Diaz, M. Guizar-Sicairos, P. Karvinen, E. Farm, E. Harkonen, M. Ritala, A. Men-zel, J. Raabe, and O. Bunk. X-ray ptychographic computed tomography at 16 nm isotropic 3dresolution. Scientific reports, 4, 2014.

[20] W. Hoppe. Beugung im inhomogenen primarstrahlwellenfeld. i. prinzip einer phasenmessung vonelektronenbeungungsinterferenzen. Acta Crystallographica Section A: Crystal Physics, Diffraction,Theoretical and General Crystallography, 25(4):495–501, 1969.

[21] A. Huiser and P. Van Toorn. Ambiguity of the phase-reconstruction problem. Optics letters,5(11):499–501, 1980.

[22] X. Jiang, W. Van den Broek, and C. T. Koch. Inverse dynamical photon scattering (IDPS): anartificial neural network based algorithm for three-dimensional quantitative imaging in opticalmicroscopy. Optics express, 24(7):7006–7018, 2016.

[23] M. W. Jones, N. W. Phillips, G. A. van Riessen, B. Abbey, D. J. Vine, Y. S. Nashed, S. T.Mudie, N. Afshar, R. Kirkham, B. Chen, et al. Simultaneous x-ray fluorescence and scanningx-ray diffraction microscopy at the australian synchrotron xfm beamline. Journal of SynchrotronRadiation, 23(5), 2016.

[24] A. S. Jurling and J. R. Fienup. Applications of algorithmic differentiation to phase retrievalalgorithms. JOSA A, 31(7):1348–1359, 2014.

[25] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980, 2014.

[26] J. Li and T. Zhou. Numerical optimization algorithm of wavefront phase retrieval from multiplemeasurements. arXiv preprint arXiv:1607.01861, 2016.

[27] S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Mathematics,16(2):146–160, 1976.

[28] A. M. Maiden and J. M. Rodenburg. An improved ptychographical phase retrieval algorithm fordiffractive imaging. Ultramicroscopy, 109(10):1256–1262, 2009.

[29] S. Marchesini, H. Krishnan, B. J. Daurer, D. A. Shapiro, T. Perciano, J. A. Sethian, and F. R. N. C.Maia. SHARP: a distributed GPU-based ptychographic solver. Journal of Applied Crystallography,49(4):1245–1252, Aug 2016.

[30] J. Miao, P. Charalambous, J. Kirz, and D. S. yre. An extension of the methods of x-ray crystallography to allow imaging of micron-size non-crystalline specimens. Nature, 400:342–344, 1999.

[31] H. Mimura, S. Handa, T. Kimura, H. Yumoto, D. Yamakawa, H. Yokoyama, S. Matsuyama,K. Inagaki, K. Yamamura, Y. Sano, K. Tamasaku, Y. Nishino, M. Yabashi, T. Ishikawa, andK. Yamauchi. Breaking the 10 nm barrier in hard-x-ray focusing. Nature Physics, 6(2):122–125,2010.

[32] D. Morozov and T. Peterka. Block-parallel data analysis with diy2. 2016.

[33] Y. S. Nashed, D. J. Vine, T. Peterka, J. Deng, R. Ross, and C. Jacobsen. Parallel PtychographicReconstruction. Optics Express, 22(26):32082–32097, 2014.

[34] T. Peterka, D. Goodell, R. Ross, H.-W. Shen, and R. Thakur. In Proceedings of the Conferenceon High Performance Computing Networking, Storage and Analysis, page 4. ACM, 2009.

[35] J. Qian, C. Yang, A. Schirotzek, F. Maia, and S. Marchesini. Efficient algorithms for ptychographicphase retrieval. Inverse Problems and Applications, Contemp. Math, 615:261–280, 2014.

[36] L. B. Rall. Automatic differentiation: Techniques and applications. 1981.

[37] J. M. Rodenburg and H. M. Faulkner. A phase retrieval algorithm for shifting illumination. Appliedphysics letters, 85(20):4795–4797, 2004.

[38] J. M. Rodenburg, A. Hurst, A. Cullis, B. Dobson, F. Pfeiffer, O. Bunk, C. David, K. Jefimovs,and I. Johnson. Hard-X-Ray Lensless Imaging of Extended Objects. Physical Review Letters,98(3):034801, 2007.

[39] G. RW and W. Saxton. Phase determination from image and diffraction plane pictures in electron-

12

Page 13: Distributed Automatic Di erentiation for Ptychographytpeterka/papers/2017/nashed-iccs17-pape… · X-ray lens-based imaging in high contrast test structures [7] or in inferred wavefront

Distributed Automatic Differentiation for Ptychography Nashed, Peterka, Deng, and Jacobsen

microscope. Optik, 34(3):275, 1971.

[40] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117,2015.

[41] A. Suzuki, S. Furutaku, K. Shimomura, K. Yamauchi, Y. Kohmura, T. Ishikawa, and Y. Takahashi.High-resolution multislice x-ray ptychography of extended thick objects. Physical review letters,112(5):053903, 2014.

[42] P. Thibault, M. Dierolf, A. Menzel, O. Bunk, C. David, and F. Pfeiffer. High-resolution scanningx-ray diffraction microscopy. Science, 321(5887):379–382, 2008.

[43] G. Tinhofer, E. Mayr, H. Noltemeier, and M. Syslo. Computational graph theory, volume 7.Springer, 1990.

[44] E. H. R. Tsai, I. Usov, A. Diaz, A. Menzel, and M. Guizar-Sicairos. X-ray ptychography withextended depth of field. Opt. Express, 24(25):29089–29108, Dec 2016.

[45] D. Vine, D. Pelliccia, C. Holzner, S. B. Baines, A. Berry, I. McNulty, S. Vogt, A. G. Peele,and K. A. Nugent. Simultaneous x-ray fluorescence and ptychographic microscopy of cyclotellameneghiniana. Optics express, 20(16):18287–18296, 2012.

[46] S. F. Walter and L. Lehmann. Algorithmic differentiation in python with algopy. Journal ofComputational Science, 4(5):334–344, 2013.

[47] R. Wilke, M. Priebe, M. Bartels, K. Giewekemeyer, A. Diaz, P. Karvinen, and T. Salditt. Hardx-ray imaging of bacterial cells: nano-diffraction and ptychographic reconstruction. Optics express,20(17):19232–19254, 2012.

[48] F. Zhang, G. Pedrini, and W. Osten. Phase retrieval of arbitrary complex-valued fields throughaperture-plane modulation. Physical Review A, 75(4):043805, 2007.

[49] J. Zhong, L. Tian, P. Varma, and L. Waller. Nonlinear optimization algorithm for partiallycoherent phase retrieval and source recovery.

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of ArgonneNational Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Office of Sciencelaboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Governmentretains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwidelicense in said article to reproduce, prepare derivative works, distribute copies to the public,and perform publicly and display publicly, by or on behalf of the Government.

13


Recommended