GPU-based shear–shear correlation calculation

Computer Physics Communications ( ) –

Contents lists available at ScienceDirect

Computer Physics Communications

journal homepage: www.elsevier.com/locate/cpc

GPU-based shear–shear correlation calculation

Miguel Cárdenas-Montes a,∗, Miguel A. Vega-Rodríguez b, Christopher Bonnett c,Ignacio Sevilla-Noarbe a, Rafael Ponce a, Eusebio Sánchez Alvaro a,Juan José Rodríguez-Vázquez a

a CIEMAT, Department of Fundamental Research, Avda. Complutense 40, 28040, Madrid, Spainb University of Extremadura, ARCO Research Group, Department Technologies of Computers and Communications, Escuela Politécnica,Campus Universitario s/n, 10003, Cáceres, Spainc Institut de Ciencies de lEspai, CSIC/IEEC, F. de Ciencies, Torre C5 par-2, Barcelona 08193, Spain

a r t i c l e i n f o

Article history:Received 31 January 2013Received in revised form7 July 2013Accepted 10 August 2013Available online xxxx

Keywords:Gravitational weak lensingShear–shear correlation functionGPU computingHeterogeneous computingOptimization

a b s t r a c t

Light rays are deflected when travelling through a gravitational potential: this phenomenon is known asgravitational lensing. It causes the observed shapes of distant galaxies to be very slightly distorted by theintervening matter in the Universe, as their light travels towards us. This distortion is called cosmic shear.By measuring this component it is possible to derive the properties of the mass distribution causing thedistortion. This in turn can lead to themeasurement of the accelerated expansion of the Universe, as mat-ter clumps together differently depending on its dynamics at each cosmological epoch. Themeasurementof the cosmic shear requires the statistical analysis of the ellipticities of millions of galaxies using verylarge astronomical surveys. In the past, due to the computational cost of the problem, this kind of analy-sis was performed by introducing simplifications in the estimation of such statistics. With the advent ofscientific computing using graphics processing units, analysis of the shear can be addressed without ap-proximations, even for very large surveys, while maintaining an affordable execution time. In this work,we present the creation and optimization of such a graphics processing unit code to compute the so-calledshear–shear correlation function.

© 2013 Elsevier B.V. All rights reserved.

1. Introduction

Current cosmological observations focus on surveying verylarge and deep regions of the sky, in order to be able to study thelarge-scale structure of the Universe and its evolution.

Several observational probes [1] have been identified to tacklethe study of the accelerated expansion of the Universe [2]. Amongthem, one of the most promising turns out to be the analysis ofthe small deflections that large masses produce on the light trav-elling from distant galaxies. This phenomenon is known as gravi-tational lensing. Given the very small distortions involved, usuallyit is the statistical properties of the observed distribution of galaxyshapeswhich are studied [3]. The overall effect created by the grav-itational lenses on these shapes is the cosmic shear. For many yearsthis observational technique has been burdened by very large in-strumental errors. Nevertheless, first results were possible in thelate 1990s [4–7]. Only recently have the first measurements inwide areas been carried out, delivering promising results for thefuture of the field [8]. This has paved theway for present and future

∗ Corresponding author. Tel.: +34 91 346 6281; fax: +34 91 346 6068.E-mail address:[email protected] (M. Cárdenas-Montes).

surveys, such as the Dark Energy Survey [9], the Kilo-Degree Sur-vey [10] and Euclid [11,12], to exploit this observational channelby increasing statistics by almost two orders of magnitude by thenext decade as well as reducing systematic errors.

In this context, cosmologists will have to deal with very largeamounts of data (108 objects) to extract these measurements. Inparticular, the so-called shear–shear correlation function estima-tion requires computing the auto- and cross-correlation functionsof the ellipticities of galaxies in different samples at varying red-shifts. This algorithm has a high computational cost, which goeswithO(N2) in the case that onewants to achieve the best precision.

The calculation of correlation function estimators has alreadybeen addressed in the case of large-scale structure studies, for thecomputation of the auto-correlation function of positions of galax-ies and clusters (instead of shapes). In this case, systematic er-rors play a lesser role, but the probe in itself is less sensitive tothe determination of cosmological parameters. Several codes haveharnessed the computational power of graphics processing units(GPUs) and other hardware platforms to carry out the job (Sec-tion 2). In the case of shear–shear correlations (involving the shapeof the galaxies, Section 3) several codes have been implementedusing the kd-trees approach, which simplifies the problem at thecost of precision (Section 2).

0010-4655/$ – see front matter© 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.cpc.2013.08.005

http://dx.doi.org/10.1016/j.cpc.2013.08.005

http://www.elsevier.com/locate/cpc

http://www.elsevier.com/locate/cpc

mailto:[email protected]

http://dx.doi.org/10.1016/j.cpc.2013.08.005

2 M. Cárdenas-Montes et al. / Computer Physics Communications ( ) –

To the authors’ knowledge, up to now, no code has taken ad-vantage of the capabilities of GPUs to handle this specific problem.This calculation is of particular relevance in the short termwith therapid advent of larger datasets in the next few years. GPUs are ableto deal with the shear correlation calculation of very large surveysof galaxies within a reasonable execution time.

Beyond the initial adaptation of the problem to the GPU plat-form and the subsequent verification of the results, the code passesthrough a series of optimization processes. Among other tech-niques, a more efficient use of on-chip memory with an incrementof data locality and the study of the compilation options modify-ing the use of the cache memory are checked. Finally, a concur-rent computing scenario (hybrid OpenMP–CUDA implementation)is presented.

This paper is organized as follows. Section 2 summarizes therelated work and previous efforts. In Section 3.1, the concepts ofgravitational lensing and shear are described, together with theequations to be implemented in the code. The statistical supportto the analysis is described in Section 3.2. The hardware used inthis work is presented in Section 3.3. Results are presented andanalysed in Section 4. Finally, Section 5 contains the conclusionsof this work. Details about GPU architecture and CUDA program-ming model are included in Appendix.

2. Related work

Previous efforts implement some kind of mechanism to reducethe computational cost of the point-to-point correlation estima-tion, for example, thewidely usedATHENA code.1 This is a powerfultool based on kd-trees [13], which allows controlling the precisionof the estimation by means of a parameter termed the opening an-gle measured in radians (OA hereafter). This parameter regulatesthe minimum angle at which two kd-tree nodes must ‘see’ eachother for the full point-to-point correlation to be estimated. If thenodes are far away from each other, an averaging of the values isperformed, and these averages are then correlated. Smaller open-ing angles make the required block size smaller, and the precisionachieved is higher at the expense of a higher execution time (seeSection 4 for a quantification of this effect). A similar approach isused in [14].

More generally, GPU computation is becoming increasinglyused in Cosmology to dealwith the analysis of the large-scale struc-ture of the Universe. Some examples include the two-point an-gular correlation function [15,16] and the aperture mass statistic(in which a filter is applied to shear maps, in order to detect massstructures) [17]. However, to the best of our knowledge,wepresentthe first GPU-based shear–shear correlation function estimator im-plementation.

3. Methods and materials

3.1. The shear–shear correlation function

Cosmological information, such as dark matter distribution atdifferent epochs, the amount of matter, and the expansion history,is contained in the so-called shear–shear correlation function. Athorough review on the topic of gravitational lensing can be foundin [18]. The value of the shear field γ can be conveniently estimatedfrom the ellipticity, e, of a particular galaxy. Here, |e| is defined assuch that an ellipse with axes a < b:

|ϵ| =a − ba + b

. (1)

1 http://www2.iap.fr/users/kilbinge/athena/.

Fig. 1. Angles and coordinates on a sphere for two galaxies i = (1, 2) located at(αi, δi).Source: Figure taken from [19], used with the author’s permission.

Given that each galaxy has an orientation φ with respect to thelocal coordinate frame, two ellipticity components can be defined.

ϵ = ϵx + iϵy = |ϵ|e2iφ . (2)The correlation function of these ellipticities, as a function of the

separation angle between galaxies, encodes cosmological informa-tion about the mass distribution at different redshifts; see [19] fora recent interpretation of the shear correlation function. In reality,we need to extract shear fields averaging the ellipticities of manygalaxies in every region. Just the computational problem of calcu-lating the correlation functions is addressed here by assuming thatthe ellipticity ϵ has beenmeasured to the best of our ability. The ac-tual measurement of the shear from galaxy ellipticities is beyondthe scope of this paper. The reader can consult [20,21] for a detailedexplanation of systematic effects and the steps for the extraction ofshear fromobservations. In the rest of the paper, the shear notationand not the ellipticity notation is used, assuming that this processhas already taken place.

The great circle distance θ between two galaxies i = (1, 2),necessary for the correlation function binning, is calculated usingthe position vectors of both on a unit sphere, which are obtainedfrom its spherical sky coordinates αi, δi.

v⃗i = (cosαi · cos δi, sinαi · cos δi, sin δi) (3)

cos(θ) = v⃗1 · v⃗2. (4)The shear at a particular galaxy position is defined in a local

Cartesian coordinate system with the y-axis pointing towards thenorth pole and the x-axis going along the line of constant declina-tion in a plane tangent to the sphere at the galaxy’s position. Givena pair of galaxies (1, 2), γt1 is the tangential projection of the shearof galaxy 1 along the geodesic that connects galaxy 1 and 2, and γ×1is the cross component. To be able to calculate these shear compo-nents, the angle β1 (see Fig. 1) must be known: this is the anglebetween the great circle at declination δ and the right ascension ofthe galaxy α1. Then the angle (known as the course angle) that weneed to use to project is given by Φ1 = π/2 − β1. Using the sineand cosine rules on a sphere, the calculation is done as follows:

cosΦ1 =sin(α2 − α1) cos δ2

sin θ

sinΦ1 =cos δ2 sin δ1 − sin δ2 cos δ1 cos(α2 − α1)

sin θ. (5)

The corresponding angle for Φ2 can be found by exchanging theindices.

http://www2.iap.fr/users/kilbinge/athena/

M. Cárdenas-Montes et al. / Computer Physics Communications ( ) – 3

After projecting the measured shears (γx, γy) to (γt , γ×), thefollowing correlation functions can be defined:

ξ+(θ) =

ij

wiwj(γt(θi) · γt(θj) + γ×(θi) · γ×(θj))ij

wiwj

ξ−(θ) =

ij

wiwj(γt(θi) · γt(θj) − γ×(θi) · γ×(θj))ij

wiwj

ξ×(θ) =

ij

wiwj(γt(θi) · γ×(θj))ij

wiwj, (6)

wherewi,j are theweights associated to themeasurement of galaxyellipticity. These take into account measurement errors; see [22].These are the three correlation functions that are calculated by thecode.

3.2. Statistics

In order to ascertain if the proposedmodifications applied to thecode improve the execution time, two different types of test can beapplied: parametric and non-parametric. The difference betweenboth relies on the assumption that data is normally distributed forparametric tests, whereas non-explicit conditions are assumed innon-parametric tests. For this reason, the latter is recommendedwhen the statistical model of data is unknown [23].

The Kruskal–Wallis test [24] is one such non-parametric test,which is used to compare three or more groups of sample data. Forthis test, the null hypothesis assumes that the samples are fromidentical populations.

The procedure when using a multiple comparison test (e.g. theKruskal–Wallis test) to test when the null hypothesis is rejectedimplies the use of a post-hoc test to determine which sam-ple makes the difference. The most typical post-hoc test is theWilcoxon signed-rank test with the Bonferroni or Holm correc-tion [25].

The Wilcoxon signed-rank test also belongs to the non-parametric category. It is a pairwise test that aims to detect signifi-cant differences between two samplemeans, that is, the behaviour– execution time in our study – of two codes before and after themodification.

On the other hand, the Bonferroni correction aims to control thefamily-wise error rate (FWER). The FWER is the cumulative errorwhen more than one pairwise comparison (e.g. more than oneWilcoxon signed-rank test) is performed. Therefore,whenmultiplepairwise comparisons are performed, the Bonferroni correctionallows maintaining the control over the FWER.

3.3. Hardware

The creation and optimization process of the code, aswell as thenumerical experiments,were executed on anNVIDIA C2075 (Fermiarchitecture; see the Appendix). Otherwise, the CPU numericalexperiments were executed on a computer with two Intel XeonX5570 processors at 2.93 GHz and with 8 GB of RAM.

4. Results and analysis

4.1. Baseline implementation

In the Appendix, an overview of the GPU architecture and theCUDA programming model is presented. In the following section,the CUDA baseline implementation of the shear–shear correlationcode is described, while mentioning the technical aspects thataffect the performance.

4.1.1. General description of the program flowThe code consists in the calculation of the quantities: ξ+, ξ−, ξ×

(Eq. (6), Algorithm 1) as a function of the separation angle θbetween the galaxies. This initial version of the code focuses onthe calculation rather than on the performance. However, it hasbeen coded keeping in mind the most general recommendationsfor reaching the highest efficiency.

Algorithm 1: The shear–shear correlation algorithm pseu-docodeforeach Pair of Galaxies do

Calculate the separation angle θ between the galaxies onthe sphere (dot product);if θ is in the user’s range then

Calculate all products of local components of theshears for both galaxies (g[1,2][1,2] = γx;1,2 · γy;1,2);Calculate course angles for both galaxies (Φ1 and Φ2)(Eq. (5));(angle with respect to line of equal declination) ;Calculate ξ+, ξ−, ξ× (Eq. (6)) after projecting theshears to the tangential and crossed componentsγt , γ×;Populate the histograms held in shared memorywiththe computed value of ξ+, ξ−, ξ×, number of pairs,and number of pairs with their correspondingweights;

Load the histograms held in shared memory into globalmemory.

Once the separation angle θ of the galaxy pair is calculated,and if it is within the histogram range (user defined), five valueshave to be computed and incorporated in equal number of his-tograms. These values correspond to the number of pairs of galax-ies, ξ+, ξ−, ξ× (Eq. (6)), and the sum of weights of all the pairs. Forthe calculation of the values, and in order to avoid slow access toglobal memory, intermediate reusable values are stored on sharedmemory.

4.1.2. Memory managementThe baseline code implements a coalesced pattern access to

global memory. Input data are sorted by components rather thanby galaxies: first the x-coordinates for all galaxies, and successivelythe y-coordinates, the z-coordinates,γx, γy (values of themeasuredshear field in the local reference frame), and the weight values.By implementing this layout, adjacent threads in a block requestcontiguous data from global memory. Coalesced access maximizesglobal memory bandwidth by reducing the number of bus transac-tions.

Furthermore, this baseline code pays special attention to mak-ing an intensive use of shared memory for intermediate cal-culations and for the construction of the correlation functionhistograms (Algorithm 1). The dot product and the arc-cosine cal-culation, necessary to get the angle subtended by each pair ofgalaxies, are executed on shared memory. The same is true for thecalculation of the course angles for both galaxies and the projec-tion of the shears to the tangential and crossed components. Thisavoids the use of globalmemory,which ismuch slower than sharedmemory for any intermediate calculation which requires frequentread and write accesses.

An expected bottleneck is the construction of the histograms.Until this point, a multithreaded calculation has operated over thepairs of galaxies calculating the dot product, followed by the arc-cosine, and finally the bin in the angle histogram where the value


has to be incremented.2 Due to the multithreaded nature of thekernel, simultaneous updates of the same bin in the histogrammust be avoided in order not to miss any count. This led to theusage of atomic functions3 to create the histograms. Alternatives toatomic functions exist, for example water-fall–if–elseif structures.However, they imply less flexibility tomodifications to the angularrange of the supported histogram as well as diminishing thereadability of the kernel due to the larger number of code lines,making the code less compact.

A known drawback of atomic operations is that, when twothreads are trying to update a value in the same bin, the operationsare not parallel but sequential. Therefore, if millions of threadsare accessing at most a hundred bins, then the serialization of theaccess will severely impact the performance. In order to overcomethis bottleneck, the histogram construction can be parallelizedby means of constructing partial histograms on shared memory,and later gathering them on global memory. This mechanismincrements the parallelismof the kernel and diminishes the impactof sequential operations on performance.

The initial description of the code focused on an intensive useof shared memory for intermediate operations. In order to avoidoverloading it, registers are used to store relevant data for thecalculation in process. Registers have a higher bandwidth thansharedmemory, but their size is smaller. Data frequently accessed4

for readout are stored in registers such as galaxy coordinates,ellipticities, and the weight value.

Unfortunately, this strategy is not exempt from drawbacks. Theincrement in the usage of registers can force the reduction of theoccupancy: fewer streaming processors are active at the sametime. Therefore the volume of information migrated towards theregisters should be fitted carefully in order to avoid any harm to theperformance of the code. Several tests were performed with incre-mental use of registers until the performance reached its optimum.

The baseline code has a consumption of 15.36 kB of sharedmemory and 63 registers per thread, achieving an occupancy ofeach multiprocessor of 25%.

Due to the fact that the correlation function needs to be studiedat small separation scales, double precision is used. The only ex-ceptions are the quantities added to the histograms ξ+, ξ−, and ξ×

because atomic operations in double precision in shared memoryare still not supported. In all cases the numerical experimentswereperformed with the CUDA 5.0 release.

4.1.3. Comparison with ATHENA input referenceFor comparison purposes, ξ+, ξ− and ξ× obtained with the GPU

implementation and ATHENA version 1.54 are plotted in Fig. 2.Despite the slightly different binning schemes, the results are inexcellent agreement, proving that the GPU-based code presentedhere is completely compatible with a standard analysis code usedin cosmology. Unfortunately, the sample ATHENA input file used asreference has only 40,546 galaxies. In order to obtainmore realisticexecution times, a test with one million galaxies with real data isdetailed in the next section.

4.1.4. Comparison with the one million galaxies input referenceData from the Canada–France–Hawaii Lensing Survey [8] is

used, hereafter referred to as CFHTLenS. TheCFHTLenS survey anal-ysis combinedweak lensing data processingwith THELI [27], shearmeasurementwith lensfit [28], and photometric redshift measure-ment with PSF-matched photometry [29]. A full systematic error

2 For the sake of simplicity, only the angle histogram is considered in thisreasoning, but the solution is equally applied to the other histograms.3 The atomic operation for float on shared memory is supported for compute

capability 1.3 and higher [26].4 This technique is termed increment of data locality. Data frequently used are

stored locally to the thread.

analysis of the shear measurements in combination with the pho-tometric redshifts is presented in [8], with additional error analy-ses of the photometric redshift measurements presented in [30].

A query was done on the CFHTLenS catalogue query page5for right ascension, declination, the ellipticities (as proxies of theshear), and the weight, without any selection cuts, except the re-quirement that the measured ellipticities are non-zero. The pur-pose here is to run the code on a catalogue with ellipticities evenif they are not accurate or contain contamination from non-galaxycomponents. An area was selected from one of the four fields tohold exactly one million galaxies (randomly selected).

On the resulting catalogue the GPU code is executed as wellas ATHENA with varying opening angles to compare the precisionand execution time performance. For this purpose, the binningwastuned to obtain values for ξ+, ξ−, ξ× at the exact same angle sep-aration values θ . The results for execution time are shown in Fig. 3.

The GPU implementation takes 3650.0 ± 1.4 s to analyse thecatalogue. It should be noted that the execution times are tightlybound to the number of bins in the histogram. Variations in thiswill produce a different amount of sequential updates on the valuesstored in the bins, and consequently of the execution time. TheGPU implementation speed is comparable to that of ATHENAwhenusing an opening angle 0.01 rad for this dataset: 3723.3 ± 8.4 s,(see Fig. 3).

When the opening angle approaches zero radians, the codemakes fewer approximations and becomes equivalent to a brute-force method. Unfortunately, the reduction of the opening angleleads to a critical increment in the execution time. For OA =

0.005 rad, the execution time is 12688.7± 122.7 s; this is 3.5 timesslower than the GPU processing time. Finally, for OA equal to zerothe execution time increases to 247681 s; this is a factor 68 slowerthan the GPU code execution time.

Concerning precision, Fig. 4 shows how using an OA with anexecution time equivalent to the GPU code (OA = 0.01) caninduce large errors (a few per cent in relative terms). The exactrequired precision will differ from survey to survey and is stilla topic of debate in the cosmological community (Tim Eifler,private communication). On the other hand, the GPU code showsdifferences smaller than 0.001% with respect to the (much slower)OA = 0.0 execution. It is worthwhile noting that for larger datasetsin terms of angular range the required OAs to reach the sameoverall precision will be smaller, thus pushing the case for a fastbrute-force implementation to tackle future shear surveys.

4.2. Code optimization

4.2.1. L1 memory optimizationOnce the performance has reached a satisfactory level, and

considering thatmost of the additional codemodifications degradethe performance, a second phase of optimization based on thecompiler options was performed.

The most successful test corresponds to the modification of theL1 configuration. The Fermi architecture of the GPU distributes64 kB between shared memory and L1 cache memory. Three con-figurations are possible: the default configuration with 48 kB ofshared memory and 16 kB of L1 cache memory, a second configu-rationwith 16 kB of sharedmemory and 48 kB of L1 cachememory,and finally it is possible to turn off the L1 cache memory.

Both the L1 cache and the L2 cache are queried during thememory location process. First of all, L1 is queried, and only whenthe memory location is not found is L2 queried. Finally, if thememory location is not found in any of the two caches, then themainmemory is accessed. Throughmainmemory accessing, the L1and L2 caches are populated withmemory addresses, whichmightavoid future main memory accesses.

5 http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/community/CFHTLens/query.html.

http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/community/CFHTLens/query.html












Fig. 2. Comparison of the results obtained with the GPU implementation and ATHENA v1.54 OA = 0.02 for the ATHENA input reference (40,546 galaxies): (a) ξ+ , (c) ξ− , and(e) ξ×; and (b) 106

× (ξGPU+

− ξATH+

), (d) 106× (ξGPU

−− ξATH

−), and (f) 106

× (ξGPU×

− ξATH×

).

The configuration implementing 48 kB as cache memory and16 kB as sharedmemory is recommendedwhen intense data reuseexists or one has a misaligned, unpredictable, or irregular mem-ory access pattern. If applications need to share data among thethreads of the thread block, 48 kB as shared memory and 16 kB ascache memory is recommended. If the kernel has a simple enoughmemory access pattern, the explicit caching of global memory intoshared memory through turning off L1 may increase the perfor-mance of the code.

As the baseline code implements the L1 default configu-ration, the two other options were tested in order to checkpotential reductions in the execution time. The numerical exper-iments demonstrate that only the configuration with L1 turned off

Table 1Mean execution time (s) for original code and when applying L1optimization.

Baseline code 3650.0 ± 1.4Base code + L1 turned off 3618.7 ± 0.6Reduction 31.3Speedup 1.009 (0.9%)

improves the efficiency of the code. By turning off the L1 cache, theachieved speedup is 1.009, which is equivalent to 0.9% of improve-ment (Table 1).

The statistical analysis of the execution times (Table 1) is per-formed using theWilcoxon signed-rank test. The results of this test


Fig. 3. Mean execution time (s) for ATHENA code for various opening angles(radians) and GPU code for the one million galaxies input reference (CFHTLenS).The execution time of the GPU code is roughly equivalent to the ATHENA codefor an opening angle of 0.01 rad. For the same precision (brute force), the GPUimplementation is a factor 68 faster than a CPU-based code such as ATHENA.

(p-value = 8 · 10−5) indicate that the differences are statisticallysignificant for a confidence level of 95% (p-value under 0.05).

4.3. Heterogeneous computing

In the previous section, optimization focused on the reductionof the execution time by modifying the code and the compilationoptions. The baseline code and the later L1 memory optimizationhave produced a competitive implementation in which a highaccuracy and an affordable execution time are achieved.

However, parts of the computational resources are underusedbecause during the kernel execution the CPU stays idle. In order tobalance the computational load between the GPU and the CPU, aconcurrent computing scenario is proposed. Concurrent comput-ing allows distributing tasks between the CPU and the GPU. Thetasks involved should not have dependencies between them. Con-currency is applied to the shear correlation calculation by dividingthe input data into two chunks, which are assigned to the CPU andto the GPU respectively. In the CPU part, a parallel implementa-tion based on OpenMP is applied. Due to the fact that the optimalscenario is when both executions take the same execution time,the choice of the chunk size is critical. Initially, diverse chunk sizeswere tested in order to select the most appropriate one for the re-duction of the execution time (Fig. 5).

This naive trial-and-errormethod shows that the optimal chunksize is to compute 11% of the galaxies on the CPU and the rest on theGPU on our test setup. This may differ for different machines. Forlower percentages, the CPU-part analysis finishes faster than theGPU-part analysis does. As a larger volume of data is supplied toCPU part, the execution time diminishes progressively to a pointat which the minimum is reached. When a larger than optimalamount of data is supplied for CPU processing, the execution timegrows significantly.

When balancing CPU and GPU processing, the code achievesan integrated speedup of 1.11, which means a reduction in theprocessing time of 11% in relation to the baseline execution time(Table 2), including the previous code optimization.

The statistical analysis with the Kruskal–Wallis test of the suc-cessive versions of the code shows that the differences in the exe-cution time are significant at a significance level of more than 95%.TheWilcoxon signed-rank testwith the Bonferroni correction indi-cates that the differences between the three sets of execution times

Table 2Mean execution time (s) for the original code and when implementing concurrentcomputing.

Baseline code 3650.0 ± 1.4Base code + L1 turned off + concurrent computing 3287.80 ± 0.03Reduction 362.20Speedup 1.11 (11%)

(baseline, L1 memory optimization, and concurrent implementa-tion) are significant also at a significance level of more than 95%.This result reinforces that the modifications applied produce a netimprovement in the application productivity.

5. Conclusions

In this paper, the first GPU code for the computation of theshear–shear correlation function is presented. In the past, the com-putational cost of the problem has prevented a brute-force im-plementation, and approximations (kd-trees) were used, aimingto reduce the execution time. By using GPU computing, theshear–shear correlation function estimation without any simpli-fications can be achieved in a reasonable timescale. In this work,an implementation is shown where a 68-fold improvement in ex-ecution time is reached with respect to the same algorithm run-ning on a CPU while obtaining the same precision on the results(smaller than 0.001%). A kd-tree code taking the same amount oftime by averaging the shears will induce errors of the order of afew per cent, which are very relevant in the new era of increasedstatistics from very large surveys. A natural follow-up of this workis a multi-GPU approach to be run on a GPU farm to handle evenlarger datasets (O(108)). In addition to this, an additional optimiza-tion process consisting of a GPU–CPU concurrent computing im-plementation achieves a reduction of 11% in execution time.

The code is being made available ( http://wwwae.ciemat.es/cosmo/gp2pcf/) for cosmologists to integrate it within their port-folio of analysis codes.

Acknowledgements

The authorswould like to thankMartin Kilbinger for permissionto reproduce a figure from their paper.

IS would like to thank Tim Eifler for useful comments regardingthe cosmology-related aspects of this work.

The authors would like to thank the Spanish Ministry ofScience and Innovation (MICINN) for funding support throughgrants AYA2009-13936 and through the Consolider Ingenio-2010program, under project CSD2007-00060.

CB is also supported by project 2009SGR1398 from Generalitatde Catalunya and by the European Commission’sMarie Curie InitialTraining Network CosmoComp (PITN-GA-2009-238356).

The CFHTLens data is based on observations obtained withMegaPrime/MegaCam, a joint project of CFHT and CEA/DAPNIA, atthe Canada–France–Hawaii Telescope (CFHT), which is operatedby the National Research Council (NRC) of Canada, the InstitutNational des Sciences de l’Univers of the Centre National de laRecherche Scientifique (CNRS) of France, and the University ofHawaii. This research used the facilities of the Canadian AstronomyData Centre operated by the National Research Council of Canadawith the support of the Canadian Space Agency. CFHTLenS dataprocessing was made possible thanks to significant computingsupport from the NSERC Research Tools and Instruments grantprogram.

Appendix. Overview of GPU architecture and programmingmodel

During the last two decades, the semiconductor industry hasfollowed two alternative paths to increase the performance of its

http://wwwae.ciemat.es/cosmo/gp2pcf/







× ×

(a) Deviations GPU–ATHENA of computed ξ+ . (b) Deviations GPU–ATHENA of computed ξ− .

Fig. 4. Deviations of the GPU code results with respect to ATHENA results at different opening angle settings, for the computation of ξ+ and ξ− . As the OA becomes smaller,fewer approximations are made by the ATHENA implementation, and the result converges to the GPU computed values (to levels below 0.001%).

Fig. 5. Execution time (s) for diverse CPU-processed percentages in the concurrentcomputing model. The dotted line is the reference to the baseline code executiontime, whereas the dashed line is the execution time after L1 memory optimization.

products. On the one hand, the number of cores has grown, evolv-ing from a single-core processor to a two-core processor, then to afour-core one, and so on. This has generated the multi-core archi-tecture. On the other hand, the many-core architecture follows adifferent strategy by implementing many small cores. NVIDIA GPUis an example of this kind of architecture.

The main differences between these two types of architectureemerge from the purposes for which they are designed. Coresin multi-core architecture have to deal with a wide portfolio ofsequential general-purpose codes. On the contrary, the many-core architecture comes from the game industry, where a massivenumber of floating-point calculations per time unit is required.

Scientific computing might benefit from this high capacityfor simulation and analysis. For this purpose, NVIDIA introducedthe CUDA (Compute Unified Device Architecture) programmingmodel. CUDA can be seen as a set of C extensions to handle codeon a GPU. A CUDA code embodies two differentiated parts: the se-quential code, which is executed in the CPU, and the parallel code,which is executed in the GPU. This piece of the code is termed thekernel. The compiler separates the two parts during the compila-tion.

From the architecture point of view, a GPU is composed of an ar-ray of highly threaded streaming multiprocessors (SMs). Each SMis in turn composed of several streaming processors (SPs), which

share control logic and the instruction cache and are able to sup-port many threads. This architecture is especially recommendablefor single-instruction multiple-data (SIMD) problems.

Concerning data storage, the architecture implements diversetypes ofmemory covering awide range of capacities, latencies, andbandwidths. Global memory is the main memory of the GPU card.Unfortunately, it also has the lowest bandwidth and the largestlatency. Data stored in global memory is accessible by all threadson the card. The register has the highest bandwidth and the lowestlatency, but its size is smaller. Registers are tightly bound to athread, so that the data in the registers are only accessible bythe corresponding thread. A third type of memory is that termedshared memory. Regarding the latency and the bandwidth, it isan intermediate case between the two previous types. An otherdifference is that it is accessible by all the threads belonging to ablock of threads.6

In spite of the fact that a CUDA kernel is executed correctly onany CUDA device, its performance will differ, depending on theparticular architecture and the code adaptations. For this reason, itis necessary to know the particularities of the architecture to profitfrom the capabilities offered by the hardware.

In NVIDIA Tesla architectures, each streaming multiprocessor(SM) had only 16k registers whereas in Fermi architecture this on-chip memory has grown to 32k. Another feature that has beenincremented in the Fermi architecture is the maximum number ofthreads per block, from 512 to 1024.

NVIDIA Fermi architectures introduce a two-level transparentcache-memory hierarchy. Each SM has 64 kB of on-chip memorydistributed between shared memory and L1 cache memory. Userscan select diverse configurations of shared memory and L1.

When implementing coalesced or non-coalesced access toglobal memory, the memory transaction segment size becomes animportant factor in the final performance. In the Tesla architecture,the available memory transaction segment sizes are 32, 64, and128 bytes. The selected value depends on the amount of memoryneeded and thememory access pattern. The selection is automatic,in order to avoid wasting bandwidth.

In the Fermi architecture, thememory transaction segment sizefollows a different rule. When L1 cache memory is enabled, thehardware always issues transactions of 128 bytes (cache-line size);otherwise, 32 byte transactions are issued.

6 A block of threads is a logical group of threads which are executed on an SM.


References

[1] Andreas Albrecht, Gary Bernstein, Robert Cahn, Wendy L. Freedman,Jacqueline Hewitt, Wayne Hu, John Huth, Marc Kamionkowski, Edward W.Kolb, Lloyd Knox, John C. Mather, Suzanne Staggs, Nicholas B. Suntzeff, Reportof the dark energy task force, 2006, arXiv:astro-ph/0609591.

[2] Joshua Frieman, Michael Turner, Dragan Huterer, Dark energy and theaccelerating universe, Annual Review of Astronomy and Astrophysics 46(2008) 385–432.

[3] L. Fu, E. Semboloni, H. Hoekstra, M. Kilbinger, L. van Waerbeke, I. Tereno,Y. Mellier, C. Heymans, J. Coupon, K. Benabed, J. Benjamin, E. Bertin, O. Dore,M.J. Hudson, O. Ilbert, R. Maoli, C. Marmo, H.J. McCracken, B. Menard, Veryweak lensing in the CFHTLS wide: cosmology from cosmic shear in the linearregime, Astronomy and Astrophysics 479 (1) (2008) 9–25.

[4] David Bacon, Alexandre Refregier, Richard Ellis, Detection of weak gravita-tional lensing by large-scale structure, Monthly Notices of the Royal Astro-nomical Society 318 (2) (2000) 625–640.

[5] Nick Kaiser, Gillian Wilson, Gerard A. Luppino, Large-scale cosmic shearmeasurements, 2000. arXiv:astro-ph/0003338.

[6] L. Van Waerbeke, Y. Mellier, T. Erben, J.C. Cuillandre, F. Bernardeau, R. Maoli,E. Bertin, H.J. McCracken, O. Le Fevre, B. Fort, M. Dantel-Fort, B. Jain,P. Schneider, Detection of correlated galaxy ellipticities on CFHT data: firstevidence for gravitational lensing by large-scale structures, Astronomy andAstrophysics 358 (2000) 30–44.

[7] David M. Wittman, J. Anthony Tyson, David Kirkman, Ian Dell’Antonio, GaryBernstein, Detection of weak gravitational lensing distortions of distantgalaxies by cosmic dark matter at large scales, Nature 405 (2000) 143–148.

[8] C. Heymans, L. Van Waerbeke, L. Miller, T. Erben, H. Hildebrt, H. Hoekstra,T.D. Kitching, Y.Mellier, P. Simon, C. Bonnett, J. Coupon, L. Fu, J. Harnois Déraps,M.J. Hudson, M. Kilbinger, K. Kuijken, B. Rowe, T. Schrabback, E. Semboloni,E. van Uitert, S. Vafaei, M. Veler, CFHTLenS: the Canada–France–Hawaiitelescope lensing survey, Monthly Notices of the Royal Astronomical Society427 (1) (2012) 146–166.

[9] The dark energy survey collaboration: the dark energy survey, 2005.arXiv:astro-ph/0510346.

[10] Jelte T.A. de Jong, Gijs A. Verdoes Kleijn, Konrad H. Kuijken, Edwin A. Valentijn,The kilo-degree survey, 2012. arXiv:astro-ph/1206.1254.

[11] L. Amendola, et al. Cosmology and fundamental physics with the Euclidsatellite, 2012. arXiv:astro-ph/1206.1225.

[12] R. Laureijs, et al., Euclid definition study report, Report Number:ESA/SRE(2011)12, 2011, arXiv:astro-ph/1110.3193.

[13] Andrew Moore, Andy Connolly, Chris Genovese, Alex Gray, Larry Grone,Nick Kanidoris, Robert Nichol, Jeff Schneider, Alex Szalay, Istvan Szapudi,Larry Wasserman, Fast algorithms and efficient statistics: N-point correlationfunctions, ESO Astrophysics Symposia (2001) 71–82.

[14] Mike Jarvis, Gary Bernstein, Bhuvnesh Jain, The skewness of the aperturemass statistic, Monthly Notices of the Royal Astronomical Society 352 (2004)338–352.

[15] R. Ponce, M. Cárdenas-Montes, J.J. Rodríguez-Vázquez, E. Sanchez, I. Sevilla,Application of GPUs for the calculation of two point correlation functionsin cosmology, in: P. Ballester, D. Egret, N.P.F. Lorente. (Eds.), AstronomicalData Analysis Software and Systems XXI, in: ASP Conference Series, vol. 461,Astronomical Society of the Pacific, 2012, p. 73.

[16] Dylan W. Roeh, Volodymyr V. Kindratenko, Robert J. Brunner, Acceleratingcosmological data analysis with graphics processors, in: Proceedings of 2ndWorkshop on General Purpose Processing on Graphics Processing Units, ACM,2009, pp. 1–8.

[17] Deborah Bard, Matthew Bellis, Mark T. Allen, Hasmik Yepremyan, Jan M.Kratochvil, Cosmological calculations on the GPU, Astronomy and Computing1 (2013) 17–22.

[18] Matthias Bartelmann, Peter Schneider, Weak gravitational lensing, PhysicsReports 340 (4–5) (2001) 291–472.

[19] Martin Kilbinger, Liping Fu, Catherine Heymans, Fergus Simpson, JonathanBenjamin, Thomas Erben, Joachim Harnois-Déraps, Henk Hoekstra, HendrikHildebrandt, Thomas D. Kitching, Yannick Mellier, Lance Miller, LudovicVan Waerbeke, Karim Benabed, Christopher Bonnett, Jean Coupon, MichaelJ. Hudson, Konrad Kuijken, Barnaby Rowe, Tim Schrabback, ElisabettaSemboloni, Sanaz Vafaei, Malin Velander, CFHTLenS: combined probecosmologicalmodel comparison using 2Dweak gravitational lensing,MonthlyNotices of the Royal Astronomical Society 430 (3) (2013) 2200–2220.

[20] G. Bernstein, M. Jarvis, Shapes and shears, stars and smears: optimalmeasurements for weak lensing, The Astronomical Journal 123 (2) (2002)583–618.

[21] Henk Hoekstra, Bhuvnesh Jain, Weak gravitational lensing and its cosmolog-ical applications, Annual Review of Nuclear and Particle Science 58 (2008)99–123.

[22] H. Hoekstra, M. Franx, K. Kuijken, P.G. VanDokkum, Monthly Notices of theRoyal Astronomical Society 333 (2002) 911–922.

[23] Salvador García, Daniel Molina, Manuel Lozano, Francisco Herrera, A studyon the use of non-parametric tests for analyzing the evolutionary algorithms’behaviour: a case study on the CEC’2005 special session on real parameteroptimization, Journal of Heuristics 15 (6) (2009) 617–644.

[24] D. Sheskin, Handbook of Parametric and Non-Parametric Statistical Proce-dures, CRC Press, 2004.

[25] Salvador García, Alberto Fernández, Julián Luengo, Francisco Herrera, Astudy of statistical techniques and performance measures for genetics-basedmachine learning: accuracy and interpretability, Soft Computing 13 (10)(2009) 959–977.

[26] Jason Sanders, Edward Kandrot, CUDA by Example: An Introduction toGeneral-Purpose GPU Programming, Addison-Wesley Professional, 2010.

[27] T. Erben, H. Hildebrandt, L. Miller, L. van Waerbeke, C. Heymans, H. Hoekstra,T.D. Kitching, Y. Mellier, J. Benjamin, C. Blake, C. Bonnett, O. Cordes, J. Coupon,L. Fu, R. Gavazzi, B. Gillis, E. Grocutt, S.D.J. Gwyn, K. Holhjem, M.J. Hudson, M.Kilbinger, K. Kuijken, M. Milkeraitis, B.T.P. Rowe, T. Schrabback, E. Semboloni,P. Simon,M. Smit, O. Toader, S. Vafaei, E. vanUitert,M. Velander, CFHTLenS: theCanada–France–Hawaii telescope lensing survey—imaging data and catalogueproducts, 2012. arXiv:astro-ph/1210.8156.

[28] L. Miller, C. Heymans, T.D. Kitching, L. vanWaerbeke, T. Erben, H. Hildebrandt,H. Hoekstra, Y. Mellier, B.T.P. Rowe, J. Coupon, J.P. Dietrich, L. Fu, J. Harnois-Déraps, M.J. Hudson, M. Kilbinger, K. Kuijken, T. Schrabback, E. Semboloni,S. Vafaei, M. Velander, Bayesian galaxy shape measurement for weak lensingsurveys—III. Application to the Canada–France–Hawaii telescope lensingsurvey, Monthly Notices of the Royal Astronomical Society 429 (4) (2013)2858–2880.

[29] H. Hildebrandt, T. Erben, K. Kuijken, L. van Waerbeke, C. Heymans, J. Coupon,J. Benjamin, C. Bonnett, L. Fu, H. Hoekstra, T.D. Kitching, Y. Mellier, L. Miller,M. Velander, M.J. Hudson, B.T.P. Rowe, T. Schrabback, E. Semboloni, N. Benítez,CFHTLenS: improving the quality of photometric redshifts with precisionphotometry,MonthlyNotices of the Royal Astronomical Society 421 (3) (2012)2355–2367.

[30] Jonathan Benjamin, Ludovic Van Waerbeke, Catherine Heymans, MartinKilbinger, Thomas Erben, Hendrik Hildebrandt, Henk Hoekstra, Thomas D.Kitching, Yannick Mellier, Lance Miller, Barnaby Rowe, Tim Schrabback,Fergus Simpson, Jean Coupon, Liping Fu, Joachim Harnois-Déraps, MichaelJ. Hudson, Konrad Kuijken, Elisabetta Semboloni, Sanaz Vafaei, MalinVelander, CFHTLenS tomographic weak lensing: quantifying accurate redshiftdistributions, Monthly Notices of the Royal Astronomical Society 431 (2013)1547–1564.

http://arxiv.org/astro-ph/0609591

http://refhub.elsevier.com/S0010-4655(13)00266-X/sbref2








http://arxiv.org/astro-ph/1206.1254





















Date post:	19-Dec-2016
Category:	Documents
Upload:	juan-jose
View:	213 times
Download:	0 times

GPU-based shear–shear correlation calculation

Documents