LogCA: A High-Level Performance Model for Hardware...

LogCA A High-Level Performance Model for HardwareAccelerators

Muhammad Shoaib Bin Altaf lowast

AMD ResearchAdvanced Micro Devices Inc

shoaibaltafamdcom

David A WoodComputer Sciences Department

University of Wisconsin-Madisondavidcswiscedu

ABSTRACTWith the end of Dennard scaling architects have increasingly turnedto special-purpose hardware accelerators to improve the performanceand energy efficiency for some applications Unfortunately accel-erators donrsquot always live up to their expectations and may under-perform in some situations Understanding the factors which effectthe performance of an accelerator is crucial for both architects andprogrammers early in the design stage Detailed models can behighly accurate but often require low-level details which are notavailable until late in the design cycle In contrast simple analyticalmodels can provide useful insights by abstracting away low-levelsystem details

In this paper we propose LogCAmdasha high-level performancemodel for hardware accelerators LogCA helps both programmersand architects identify performance bounds and design bottlenecksearly in the design cycle and provide insight into which optimiza-tions may alleviate these bottlenecks We validate our model acrossa variety of kernels ranging from sub-linear to super-linear com-plexities on both on-chip and off-chip accelerators We also describethe utility of LogCA using two retrospective case studies First wediscuss the evolution of interface design in SUNOraclersquos encryptionaccelerators Second we discuss the evolution of memory interfacedesign in three different GPU architectures In both cases we showthat the adopted design optimizations for these machines are similarto LogCArsquos suggested optimizations We argue that architects andprogrammers can use insights from these retrospective studies forimproving future designs

CCS CONCEPTSbull Computing methodologies rarr Modeling methodologies bull Com-puter systems organization rarr Heterogeneous (hybrid) systemsbull Hardware rarr Hardware accelerators

KEYWORDSAnalytical modeling Performance Accelerators Heterogenous ar-chitectures

lowastThis work was done while a PhD student at Wisconsin

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specific permission andor afee Request permissions from permissionsacmorgISCA rsquo17 June 24-28 2017 Toronto ON Canadacopy 2017 Association for Computing MachineryACM ISBN 978-1-4503-4892-81706 $1500httpsdoiorg10114530798563080216

16 64 256 1K 4K 16

K64

K

0001

001

01

1

10

Break-even point

Offloaded Data (Bytes)

Tim

e(m

s)

Unaccelerated Accelerated

(a) Execution time on UltraSPARC T2

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

Break-even point

Offloaded Data (Bytes)

Sp

eed

up

SPARC T4 UltraSPARC T2 GPU

(b) Variation in speedup for different crypto accelerators

Figure 1 Executing Advanced Encryption Standard (AES)[30]

ACM Reference formatMuhammad Shoaib Bin Altaf and David A Wood 2017 LogCA A High-Level Performance Model for Hardware Accelerators In Proceedings ofISCA rsquo17 Toronto ON Canada June 24-28 2017 14 pageshttpsdoiorg10114530798563080216

1 INTRODUCTIONThe failure of Dennard scaling [12 49] over the last decade hasinspired architects to introduce specialized functional units such asaccelerators [6 36] These accelerators have shown considerableperformance and energy improvement over general-purpose coresfor some applications [14 16 23 25 26 50 51 55] Commercialprocessors already incorporate a variety of accelerators rangingfrom encryption to compression from video streaming to patternmatching and from database query engines to graphics processing[13 37 45]

Unfortunately accelerators do not always live up to their name orpotential Offloading a kernel to an accelerator incurs latency andoverhead that depends on the amount of offloaded data location of

ISCA rsquo17 June 24-28 2017 Toronto ON Canada M S B Altaf et al

accelerator and its interface with the system In some cases thesefactors may outweigh the potential benefits resulting in lower thanexpected ormdashin the worst casemdashno performance gains Figure 1illustrates such an outcome for the crypto accelerator in UltraSPARCT2 running the Advanced Encryption Standard (AES) kernel [30]

Figure 1 provides two key observations First accelerators canunder-perform as compared to general-purpose core eg the ac-celerated version in UltraSPARC T2 outperforms the unacceleratedone only after crossing a threshold block size ie the break-evenpoint (Figure 1-a) Second different acceleratorsmdashwhile executingthe same kernelmdashhave different break-even points eg SPARC T4breaks even for smaller offloaded data while UltraSPARC T2 andGPU break even for large offloaded data (Figure 1-b)

Understanding the factors which dictate the performance of anaccelerator are crucial for both architects and programmers Pro-grammers need to be able to predict when offloading a kernel will beperformance efficient Similarly architects need to understand howthe acceleratorrsquos interfacemdashand the resulting latency and overheadsto offload a kernelmdashwill affect the achievable accelerator perfor-mance Considering the AES encryption example programmers andarchitects would greatly benefit from understanding What bottle-necks cause UltraSPARC T2 and GPU to under-perform for smalldata sizes Which optimizations on UltraSPARC T2 and GPU resultin similar performance to SPARC T4 Which optimizations are pro-grammer dependent and which are architect dependent What arethe trade-offs in selecting one optimization over the other

To answer these questions programmer and architects can employeither complex or simple modeling techniques Complex modelingtechniques and full-system simulation [8 42] can provide highlyaccurate performance estimates Unfortunately they often requirelow-level system details which are not available till late in the designcycle In contrast analytical modelsmdashsimpler ones in particularmdashabstract away these low-level system details and provide key insightsearly in the design cycle that are useful for experts and non-expertsalike [2 5 19 27 48 54]

For an insightful model for hardware accelerators this paperpresents LogCA LogCA derives its name from five key parameters(Table 1) These parameters characterize the communication latency(L) and overheads (o) of the accelerator interface the granularitysize(g) of the offloaded data the complexity (C) of the computation andthe acceleratorrsquos performance improvement (A) as compared to ageneral-purpose core

LogCA is inspired by LogP [9] the well-known parallel compu-tation model LogP sought to find the right balance between overlysimple models (eg PRAM) and the detailed reality of modern par-allel systems LogCA seeks to strike the same balance for hardwareaccelerators providing sufficient simplicity such that programmersand architects can easily reason with it Just as LogP was not thefirst model of parallel computation LogCA is not the first model forhardware accelerators [28] With LogCA our goal is to develop asimple model that supports the important implications (sect2) of ouranalysis and use as few parameters as possible while providing suf-ficient accuracy In Einsteinrsquos words we want our model to be assimple as possible and no simpler

LogCA helps programmers and architects reason about an accel-erator by abstracting the underlying architecture It provides insights

about the acceleratorrsquos interface by exposing the design bounds andbottlenecks and suggests optimizations to alleviate these bottlenecksThe visually identifiable optimization regions help both experts andnon-experts to quantify the trade-offs in favoring one optimizationover the other While the general trend may not be surprising weargue that LogCA is accurate enough to answer important what-ifquestions very early in the design cycle

We validate our model across on-chip and off-chip acceleratorsfor a diverse set of kernels ranging from sub-linear to super-linearcomplexities We also demonstrate the utility of our model usingtwo retrospective case studies (sect5) In the first case study we con-sider the evolution of interface in the cryptographic accelerator onSunOraclersquos SPARC T-series processors For the second case weconsider the memory interface design in three different GPU ar-chitectures a discrete an integrated and a heterogeneous systemarchitecture (HSA) [38] supported GPU In both case studies weshow that the adopted design optimizations for these machines aresimilar to LogCArsquos suggested optimizations We argue that architectsand programmers can use insights from these retrospective studiesfor improving future designs

This paper makes the following contributions

bull We propose a high-level visual performance model provid-ing insights about the interface of hardware accelerators(sect2)

bull We formalize performance metrics for predicting the ldquorightrdquoamount of offloaded data (sect22)

bull Our model identifies the performance bounds and bottle-necks associated with an accelerator design (sect3)

bull We provide an answer to what-if questions for both pro-grammers and architects at an early design stage (sect3)

bull We define various optimization regions and the potentialgains associated with these regions (sect32)

bull We demonstrate the utility of our model on five differentcryptographic accelerators and three different GPU archi-tectures (sect5)

2 THE LogCA MODELLogCA assumes an abstract system with three components (Figure 2(a)) Host is a general-purpose processor Accelerator is a hardwaredevice designed for the efficient implementation of an algorithmand Interface connects the host and accelerator abstracting awaysystem details including the memory hierarchy

Our model uses the interface abstraction to provide intuition forthe overhead and latency of dispatching work to an accelerator Thisabstraction enables modeling of different paradigms for attachingacceleratorsmdashdirectly connected system bus or PCIe This alsogives the flexibility to use our model for both on-chip and off-chipaccelerators This abstraction can also be trivially mapped to sharedmemory systems or other memory hierarchies in heterogeneousarchitectures The model further abstracts the underlying architectureusing the five parameters defined in Table 1

Figure 2 (b) illustrates the overhead and latency model for anun-pipelined accelerator where computation lsquoirsquo is returned before re-questing computation lsquoi+1rsquo Figure 2 (b) also shows the breakdownof time for an algorithm on the host and accelerator We assume that

LogCA A High-Level Performance Model for Hardware Accelerators ISCA rsquo17 June 24-28 2017 Toronto ON Canada

Table 1 Description of the LogCA parameters

Parameter Symbol Description Units

Latency L Cycles to move data from the host to the accelerator across the interface including the cycles dataspends in the caches or memory

Cycles

Overhead o Cycles the host spends in setting up the algorithm Cycles

Granularity g Size of the offloaded data Bytes

Computational Index C Cycles the host spends per byte of data CyclesByte

Acceleration A The peak speedup of an accelerator NA

Host Accelerator

Interface

time

Co(g)

o1(g) L1(g)C1(g) =

Co(g)A

Gain

T0(g)

T1(g)

(a) (b)

Figure 2 Top level description of the LogCA model (a) Showsthe various components (b) Time-line for the computation per-formed on the host system (above) and on an accelerator (be-low)

the algorithmrsquos execution time is a function of granularity ie thesize of the offloaded data With this assumption the unacceleratedtime T0 (time with zero accelerators) to process data of granularityg will be T0 (g) =C0 (g) where C0 (g) is the computation time on thehost

When the data is offloaded to an accelerator the new executiontime T1 (time with one accelerator) is T1 (g) =O1 (g)+L1 (g)+C1 (g)where O1 (g) is the host overhead time in offloading lsquogrsquo bytes ofdata to the accelerator L1 (g) is the interface latency and C1 (g) is thecomputation time in the accelerator to process data of granularity g

To make our model more concrete we make several assumptionsWe assume that an accelerator with acceleration lsquoArsquo can decreasein the absence of overheads the algorithmrsquos computation time onthe host by a factor of lsquoArsquo ie the accelerator and host use algo-rithms with the same complexity Thus the computation time on theaccelerator will be C1 (g) =

C0 (g)A This reduction in the computation

time results in performance gains and we quantify these gains withspeedup the ratio of the un-accelerated and accelerated time

Speedup(g) =T0 (g)T1 (g)

=C0 (g)

O1 (g)+L1 (g)+C1 (g)(1)

We assume that the computation time is a function of the com-putational index lsquoCrsquo and granularity ie C0 (g) =C lowast f (g) wheref (g) signifies the complexity of the algorithm We also assume thatf (g) is power function of rsquogrsquo ie O (gβ ) This assumption resultsin a simple closed-form model and bounds the performance for amajority of the prevalent algorithms in the high-performance comput-ing community [4] ranging from sub-linear (β lt 1) to super-linear(β gt 1) complexities However this assumption may not work wellfor logarithmic complexity algorithms ie O (log(g))O (g log(g))This is because asymptotically there is no function which grows

slower than a logarithmic function Despite this limitation we ob-serve thatmdashin the granularity range of our interestmdashLogCA can alsobound the performance for logarithmic functions (sect5)

For many algorithms and accelerators the overhead is indepen-dent of the granularity ie O1 (g) = o Latency on the other handwill often be granularity dependent ie L1 (g) = Llowastg Latency maybe granularity independent if the accelerator can begin operatingwhen the first byte (or block) arrives at the accelerator ie L1 (g) = LThus LogCA can also model pipelined interfaces using granularityindependent latency assumption

We define computational intensity1 as the ratio of computationalindex to latency ie C

L and it signifies the amount of work done ona host per byte of offloaded data Similarly we define acceleratorrsquoscomputational intensity as the ratio of computational intensity toacceleration ie CA

L and it signifies the amount of work done onan accelerator per byte of offloaded data

For simplicity we begin with the assumption of granularity in-dependent latency We revisit granularity dependent latencies later(sect 23) With these assumptions

Speedup(g) =C lowast f (g)

o+L+ Clowast f (g)A

=C lowastgβ

o+L+ Clowastgβ

A

(2)

The above equation shows that the speedup is dependent on LogCAparameters and these parameters can be changed by architects andprogrammers through algorithmic and design choices An architectcan reduce the latency by integrating an accelerator more closelywith the host For example placing it on the processor die ratherthan on an IO bus An architect can also reduce the overheads bydesigning a simpler interface ie limited OS intervention and ad-dress translations lower initialization time and reduced data copyingbetween buffers (memories) etc A programmer can increase thecomputational index by increasing the amount of work per byteoffloaded to an accelerator For example kernel fusion [47 52]mdashwhere multiple computational kernels are fused into onemdashtends toincrease the computational index Finally an architect can typicallyincrease the acceleration by investing more chip resources or powerto an accelerator

21 Effect of GranularityA key aspect of LogCA is that it captures the effect of granularity onthe acceleratorrsquos speedup Figure 3 shows this behavior ie speedupincreases with granularity and is bounded by the acceleration lsquoArsquo At

1not to be confused with operational intensity [54] which signifies operations performedper byte of DRAM traffic

g1 gA2

1

A2

A

Granularity (Bytes)

Sp

eed

up

(g)

Figure 3 A graphical description of the performance metrics

one extreme for large granularities equation (2) becomes

limgrarrinfin

Speedup(g) = A (3)

While for small granularities equation (2) reduces to

limgrarr0

Speedup(g) ≃ Co+L+ C

Alt

Co+L

(4)

Equation (4) is simply Amdahlrsquos Law [2] for accelerators demon-strating the dominating effect of overheads at small granularities

22 Performance MetricsTo help programmers decide when and how much computation tooffload we define two performance metrics These metrics are in-spired by the vector machine metrics Nv and N12[18] where Nvis the vector length to make vector mode faster than scalar modeand N12 is the vector length to achieve half of the peak perfor-mance Since vector length is an important parameter in determiningperformance gains for vector machines these metrics characterizethe behavior and efficiency of vector machines with reference toscalar machines Our metrics tend to serve the same purpose in theaccelerator domain

g1 The granularity to achieve a speedup of 1 (Figure 3) It isthe break-even point where the acceleratorrsquos performance becomesequal to the host Thus it is the minimum granularity at which anaccelerator starts providing benefits Solving equation (2) for g1gives

g1 =

[(A

Aminus1

)lowast(

o+LC

)] 1β

(5)

IMPLICATION 1 g1 is essentially independent of accelerationfor large values of lsquoArsquo

For reducing g1 the above implication guides an architect toinvest resources in improving the interface

IMPLICATION 2 Doubling computational index reduces g1 by

2minus1β

The above implication demonstrates the effect of algorithmiccomplexity on g1 and shows that varying computational index has aprofound effect on g1 for sub-linear algorithms For example for asub-linear algorithm with β = 05 doubling the computational indexdecreases g1 by a factor of four However for linear (β = 1) andquadratic (β = 2) algorithms g1 decreases by factors of two and

radic2

respectively

g A2 The granularity to achieve a speedup of half of the acceler-

ation This metric provides information about a systemrsquos behaviorafter the break-even point and shows how quickly the speedup canramp towards acceleration Solving equation (2) for g A

2gives

g A2=

[Alowast(

o+LC

)] 1β

(6)

Using equation (5) and (6) g1 and g A2are related as

g A2= (Aminus1)

1β lowastg1 (7)

IMPLICATION 3 Doubling acceleration lsquoArsquo increases the gran-

ularity to attain A2 by 2

1β

The above implication demonstrates the effect of accelerationon g A

2and shows that this effect is more pronounced for sub-linear

algorithms For example for a sub-linear algorithm with β = 05doubling acceleration increases g A

2by a factor of four However for

linear and quadratic algorithms g A2increases by factors of two and

radic2 respectivelyFor architects equation (7) also exposes an interesting design

trade-off between acceleration and performance metrics Typicallyan architect may prefer higher acceleration and lower g1 g A

2 How-

ever equation (7) shows that increasing acceleration also increasesg A

2 This presents a dilemma for an architect to favor either higher

acceleration or reduced granularity especially for sub-linear algo-rithms LogCA helps by exposing these trade-offs at an early designstage

In our model we also use g1 to determine the complexity of thesystemrsquos interface A lower g1 (on the left side of plot in Figure 3)is desirable as it implies a system with lower overheads and thus asimpler interface Likewise g1 increases with the complexity of theinterface or when an accelerator moves further away from the host

23 Granularity dependent latencyThe previous section assumed latency is granularity independent butwe have observed granularity dependent latencies in GPUs In thissection we discuss the effect of granularity on speedup and deriveperformance metrics assuming granularity dependent-latency

Assuming granularity dependent latency equation (1) reduces to

Speedup(g) =C lowastgβ

o+Llowastg+ Clowastgβ

A

(8)

For large granularities equation (8) reduces to

limgrarrinfin

Speedup(g) =

(A

AClowastgβ

lowast (Llowastg)+1

)lt

CLlowastgβminus1 (9)

Unlike equation (3) speedup in the above equation approachesCL lowastgβminus1 at large granularities Thus for linear algorithms with gran-ularity dependent latency instead of acceleration speedup is limitedby C

L However for super-linear algorithms this limit increases by afactor of gβminus1 whereas for sub-linear algorithms this limit decreasesby a factor of gβminus1

IMPLICATION 4 With granularity dependent latency the speedupfor sub-linear algorithms asymptotically decreases with the increasein granularity

The above implication suggests that for sub-linear algorithms onsystems with granularity dependent latency speedup may decreasefor some large granularities This happens because for large granu-larities the communication latency (a linear function of granularity)may be higher than the computation time (a sub-linear function ofgranularity) on the accelerator resulting in a net de-accelerationThis implication is surprising as earlier we observed thatmdashfor sys-tems with granularity independent latencymdashspeedup for all algo-rithms increase with granularity and approaches acceleration forvery large granularities

For very small granularities equation (8) reduces to

limgrarr 0

Speedup(g) ≃ Alowast CAlowast (o+L)+C

(10)

Similar to equation (4) the above equation exposes the increasingeffects of overheads at small granularities Solving equation (8) forg1 using Newtonrsquos method [53]

g1 =C lowast (β minus1) lowast (Aminus1)+Alowasto

C lowastβ lowast (Aminus1)minusAlowastL(11)

For a positive value of g1 equation (11) must satisfy CL gt 1

β

Thus for achieving any speedup for linear algorithms CL should

be at least 1 However for super-linear algorithms a speedup of 1can achieved at values of C

L smaller than 1 whereas for sub-linearalgorithms algorithms C

L must be greater than 1

IMPLICATION 5 With granularity dependent latency computa-tional intensity for sub-linear algorithms should be greater than 1to achieve any gains

Thus for sub-linear algorithms computational index has to begreater than latency to justify offloading the work However forhigher-complexity algorithms computational index can be quitesmall and still be potentially useful to offload

Similarly solving equation (8) using Newtonrsquos method for g A2

gives

g A2=

C lowast (β minus1)+AlowastoC lowastβ minusAlowastL

(12)

For a positive value of g A2 equation (12) must satisfy CA

L gt 1β

Thus for achieving a speedup of A2 CL should be at least lsquoArsquo for

linear algorithms However for super-linear algorithms a speedupof A

2 can achieved at values of CL smaller than lsquoArsquo whereas for

sub-linear algorithms CL must be greater than lsquoArsquo

IMPLICATION 6 With granularity dependent latency accelera-torrsquos computational intensity for sub-linear algorithms should begreater than 1 to achieve speedup of half of the acceleration

The above implication suggests that for achieving half of theacceleration with sub-linear algorithms the computation time on theaccelerator must be greater than latency However for super-linearalgorithms that speedup can be achieved even if the computationtime on accelerator is lower than latency Programmers can usethe above implications to determinemdashearly in the design cyclemdashwhether to put time and effort in porting a code to an accelerator

g1

1

A

CL

limgrarrinfin Speedup(g) = A

CL gt A

Sp

eed

up

g1

1

CL

A

CL lt A

g1

1

CL

A

limgrarrinfin Speedup(g) = CL

CL lt A

Granularity (Bytes)

Sp

eed

up

g1

1

CL

A

limgrarrinfin Speedup(g) lt CL

CL lt A

Granularity (Bytes)

(a) Performance bounds for compute-bound kernels

(b) Performance bounds for latency-bound kernels

Figure 4 LogCA helps in visually identifying (a) compute and(b) latency bound kernels

For example consider a system with a minimum desirable speedupof one half of the acceleration but has a computational intensity ofless than the acceleration With the above implication architectsand programmers can infer early in the design stage that the desiredspeedup can not be achieved for sub-linear and linear algorithmsHowever the desired speedup can be achieved with super-linearalgorithms

We are also interested in quantifying the limits on achievablespeedup due to overheads and latencies To do this we assume ahypothetical accelerator with infinite acceleration and calculate thegranularity (gA) to achieve the peak speedup of lsquoArsquo With this as-sumption the desired speedup of lsquoArsquo is only limited by the overheadsand latencies Solving equation (8) for gA gives

gA =C lowast (β minus1)+Alowasto

C lowastβ minusAlowastL(13)

Surprisingly we find that the above equation is similar to equa-tion (12) ie gA equals g A

2 This observation shows that with a

hypothetical accelerator the peak speedup can now be achieved atthe same granularity as g A

2 This observation also demonstrates that

if g A2is not achievable on a system ie CA

L lt 1β

as per equation(12) then despite increasing the acceleration gA will not be achiev-able and the speedup will still be bounded by the computationalintensity

IMPLICATION 7 If a speedup of A2 is not achievable on an ac-

celerator with acceleration lsquoArsquo despite increasing acceleration toAtilde (where Atilde gt A) the speedup is bounded by the computationalintensity

The above implication helps architects in allocating more re-sources for an efficient interface instead of increasing acceleration

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

No variation

Granularity (Bytes)

Sp

eed

up

(a) Latency

LogCAL110x

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

Granularity (Bytes)

(b) Overheads

LogCAo110x

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

Granularity (Bytes)

(c) Computational Index

LogCAC10x

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

Granularity (Bytes)

(d) Acceleration

LogCAA10x

Figure 5 The effect on speedup of 10x improvement in each LogCA parameter The base case is the speedup of AES [30] on Ultra-SPARC T2

3 APPLICATIONS OF LogCAIn this section we describe the utility of LogCA for visually iden-tifying the performance bounds design bottlenecks and possibleoptimizations to alleviate these bottlenecks

31 Performance BoundsEarlier we have observed that the speedup is bounded by eitheracceleration (equation 3) or the product of computational intensityand gβminus1 (equation 9) Using these observations we classify kernelseither as compute-bound or latency-bound For compute-bound ker-nels the achievable speedup is bounded by acceleration whereas forthe latency-bound kernels the speedup is bounded by computationalintensity Based on this classification a compute-bound kernel caneither be running on a system with granularity independent latencyor has super-linear complexity while running on a system with gran-ularity dependent latency Figure 4-a illustrates these bounds forcompute-bound kernels On the other hand a latency-bound kernelis running on a system with granularity dependent latency and haseither linear or sub-linear complexity Figure 4-b illustrates thesebounds for latency-bound kernels

Programmers and architects can visually identify these boundsand use this information to invest their time and resources in the rightdirection For example for compute-bound kernelsmdashdependingon the operating granularitymdashit may be beneficial to invest moreresources in either increasing acceleration or reducing overheadsHowever for latency-bound kernels optimizing acceleration andoverheads is not that critical but decreasing latency and increasingcomputational index maybe more beneficial

32 Sensitivity AnalysisTo identify the design bottlenecks we perform a sensitivity analysisof the LogCA parameters We consider a parameter a design bottle-neck if a 10x improvement in it provides at lest 20 improvement inspeedup A lsquobottleneckedrsquo parameter also provides an optimizationopportunity To visually identify these bottlenecks we introduceoptimization regions As an example we identify design bottlenecksin UltraSPARC T2rsquos crypto accelerator by varying its individualparameters 2 in Figure 5 (a)-(d)

2We elaborate our methodology for measuring LogCA parameters later (sect 4)

Figure 5 (a) shows the variation (or the lack of) in speedup withthe decrease in latency The resulting gains are negligible and inde-pendent of the granularity as it is a closely coupled accelerator

Figure 5 (b) shows the resulting speedup after reducing overheadsSince the overheads are one-time initialization cost and independentof granularity the per byte setup cost is high at small granularitiesDecreasing these overheads considerably reduces the per byte setupcost and results in significant gains at these smaller granularitiesConversely for larger granularities the per byte setup cost is alreadyamortized so reducing overheads does not provide much gainsThus overhead is a bottleneck at small granularities and provide anopportunity for optimization

Figure 5 (c) shows the effect of increasing the computationalindex The results are similar to optimizing overheads in Figure 5 (b)ie significant gains for small granularities and a gradual decreasein the gains with increasing granularity With the constant overheadsincreasing computational index increases the computation time of thekernel and decreases the per byte setup cost For smaller granularitiesthe reduced per byte setup cost results in significant gains

Figure 5 (d) shows the variation in speedup with increasing peakacceleration The gains are negligible at small granularities andbecome significant for large granularities As mentioned earlierthe per byte setup cost is high at small granularities and it reducesfor large granularities Since increasing peak acceleration does notreduce the per byte setup cost optimizing peak acceleration providesgains only at large granularities

We group these individual sensitivity plots in Figure 6 to buildthe optimization regions As mentioned earlier each region indicatesthe potential of 20 gains with 10x variation of one or more LogCAparameters For the ease of understanding we color these regionsand label them with their respective LogCA parameters For exam-ple the blue colored region labelled lsquooCrsquo (16B to 2KB) indicatesan optimization region where optimizing overheads and computa-tional index is beneficial Similarly the red colored region labelledlsquoArsquo (32KB to 32MB) represents an optimization region where opti-mizing peak acceleration is only beneficial The granularity rangeoccupied by a parameter also identifies the scope of optimizationfor an architect and a programmer For example for UltraSPARCT2 overheads occupy most of the lower granularity suggesting op-portunity for improving the interface Similarly the absence of thelatency parameter suggests little benefits for optimizing latency

We also add horizontal arrows to the optimization regions inFigure 6 to demarcate the start and end of granularity range for each

Table 2 Description of the Cryptographic accelerators

Crypto Accelerator PCI Crypto UltraSPARC T2 SPARC T3 SPARC T4 Sandy BridgeProcessor AMD A8-3850 S2 S2 S3 Intel Core i7-2600Frequency 29 GHz 116 GHz 165 GHz 3 GHz 34 GHzOpenSSL version 098o 098o 098o 102 101k 098oKernel Ubuntu 3130-55 Oracle Solaris 11 Oracle Solaris 11 Oracle Solaris 112 Linux2632-504

16 128 g1 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

oC AoCA

oC

A

Granularity (Bytes)

Sp

eed

up

LogCA L110xo110x C10x A10x

Figure 6 Optimization regions for UltraSPARC T2 The pres-ence of a parameter in an optimization region indicates thatit can at least provides 20 gains The horizontal arrow in-dicates the cut-off granularity at which a parameter provides20 gains

parameter For example optimizing acceleration starts providingbenefits from 2KB while optimizing overheads or computationalindex is beneficial up till 32KB These arrows also indicate thecut-off granularity for each parameter These cut-off granularitiesprovide insights to architects and programmers about the designbottlenecks For example high cut-off granularity of 32KB suggestshigh overheads and thus a potential for optimization

4 EXPERIMENTAL METHODOLOGYThis section describes the experimental setup and benchmarks forvalidating LogCA on real machines We also discuss our methodol-ogy for measuring LogCA parameters and performance metrics

Our experimental setup comprises of on-chip and off-chip cryptoaccelerators (Table 2) and three different GPUs (Table 3) The on-chip crypto accelerators include cryptographic units on SunOracleUltraSPARC T2 [40] SPARC T3 [35] SPARC T4 [41] and AES-NI(AES New Instruction) [15] on Sandy Bridge whereas the off-chipaccelerator is a Hifn 7955 chip connected through the PCIe bus [43]The GPUs include a discrete NVIDIA GPU an integrated AMDGPU (APU) and HSA supported integrated GPU

For the on-chip crypto accelerators each core in UltraSPARC T2and SPARC T3 has a physically addressed crypto unit which requiresprivileged DMA calls However the crypto unit on SPARC T4 isintegrated within the pipeline and does not require privileged DMAcalls SPARC T4 also provides non-privileged crypto instructions toaccess the crypto unit Similar to SPARC T4 sandy bridge providesnon-privileged crypto instructionmdashAESNI

Considering the GPUs the discrete GPU is connected throughthe PCIe bus whereas for the APU the GPU is co-located with thehost processor on the same die For the APU the system memoryis partitioned between host and GPU memory This eliminates thePCIe bottleneck of data copying but it still requires copying databetween memories Unlike discrete GPU and APU HSA supportedGPU provides a unified and coherent view of the system memoryWith the host and GPU share the same virtual address space explicitcopying of data between memories is not required

Our workloads consist of encryption hashing and GPU kernelsFor encryption and hashing we have used advanced encryptionstandard (AES) [30] and standard hashing algorithm (SHA) [31]respectively from OpenSSL [34]mdashan open source cryptography li-brary For GPU kernels we use matrix multiplication radix sortFFT and binary search from AMD OpenCL SDK [1] Table 4 we listthe complexities of each kernel both in terms of number of elementsn and granularity g We expect these complexities to remain same inboth cases but we observe that they differ for matrix multiplicationFor example for a square matrix of size n matrix multiplication hascomplexity of O (n3) whereas the complexity in terms of granularityis O (g17) This happens because for matrix multiplicationmdashunlikeothersmdashcomputations are performed on matrices and not vectorsSo offloading a square matrix of size n corresponds to offloading n2

elements which results in the apparent discrepancy in the complexi-ties We also observe that for the granularity range of 16B to 32MBβ = 011 provides a close approximation for log(g)

Table 3 Description of the GPUs

Platform Discrete GPU Integrated APU AMD HSAName Tesla C2070 Radeon HD 6550 Radeon R7Architecture Fermi Beaver Creek KaveriCores 16 5 8Compute Units 448 400 512Clock Freq 15 GHz 600 MHz 720 MHzPeak FLOPS 1 T 480 G 856 GHostProcessor Intel AMD AMD

Xeon E5520 A8-3850 A10-7850KFrequency GHz 227 29 17

For calculating execution times we have used Linux utilities onthe crypto accelerators whereas for the GPUs we have used NVIDIAand AMD OpenCL profilers to compute the setup kernel and datatransfer times and we report the average of one hundred executionsFor verifying the usage of crypto accelerators we use built-in coun-ters in UltraSPARC T2 and T3 [46] SPARC T4 however no longer

Table 4 Algorithmic complexity of various kernels with num-ber of elements and granularity The power of g represents β

for each kernel

Kernel Algorithmic ComplexityAdvanced Encryption Standard (AES) O (n) O (g101)Secure Hashing Algorithm (SHA) O (n) O (g097)Matrix Multiplication (GEMM) O (n3) O (g17)Fast Fourier Transform (FFT) O (n logn) O (g12)Radix Sort O (kn) O (g094)Binary Search O (logn) O (g014)

Table 5 Calculated values of LogCA Parameters

LogCA ParametersDevice Benchmark L o C A

(cycles) (cycles) (cyclesB)

Discrete GPU

AES 174Radix Sort 290GEMM 3times103 2times108 2 30FFT 290Binary Search 116

APU

AES 174Radix Sort 290GEMM 15 4times108 2 7FFT 290Binary Search 116

UltraSPARC T2 AES 1500 29times104 90 19SHA 105times103 72 12

SPARC T3 AES 1500 27times104 90 12SHA 105times103 72 10

SPARC T4 AES 500 435 32 12SHA 16times103 32 10

SPARC T4 instr AES 4 111 32 12SHA 1638 32 10

Sandy Bridge AES 3 10 35 6

supports these counters so we use Linux utilities to trace the execu-tion of the crypto instructions [3] We use these execution times todetermine LogCA parameters We calculate these parameters onceand can be later used for different kernels on the same system

For computational index and β we profile the CPU code on thehost by varying the granularity from 16B to 32MB At each granu-larity we measure the execution time and use regression analysisto determine C and β For overheads we use the observation thatfor very small granularities the execution time for a kernel on anaccelerator is dominated by the overheads ie limgrarr0 T1 (g) ≃ oFor acceleration we use different methods for the on-chip accelera-tors and GPUs For on-chip accelerators we calculate accelerationusing equation (3) and the observation that the speedup curve flat-tens out and approaches acceleration for very large granularitiesHowever for the GPUs we do not use equation (3) as it requirescomputing acceleration for each kernel as each application has adifferent access pattern which affects the speedup So we boundthe maximum performance using the peak flops from the devicespecifications We use the ratio of peak GFLOPs on CPU and GPUie A = Peak GFLOPGPU

Peak GFLOPCPU Similar to acceleration we use two different

techniques for calculating latency For the on-chip accelerators we

run micro-benchmarks and use execution time on host and acceler-ators On the other hand for the GPUs we compute latency usingpeak memory bandwidth of the GPU Similar to Meswani et al [29]we use the following equation for measuring data copying time forthe GPUs L = 1

BWpeak

Earlier we develop our model using assumptions of granularityindependent and dependent latencies In our setup we observe thatthe on-chip crypto accelerators and HSA-enabled GPU representaccelerators with granularity independent latency while the off-chipcrypto accelerator and discrete GPUAPU represent the granular-ity dependent accelerators For each accelerator we calculate thespeedup and performance metrics using the respective equations(sect2)

5 EVALUATIONIn this section we show that LogCA closely captures the behavior forboth off and on-chip accelerators We also list the calculate LogCAparameters in Table 5 To demonstrate the utility of our modelwe also present two case studies In these studies we consider theevolution of interface in SUNOraclersquos crypto accelerators and threedifferent GPU architectures In both cases we elaborate the designchanges using the insights LogCA provides

51 Linear-Complexity Kernels (β = 1)Figure 7 shows the curve-fitting of LogCA for AES We considerboth off-chip and on-chip accelerators connected through differentinterfaces ranging from PCIe bus to special instructions We observethat the off-chip accelerators and APU unlike on-chip acceleratorsprovide reasonable speedup only at very large granularities We alsoobserve that the achievable speedup is limited by computationalintensity for off-chip accelerators and acceleration for on-chip accel-erators This observation supports earlier implication on the limitsof speedup for granularity independent and dependent latencies inequation (3) and (9) respectively

Figure 7 also shows that UltraSPARC T2 provides higher speedupsthan Sandy Bridge but it breaks-even at a larger granularity SandyBridge on the other hand breaks-even at very small granularitybut provides limited speedup The discrete GPU with powerful pro-cessing cores has the highest acceleration among others Howeverits observed speedup is less than others due to high overheads andlatencies involved in communicating through the PCIe bus

We have also marked g1 and g A2for each accelerator in Figure 7

which help programmers and architects identify the complexity ofthe interface For example g1 for crypto instructions ie SPARCT4 and Sandy Bridge lies on the extreme left while for the off-chipaccelerators g1 lies on the far right It is worth mentioning that wehave marked g a

2for on-chip accelerators but not for the off-chip

accelerators For off-chip accelerators computational intensity isless than acceleration and as we have noted in equation (12) thatg A

2for these designs does not existWe also observe that g1 for the crypto-card connected through

the PCIe bus does not exist showing that this accelerator does notbreak-even even for large granularities Figure 7 also shows thatg1 for GPU and APU is comparable This observation shows thatdespite being an integrated GPU and not connected to the PCIe bus

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

CL

Sp

eed

up

(a) PCIe crypto

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100A

g1

CL

(b) NVIDIA Discrete GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

(c) AMD Integrated GPU (APU)

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

(d) UltraSPARC T2

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

Sp

eed

up

(e) SPARC T3

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

(f) SPARC T4 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

gA2

CL

g1 lt 16B

Granularity (Bytes)

(g) SPARC T4 instruction

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

CL

g1 gA2lt 16B

Granularity (Bytes)

(h) AESNI on Sandy Bridge

observed LogCA

Figure 7 Speedup curve fittings plots comparing LogCA with the observed values of AES [30]

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

Sp

eed

up

(a) UltraSPARC T2 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

(b) SPARC T3 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

(c) SPARC T4 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

gA2

g1 lt 16B

CL

Granularity (Bytes)

(d) SPARC T4 instruction

observed LogCA

Figure 8 Speedup curve fittings plots comparing LogCA with the observed values of SHA256 [31] LogCA starts following observedvalues after 64B

APU spends considerable time in copying data from the host todevice memory

Figure 8 shows the curve fitting for SHA on various on-chipcrypto accelerators We observe that g1 and g A

2do exist as all of

these are on-chip accelerators We also observe that the LogCAcurve mostly follows the observed value However it deviates fromthe observed value before 64B This happens because SHA requiresblock size of 64B for hash computation If the block size is less than64B it pads extra bits to make the block size 64B Since LogCAdoes not capture this effect it does not follow the observed speedupfor granularity smaller than 64B

Figure 9-a shows the speedup curve fitting plots for Radix sortWe observe that LogCA does not follow observed values for smallergranularities on GPU Despite this inaccuracy LogCA accuratelypredicts g1 and g A

2 We also observe that g A

2for GPU is higher than

APU and this observation supports equation (7) that increasingacceleration increases g A

2

52 Super-Linear Complexity Kernels (β gt 1)Figures 9-b and 9-c show the speedup curve fitting plots for super-complexity kernels on discrete GPU and APU We observe that ma-trix multiplication with higher complexity (O (g17)) achieves higherspeedup than sort and FFT with lower complexities of O (g) andO (g12) respectively This observation corroborates results fromequation (9) that achievable speedup of higher-complexity algo-rithms is higher than lower-complexity algorithms We also observethat g A

2does not exist for FFT This happens because as we note in

equation (12) that for g A2to exist for FFT C

L should be greater thanA

12 However Figure 9-c shows that CL is smaller than A

12 for bothGPU and APU

53 Sub-Linear Complexity Kernels (β lt 1)Figure 9-d shows the curve fitting for binary search which is asub-linear algorithm (β = 014) We make three observations Firstg1 does not exist even for very large granularities and C

L lt 1 Thisobservation supports implication (5) that for a sub-linear algorithm

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

Sp

eed

up

GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

APU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1 gA2

CL

APU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100A

g1

CL

Granularity (Bytes)

Sp

eed

up

GPU

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

A

g1

CL

Granularity (Bytes)

APU

16 128 1K 8K 64

K51

2K 4M 32M

001

01

1

10

100A

g1 gA2

CL

Granularity (Bytes)

GPU

16 128 1K 8K 64

K51

2K 4M 32M

001

01

1

10

100

A

g1 gA2

CL

Granularity (Bytes)

APU

(a) Radix Sort (b) Matrix Multiplication

(c) FFT (d) Binary Search

observed LogCA

Figure 9 Speedup curve fittings plots comparing LogCA with the observed values of (a) Radix Sort (b) Matrix Multiplication (c) FFTand (d) Binary Search

of β = 014 CL should be greater than 7 to provide any speedup

Second for large granularities speedup starts decreasing with anincrease in granularity This observation supports our earlier claimin implication (4) that for systems with granularity dependent la-tencies speedup for sub-linear algorithms asymptotically decreasesThird LogCA deviates from the observed value at large granularitiesThis deviation occurs because LogCA does not model caches Asmentioned earlier LogCA abstracts the caches and memories witha single parameter of latency which does not capture the memory-access pattern accurately Even though LogCA does not accuratelycaptures binary search behavior it still provides an upper bound onthe achievable performance

54 Case StudiesFigure 10 shows the evolution of crypto accelerators in SPARCarchitectures from the off-chip accelerators in pre-Niagara (Figure 10(a)) to accelerators integrated within the pipeline in SPARC T4(Figure 10 (e)) We observe that latency is absent in the on-chipacceleratorsrsquo optimization regions as these accelerators are closelycoupled with the host We also note that the optimization regionwith overheadsmdashrepresenting the complexity of an acceleratorrsquosinterfacemdashshrinks while the optimization regions with accelerationexpand from Figure 10 (a-e) For example for the off-chip cryptoaccelerator the cut-off granularity for overheads is 256KB whereasit is 128B for the SPARC T4 suggesting a much simpler interface

Figure 10 (a) shows the optimization regions for the off-chipcrypto accelerator connected through the PCIe bus We note thatoverheads and latencies occupy most of the optimization regionsindicating high overhead OS calls and high-latency data copyingover the PCIe bus as the bottlenecks

Figure 10 (b) shows the optimization regions for UltraSPARCT2 The large cut-off granularity for overheads at 32KB suggestsa complex interface indicating high overhead OS call creating abottleneck at small granularities The cut-off granularity of 2KB foracceleration suggests that optimizing acceleration is beneficial atlarge granularities

Figure 10 (d) shows optimization regions for on-chip acceleratoron SPARC T4 There are three optimization regions with the cut-offgranularity for overhead now reduced to only 512B This observationsuggests a considerable improvement in the interface design overSPARC T3 and it is also evident by a smaller g1 We also note thatcut-off granularity for acceleration now decreases to 32B showingan increase in the opportunity for optimizing acceleration

Figure 10 (e) shows optimization regions for crypto instructionson SPARC T4 We observe that unlike earlier designs it has only twooptimization regions and the speedup approaches the peak accelera-tion at a small granularity of 128B In contrast UltraSPARC T2 andSPARC T3 do not even provide any gains at this granularity We alsoobserve that the cut-off granularity for overheads further reduces to128B suggesting some opportunity for optimization at very smallgranularities The model also shows that the acceleration occupiesthe maximum range for optimization For example optimizing accel-eration provides benefits for granularities greater than 16B The lowoverhead access which LogCA shows is due to the non-privilegedinstruction SPARC T4 uses to access the cryptographic unit whichis integrated within the pipeline

Figure 11 shows the evolution of memory interface design inGPU architectures It shows the optimization regions for matrixmultiplication on a discrete NVIDIA GPU an AMD integrated GPU(APU) and an integrated AMD GPU with HSA support We observethat matrix multiplication for all three architectures is compute bound

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

oCLoC

LC

oL

Granularity (Bytes)

Sp

eed

up

(a) PCIe Crypto Accelerator

16 128 g1 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

CL

oC AoCA

oA

Granularity (Bytes)

(b) UltraSPARC T2

16 128 g1 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000CL

A

oC AoCA

oA

Granularity (Bytes)

(c) SPARC T3

g112

8 1K 8K 64K

512K 4M 32

M

01

1

10

100

1000

A

oCA A

CL

oA

Granularity (Bytes)

Sp

eed

up

(d) SPARC T4 engine

16 128 1K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

oCA A

CL

oA

Granularity (Bytes)

(e) SPARC T4 instruction

Figure 10 LogCA for performing Advanced Encryption Standard on various crypto accelerators LogCA identifies the design bottle-necks through LogCA parameters in an optimization region The bottlenecks which LogCA suggests in each design is optimized inthe next design

16 128 1K 8K g1

64K

512K 4M 32

M

01

1

10

100

1000

A

LoC LCALC A

ALC

o

Granularity (Bytes)

Sp

eed

up

(a) NVIDIA Discrete GPU

16 128 1K 8K g1

64K

512K 4M 32

M

01

1

10

100

1000

A

LoCLo

CA

ACA

AoL

C

Granularity (Bytes)

(b) AMD Integrated GPU (APU)

16 128 g11K 8K 64

K51

2K 4M 32M

01

1

10

100

1000

A

oC o

CA

o

AC

Granularity (Bytes)

(c) HSA supported AMD Integrated GPU

Figure 11 Various Optimization regions for matrix multiplication over a range of granularities on (a) NVIDIA discrete GPU (b)AMD APU and (c) HSA Supported GPU

(sect31) We also observe that the computational index occupies mostof the regions which signifies maximum optimization potential

The discrete GPU has four optimization regions (Figure 11 (a))Among these latency dominates most of the regions signifyinghigh-latency data copying over the PCIe bus and thus maximumoptimization potential The high cut-off granularity for overheads at

32KB indicates high overhead OS calls to access the GPU Similarlywith highly aggressive cores acceleration has high cut-off granular-ity of 256KB indicating less optimization potential for acceleration

Similar to the discrete GPU the APU also has four optimiza-tion regions (Figure 11 (b)) There are few notable differences ascompared to the discrete GPU The cut-off granularity for latency

reduces to 512KB with the elimination of data copying over thePCIe bus the overheads are still high suggesting high overhead OScalls to access the APU with less aggressive cores the cut-off granu-larity for acceleration reduces to 64KB implying more optimizationpotential for acceleration

Figure 11 (c) shows three optimization regions for the HSA en-abled integrated GPU We observe that latency is absent in all regionsand the cut-off granularity for overhead reduces to 8KB These re-ductions in overheads and latencies signify a simpler interface ascompared to the discrete GPU and APU We also observe that thecut-off granularity for acceleration drops to 2KB suggesting higherpotential for optimizing acceleration

6 RELATED WORKWe compare and contrast our work with prior approaches Lopez-Novoa et al [28] provide a detailed survey of various acceleratormodeling techniques We broadly classify these techniques in twocategories and discuss the most relevant work

Analytical Models There is a rich body of work exploring ana-lytical models for performance prediction of accelerators For somemodels the motivation is to determine the future trend in heteroge-neous architectures Chung et al [7] in a detailed study predict thefuture landscape of heterogeneous computing Hempstead et al [17]propose an early-stage model Navigo that determines the fraction ofarea required for accelerators to maintain the traditional performancetrend Nilakantan et al [32] propose to incorporate communicationcost for early-stage model of accelerator-rich architectures For oth-ers the motivation is to determine the right amount of data to offloadand the potential benefits associated with an accelerator [24]

Some analytical models are architecture specific For examplea number of studies [20 21 44 57] predict performance of GPUarchitectures Hong et al [20] present an analytical performancemodel for predicting execution time on GPUs They later extendtheir model and develop an integrated power and performance modelfor the GPUs [21] Song et al [44] use a simple counter basedapproach to predict power and performance Meswani et al [29]explore such models for high performance applications Daga etal [11] analyze the effectiveness of Accelerated processing units(APU) over GPUs and describe the communication cost over thePCIe bus as a major bottleneck in exploiting the full potential ofGPUs

In general our work is different from these studies because ofthe complexity These models use a large number of parameters toaccurately predict the power andor performance whereas we limitthe number of parameters to reduce the complexity of our modelThey also require deep understanding of the underlying architectureMost of these models also require access to GPU specific assemblyor PTX codes Unlike these approaches we use CPU code to providebounds on the performance

Roofline Models In terms of simplicity and motivation our workclosely matches the Roofline model [54]mdasha visual performancemodel for multi-core architectures Roofline exposes bottlenecks fora kernel and suggests several optimizations which programmers canuse to fine tune the kernel on a given system

A number of extensions of Roofline have been proposed [1022 33 56] and some of these extensions are architecture specific

For example targeting GPUs [22] vector processors [39] and FP-GAs [10 56]

Despite the similarities roofline and its extensions cannot be usedfor exposing design bottlenecks in an acceleratorrsquos interface Theprimary goal of roofline models has been to help programmers andcompiler writer while LogCA provides more insights for architects

7 CONCLUSION AND FUTURE WORKWith the recent trend towards heterogeneous computing we feelthat the architecture community lacks a model to reason about theneed of accelerators In this respect we propose LogCAmdashan insight-ful visual performance model for hardware accelerators LogCAprovides insights early in the design stage to both architects andprogrammers and identifies performance bounds exposes interfacedesign bottlenecks and suggest optimizations to alleviate these bot-tlenecks We have validated our model across a range of on-chip andoff-chip accelerators and have shown its utility using retrospectivestudies describing the evolution of acceleratorrsquos interface in thesearchitectures

The applicability of LogCA can be limited by our simplifying as-sumptions and for more realistic analysis we plan to overcome theselimitations in our future work For example We also assume a singleaccelerator system and do not explicitly model contention amongresources Our model should handle multi-accelerator and pipelinedscenarios For fixed function accelerators our design space is cur-rently limited to encryption and hashing kernels To overcome thiswe are expanding our design space with compression and databaseaccelerators in Oracle M7 processor We also plan to complementLogCA with an energy model as energy efficiency is a prime designmetric for accelerators

ACKNOWLEDGEMENTSWe thank our anonymous reviewers Arkaprava Basu and TonyNowatzki for their insightful comments and feedback on the paperThanks to Mark Hill Michael Swift Wisconsin Computer Archi-tecture Affiliates and other members of the Multifacet group fortheir valuable discussions We also thank Brian Wilson at Universityof Wisconsin DoIT and Eric Sedlar at Oracle Labs for providingaccess to SPARC T3 and T4 servers respectively Also thanks toMuhammad Umair Bin Altaf for his help in the formulation Thiswork is supported in part by the National Science Foundation (CNS-1302260 CCF-1438992 CCF-1533885 CCF- 1617824) Googleand the University of Wisconsin-Madison (Amar and Balindar SohiProfessorship in Computer Science) Wood has a significant financialinterest in AMD and Google

REFERENCES[1] Advanced Micro Devices 2016 APP SDK - A Complete Development Platform

Advanced Micro Devices httpdeveloperamdcomtools-and-sdksopencl-zoneamd-accelerated-parallel-processing-app-sdk

[2] Gene M Amdahl 1967 Validity of the single processor approach to achievinglarge scale computing capabilities Proceedings of the April 18-20 1967 springjoint computer conference on - AFIPS rsquo67 (Spring) (1967) 483 httpsdoiorg10114514654821465560

[3] Dan Anderson 2012 How to tell if SPARC T4 crypto is being used httpsblogsoraclecomDanXentryhow_to_tell_if_sparc

[4] Krste Asanovic Rastislav Bodik James Demmel Tony Keaveny Kurt KeutzerJohn Kubiatowicz Nelson Morgan David Patterson Koushik Sen John

Wawrzynek David Wessel and Katherine Yelick 2009 A View of the Par-allel Computing Landscape Commun ACM 52 10 (oct 2009) 56ndash67 httpsdoiorg10114515627641562783

[5] Nathan Beckmann and Daniel Sanchez 2016 Cache Calculus Modeling Cachesthrough Differential Equations Computer Architecture Letters PP 99 (2016)1 httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=7366753$delimiter026E30F$npapers3publicationdoi101109LCA20152512873

[6] C Cascaval S Chatterjee H Franke K J Gildea and P Pattnaik 2010 Ataxonomy of accelerator architectures and their programming models IBMJournal of Research and Development 54 (2010) 51ndash510 httpsdoiorg101147JRD20102059721

[7] Eric S Chung Peter a Milder James C Hoe and Ken Mai 2010 Single-ChipHeterogeneous Computing Does the Future Include Custom Logic FPGAs andGPGPUs 2010 43rd Annual IEEEACM International Symposium on Microar-chitecture (dec 2010) 225ndash236 httpsdoiorg101109MICRO201036

[8] Jason Cong Zhenman Fang Michael Gill and Glenn Reinman 2015 PARADEA Cycle-Accurate Full-System Simulation Platform for Accelerator-Rich Archi-tectural Design and Exploration In 2015 IEEEACM International Conference onComputer-Aided Design Austin TX

[9] D Culler R Karp D Patterson and A Sahay 1993 LogP Towards a realisticmodel of parallel computation In Proceedings of the Fourth ACM SIGPLANSymposium on Principles and Practice of Parallel Programming 1ndash12 httpdlacmorgcitationcfmid=155333

[10] Bruno da Silva An Braeken Erik H DrsquoHollander and Abdellah Touhafi 2013Performance Modeling for FPGAs Extending the Roofline Model with High-level Synthesis Tools Int J Reconfig Comput 2013 (jan 2013) 77mdash-77httpsdoiorg1011552013428078

[11] Mayank Daga Ashwin M Aji and Wu-chun Feng 2011 On the Efficacy of aFused CPU+GPU Processor (or APU) for Parallel Computing In 2011 Symposiumon Application Accelerators in High-Performance Computing Ieee 141ndash149httpsdoiorg101109SAAHPC201129

[12] Hadi Esmaeilzadeh Emily Blem Renee St Amant Karthikeyan Sankaralingamand Doug Burger 2011 Dark silicon and the end of multicore scaling InProceeding of the 38th annual international symposium on Computer archi-tecture - ISCA rsquo11 ACM Press New York New York USA 365 httpsdoiorg10114520000642000108

[13] H Franke J Xenidis C Basso B M Bass S S Woodward J D Brown andC L Johnson 2010 Introduction to the wire-speed processor and architectureIBM Journal of Research and Development 54 (2010) 31ndash311 httpsdoiorg101147JRD20092036980

[14] Venkatraman Govindaraju Chen Han Ho and Karthikeyan Sankaralingam 2011Dynamically specialized datapaths for energy efficient computing In Proceedings- International Symposium on High-Performance Computer Architecture 503ndash514httpsdoiorg101109HPCA20115749755

[15] Shay Gueron 2012 Intel Advanced Encryption Standard (AES) Instructions SetTechnical Report Intel Corporation httpssoftwareintelcomsitesdefaultfilesarticle165683aes-wp-2012-09-22-v01pdf

[16] Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex SolomatnikovBenjamin C Lee Stephen Richardson Christos Kozyrakis and Mark Horowitz2010 Understanding sources of inefficiency in general-purpose chips Proceed-ings of the 37th annual international symposium on Computer architecture - ISCA

rsquo10 (2010) 37 httpsdoiorg10114518159611815968[17] Mark Hempstead Gu-Yeon Wei and David Brooks 2009 Navigo An early-

stage model to study power-constrained architectures and specialization In ISCAWorkshop on Modeling Benchmarking and Simulations (MoBS)

[18] John L Hennessy and David A Patterson 2006 Computer Architecture FourthEdition A Quantitative Approach 704 pages httpsdoiorg10111151881

[19] Mark D Hill and Michael R Marty 2008 Amdahlrsquos Law in the Multicore EraComputer 41 7 (jul 2008) 33ndash38 httpsdoiorg101109MC2008209

[20] Sunpyo Hong and Hyesoon Kim 2009 An analytical model for a GPU architec-ture with memory-level and thread-level parallelism awareness In Proceedingsof the 36th Annual International Symposium on Computer Architecture Vol 37152ndash163 httpsdoiorg10114515558151555775

[21] Sunpyo Hong and Hyesoon Kim 2010 An integrated GPU power and perfor-mance model In Proceedings of the 37th Annual International Symposium onComputer Architecture Vol 38 280mdash-289 httpsdoiorg10114518160381815998

[22] Haipeng Jia Yunquan Zhang Guoping Long Jianliang Xu Shengen Yan andYan Li 2012 GPURoofline A Model for Guiding Performance Optimizations onGPUs In Proceedings of the 18th International Conference on Parallel Processing(Euro-Parrsquo12) Springer-Verlag Berlin Heidelberg 920ndash932 httpsdoiorg101007978-3-642-32820-6_90

[23] Onur Kocberber Boris Grot Javier Picorel Babak Falsafi Kevin Lim andParthasarathy Ranganathan 2013 Meet the Walkers Accelerating Index Traver-sals for In-memory Databases In Proceedings of the 46th Annual IEEEACMInternational Symposium on Microarchitecture (MICRO-46) ACM New YorkNY USA 468ndash479 httpsdoiorg10114525407082540748

[24] Karthik Kumar Jibang Liu Yung Hsiang Lu and Bharat Bhargava 2013 Asurvey of computation offloading for mobile systems Mobile Networks andApplications 18 (2013) 129ndash140 httpsdoiorg101007s11036-012-0368-0

[25] Snehasish Kumar Naveen Vedula Arrvindh Shriraman and Vijayalakshmi Srini-vasan 2015 DASX Hardware Accelerator for Software Data Structures InProceedings of the 29th ACM on International Conference on Supercomputing(ICS rsquo15) ACM New York NY USA 361ndash372 httpsdoiorg10114527512052751231

[26] Maysam Lavasani Hari Angepat and Derek Chiou 2014 An FPGA-based In-Line Accelerator for Memcached IEEE Comput Archit Lett 13 2 (jul 2014)57ndash60 httpsdoiorg101109L-CA201317

[27] John D C Little and Stephen C Graves 2008 Littlersquos law In Building intuitionSpringer 81ndash100

[28] U Lopez-Novoa A Mendiburu and J Miguel-Alonso 2015 A Survey of Perfor-mance Modeling and Simulation Techniques for Accelerator-Based ComputingParallel and Distributed Systems IEEE Transactions on 26 1 (jan 2015) 272ndash281httpsdoiorg101109TPDS20142308216

[29] M R Meswani L Carrington D Unat A Snavely S Baden and S Poole2013 Modeling and predicting performance of high performance computingapplications on hardware accelerators International Journal of High Perfor-mance Computing Applications 27 (2013) 89ndash108 httpsdoiorg1011771094342012468180

[30] National Institute of Standards and Technology 2001 Advanced EncryptionStandard (AES) National Institute of Standards and Technology httpsdoiorg106028NISTFIPS197

[31] National Institute of Standards and Technology 2008 Secure Hash StandardNational Institute of Standards and Technology httpcsrcnistgovpublicationsfipsfips180-3fips180-3_finalpdf

[32] S Nilakantan S Battle and M Hempstead 2013 Metrics for Early-Stage Model-ing of Many-Accelerator Architectures Computer Architecture Letters 12 1 (jan2013) 25ndash28 httpsdoiorg101109L-CA20129

[33] Cedric Nugteren and Henk Corporaal 2012 The Boat Hull Model EnablingPerformance Prediction for Parallel Computing Prior to Code Development Cate-gories and Subject Descriptors In Proceedings of the 9th Conference on Comput-ing Frontiers ACM 203mdash-212

[34] OpenSSL Software Foundation 2015 OpenSSL Cryptography and SSLTLSToolkit OpenSSL Software Foundation httpsopensslorg

[35] Sanjay Patel 2009 Sunrsquos Next-Generation Multithreaded Processor RainbowFalls In 21st Hot Chip Symposium httpwwwhotchipsorgwp-contentuploadshc

[36] Sanjay Patel and Wen-mei W Hwu 2008 Accelerator Architectures IEEE Micro28 4 (jul 2008) 4ndash12 httpsdoiorg101109MM200850

[37] Stephen Phillips 2014 M7 Next Generation SPARC In 26th Hot Chip Sympo-sium

[38] Phil Rogers 2013 Heterogeneous system architecture overview In Hot Chips[39] Yoshiei Sato Ryuichi Nagaoka Akihiro Musa Ryusuke Egawa Hiroyuki Tak-

izawa Koki Okabe and Hiroaki Kobayashi 2009 Performance tuning andanalysis of future vector processors based on the roofline model Proceedingsof the 10th MEDEA workshop on MEmory performance DEaling with Applica-tions systems and architecture - MEDEA rsquo09 (2009) 7 httpsdoiorg10114516219601621962

[40] M Shah J Barren J Brooks R Golla G Grohoski N Gura R Hetherington PJordan M Luttrell C Olson B Sana D Sheahan L Spracklen and A Wynn2007 UltraSPARC T2 A highly-treaded power-efficient SPARC SOC InSolid-State Circuits Conference 2007 ASSCC rsquo07 IEEE Asian 22ndash25 httpsdoiorg101109ASSCC20074425786

[41] Manish Shah Robert Golla Gregory Grohoski Paul Jordan Jama Barreh JeffreyBrooks Mark Greenberg Gideon Levinsky Mark Luttrell Christopher OlsonZeid Samoail Matt Smittle and Thomas Ziaja 2012 Sparc T4 A dynamicallythreaded server-on-a-chip IEEE Micro 32 (2012) 8ndash19 httpsdoiorg101109MM20121

[42] Yakun Sophia Shao Brandon Reagen Gu-Yeon Wei and David Brooks 2014Aladdin A Pre-RTL Power-Performance Accelerator Simulator Enabling LargeDesign Space Exploration of Customized Architectures In International Sympo-sium on Computer Architecture (ISCA)

[43] Soekris Engineering 2016 vpn 1401 for Std PCI-sockets Soekris Engineeringhttpsoekriscomproductsvpn-1401html

[44] Shuaiwen Song Chunyi Su Barry Rountree and Kirk W Cameron 2013 Asimplified and accurate model of power-performance efficiency on emergent GPUarchitectures In Proceedings - IEEE 27th International Parallel and DistributedProcessing Symposium IPDPS 2013 673ndash686 httpsdoiorg101109IPDPS201373

[45] Jeff Stuecheli 2013 POWER8 In 25th Hot Chip Symposium[46] Ning Sun and Chi-Chang Lin 2007 Using the Cryptographic Accelerators in the

UltraSPARC T1 and T2 processors Technical Report httpwwworaclecomtechnetworkserver-storagesolarisdocumentation819-5782-150147pdf

[47] S Tabik G Ortega and E M Garzoacuten 2014 Performance evaluation of ker-nel fusion BLAS routines on the GPU iterative solvers as case study TheJournal of Supercomputing 70 2 (nov 2014) 577ndash587 httpsdoiorg101007s11227-014-1102-4

[48] Y C Tay 2013 Analytical Performance Modeling for Computer Systems (2nded) Morgan amp Claypool Publishers

[49] MB Taylor 2012 Is dark silicon useful harnessing the four horsemen of thecoming dark silicon apocalypse In Design Automation Conference (DAC) 201249th ACMEDACIEEE 1131ndash1136 httpsdoiorg10114522283602228567

[50] G Venkatesh J Sampson N Goulding S Garcia V Bryksin J Lugo-MartinezS Swanson and M B Taylor 2010 Conservation cores Reducing the energyof mature computations In International Conference on Architectural Supportfor Programming Languages and Operating Systems - ASPLOS 205ndash218 httpsdoiorg10114517360201736044

[51] Ganesh Venkatesh Jack Sampson Nathan Goulding-Hotta Sravanthi KotaVenkata Michael Bedford Taylor and Steven Swanson 2011 QsCores Trad-ing Dark Silicon for Scalable Energy Efficiency with Quasi-Specific Cores InProceedings of the 44th Annual IEEEACM International Symposium on Microar-chitecture - MICRO-44 rsquo11 163 httpsdoiorg10114521556202155640

[52] Guibin Wang Yisong Lin and Wei Yi 2010 Kernel Fusion An EffectiveMethod for Better Power Efficiency on Multithreaded GPU In Green Computingand Communications (GreenCom) 2010 IEEEACM Intrsquol Conference on CyberPhysical and Social Computing (CPSCom) 344ndash350 httpsdoiorg101109GreenCom-CPSCom2010102

[53] Eric W Weisstein 2015 Newtonrsquos Method From MathWorld ndash A Wolfram WebResource httpmathworldwolframcomNewtonsMethodhtml

[54] Samuel Williams Andrew Waterman and David Patterson 2009 Roofline aninsightful visual performance model for multicore architectures Commun ACM52 (2009) 65ndash76 httpsdoiorg10114514987651498785

[55] Lisa Wu Andrea Lottarini Timothy K Paine Martha A Kim and Kenneth ARoss 2014 Q100 The Architecture and Design of a Database Processing UnitIn Proceedings of the 19th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS rsquo14) ACM NewYork NY USA 255ndash268 httpsdoiorg10114525419402541961

[56] Moein Pahlavan Yali 2014 FPGA-Roofline An Insightful Model for FPGA-based Hardware Accelerators in Modern Embedded Systems PhD DissertationVirginia Polytechnic Institute and State University

[57] Yao Zhang and John D Owens 2011 A quantitative performance analysismodel for GPU architectures In Proceedings - International Symposium on High-Performance Computer Architecture 382ndash393 httpsdoiorg101109HPCA20115749745

Abstract
1 Introduction
2 The LogCA Model
- 21 Effect of Granularity
- 22 Performance Metrics
- 23 Granularity dependent latency
- - 3 Applications of LogCA
  - - 31 Performance Bounds
    - 32 Sensitivity Analysis
    - - 4 Experimental Methodology
      - 5 Evaluation
      - 51 Linear-Complexity Kernels (= 1)
        
        52 Super-Linear Complexity Kernels (gt 1)
        
        53 Sub-Linear Complexity Kernels (lt 1)
        
        54 Case Studies
        
        6 Related Work
        
        7 Conclusion and Future Work
        
        References

Page 2: LogCA: A High-Level Performance Model for Hardware ...research.cs.wisc.edu/multifacet/papers/isca17_logca.pdf · LogCA: A High-Level Performance Model for Hardware Accelerators Muhammad