ParaFormanceTMDemocratising Parallel
SoftwareChris Brown
A Scottish Startup
• £600k Scottish Enterprise grant money so far…• … built on over £7M of EU funding.• Looking to spin out from the University of St
Andrews• A team of 4 full time software developers• Looking for pre-revenue investment• Looking for triallists!
• “Kilo-Core”
• 1000 independent programmable processors
• Designed by a team at the University of California, Davis
• 1.78 trillion instructions per second and contains 621 million transistors
• Each processor is independently clocked, it can shut itself down to further save energy
• 1,000 processors execute 115 billion instructions per second using 0.7 Watts
• Powered by a single AA battery
The world’s first 1000 core processor
Parallel Libraries
1.OpenMP§ Pragma based
2.Intel TBB§ Pattern based
3.Others…§ MPI§ PThreads§ FastFlow§ …
9
Parallelism Discovery• Profiles execution of application
• Locates “hot spots” of computation
• Goal is to find instances of patterns and inform user “best” pattern to choose
Safety Checking
Checks code for potential thread safe violations using Static Analysis• Race conditions• Array collisions• Variable accesses• Private variables• Critical regions
Automatic Repairing
• Repairs code to make it ’thread safe’• Refactors code to remove
potential sources of thread violations• Introduces local variables• Array collisions
Refactoring
• Rewrites code into a parallel version• Portable across range of
different types of parallelism:• TBB, OpenMP, Pthreads, etc.
Modify Refactor
Examples of ParaFormanceParaFormance is designed to be general, and we have tried it on many different types of application:
Machine learning, ant colony optimisation, linear programming, image processing, CFD…
ExamplesofParaFormance
• ParaFormance™isdesignedtobegeneral,andwehavetrieditonmanydifferenttypesofapplication:– Machinelearning,antcolonyoptimisation,linearprogramming,imageprocessing,CFD,…
Weather ForecastingWeatherForecasting
InitialresultsofParaFormance™onaweatherforecastingapplication
2.5million lines
300+files
1200+potentialsourcesofparallelism
Paraformance narrowsdownto27possibleparallelismsites
1monthofmanualeffortreducedtoonly5minutes
• 2.5 million lines• 300+ files• 1200+ potential sources of
parallelism• Paraformance narrows down
to 27 possible parallelism sites
1 month of manual effort reduced to only 5 minutes!
Comparison of Development Times
Man. Time Refac. TimeConvolution 24 hours 3 hours
Ant Colony 8 hours 1 hourBasic N2 40 hours 5 hours
Graphical Lasso 15 hours 2 hours
Comparable Performance
18
1 2 4 6 8 10 12 14 16
1
2
4
6
8
10
No of �2 workers
Spee
dup
Speedups for Convolution �1(G) k �2(F )
�1 = 1
�1 = 2
�1 = 4
�1 = 6
�1 = 8
�1 = 10
1 2 4 6 8 10 12 14 16 18 20 22 24
124681012141618202224
No of Workers
Spee
dup
Speedups for Ant Colony, BasicN2 and Graphical Lasso
BasicN2
BasicN2 Manual
Graphical Lasso
Graphical Lasso Manual
Ant Colony Optimisation Manual
Ant Colony Optimisation
Figure 3. Refactored Use Case Results in FastFlow
code and simply points the refactoring tool towards them. Theactual parallelisation is then performed by the refactoring tool,supervised by the programmer. This can give significant sav-ings in effort, of about one order of magnitude. This is achievedwithout major performance losses: as desired, the speedupsachieved with the refactoring tool are approximately the sameas for full-scale manual implementations by an expert. Infuture we expect to develop this work in a number of newdirections, including adding advanced performance models tothe refactoring process, thus allowing the user to accuratelypredict the parallel performance from applying a particularrefactoring with a specified number of threads. This may beparticularly useful when porting the applications to differentarchitectures, including adding refactoring support for GPUprogramming in OpenCl. Also, once sufficient automisationof the refactoring tool is achieved, the best parametrisationregarding parallel efficiency can be determined via optimisa-tion, further facilitating this approach. In addition, we alsoplan to implement more skeletons, particularly in the field ofcomputer alegbra and physics, and demonstrate the refactoringapproach with these new skeletons on a wide range of realisticapplications. This will add to the evidence that our approach isgeneral, usable and scalable. Finally, we intend to investigatethe limits of scalability that we have obvserved for some of ouruse-cases, aiming to determine whether the limits are hardwareartefacts or algorithmic.
REFERENCES
[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati. FastFlow:High-Level and Efficient Streaming on Multi-Core. ProgrammingMulti-core and Many-core Computing Systems. Parallel and DistributedComptuing. Chap. 13, 2013. Wiley.
[2] Michael P Allen. Introduction to Molecular Dynamics Simulation.Computational Soft Matter: From Synthetic Polymers to Proteins, 23:1–28, 2004.
[3] M. den Besten, T. Stuetzle, M. Dorigo. Ant Colony Optimization forthe Total Weighted Tardiness Problem PPSN 6, p611-620, Sept. 2000.
[4] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, and A. Elliott.Cost-Directed Refactoring for Parallel Erlang Programs. in Interna-tional Journal Parallel Processing. HLPP 2013 Special Issue. Springer.Paris, September 2013. DOI 10.1007/s10766-013-0266-5
[5] C. Brown, K. Hammond, M. Danelutto, and P. Kilpatrick. A Language-Independent Parallel Refactoring Framework. in Proc. of the FifthWorkshop on Refactoring Tools (WRT ’12)., Pages 54-58. ACM, NewYork, USA. 2012.
[6] C. Brown, H. Li, and S. Thompson. An Expression Processor: A CaseStudy in Refactoring Haskell Programs. Eleventh Symp. on Trends inFunc. Prog., May 2010.
[7] C. Brown, H. Loidl, and K. Hammond. Paraforming: Forming HaskellPrograms using Novel Refactoring Techniques. 12th Symp. on Trendsin Func. Prog., Spain, May 2011.
[8] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, H. Schöner,and T. Breddin. Paraphrasing: Generating Parallel Programs UsingRefactoring. In 10th International Symposium, FMCO 2011. Turin,Italy, October 3-5, 2011. Revised Selected Papers. Springer-Berlin-Heidelberg. Pages 237-256.
[9] R. M. Burstall and J. Darlington. A Transformation System forDeveloping Recursive Programs. J. of the ACM, 24(1):44–67, 1977.
[10] M. Cole. Algorithmic Skeletons: Structured Management of ParallelComputations. Research Monographs in Par. and Distrib. Computing.Pitman, 1989.
[11] M. Cole. Bringing Skeletons out of the Closet: A Pragmatic Manifestofor Skeletal Parallel Programming. Par. Computing, 30(3):389–406,2004.
[12] D. Dig. A Refactoring Approach to Parallelism. IEEE Softw., 28:17–22,January 2011.
[13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse In-verse Covariance Estimation with the Graphical Lasso. Biostatistics,9(3):432–441, July 2008.
[14] R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel Func. Prog.in Eden. J. of Func. Prog., 15(3):431–475, 2005.
[15] T. Mens and T. Tourwé. A Survey of Software Refactoring. IEEETrans. Softw. Eng., 30(2):126–139, 2004.
[16] H. Partsch and R. Steinbruggen. Program Transformation Systems.ACM Comput. Surv., 15(3):199–236, 1983.
[17] K. Hammond, M. Aldinucci, C. Brown, F. Cesarini, M. Danelutto,H. Gonzalez-Velez, P. Kilpatrick, R. Keller, T. Natschlager, andG. Shainer. The ParaPhrase Project: Parallel Patterns for AdaptiveHeterogeneous Multicore Systems. FMCO. Feb. 2012.
[18] K. Hammond, J. Berthold, and R. Loogen. Automatic Skeletonsin Template Haskell. Parallel Processing Letters, 13(3):413–424,September 2003.
[19] W. Opdyke. Refactoring Object-Oriented Frameworks. PhD Thesis,Dept. of Comp Sci, University of Illinois at Urbana-Champaign, Cham-paign, IL, USA (1992).
[20] T. Sheard and S. P. Jones. Template Meta-Programming for Haskell.SIGPLAN Not., 37:60–75, December 2002.
[21] D. B. Skillicorn and W. Cai. A Cost Calculus for Parallel FunctionalProgramming. J. Parallel Distrib. Comput., 28(1):65–83, 1995.
[22] J. Wloka, M. Sridharan, and F. Tip. Refactoring for reentrancy. InESEC/FSE ’09, pages 173–182, Amsterdam, 2009. ACM.
Image Convolution – 20 Cores!
19
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
14
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
0 1 2 4 6 8 10 12 14 16 18 20 22 24
1
4
8
12
16
20
24
Nr. threads
Spee
dup
Speedups for Image Convolution on titanic
0 1 2 4 6 8 10 12
1
4
8
12
Nr. threads
Speedups for Image Convolution on xookik
0 20 40 60 80 100 120 140 160
14
8
12
16
20
24
28
32
36
Nr. threads
Speedups for Image Convolution on power8
OpenMP (s | m)
OpenMP (m | m)OpenMP m
TBB (s | m)
TBB (m | m)TBB m
FF (s | m)
FF (m | m)FF m
Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.
1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }
Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:
seq (solve) ; pick best ; seq (update)
where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:
1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)
Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for
cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.
VI. RELATED WORK
Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).
Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.
In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.
VII. CONCLUSIONS AND FUTURE WORK
In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can
ParaFormance…
• Saves time and money• Gets products faster to market• De-risks for multi-core • Requires less specialised software teams• Increases developer team productivity• Produces reliable software/products• Allows more easily maintained projects
Why give ParaFormance a free trial today?
www.paraformance.com