Evolution of Parallel Evolution of Parallel Programming in HEPProgramming in HEP
F. Rademakers – CERN
International Workshop on Large Scale Computing VECC, Kolkata
IWLSC, 9 Feb 2006 2 Fons Rademakers
OutlineOutline
Why use parallel computing Parallel computing concepts Typical parallel problems Amdahl’s law Parallelism in HEP Parallel data analysis in HEP PIAF PROOF Conclusions
IWLSC, 9 Feb 2006 3 Fons Rademakers
Why ParallelismWhy Parallelism
Two primary reasons: Save time – wall clock time Solve larger problems
Other reasons: Taking advantage of non-local resources – Grid Cost saving – using multiple “cheap” machines instead of paying for a super
computer Overcoming memory constraints – single computers have finite memory
resources, use many machine to create a very large memory
Limits to serial computing Transmission speeds – the speed of a serial computer is directly dependent
on how much data can move through the hardware Limits speed of light (30 cm/ns) and transmission limit of copper wire (9 cm/ns)
Limits to miniaturization Economic limitations
Ultimately, parallel computing is an attempt to maximize the infinite but seemingly scarce commodity called time
IWLSC, 9 Feb 2006 4 Fons Rademakers
Parallel Computing ConceptsParallel Computing Concepts
Parallel hardware A single computer with multiple processors (multiple multi-core) An arbitrary number of computers connected by a network
(LAN/WAN) A combination of both
Parallelizable computational problems Can be broken apart into discrete pieces of work that can be solved
simultaneously Can execute multiple program instructions at any moment in time Can be solved in less time with multiple compute resource than with
a single compute resource
IWLSC, 9 Feb 2006 5 Fons Rademakers
Parallel Computing ConceptsParallel Computing Concepts
There are different ways to classify parallel computers (Flynn’s Taxonomy):
SISD
Single Instruction, Single Data
SIMD
Single Instruction, Multiple Data
MISD
Multiple Instruction, Multiple Data
MIMD
Multiple Instruction, Multiple Data
IWLSC, 9 Feb 2006 6 Fons Rademakers
SISDSISD
A serial (non-parallel) computer Single instruction: only one
instruction stream is being acted on by the CPU during any one clock cycle
Single data: only one data stream is being used as input during any one clock cycle
Deterministic execution Examples: most classical PC’s,
single CPU workstations and mainframes
IWLSC, 9 Feb 2006 7 Fons Rademakers
SIMDSIMD
A type of parallel computer Single instruction: all processing
units execute the same instruction at any given clock cycle
Multiple data: each processing unit can operate on a different data element
This type of machine typically has an instruction dispatcher, a very high bandwidth internal network and a very large array of very small-capacity CPU’s
Best suited for specialized problems with high degree of regularity: image processing
Synchronous and deterministic execution Two varieties: processor arrays and vector pipelines Examples (some extinct):
Processor arrays: Connection Machine, Maspar MP-1, MP-2 Vector pipelines: CDC 205, IBM 9000, Cray C90, Fujitsu, NEC SX-2
IWLSC, 9 Feb 2006 8 Fons Rademakers
MISDMISD
Few actual examples of this class of parallel computer have ever existed
Some conceivable examples might be: Multiple frequency filter operating on a single signal stream Multiple cryptography algorithms attempting to track a single coded
message
IWLSC, 9 Feb 2006 9 Fons Rademakers
MIMDMIMD
Currently the most common type of parallel computer
Multiple instruction: every processor may be executing a different instruction stream
Multiple data: every processor may be working with a different data stream
Execution can be synchronous or asynchronous, deterministic or non-deterministic
Examples: most current supercomputers, networked parallel computer “grids” and multi-processor SMP computers – including multi-CPU and multi-core PC’s
IWLSC, 9 Feb 2006 10 Fons Rademakers
Relevant TerminologyRelevant Terminology
Observed speedup wall-clock time of serial execution / wall-clock time of parallel execution
Granularity Coarse: relatively large amounts of computational work are done between
communication events Fine: relatively small amounts of computational work are done between
communication events Parallel overhead
The amount of time required to coordinate parallel tasks, as opposed to doing useful work, typically:
Task start-up time Synchronizations Data communications Software overhead imposed by parallel compilers, libraries, tools, OS, etc. Task termination time
Scalability Refers to a parallel system’s ability to demonstrate a proportional increase in
parallel speedup with the addition of more processors Embarrassingly parallel
IWLSC, 9 Feb 2006 11 Fons Rademakers
Typical Parallel ProblemsTypical Parallel Problems
Traditionally, parallel computing has been considered to be “the high-end of computing”:
Weather and climate Chemical and nuclear reactions Biological, human genome Geological, seismic activity Electronic circuits
Today commercial applications are the driving force: Parallel databases, data mining Oil exploration Web search engines Computer-aided diagnosis in medicine Advanced graphics and virtual reality
The future: during past 10 years trends indicated by ever faster networks, distributed systems and multi-processor, and now multi-core, computer architectures suggest that parallelism is the future
IWLSC, 9 Feb 2006 12 Fons Rademakers
Amdahl’s LawAmdahl’s Law
Amdahl’s law states that potential speedup is defined by the fraction of code (P) that can be parallelized:
If none of the code can be parallelized, P = 0 and the speedup = 1 (no speedup). If all the code is parallelized, P = 1, the speedup is infinite (in theory)
If 50% of the code can be parallelized, maximum speedup = 2, meaning the code will run twice as fast
Introducing the number of processors performing the parallel fraction of work, the relationship can written like:
Where P = parallel fraction, N = number of processors and S = serial fraction
IWLSC, 9 Feb 2006 13 Fons Rademakers
IWLSC, 9 Feb 2006 14 Fons Rademakers
Parallelism in HEPParallelism in HEP
Main areas of processing in HEP DAQ
Typically highly parallel Process in parallel large number of detectors modules or sub-detectors
Simulation No need for fine-grained track level parallelism, a single event is not the
end product Some attempts were made to introduce track level parallelism in G3 Typically job level parallelism, resulting in a large number of files
Reconstruction Idem as for simulation
Analysis Run over many events in parallel to get quickly the final analysis results Embarrassingly parallel, event level parallelism Preferably interactive for better control on and feedback of the analysis Main challenge: efficient data access
IWLSC, 9 Feb 2006 15 Fons Rademakers
Parallel Data Analysis in HEPParallel Data Analysis in HEP
Most parallel data analysis systems designed in the past and present are based on job splitting scripts and batch queues
When queue full no parallelism Explicit parallelism
Turn around time dictated by batch system scheduler and resource availability
Remarkably few attempts at real interactive implicitly parallel systems
PIAF PROOF
IWLSC, 9 Feb 2006 16 Fons Rademakers
Classical Parallel Data AnalysisClassical Parallel Data AnalysisStorageBatch farm
queues
manager
outputs
catalog
“Static” use of resources Jobs frozen, 1 job / CPU
“Manual” splitting, merging Limited monitoring (end of single job)
submit
files
jobsdata file splitting
myAna.C
mergingfinal analysis
IWLSC, 9 Feb 2006 17 Fons Rademakers
Interactive Parallel Data AnalysisInteractive Parallel Data Analysiscatalog StorageInteractive farm
scheduler
query
Farm perceived as extension of local PC More dynamic use of resources Automated splitting and merging Real time feedback
MASTER
query:data file list, myAna.C
files
final outputs(merged)
feedbacks
(merged)
IWLSC, 9 Feb 2006 18 Fons Rademakers
PIAFPIAF
The Parallel Interactive Analysis Facility First attempt at an interactive parallel analysis system Extension of and based on the PAW system Joint project between CERN/IT and Hewlett-Packard Development started in 1992 Small production service opened for LEP users in 1993
Up to 30 concurrent users CERN PIAF cluster consisted of 8 HP PA-RISC
machines FDDI interconnect 512 MB RAM Few hundred GB disk
First observation of hyper-speedup using column-wise n-tuples
IWLSC, 9 Feb 2006 19 Fons Rademakers
PIAF ArchitecturePIAF Architecture
Two-tier push architecture Client → Master → Workers Master divides total number of events by number of workers and
assigns each worker 1/n number of events to process
Pros Transparent
Cons Slowest node determined time of completion Not adaptable to varying node loads No optimized data access strategies Required homogeneous cluster Not scalable
IWLSC, 9 Feb 2006 20 Fons Rademakers
PIAF Push ArchitecturePIAF Push Architecture
Initialization
Process
Wait for nextcommand
Slave 1Process(“ana.C”)
Pro
cess
or
Initialization
Process
Wait for nextcommand
Slave NMaster
SendEvents() SendEvents()
SendObject(histo)SendObject(histo)
Addhistograms
Displayhistograms
1/n 1/n
Process(“ana.C”)
IWLSC, 9 Feb 2006 21 Fons Rademakers
PROOFPROOF
Parallel ROOT Facility Second generation interactive parallel analysis system Extension of and based on the ROOT system Joint project between ROOT, LCG, ALICE and MIT Proof of concept in 1997 Development picked up in 2002 PROOF in production in Phobos/BNL (with up to 150
CPU’s) since 2003 Second wave of developments started in 2005
following interest by LHC experiments
IWLSC, 9 Feb 2006 22 Fons Rademakers
PROOF Original Design GoalsPROOF Original Design Goals
Interactive parallel analysis on heterogeneous cluster Transparency
Same selectors, same chain Draw(), etc. on PROOF as in local session
Scalability Good and well understood (1000 nodes most extreme case) Extensive monitoring capabilities MLM (Multi-Level-Master) improves scalability on wide area clusters
Adaptability Partly achieved, system handles varying load on cluster nodes MLM allows much better latencies on wide area clusters No support yet for coming and going of worker nodes
IWLSC, 9 Feb 2006 23 Fons Rademakers
good connection ?VERY importantless important
Optimize for data locality or efficient data server access
adapts to clusterof clusters orwide area virtual clusters
Physically separated domains
PROOF Multi-Tier ArchitecturePROOF Multi-Tier Architecture
IWLSC, 9 Feb 2006 24 Fons Rademakers
PROOF Pull ArchitecturePROOF Pull Architecture
Initialization
Process
Process
Process
Process
Wait for nextcommand
Slave 1Process(“ana.C”)
Pac
ket
gen
erat
or
Initialization
Process
Process
Process
Process
Wait for nextcommand
Slave NMaster
GetNextPacket()
GetNextPacket()
GetNextPacket()
GetNextPacket()
GetNextPacket()
GetNextPacket()
GetNextPacket()
GetNextPacket()
SendObject(histo)SendObject(histo)
Addhistograms
Displayhistograms
0,100
200,100
340,100
490,100
100,100
300,40
440,50
590,60
Process(“ana.C”)
IWLSC, 9 Feb 2006 25 Fons Rademakers
PROOF New FeaturesPROOF New Features
Support for “interactive batch” mode Allow submission of long running queries Allow client/master disconnect and reconnect
Powerful, friendly and complete GUI
Work in grid environments Startup of agents via Grid job scheduler Agents calling out to master (firewalls, NAT) Dynamic master-worker setup
IWLSC, 9 Feb 2006 26 Fons Rademakers
Interactive/Batch queriesInteractive/Batch queries
GUI
Commands
scripts Batch
stateful
statefulor stateless
stateless
Interactive analysis usinglocal resources, e.g.- end-analysis calculations- visualizationv
Analysis jobs with well defined algorithms (e.g. production of personal trees)
Medium term jobs, e.g.analysis design and development using alsonon-local resources
Goal: bring these to thesame level of perception
IWLSC, 9 Feb 2006 27 Fons Rademakers
AQ1: 1s query produces a local histogram
AQ2: a 10mn query submitted to PROOF1
AQ3->AQ7: short queries
AQ8: a 10h query submitted to PROOF2BQ1: browse results of AQ2
BQ2: browse temporary results of AQ8
BQ3->BQ6: submit 4 10mn queries to PROOF1
CQ1: Browse results of AQ8, BQ3->BQ6
Monday at 10h15
ROOT sessionon my desktop
Monday at 16h25
ROOT sessionon my laptop
Wednesday at 8h40
ROOT session on my laptop in
Kolkata
Analysis Session ExampleAnalysis Session Example
IWLSC, 9 Feb 2006 28 Fons Rademakers
New PROOF GUINew PROOF GUI
IWLSC, 9 Feb 2006 29 Fons Rademakers
New PROOF GUINew PROOF GUI
IWLSC, 9 Feb 2006 30 Fons Rademakers
New PROOF GUINew PROOF GUI
IWLSC, 9 Feb 2006 31 Fons Rademakers
New PROOF GUINew PROOF GUI
IWLSC, 9 Feb 2006 32 Fons Rademakers
TGrid – Abstract Grid InterfaceTGrid – Abstract Grid Interface
class TGrid : public TObject {public: virtual Int_t AddFile(const char *lfn, const char *pfn) = 0; virtual Int_t DeleteFile(const char *lfn) = 0; virtual TGridResult *GetPhysicalFileNames(const char *lfn) = 0; virtual Int_t AddAttribute(const char *lfn, const char *attrname, const char *attrval) = 0; virtual Int_t DeleteAttribute(const char *lfn, const char *attrname) = 0; virtual TGridResult *GetAttributes(const char *lfn) = 0; virtual void Close(Option_t *option="") = 0; virtual TGridResult *Query(const char *query) = 0;
static TGrid *Connect(const char *grid, const char *uid = 0, const char *pw = 0);
ClassDef(TGrid,0) // ABC defining interface to GRID services};
IWLSC, 9 Feb 2006 33 Fons Rademakers
PROOF on the GridPROOF on the Grid
PROOFPROOF
USER SESSIONUSER SESSION
PROOF PROOF SLAVE SLAVE SERVERSSERVERS
PROOF MASTERPROOF MASTER SERVERSERVER
PROOF PROOF SLAVE SLAVE SERVERSSERVERS
PROOF PROOF SLAVE SLAVE SERVERSSERVERS
Guaranteed site access throughPROOF Sub-Masters calling outto Master (agent technology)
PROOF SUB-PROOF SUB-MASTERMASTER SERVER SERVER
PROOFPROOF
PROOFPROOF
PROOFPROOF
Grid/ROOT Authentication
Grid Access Control Service
TGrid UI/Queue UI
Proofd Startup
Grid Service Interfaces
Grid File/Metadata Catalogue
Client retrieves listof logical files (LFN + MSN)
Slave servers access data via xrootd from local disk pools
IWLSC, 9 Feb 2006 34 Fons Rademakers
Running PROOFRunning PROOF
TGrid *alien = TGrid::Connect(“alien”);
TGridResult *res; res = alien->Query(“lfn:///alice/simulation/2001-04/V0.6*.root“);
TChain *chain = new TChain("AOD");chain->Add(res);
gROOT->Proof(“master”);chain->Process(“myselector.C”);
// plot/save objects produced in myselector.C. . .
IWLSC, 9 Feb 2006 35 Fons Rademakers
ConclusionsConclusions
The Amdahl’s Law shows that making really scalable parallel applications is very hard
Parallelism in HEP off-line computing still lagging To solve the LHC data analysis problems, parallelism
is the only solution To make good use of the current and future generation
of multi-core CPU’s parallel applications are required