Date post: | 25-May-2015 |
Category: |
Technology |
Upload: | micah-altman |
View: | 1,685 times |
Download: | 8 times |
High Performance Statistical Computing with Applications in the Social Sciences
Micah Altman
Senior Research Scientist
“introduction to the RCE” by,
Earl Robert Kinney
Manager, Research Computing Environment
Institute for Quantitative Social Science
Harvard University
Goals for todayAnalysis Describe performance goals Identify resource use patterns Identify resource bottlenecks Identify performance hot-spots Select problem decomposition
Application Connect to RCE Use the RCE to analyze larger
data sets Use the RCE to run interactive
analyses more quickly Use the RCE to run large numbers
of analyses independently
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 25
[Source: Wikimedia Commons]
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 3
Organization of this Workshop Motivation Principles Introduction to RCE Measuring Resource Use Scaling Up Tuning Up Scaling Out
(Parallelization) Additional Resources
Nine Steps to Faster Results1. Predict your resource needs through benchmarks,
models, algorithmic analysis
2. Select alternate algorithms when resource needs grow very rapidly with problem size
3. Identify resource bottlenecks using systems performance analysis tools
4. Address bottlenecks by increasing resources and/or changing program resource management
5. Discover hot-spots in programs using profiling tools
6. Adapt hot-spots to system architecture
7. Decompose the problem into independent subproblems
8. Distribute subproblems across pools of resources
9. Repeat analysis after making any changes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 4
FREE! With every first class!
Coffee!Chocolate!!Consulting!!!Time off for good behavior !!!!
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 5 of 85
And now, a word from our sponsor…
6
IQSS (and affiliates) offer you support across all stages of your quantitative research:
Research design, including:design of surveys, selection of statistical methods.
Primary and secondary data collection, including:the collection of geospatial and survey data.
Data management, including:storage, cataloging, permanent archiving, and distribution.
Data analysis, including :survey consulting, statistical software training, GIS consulting, high performance research computing.
http://iq.harvard.edu/M. Altman & B. Kinney
High Perf. Stat. Computing (v.9/10/11)
But wait … there’s more!
7
The IQSS grants administration team helps with every aspect of the grant process. Contact us when you are planning your proposal.
Assisting in identifying research funding opportunities Consulting on writing proposals Assisting IQSS affiliates with:
preparation, review and submission of all grant applications
(“pre-award support”) management of their sponsored research portfolio
(“post-award support”) Interpret sponsor policies Coordinate with FAS Research Administration and the Central Office for
Sponsored Programs
… And, of course, support seminars like this!
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 8
“One’s Reach should exceed One’s Grasp” Leading edge statistical methods (such as
MCMC) can require lots of computing power Ensuring robust results can multiply (and re-
multiply) the number of analyses done: Sensitivity analysis Parameterization studies Alternative models, Bayesian model averaging
Performance benchmarks provides information for budgeting computing $$$
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 9
“I Want it Now!”
Deadlines abound: conferences, trials, publication dates
New observations, variables, corrections, or model specifications may necessitate speedy reanalysis
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 10
"My strength is as the strength of ten because my heart is pure." Selection of algorithms can change the
nature of the computational resource usage Tuning for a particular system can increase
performance approximately ten-fold In some circumstances work can be split
across thousands of systems.
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 11
Principles
Goals matter Problems matter Algorithms matter Answers matter Architecture matters
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Types of Performance Goals Task completion time – wait time to finish Efficiency – resource use for task Throughput –work done by system overall Latency – delay before response Responsiveness – perception of response Reliability – probability task/system will fail
during time interval
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 12
“If you don’t know where you’re going, any road will take you.” – Proverb “If you come to a fork in the road, take it.” – Yogi Berra
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 13
Performance Goals – Rules of Thumb Completion time:
work (i)/resource(i) Throughput:
maximize(work/resource) for all jobs
Latency: time elapsed before first response to input
Real-time: complete task within fixed interval
“Responsiveness” Perceived latency Task completion time Task progress indicators
Rules of thumb
Users of interactive software want responsiveness
Users of batch jobs want small completion times
Systems administrators want maximum throughput, reliability
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Size of Factors Affecting Performance
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 14
Run Time For Large Instance (n=1000)
NP-Hard (worst case) 10^292 years
Very Inefficient AlgorithmO(N^3)
1.6 years
Inefficient AlgorithmO(N^2)
16 hours
Very Poor Memory Access Patterns
11 hours
Un-optimized Code 67 minutes
Optimized code 7 minutes
Local Multiprocessing 2 minutes
Fully Parallel/Full Cluster
4 seconds
If runtime for solve small instance of a problem (n=10), running on single system is one minute, how long will it take to solve larger instance of n=1000?
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 15
Problem Complexity Classes
Problem complexity class: set of problems that can be solved in O(f(n)) for some f
More general than algorithmic complexity – encompasses all possible algorithms to solve the given problem
Polynomial time algorithm necessary for large problem instances
Decision Problem
Decidable Undecidable
EXPSPACE
EXPTIME
PSPACE
NPCO-NP
P = BPP(?)
BQP
NP-complete
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Some Problems Are HARD Traveling Salesperson Problem (weighted Hamiltonian cycle): Plot a route through N
locations, visiting each once, that minimizes cost. NP-Hard: worst-case instances require exponential time for optimal, certain, solution NP-Complete: Equivalent to a large class of hard problems
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 16
Source: Applegate, Bixby, Chvátal, and Cook (1998) Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 17
How to “Solve” the Unsolvable Think small: Use only a small number of cities.
Aggregate to regions and treat as quasi-cities. Restrict Problem – Euclidean distances are easier
than travel cost. Solve a different problem: minimum spanning tree Approximate solution: for Euclidean distances, there
is an algorithm based on minimum spanning tree that is at most 50% longer
Randomize: can a randomized algorithm find solution with probability p? (No one knows…, probably not)
Be Lucky: maybe “average” problem isn’t that hard? Heuristics: Apply Simulated Annealing (etc.), cross
fingers
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
How to recognize hard problems… Is the problem routinely solved by existing
systems? Are efficient algorithms known? Does it appear in lists of hard problems? Is the problem universal?
(Any computing problem, sufficiently generalized is hard [Papadimitriou 1994])
Is run time growing exponentially in practice?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 18
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 19
Algorithmic Complexity Measures the complexity of a particular
solution to a problem Resource complexity: a measure of the
resources used to solve a problem, as a function of input size
Common resource measures: Time, usually represented as number of
operations executed Space, usually represented as number of discrete
scalar values stored
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 20
Algorithmic Complexity: SearchbubbleSort(list) while (not finished) {
finished <- truefor i in (1 to length(list)-1){
if (list[i]>list[i+1]) { swap(list[i],list[i+1]) finished<- false}
}}
2nO
quicksort(list) if (length(list)=1) return select from (list) for x in (list) {
if x=pivot, add x to pivotListif x>pivot, add x to greaterListif x>pivot, add x to lessList
} return(quicksort(
lessList + pivotlist + greaterlist))
nnO log
*illustrations courtesy of wikipedia
Number of Operations
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Search Complexity Continued Tally sort:
Items in a fixed range, no duplicates Inlist = logical(length=max-min) For (i = 1:length(items)) {inlist[items[i]]=TRUE)} For (I in mix:max) if (inlist[i]) dowork(i)
How fast is this? Algorithm Recurse_sort(array L, i = 0, j = length(L)-1)
if L[j] < L[i] then L[i] ↔ L[j] if j - i > 1 then t = (j - i + 1)/3 Recurse_sort (L, i , j-t) Recurse_sort (L, i+t, j ) Recurse_sort (L, i , j-t) return L
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 21
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Answers Matter
Before optimization, verify the answer Right can mean “right enough” if well-defined Correct code may have different performance
characteristics than incorrect code Returning wrong answer can always be done
quickly
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 22
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Simple VN Architecture
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 23
Input/Output
ProcessorMemory
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
More Modern
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 24
Processor
Processor
RAID Controller
Processor
Memory
GPU
DiskDisk
Disk
Network Card
CoreCore
L1L1
L2 FPU FPU
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Inside the Core
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 25
Processor
Processor
RAID ControllerProcessor
Memory
GPU
DiskDiskDisk
Network Card
Core
Core
L1
L1
L2FPU
FPU
© Intel
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Deep Inside the Core
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 26
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 27
Resource Hierarchy: Big, Fast, Cheap*
• Big, Fast, Cheap – Pick 2• Latency increases with each step down • Storage increases• Throughput decreases
(except, with some offline storage)
Registers (<1KB)
Cache (1 MB)
Ram (10 Gigabytes)
Local Storage (10’s Terabytes)
ONLINE Storage (100’s Petabytes)
OFFLINE STORAGE ( 10’s Exabytes )
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 28
Reading One Byte: x-= m[1,3]
CPU: 8 bytes: Load Register
Cache: 256 Bytes <- Cache Line
RAM: 4K <- Page
Disk <- 8K from NFS
Networked File System
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 29
General Performance Implications of Architecture Talking to external devices can cause waits… (latency) Information transmitted to CPU is limited by bus
(throughput) In practice, expect 80% of theoretical data-path bandwidth at
best Some optimizations are highly specific to architectural
details Hidden parallelism at low levels Information travels in chunks
(at least bus size) Complexity makes theoretical performance analysis
difficult – use benchmarksHig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 30
From Principles to Practice Practice = Principles * Optimization Goals *
Problem Type * Computing Environment Optimization Goals
Throughput Latency Reliability Scaling up Scaling out
Problem Decomposition Independent data Independent calculations Coupled calculations
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Principled Preparation Checklist
Verify that your problem is tractable: Substitute an easier problem Restrict or limit the problem Be lucky or clever
Establish performance goals Identify possible algorithms
What is their resource complexity? Are better algorithms known?
Identify potential system characteristics Communications costs Systems resources
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 31
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
Lab 0: Problem definition Define your computing
problem as formally as you can?
What algorithms are you used to solve the problem?
What are your performance goals?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 32
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: P
rin
cip
les
An Introduction to the IQSS RCE What is it? Why use it? How does it work? How do we use it?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 33
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
What is the RCE?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 34
Virt
ual D
eskt
op
• Full virtual desktop environment – connect anywhere
• Many research software packages available
• Persistent session – connect anytime In
tera
ctiv
e N
odes
• For large interactive jobs.
• Large amounts of memory available on demand
• Stata, Matlab, Mathmatica
Bat
ch P
roce
ssin
g • Run hundreds of jobs at once.
• Optimized for non-interactive, independent work
• Easy to run from your virtual desktop
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Why use the RCE? For Research
An environment customized for quantitative social science research A wide variety of research software packages are available
Fore Convenience The RCE enables you to access a research desktop from almost any computer Sessions are persistent -- disconnect from your office, reconnect from home File storage is central. Never worry about which computer has your files
For Resources Large analysis jobs are offloaded to high-powered Large resource pools : 800 processors , 3.3 TB of memory , 40 TB of disk storage Regularly updated software
For collaboration Offers an ideal environment for collaborative research projects Share project files, desktops, software
For reliability System performance and availability is constantly monitored Research files are regularly backed up and stored securely IQSS has full time staff dedicated to the support the RCE
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 35
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
RCE Architecture
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 36
Batch Nodes
Interactive Nodes
Login Nodes
Virtual Desktop
SessionsClient
Disk Disk Disk Disk
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
RCE Architecture Rules of Thumb Connect to interactive pool Small problems – run directly
(on an interactive node) Large-memory problems use interactive
nodes Interactive problems use interactive nodes Large-compute jobs use Batch submit
-- but problem must be decomposed
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 37
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
RCE Powered Apps – How it Works
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 38
User clicks on applicationfrom menu
RCE checks for availability ofInteractive nodes
RCE submits special condorjob to interactive
master node.
RCE submits special condor
job to batch master node.
User receives notice, offeredbatch node to run their job.
Window appears on RCE desktop and application
runs on node.
A node is available
No node is available User hits “yes”
~30s
~120s
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
RCE Desktop
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 39
Application Menu – Application launching
HMDC Outage Notifier – Updates to reflect status of environment.
File Browser – Graphical view of yourhome directory and files.
Quick Launch – Quick Access to E-mail, Web, and Office applications.
Status Bar – Shows open applications.
Desktop Shortcuts – ContainsShortcuts to home directory and trash
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Login Nodes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 40
Number of servers:
8
Number of processors:
32
RAM per session:
~6 GB
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Apps On Interactive Nodes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 41
Features
Easiest way to launch applicationsLimitations
Smaller amounts of RAMCompetition for resources with interactive processes.
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Interactive Nodes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 42
Number of servers:
13
Number of processors: 84
RAM per job:
1-64GBHig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Apps on Interactive Nodes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 43
Features More memory available for
application Dedicated processor reduces
competition for resources Multiple cores available
(e.g. for Stata-MP) Limitations
Interactive nodes are limited in number
Time limit on applications (currently 72 hours)
Time can be extended by request
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Batch Nodes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 44
Number of servers:
61
Number of processors:
258
RAM per job:
2-4GB
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Running Statistical Apps OnBatch Nodes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 45
Features Nearly 400 nodes can
run at the same time Well suited for loosely-
coupled parallel problems
Limitations Memory is more limited Application must be
designed to harness the power of all node
No failover to other pools
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Memory Limitations
Login Nodes Each user on the machine is
allowed to use a portion of available memory
No enforcement of login limits (can be oversubscribed)
Interactive/Batch Nodes Each node has share of
memory based on request Physical hardware will only
run number of jobs equal to processor cores (not oversubscribed)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 46
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Get started with the RCE: Checklist
Apply for an RCE account:[email protected]
Install the free NX software Connect to rce.hmdc.harvard.edu Run interactive programs with menus Run large interactive jobs with “RCE
Powered” menu Run large batch jobs using a simple
launcher script
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 47
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Intr
od
uct
ion
to
RC
E
Lab 1: Connecting to the RCE In this lab, we will
login to the RCE and launch stata on a Interactive node
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 48
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: A
nal
yzin
g R
eso
urc
e U
se
Systems Resource Use Benchmarks Timing System resource monitoring System resource limits
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 49
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Benchmarks What patterns of usage are likely to occur?
What are the 80% cases? Are there 10% cases that have unusual patterns of
data access, or unusual input? Can you construct a plausible worst-case?
Parameterize benchmarks Parameterize problem size Vary order-of-magnitude
Create benchmarks based on real cases Use real problems for full benchmarking Miniaturize real problems for quick tests
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 50
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Common Benchmarks
Artificial benchmarks Simple “unit” benchmarks Real application + random data Real application + real data Real application + worst case data Mix of applications
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 51
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Timing Why measure timings
Direct or indirect measure of performance Establish baseline for changes Empirical measure of scaling
Limitations Timers are often imprecise for brief events Other activity on the system “noise” Many tools aggregate all phases of execution Many tools aggregate all areas of resource use CPU timings may exclude system resource use Must use condor_submit to run these on non-interactive
nodes Heisenbugs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 52
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Alternative: Queuing Models Formalists alternative to benchmarks Can be useful for capacity planning Model services as network of queues
Different classes of “customers” Resources with different delay characteristics Transition probabilities
Distribution of “service events” Poisson events discrete, independent, no
memory Number of events are Poisson distributed
interarrival time exponentially distributed Little’s law:
Length of queue = arrival rate * time in queue Limitations
Heroic assumptions are often required State-space explosion Only simplest models solvable closed-form
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 53
Source: Takefusa, et al. 1999
Wall-Clock Time Measure completion time Show phases of execution by inserting calls
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 54
Linux: dateOS X: dateWindows: DATER: Sys.time()Stata: display "$S_TIME $S_DATE”Matlab: clock; ticC: time(), getitimer()
> print(Sys.time())
[1] "2010-04-28 10:21:45 EDT"> res <- optim(sq, distance, genseq, method="SANN",+ control = list(maxit=30000, temp=2000))> print(Sys.time())[1] "2010-04-28 10:21:55 EDT"
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
CPU Time Measure CPU time used by program Show “system”-state and “user”-state time Some tools show other resources
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 55
Linux: /usr/bin/time –v OS X: /usr/bin/time –l Windows: timeit.exe*R: system.time()Stata: timerMatlab: cputimeC: getrusage()
$ /usr/bin/time -v \/usr/local/stata11/stata -b mycommand.do
User time (seconds): 0.00System time (seconds): 0.01Percent of CPU this job got: 64%Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03...
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
*Optional tool, may require installation on your system
Interpreting CPU Time
If (system)/(system + user) > .1 Possibly inefficient use of system calls, I/O
If elapsed time >> (system+user) Possible resource bottleneck Possible sleep If CPU Percent low possible CPU contention
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 56
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
User time (seconds): 0.00System time (seconds): 0.01Percent of CPU this job got: 64%Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
Monitoring Running Processes Show list of processes running See current and accumulated CPU usage See CPU utilization
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 57
Linux: top; gnome-system-monitorOS X: top; Utilities -> “Activity Monitor”;
atMonitor (3rd party, highly recommended)
Windows: taskmrg.exe; top.exe *
$ gnome-system-monitor &
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
*Optional tool, may require installation on your system
Interpreting Process Monitor Results
Show list of processes running See current and accumulated CPU & memory usage See CPU utilization
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 58
$ gnome-system-monitor &
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Show processes
Sort # of processeswaiting to use CPU
Sort processesBy CPU use
Sample Performance Curves
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 59
• Best case: linear in size of problem• Nonlinearities could mean…
• inefficient algorithm (case 2)• hard problem (case 3)• poor data access patterns (case 4)
System Resource Monitoring Why monitor system resources?
Identify bottlenecks Identify processes using resources – may affect overall
throughput and capacity Identify processes actively using resources – may affect
performance Limitations
Tools are often imprecise for brief events Other activity on the system “noise” Many tools aggregate all phases of execution Many tools aggregate all system use Many tools aggregate sub-resource use Must use condor_submit to run these on non-interactive nodes Heisenbugs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 60
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Monitoring System Resources See system aggregated use and activity for memory, disk,
network See memory use by process See resource use by process (varies by platform)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 61
Linux: gnome-system-monitor ; /usr/bin/time –v sar ; iostat ; vmstat
OS X: Utilities -> “Activity Monitor”/usr/bin/time –v; sar ; iostat
Windows: perfmon.exe; taskmrg.exe
$ gnome-system-monitor &$ sar –A 1 10$ /usr/bin/time –v stata –b somefile.do
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Detailed System Resource Tracing See system use/calls for process as it runs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 62
Linux: strace; systap (add-on)OS X: dtraceWindows: procmon.exe (add-on)
$ strace –o strace.log myProgram Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
$ sudo dtrace -n 'syscall:::entry { @[execname] = count() }' -c ls
Interpreting Process Memory Use
Memory – amount of virtual memory requested Resident Memory – amount of memory currently in
RAM for processM. Altman & B. Kinney
High Perf. Stat. Computing (v.9/10/11) 63
$ gnome-system-monitor &
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Use monitor->preferences toAdd “Resident Memory”
Memory in residence
Requestedmemory
01:37:57 PM pgpgin/s pgpgout/s fault/s majflt/s01:37:58 PM 0.00 0.00 14.71 0.00
01:37:57 PM tps rtps wtps bread/s bwrtn/s01:37:58 PM 0.00 0.00 0.00 0.00 0.00
Interpreting System Activity
Page faults – indicate memory activity or resource contention
File i/o – indicates file activity
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 64
$ sar -bB 1 10
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
System memory activity
System disk activity
Interpreting System Activity
Page faults – indicate memory activity or resource contention
File i/o – indicates file activity
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 65
$ perfmon
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Major (requiring I/O) page faults: 0Minor (reclaiming a frame) page faults: 149Voluntary context switches: 1280Involuntary context switches: 460Swaps: 0File system inputs: 0File system outputs: 0
Interpreting Process Resource Use
Page faults – indicate memory activity or resource contention Voluntary context switches – indicates waiting on I/O or memory Swaps – indicates a severe system memory shortage File i/o – indicates file activity If the numbers is always 0 – it’s a lie
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 66
$ /usr/bin/time –v stata –b command
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Often memory related
Process disk I/O
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 67
Symptoms of CPU Bound System/Problem
CPU User+Sys activity near 100% while there are active processes (if # of procs > # of cpus)
Performance curve for your problem is continuous This is usually good
CPU is most expensive resource You can trust code profiling reports More likely to have gains from parallelization
However, if CPU %sys is high suspect inefficient use of system calls, or borderline I.o or memory bottlenecks
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 68
Symptoms of Resource Bottleneck Memory Bottlenecks:
Severe: processes in swap queue (or wait on swap) lots of space in use (see swap –m), swapping activity, free memory low
Moderate: high context switches + high page (validity) faults + active processes with memory >> resident memory
I/O Bottlenecks: Moderate:
high % sys activity in CPU, high # of system calls, # interrupts Severe
I/O rate high Context switches, wait on I/O, or processes sleeping on I/O Physical disk activity high
Performance curve Discontinuous regions of accelerated performance decline
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 69
Tune against bottlenecks Typically, a single resource will be
the bottleneck point: CPU Memory I/O: Graphics, Network, Disk
If you don’t address the bottleneck, optimizations elsewhere won’t matter
Bottlenecks may depend on usage scenario and phase of operation
Fixing one bottleneck may reveal others
Don’t expect speedup of the entire program to be proportional to the code you just tuned!
Programs interact, try to profile on a quiet system first
Hig
h P
erfo
rman
ce C
ompu
ting:
Sca
ling
Up
Resource Analysis: Checklist
Identify benchmarks Small instances of your problem Can vary size
Target an isolated system Minimize other activity
Time benchmarks at various sizes Monitor systems resources Look for non-linearities in performance curve Look for bottlenecks
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 70
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
An
alyz
ing
Res
ou
rce
Use
Lab: Analyzing Resource Use In this lab, we will
login to the RCE and run a simple set of benchmarks
Use timing tools and performance analysis
Identify bottlenecks and performance curves
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 71
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: A
nal
yzin
g R
eso
urc
e U
se
Scaling Up Addressing resource bottlenecks System and application limits Storing/accessing large datasets Visualizing large datasets
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 72
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 73
When to Scale Up If resource analysis identifies a memory bottleneck If resource analysis identifies an I/O bottleneck
(maybe …) If problem size prevents program from starting If program crashes or hangs in the middle of
solving large problems (maybe…) If planning ahead for significant usage changes:
- size of problem data >
~1/2 available physical memory (RAM)- change of algorithm- change of data structure
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 74
Addressing Memory Bottlenecks Review: Symptoms of memory bottleneck
Discontinuity in performance curve Memory size of process increasing Resident memory size of process relatively large System activity shows memory activity
Principals of addressing memory bottlenecks Memory hierarchy Locality of reference
Programming patterns Add more resources Modify data types Modify data structures Modify algorithms
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Memory Hierarchy
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 75
Registers (<1KB)
Cache (1 MB)
Ram (10 Gigabytes)
Local Storage (10’s Terabytes)
ONLINE Storage (100’s Petabytes)
OFFLINE STORAGE ( 10’s Exabytes )
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
If a register access took a second, tape access would take a few centuries..
Buy one, get 8092 free!
Locality of reference Temporal locality: reuse same data elements Spatial locality: use elements that are “near” each-other in
memory What is “near”?
For vectors and files: sequential ordering For matrices: either row or column ordering
depends on language For complex data structures:
use experimentation and analysis
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 76
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Row-Major Order
Adding More Resources “$$$” Optimization Buy more memory, or…
use the RCE to request a larger share This is effective if local set size < share size
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 77
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
System and Application Resource Limits Limits imposed by system or application Virtual Memory
Logical memory space for process Virtual memory limits maximum size of memory requrested Can prevent program from starting, or loading large data
Physical Memory Physical RAM installed in system Usually smaller than VM, but not always Maximum efficient local set
Resident Size Limits Affects maximum efficient local set
- not as severely as physical limits
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 78
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 79
Limits in Linux and OS X Where limits are set:
Set at bootup Set by system at login – group/user level total memory limits Set in shell at process creation – request new limit (up to user
maximum) Set in code via setrlimit Set in application
Know your limits Linux/OS X: /usr/bin/ulimit –a R: none for Linux Stata: query memory
Limits on 32 v. 64 bit systems 32 bit OS has limit of 4GB for virtual & physical memory 64 bit OS
No practical limit on virtual memory Physical memory still limited by hardware configuration and design Data structures may require more memory to store, since pointers
and default data types are larger
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 80 of 85
Limits in windows systems Where limits are set:
Limits implied by configuration at boot Virtual memory typically depends on configured paging space on disk + pagefile
R: memory.limit() Limits on 32 v. 64 bit systems
Most 32 bit windows OS has limit of 3GB physical memory 32 Bit addressing allows 4GB, but 1GB reserved for memory mapped hardware,
so only 3GB left over in most Windows configurations 64 bit OS
No practical limit on virtual memory (8 TB) Physical memory still limited by hardware configuration and design Data structures may require more memory to store, since pointers and default data
types are larger Some windows applications are 32-bit versions, so still limited to 4GB virtual
memory.
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Basic Memory Management in Statistical Software
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 81
Matlab R Stata
Memory Limit --- memory.size()[windows only]
set memory
Remove objects CLEAR rm() clear
Shrink data types
--- as.integer(real_val)as.factor(string_val)
compress
Measure data size
object.size()gc()
memory
Order for virtual memory
PACK gc() set virtual
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Data Structures and Memory In some circumstances, few
data points * lots of dimensions = huge matrices
Alternate representations can help Sparse matrix Network/graph Linked list Hash
Matlab and R offer these Stata doesn’t offer much
control over this M. Altman & B. Kinney
High Perf. Stat. Computing (v.9/10/11) 82
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 83
Algorithmic Approaches for Memory Bottlenecks Software Default memory model
R in-memory by default. Can work around through DBI() interfaces, chunking data.
SAS/SPSS cross-tabs, descriptive stats, correlations can be done without large memory, GLM etc. is done in memory
Numerical Python/Perl Data Language/Lapack libraries: for large matrix manipulation
MySQL/PostGres – most analyses done streaming, with small memory size
Avoid in-Memory Duplication:
Dataframe.new<-bigData; dataFrame[x,y]<-0 # duplicate storageX<-1:1000; x[2000]=0 # duplication by extensionXhuge<- Xhuge*2 # temporary copy needed
Use memory efficient options for MLE, optimization,(e.g. Conjugate Gradient method for optim() , avoid Hessians)
Use memory-friendly techniques Blocking Divide and conquer
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Example: Computing Variance
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 84
Source: Wikipedia
def two_pass_variance(data): n = 0 sum1 = 0 for x in data: n = n + 1 sum1 = sum1 + x mean = sum1/n sum2 = 0 for x in data: sum2 = sum2 + (x - mean)**2 variance = sum2/(n - 1) return variance
def naive_variance(data): n = 0 Sum = 0 Sum_sqr = 0 for x in data: n = n + 1 Sum = Sum + x Sum_sqr = Sum_sqr + x*x mean = Sum/n variance = (Sum_sqr - Sum*mean)/(n - 1) return variance
def online_variance(data): n = 0 mean = 0 M2 = 0 for x in data: n = n + 1 delta = x - mean mean = mean + delta/n M2 = M2 + delta*(x – mean) variance_n = M2/n variance = M2/(n - 1) return variance
Two Pass AlgorithmOne-PassNumerically Unstable
One-Pass(Accurate)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 85
Storing Large Data: Compression Advantages:
Reduces size on disk Can decrease load times
(disk is slow, CPU can keep up with decompression) Disadvantages
Generally does not change size in working memory (compressed data structures are rare)
Less efficient for data that will be updated(need to write a separate copy of file)
Tools Separate stage: gzip/zip Pipes avoid creating separate copies of extracted files: gzip/gunzip –c Access compressed data directly in R: gzfile()H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: S
calin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 86
Storing large data: data formats Making the most of formats
Use storage features: binary storage, internal compression Choose efficient data types: integers, numeric, factor; avoid character, custom types Partition sets of variables into different data frames by hand
Key-Value based Data Storage Details
B-tree based or hash file (e.g., gdbm, berkeley db) Key-value pair used for storage/quick retrieval Most efficient for multiple-reader of arbitrary values Less efficient for frequent updates, continuous reads
Tools – See R task view Scientific data formats
Details Common formats: HDF, NETCDF Built for complex hierarchical data structures Efficient storage/updates of entire variables from within datasets (uses b-trees)
Tools Require external support libraries R packages for ncdf/cdf/hdfH
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: S
calin
g U
p
Large Data: Analysis Virtual Memory
Configure system with lots of virtual memory – swap space (not currently RCE option)
Use R/Stata settings for better VM management Locality of reference critical
Subsampling Good for means and moments May miss outliers Stratify on outliers if oossible
Data Reduction With External Tools Partition data horizontally/vertically Reduce measurements (e.g. code text) Convert data structures
Data streaming/data blocking Can algorithm compute in one pass? Can algorithm compute in blocks of arbitrary size?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 87
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 88
Storing Large Data: DatabasesSystematic storage
Studies organized into databases Database partitioned into multiple
tables Tables contain sets of variables Data from multiple tables join’ed on
extractionTools
Open Source: MySQL, PostGreSQL
Commercial: Oracle Desktop: MSAccess R: DBI package Perl: DBI package Java: JDBC
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 89
Optimizations for Databases: Basics Table layout:
Layout of data in tables Appropriate normalization
SQL statements: SQL is Declarative, not procedural Stick to relational model as much as possible
(e.g., no functions, ordering in selection rules, only in final results) Put database query optimizer to work for you
Database Structures Affecting Performance Indexes: Optimizes ad-hoc row-wise retrieval Cursors: Optimize retrieving a subset of data incrementally Materialized Views: predefined subset of data for retrieval
Performance Tuning Data types: Native types fastest, variable-length character fields, large objects can
affect performance Read vs updates:
Reads do not require exclusive locks (generally) Writes require exclusive locks. If lots of writes, use row-level locks. Indexes help reads, hurt writes
Some summary statistics can be computed in SQL, regardless of main memory limitations in R
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 90
Optimizations for Databases: Gotchas Does not save disk space (often more space needed for
indices) Difficult to index ad-hoc retrieval of variables – efficiency
depends on fixed table decomposition Limited data types & numeric precision Can be much slower than storing in R (etc.) for data that
would fit in main memory If doing analysis in some other program, still need to
manage size of extracted subsets to avoid memory limitations
Overhead for access and indexing: indexing always slows down updates to database
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 91
Large Data Analysis: Graphics Types of graphical issues
Too many points Too many categories
Some Useful Tools: CRAN Large graphics taskview:
cran.r-project.org/web/views/Graphics.html Mondrian: rosuda.org/Mondrian/ GGobi: www.ggobi.org/
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 92
Large Data Analysis: Graphics Use subsamples -
good for moments of distribution not good for tail behavior, outliers
Interactive Graphics: brushing & logical zooming essential Workarounds for performance problems:
Avoid real-time binning Reduce number of simultaneous displays Avoid overlapping display windows Pause dynamic graphics, brushing, while positioning
Overplotting Pixel-size glyphs Alpha-blending Data aggregation: binning, (pre)-clustering, squashing,
(tableplots) Screen -binning Redlining Kernel smoothing
Areal plots Display is independent of number of observations Use logical zooming for large numbers of categories Use redlining Examples: histograms, mosaic plots, fluctuation diagrams
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Scaling Up: Checklist
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 93
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Sca
ling
Up
Perform resource usage analysis first Pay attention to locality of reference, memory
hierarchy Examine resource limits Consider adding memory resources Consider data types, program memory management Consider data structures Modify algorithms: blocking, divide and conquer Consider a database / data streaming algorithms Use visualization techniques suitable for large data
BREAK
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 94
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting:
Tu
nin
g U
p
Tuning Up Profiling Analysis Tuning Strategies
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 95
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 96
Review Analysis Strategies
Get the answer right (or as right as you need), first Use appropriate algorithms Identify use-case based benchmarks Look for system bottlenecks Examine processes Profile applications Tune small, expensive regions of code
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 97
Profiling methodologies– Wall Clock Usually instrumented ad-hoc Very easy to do for very small programs Very coarse grained Selects on dependent variable Stata: timer R: trace(glm,
tracer=quote(print(Sys.time())),exit=quote(print(Sys.time()))
)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
PC-Counter Sampling PC-counter sampling
Automatically inserts interrupt handler into code Timer fires interrupt regularly Interrupt handler builds up statistical picture of runtime behavior Uses “call graph” to determine where program is spending time Useful for identifying CPU intensive functionsL
Limitations Does not include bottlenecks that cause process to sleep/block/swap
(i.e. waiting for a file open/ network read to complete, swapping out large chunks of memory)
Somewhat imprecise – need larger runtimes (10’s of seconds should be fine)
Need representative input data May be limited in reporting details of calling sequence Usually limited to function calls, won’t show details on blocks of code Heisenbugs
Tools: C: gprof General: oprofile R: Rprof() ; RprofMem(); proftools Stata: profiler Matlab: profile
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 98
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
Other Profiling Methods Basic block counts
Pros: precise Cons: very limited availability, doesn’t include system
latencies, system call times, Hardware counts
Pros: precise Cons: limited availability
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 99
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 100
Example (step 1: benchmark)
Report that analysis of drug abuse study was ‘slow’ Formulates some possible cases:
Select a {1,10,100,1000} variables, and {no, simple criteria, complex} row selection criteria on {10MB,100MB,1GB} dataset for {descriptives, boxplots, q-q plots, tab-delimited, stata, R}
Clocked a simple reduced benchmark case {1 var, 100MB, all rows, descriptives}
With a stopwatch: 5 minutes and browser timed out
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 101
Example (step 2.1: bottleneck)
[maltman@login ~/rtest]$ vmstat -n 5 procs memory swap io system cpu r b w swpd free buff cache si so bi bo in cs us sy id 1 0 0 60 2064 111036 489164 0 0 45 1 39 19 3 1 41 1 0 0 60 2052 111036 489164 0 0 0 1 121 101 100 0 0 1 0 0 60 2052 111036 489164 0 0 0 1 121 101 100 0 0 1 0 0 60 2052 111036 489164 0 0 0 1 120 98 100 0 0…
[maltman@login ~/rtest]$ netstat -iacKernel Interface tableIface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flgeth0 1500 0 12284461 0 0 0 57704048 0 0 0 BRUlo 3924 0 896898 0 0 0 896898 0 0 0 LRU
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flgeth0 1500 0 12284464 0 0 0 57704049 0 0 0 BRUlo 3924 0 896898 0 0 0 896898 0 0 0 LRU
‘vmstat’ shows that CPU is bottleneck, although interrupts and context switches are suspiciously high as well:
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 102
Example (step 3: processes)
I used ‘top’ to look at progress of request:
2:27pm up 121 days, 1:23, 4 users, load average: 1.14, 0.42, 0.15313 processes: 311 sleeping, 2 running, 0 zombie, 0 stoppedCPU states: 0.2% user, 22.6% system, 39.0% nice, 38.0% idleMem: 1048092K av, 887652K used, 160440K free, 0K shrd, 98628K buffSwap: 530104K av, 4848K used, 525256K free 402456K cached PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND18336 nobody 19 19 786700 700M 1584 RN 0 96.6 53.7 30:08 R.bin
Hig
h P
erfo
rman
ce C
ompu
ting:
Spe
edin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 103
User time (seconds): 357.04 System time (seconds): 3.51 Percent of CPU this job got: 94% Elapsed (wall clock) time (h:mm:ss or m:ss): 6:23.02 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 0 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 907 Minor (reclaiming a frame) page faults: 229112 Voluntary context switches 0 Involuntary context switches: 0 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0
Example (step 4: single process timing)
- Time –v output shows process is CPU bound. -No discrepancy between wall-clock and CPU time.
I used ‘time –v R …’ to look at a smaller example:
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 104
Example (step 4.1: mini-profile)I cooked a short R test program, using a subset of the original data, timing calls, and simpler stat analysis:date()
y<-read.table("da6",header=TRUE,row.names=,sep="\t")date()summary(y$CLIDNO)date()q()
% /usr/bin/time –v R –no-save < R.example[1] "Wed Dec 26 14:47:22 2001"> y<-read.table("da6",header=TRUE,row.names=,sep="\t")> date()[1] "Wed Dec 26 14:53:44 2001"> summary(y$CLIDNO) Min. 1st Qu. Median Mean 3rd Qu. Max. 11 5807 9999 8119 9999 9999> date()[1] "Wed Dec 26 14:53:44 2001"> q()
The internal timing calls alone show that 99%+ of the time was spent in read.table()
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 105
Example (step 4.2: test)
- Pre-slicing the data using ‘cut’ to produce only the variable being analyzed reduced read.table() time to a few seconds.
- But the real code differed from my benchmark
- the real problem was reduced from >30 minutes to 5 minutes, almost all of it still in R
- summary() in R only took .1 seconds on the same data, so something else was going on in the R codeH
igh
Per
form
ance
Sta
tistic
al C
ompu
ting:
Tu
nin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 106
Example (step 4.3: profile)
‘Rprof’ showed that the mode() operation was taking an inordinate amount of time. We decided to eliminate that statistic.
Later R developers fixed this, by using a more efficient sort algorithm % total % self total seconds self seconds name 99.29 272.92 0.01 0.02 "univarStat"
98.74 271.40 0.04 0.12 "statMode"… 98.56 270.90 0.01 0.02 "table" 98.49 270.70 0.17 0.46 "factor" 60.63 166.64 60.63 166.64 "match" 37.93 104.26 0.23 0.64 "sort" 37.62 103.40 37.62 103.40 "unique" 37.62 103.40 0.00 0.00 "inherits" 37.62 103.40 0.00 0.00 "is.factor" 0.39 1.08 0.00 0.00 "[<-"
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 107
Identify Small & Expensive Regions of Code To double the speed of a program, the code you
tune has to be > 50% of the execution time ‘Optimizing’ large amounts of code is ineffective
and hard to maintain If code takes only a small %age of run time,
there is risk of ‘overtuning’ (tuning to the specific situation in a way that does not have general benefits)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 108
Tuning Strategies: General Waste not want not… Put off until tomorrow… Pre-processing Efficient/Approximate/Randomized/Optimistic algorithms Appropriate Data structures Find the smallest chunk of code that takes the biggest
share of time Never tune without profiling It’s hard to beat the professionals
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 109
Tuning Strategies: CPU CPU
Interpreted code is slower than compiled code Use Compiler optimizations/ select better compiler
Compiler can take advantage of low level memory hierarchy details
Compiler can take advantage of instruction pipelining (overlapping instructions in the CPU)
Cache size Arithmetic precision In-lining code Pre-compute tables/cache computed values Avoid numerous floating point exceptions
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 110
Tuning Strategies- Resources I/O
Correct order Correct chunk size Locality of reference Pre-fetching
Memory Avoid duplication Efficient access order Locality of reference – process in
contiguous blocks Network
Correct chunk size Avoid frequent name resolution Use appropriate protocol Use a local address if you got
it…
Hig
h P
erfo
rman
ce C
ompu
ting:
Spe
edin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 111
Tuning Tips: R Use Rprof/Rprofmem/Proftools R is optimized for whole data objects:
Vector/matrix operations much faster than explicit loops Use replacement methods to update objects
R performance tips As in previous slides – avoid unnecessary duplication of objects through
assignments, growing objects, etc. In R -- changing any object copies it Remove attributes from large objects if possible Manage intermediate results
save it if you need it reuse/remove if not (or avoid naming results at all)
Compiled code for performance gains Built in functions based on .Internal() R functions much faster than R code Use C++ for small, time-consuming operations such as inner loops Link with optimized libraries (such as ATLAS) for matrix algebra
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 112
Example: Matrix Marginsse1<-function(m) { sumvector=numeric(dim(m)[2]) for (i in (1:dim(m)[1])) { for ( j in (1:dim(m)[2]) ) {
sumvector[j]<-sumvector[j]+m[i,j] }
} return(sumvector)}
se2<-function(m) { sumvector=numeric(dim(m)[2]) for (j in (1:dim(m)[2])) { for ( i in (1:dim(m)[1]) ) {
sumvector[j]<-sumvector[j]+m[i,j] }
} return(sumvector)}
se3<-function(m) { sumvector=numeric(dim(m)[2]) for (j in (1:dim(m)[2])) { sumvector[j]<-sum(m[,j]) } return(sumvector)}
se4<-function(m) { apply(m,2,sum)}
se1a<-function(m) { sumvector=numeric(dim(m)[1]) for (i in (1:dim(m)[2])) { for ( j in (1:dim(m)[1]) ) {
sumvector[j]<-sumvector[j]+m[j,i] }
} return(sumvector)}
se2a<-function(m) { sumvector=numeric(dim(m)[1]) for (j in (1:dim(m)[1])) { for ( i in (1:dim(m)[2]) ) {
sumvector[j]<-sumvector[j]+m[j,i] }
} return(sumvector)}
se3a<-function(m) { sumvector=numeric(dim(m)[1]) for (j in (1:dim(m)[1])) { sumvector[j]<-sum(m[j,]) } return(sumvector)}
se4a<-function(m) { apply(m,1,sum)}
mm<-matrix(runif(5000^2),nrow=5000)> system.time(res1<-se1(mm)) user system elapsed 437.48 0.36 512.83 > system.time(res2<-se2(mm)) user system elapsed 412.16 0.29 464.53 > system.time(res3<-se3(mm)) user system elapsed 2.09 0.24 8.77 > system.time(res4<-se4(mm)) user system elapsed 2.99 0.38 3.68 > system.time(res1a<-se1a(mm)) user system elapsed 427.87 0.34 478.53 > system.time(res2a<-se2a(mm)) user system elapsed 436.04 0.36 473.20 > system.time(res3a<-se3a(mm)) user system elapsed 2.03 0.35 2.46 > system.time(res4a<-se4a(mm)) user system elapsed 3.54 0.45 4.17 > system.time(res4a<-se4a(mm)) user system elapsed 3.67 0.53 4.40
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 113
Tradeoffs
Efficiency/generality Efficiency/programming effort (bugs) Time/space Throughput/completion time/latency
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 114
Premature Optimization is The Root of (Much) Evil
Does it run at all? Does it produce the right answer? Which resource is the bottleneck? What part of the program accesses that resource
most? Is the algorithm efficient? Is it easy to buy new hardware, or run pieces
independently on lots of systems?… then optimize… then parallelize
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
Tuning Up: Checklist
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 115
First Perform resource analysis Select algorithms
Profile code Use representative test cases Look for small regions of expensive code Replace those regions with compiled,
professionally optimized code, such as a built-in method from your stat package
Tuning Reduce expensive operations Trade time for space with indices
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: T
un
ing
Up
Lab: Scaling Up Profile the previous
example What could you do
to tune it?
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 116
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting
: S
calin
g U
p
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 117
Scaling Out -- Basics
Why parallelize? Sources of parallelism Limits
Amdahl’s Law Communication Costs Contention
Decomposing tasks correctly: Arbitrary interleaving Safety Deadlock Avoidance Starvation
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 118
Why Parallelize?
Time vs. Money Most computers are idle most of the time Be the first on the block to have your own
compute cluster
but… Parallelization can decrease throughput Often takes considerable effort Difficult to ensure correct operation
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Sources of Parallelism Intra-core
Pipelining: execute different instruction at each pipeline stage
Hyperthreading: very fast switching between two processes, when one is “stalled” in pipeline
SSE : specialized vector operations on 8 data points simultaneously
FPU: floating point math done in separate unit
Intra-system GPU: specialized simultaneous updates of
large matrices Multi-core: shared memory, independent
programs Distributed
Separate computers Connected by network “Shared nothing”
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 119
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Processor
Processor
RAID Controller
Processor
Memory
GPU
DiskDiskDisk
Network Card
Core
Core
L1
L1
L2
FPU
FPU
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 120
Ahmdahl’s ‘Law’ Most code is not 100%
parallelizable/tunable Speedup limited by percentage
of time non-tunable requires Implication 1: may not be worth
the effort of tuning if tunable code takes up less than 80% of current runtime
Implication 2: most programs will not benefit significantly from running on more than 4 processors
Implication 3: if communication & synchronization costs are high, problem may take longer to solve
np
p1
1
0.0 0.2 0.4 0.6 0.8 1.0
1.0
1.5
2.0
2.5
3.0
3.5
4.0
percent parallelizable
spee
dup
(4
pro
cess
ors
)
0.0 0.2 0.4 0.6 0.8 1.0
05
10
15
20
25
30
x
ah
md
ah
l(x,
n =
32
)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 121
Resource Contention
Contention = too many requests for same resources E.g. : 10 processors writing to same block of
memory, disk file, etc. Use caching ( for reads), data blocking Most common on SMP/multi-core
architectures
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 122
Synchronization When processes run in parallel, the
operations could be arbitrarily interleaved
Isolate code that cannot be interleaved in critical regions
Use synchronization to coordinate access to critical regions
A race condition results when the final answer depends on the order in which operations execute.
Race conditions are almost impossible difficult to debug by trial-and-error
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 123
Race Condition ExampleATM_Withdrawal_loop () { while (1) {
// wait for loginaccountNum=waitForLogin()// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)dispense_cash(amount)
}}
Meanwhile at some other locationIdentical account is used!…
// wait for loginaccountNum=waitForLogin()// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)dispense_cash(amount)
Free Money Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 124
Synchronization by Hand – First Try
While(1) { // DO NON-CRITICAL STUFF
// SYNCHRONIZE while (flag == TRUE); flag=TRUE
//DO CRITICAL STUFF
//SYNCHRONIZE Flag=FALSE}
Meanwhile... // SYNCHRONIZE while (flag == TRUE); // OOPS //DO CRITICAL STUFF
Ok in serial, not safe in parallel
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 125
Synchronization by Hand – Second Try
PROCESS 0While(1) { // DO NON-CRITICAL STUFF
// SYNCHRONIZE while (turn == 0);
//DO CRITICAL STUFF
//SYNCHRONIZE turn=1}
Safe, but … Processes rigidly alternate If process never needs to enter critical region –
other process starves
PROCESS 1While(1) { // DO NON-CRITICAL STUFF
// SYNCHRONIZE while (turn == 1);
//DO CRITICAL STUFF
//SYNCHRONIZE turn=0}
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 126
Synchronization -- Peterson’s Algorithm
PROCESS 0while(1) { // DO NON-CRITICAL STUFF
// SYNCHRONIZE flag[0] = 1 turn=1 while (flag[1] && turn== 1);
//DO CRITICAL STUFF
//SYNCHRONIZE flag[0] = 0}
Use multiple flags (all initialized to 0) (Simplification of Dekker’s algorithm) Safe, and fair
PROCESS 1while(1) { // DO NON-CRITICAL STUFF
// SYNCHRONIZE flag[1] = 1 turn=0 while (flag[0] && turn== 0);
//DO CRITICAL STUFF
//SYNCHRONIZE flag[1] = 0}
INITIALIZATION: flag[1]=0; flag[2]=0; turn=0
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 127
Synchronization Building Blocks Tasks: Independent execution sequences Test-and-set:
atomic test and makes safe// SYNCHRONIZATIONwhile (testAndSet(flag));// Critical region// SYNCHRONIZATIONflag = FALSE
Locks: test-and-set + automatic wait loop
Semaphores:Countdown – N processes allowed in critical region+ Queue – when one process releases the semaphoe all get a turn
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 128
No Race Condition -- SynchronizedATM_Withdrawal_loop () { while (1) {
// wait for loginaccountNum=waitForLogin()
// SYNCHRONIZATIONwaitForLock(accountNum)// BEGIN CRITICAL REGION
// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)
// SYNCHRONIZATIONreleaseLock(accountNum)dispense_cash(amount)
}}
Meanwhile …
accountNum=waitForLogin()
// SYNCHRONIZATION
// Waiting ............
waitForLock(accountNum)// BEGIN CRITICAL REGION
// get account balance from databasecurBalance = get_balance(accountNum)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountset_balance(accountNum,newBalance)
// SYNCHRONIZATIONreleaseLock(accountNum)dispense_cash(amount)
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 129
Deadlock
Deadlock occurs when all processes are waiting for resources held by other processes
Starvation occurs when one or more processes receives no resources over time
Process2
Process3
Process1
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 130
Deadlock Example
Try simultaneously: A->B B->C C->A
ATM_transfer_loop () { while (1) {
// wait for loginaccountNum = waitForLogin()accountNum2 = askForAccount()
// SYNCHRONIZATIONwaitForLock(accountNum)waitForLock(accountNum2)// BEGIN CRITICAL REGION// get account balance from databasecurBalance=get_balance(accountNum)curBalance2=get_balance(accountNum2)// ask user for amountamount = get_withdrawal(curBalance)newBalance = curBalance-amountnewBalance2 = curBalance+amountset_balance(accountNum,newBalance)set_balance(accountNum2,newBalance2)// SYNCHRONIZATIONreleaseLock(accountNum)releaseLock(accountNum2)
}}
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 131
Avoiding Deadlock
Timeouts Release locks after waiting for some set period Simple to implement, pragmatic Cons:
Inefficient Programs need to get all locks at once before entering any critical regions, or release of lock
may be unsafe Timeout values may need tuning, should be somewhat randomized (to avoid cycling)
Resource graph analysis Can be computationally expensive Not generally practical for distributed systems
Algorithmic design Most elegant and efficient solution Examples
Waiting room – allow N-1 accesses to a resource that supports 1 Resource ordering – assign unambiguous ordering to resources, attempt to acquire locks in
order Ticket – give out right to attempt to acquire resources
Requires careful analysis – best to use an existing design
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Verification and Debugging
Difficult to trace parallel code – need special tools
Impossible to “debug” parallel algorithms Heisenbugs are the rule! Debugging serializes execution
Must verify models instead
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 132
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Two Easy Cases
Explicit Parallelization is hard! Look for
Existing Parallel Implementation “Embarrassingly parallel” implementation of
problem Failing that, look for existing parallel
algorithm for problem
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 133
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Easy Parallelization in STATA Stata-MP: Stata professionally parallelized
dozens of models Completely automatic Well documented:
www.stata.com/statamp/report.pdf Limitations
Parallelization applies only to built-in features Most models are only partially parallelized Some overhead for parallelization Practically limited to 8 processes in most cases
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 134
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Easy Parallelization in STATA Best cases (>95%)
By; cluster (some); cnsreg; correlate; count; ctset; eivreg; frontier; generate; gprobit; hetprob; hotelling; logit; markout; marksample; matrix accu/score/syminv; mlevel; mlogit; mlmatsum; mlvecsum; ologit; oprobit; pcorr; poisson; predict (some); probit; regress; replace; scobit; strg (gamma); summarize; tobit; treatreg; xtintreg; xtnbreg; xtprobit; _rmcoll
Negligible speedup (<10%) Biplot; centile; cluster (single); cumul; dstdize; fcast; compress; ktau; levelsof;
manova (oneway); matrix eigen/svd/symeigen; mkmat; mprobit oneway; pergram; pkexamine; pperron; reshape; rotatemat; scoreplot; stcox, stcurve; svmat; tabulate total; twoway (lfitci); vargranger; varstable; vec; vecstable; wntestb; xcorr; xtmixed; xtpcse
Limitations Parallelization applies only to built-in features Most models are only partially parallelized Some overhead for parallelization Practically limited to 8 processes in most cases Does not help for memory issues
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 135
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Somewhat Easier Parallelization in R R offers a wide range of frameworks at widely varying levels
of maturity Most mature and well supported framework is snow(),
works on top of sockets, nws, MPI, PVM For single systems with multiple cores, Rmulticore() is straightforward to use
Some packages that use snow automatically: BARD: Redistricting analysis GAMBoost: b-spline boosting Matching: propensity score matching Bnlearn: bayesian learning Rgenoud: genetci optimization Latentnet: latent cluster models … a dozen more packages (so far)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 136
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallelizing R code through lapply() *apply() are core operation
in R It runs a function on each
element of a list/slice of a matrix/array
Returns a processed output list, that may be of reduced dimension
Like a FOR loop over a list, but practically guaranteed to have no dependencies
Natural construct for paralellization
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 137
Prepare input
Process input
Manipulate output
Process more
apply()
apply()
((1,2,3),(4,5,6),(7,8,9))
(6,15,24)
Tips on parallelizing R Simply For embarassingly parallel and data-replication only: condor_submit
For coarse grained, more complex parallelization: Parallelize around the apply() operation Massively parallel, distributed, use: snow Single system, multicore use: multicore
For fine grained parallelization Too much overhead in R from interpreted code, and
copying data structures on alteration Use parallelized C through R foreign language interfaces
Seek out packages that are already parallelized by consulting the R HPC taskview
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 138
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 139
Snow (Simple Network of Workstations)
Simplified pragmatic approach for R
Uses sockets (homegrown), PVM or MPI
Can be used remotely or locally, but higher overhead for local systems
Well-suited for embarrassingly or coarse-grained parallelism
Parallelizes the sapply(), apply(),tapply(),rapply() operators
Not particularly fault tolerant – if hosts go down, you have problems
For socket interface: Use ssh for tunneling Create ssh keys, use ssh key
manager to avoid passwords Not supported by RCE (yet…)
> library(snow)# setup cluster on local machine>cl <- makeSOCKcluster(c("localhost","localhost")) > matrix(1:100,10) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 11 21 31 41 51 61 71 81 91 [2,] 2 12 22 32 42 52 62 72 82 92 [3,] 3 13 23 33 43 53 63 73 83 93 [4,] 4 14 24 34 44 54 64 74 84 94 [5,] 5 15 25 35 45 55 65 75 85 95 [6,] 6 16 26 36 46 56 66 76 86 96 [7,] 7 17 27 37 47 57 67 77 87 97 [8,] 8 18 28 38 48 58 68 78 88 98 [9,] 9 19 29 39 49 59 69 79 89 99[10,] 10 20 30 40 50 60 70 80 90 100
# run matrix sum across two cores> parApply(cl, matrix(1:100,10),2, sum) [1] 55 155 255 355 455 555 655 755 855 955
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 140
Distributed Computing in R: Tips
snow is most mature, other approaches documented in R HPC task view
snowfall and snowft build fault tolerance/management/single-threaded testing on snow
communication is often limiting performance factor Pay attention to lexical scoping, function serialization, use of
environments to avoid transmitting unnecessary data implicitly
random number generators not recommended for parallel use Preferred: use snow, rsprng, rlecuyer Minimally: set different seeds in each thread
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 141
R multicore Simplified pragmatic
approach for R Uses processes on single
system Well-suited for coarse-
grained parallelism Parallelizes the lapply()
operator only Gives some explicit
parallelization options using the fork/join idiom
> matrix(1:100,10) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 1 11 21 31 41 51 61 71 81 91 [2,] 2 12 22 32 42 52 62 72 82 92 [3,] 3 13 23 33 43 53 63 73 83 93 [4,] 4 14 24 34 44 54 64 74 84 94 [5,] 5 15 25 35 45 55 65 75 85 95 [6,] 6 16 26 36 46 56 66 76 86 96 [7,] 7 17 27 37 47 57 67 77 87 97 [8,] 8 18 28 38 48 58 68 78 88 98 [9,] 9 19 29 39 49 59 69 79 89 99[10,] 10 20 30 40 50 60 70 80 90 100
> M<-matrix(1:100,10)# convert to list form, by vector>lM<-lapply(1:(dim(M)[1]),function(x)(M[,x]))# set number of threads to number of cores> options(cores=multicore:::detectCores(all.tests=TRUE))# run across multiple cores> unlist(mclapply(lM,sum))
# if not in console mode, use:# unlist(mcapplySafer(lM.sum))
[1] 55 155 255 355 455 555 655 755 855 955
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
$ condorInteractiveSubmit -x –c 4 /usr/bin/R
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 142
R multicore tips Use where communication costs too high for distributed computation Be cautious when running from GUI (not thread safer):
mclapplySafer<- function(X,FUN,…){newFUN<-function(X,…){if(multicore:::isChild()) {multicore:::closeAll()}
FUN(X,…)}
mclapply(X,newFUN,…)}
Look at packages that use it for examples and pre-parallelized functions:
ADaCGH2, BayesPeak, BARD, batch, cMonkey, codelink, CoxBoost, Daim, dcemri, dcemriS4, ddepn, doMC, doMPI, expectreg, flexclust, flowClust, GAMBoost, GGtools, hgam, les, LVSmiRNA, mapReduce, mboost, MeDiChI, MI_Interfaces, Modalclust, pathClass, phangorn, PICS, Rsolnp, Shortread, survey, TWIX
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 143
GPU-Powered Statistics -- Simply Simple, limited alternative to
complicated GPU programming
magma , gputools Can offer 10x-100x speedup
Requires Linux Compatible NVIDIA GPU
Not available in RCE;but available on AWS
Supported models: linear algebra; glm models; clustering; ICA
> chooseGpu(deviceId = 0)
> clotting <- data.frame( u = c(5,10,15,20,30,40,60,80,100),lot1 = c(118,58,42,35,27,25,21,19,18),lot2 = c(69,35,26,21,18,16,13,12,12))
> summary(gpuGlm(lot1 ~ log(u), data=clotting, family=Gamma))
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
See : http://bioinformatics.oxfordjournals.org/content/26/1/134.full
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 144
Embarrassingly Parallel Tasks Jobs that require no communication,
separate data Easiest jobs to parallelize Variant: Replicated data
Require replication of data at beginning Summarize all results at very end Otherwise no communication Social Science has a lot of these
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Embarrassingly Parallel Social Science
Model/parameter exploration Independent runs of agent-based simulation Computational sensitivity analysis MCMC chains
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 145
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
What do I need to get started? Executable
Many of our users make use of R, a statistical programming language
Just about any application can be used Input File
If you’re using R, your input file is your R script file Some applications may not need input files
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 146
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Running your job
condor_submit_util is your friend! Running with no options defaults to interactive
mode, use this until you feel comfortable with the process.
Example of running non-interactively:
condor_submit_util -i bootstrap.R -a ‘--vanilla --no-save --args $(Process)’ -n 20 -N
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 147
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Batch Processing Programs
condor_status – shows the status of all of the nodes in the pool
condor_q – shows the status of all jobs in the queue condor_submit – submit a cluster of jobs to the
queue condor_submit_util – RCE helper application that
automates the submission process condor_userprio – shows usage statistics and
priorities for users who are actively using pool resources
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 148
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Monitoring the RCE Batch Cluster
http://cod-head.hmdc.harvard.edu http://batch-head.hmdc.harvard.edu
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 149
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallel Batch Jobs in the RCE Easy to use with embarrassingly parallel
problems If resources are available can use hundreds
of cores Supports: R, Stata, Matlab, C, Java, … lots
more
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 150
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
How Jobs are Scheduled Condor uses “fair use” algorithm
Users who have used less resources lately have better opportunity to use resources in the near future
Condor priority Lower priority = better chance of getting runtime Priority factor – defaults to 10, may increase or decrease depending on your
affiliation Average usage over time * priority factor = effective priority
When nodes are free Looks at balance of nodes, figures out who gets next “fair share”
When no nodes are free Preemption can occur Jobs will only preempt if the following is true
Job is within preemption window Preempting user has a priority 1/10th of user currently running
Preemption window Short (100 nodes): > 1 hour run time Medium (200 nodes): 1-4 hour run time Long (200 nodes): < 1 hour runtime
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 151
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 152
RCE Condor Example1. Create a directory for your jobs to run under.
Move your input data there.
$ mkdir condor$ cd condor$ cp PATHTODATA/data .
2. Create N batch scripts, batch.R.[N]. Use the full path in your script, when loading files.
3. Create a "submit" description, test1.submit (you can skip this, and follow the prompts from condor_submit_util):
Universe = vanillaExecutable = /usr/bin/RArguments = --no-save --vanillawhen_to_transfer_output = ON_EXIT_OR_EVICTtransfer_output_files = out.$(PROCESS)NUM_VIRTUAL_MACHINES = 2input = batch.R.$(Process)output = out.$(Process)error = error.$(Process)Log = log.$(Process)Queue 2
4. Submit your job:$ condor_submit test1.submitSubmitting job(s)..........2 job(s) submitted to cluster 24.
5. Check with condor_status and condor_q.
Variations:- RCE provides condor_submit_util and condor_watch utilities to simplify access
- use condor_submit_util –k –I to submitthis uses interactive prompts and keeps the submit file. You can then reuse/modify the submit file.- run one input file, use $(PROCESS) in arguments to have it do different things- add notification parameters, or use condor_watch_util to set up an email notification- chain condor jobs together with dependencies through Condor DAG
http://www.hmdc.harvard.edu/images/stories/hmdc/StatisticalComputing/ResearchComputing/Tutorial/Condor/getstarted_batchproc_in_rce.pdf
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Scaling Out, Basics: Checklist
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 153
Estimate potential gains from parallelization Assess need for synchronization Avoid deadlocks Writing parallel code is hard! Look for two easy
cases: Someone else has done the parallelization
already Completely decomposable into lots of
independent jobs
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Lab: Scaling Out, Basics Compare stata
MP/SE mlogit and mprobit performance
Launch batch jobs in RCE
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 154
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting:
Sca
ling
Ou
t
Scaling Out: Frameworks and Patterns Parallel architectures Parallelization frameworks Parallel Programming Patterns
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 155
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallel Architectures
MIMD Variants Uniform Shared Memory Non Uniform Shared
Memory Distributed computing
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 156
Single Instruction Multi-Instruction
Single Data
Multi-DataH
igh
Per
form
ance
Sta
tistic
al C
ompu
ting:
Sca
ling
Ou
t
Parallel Architectures in Use MIMD
Uniform (almost) Shared Memory multi-core chips
Distributed, shared nothing clusters and clouds
SIMD “compilers tread where angels fear…” GPU processing Intra-core SSE instructions
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 157
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallelization Frameworks Memory
Shared memory Single address space Writes protected by synchronization
Distributed Message passing
Parallelization Control Explicit
Programmer responsible for synchronization details Implicit
Programmer responsible for model/algorithm, synchronization implied
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 158
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Some Parallelization Tools Implicit
Shared Parallelizing compilers Data-flow languages Parallel functional languages
Distributed Single system image (MOSIX) DAGman*
Explicit Shared
R: multicore* Matlab parallel toolbox OpenMp
Distributed MPI
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 159
* RCE capable
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 160
Frameworks for Parallelization Single Host
Implicit parallelism Parallelized libraries Parallelizing compilers
Explicit parallelism System call level Parallelization frameworks
Distributed Computing Single system image Map-reduce Message passing
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 161
Problem Decomposition Types of Decomposition:
Independent Tasks Independent data chunks Independent phases / data-flow (pipelinining)
Example: Simple Data Decomposition N tasks. Split data into N worker. Then return to main task. Basis for loop-level parallelization Useful for sapply() type operations Can be inefficient if data too small, or chunks require
varying computation time
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallel Programming Patterns
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 162
Task Decomposition
Linear•Task parallelism
Recursive•Divide and concur
Data Partition
Linear•Geometric Decomposition
Recursive•Recursive data
Information flow
Regular•Pipeline
Irregular•Producer-consumer•Blackboard•Monitor•Agent
* Modified from Mattson et. al
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 163
Examples by directionality Divide work/resources needed for problem Minimize dependencies among pieces of problem For distributed work: minimize communication among pieces
Independent(“embarassingly parallel”)
Uni-DirectionalDependency
Complex Dependency
Data Decomposition
Analysis across separate datasets, parameterization studies, sensitivity analysis, MCMC chains,…
Preprocessing, image processing, …
Parallel linear algebra, partitioned sorting, MCMC iterations …
Task/Function Decomposition
Multi-user systems – independent jobs.
Pipelines Loop-parallelization, client-server, producer-consumer
Object Decomposition
--- Monitors, Blackboard, Agents, Peer-Peer, …
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 164
More Parallelization Patterns Monitors:
object-oriented build synchronization into data structure Any program designing around data structure access (e.g.
synchronized queue) automatically obtains consistency Producer-Consumer
Multiple producers put tasks/data-chunks in a queue/buffer Multiple consumers take tasks and process them Can be pipelined – consumers of one queue can be
producers for another stage of processing
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Parallel Programming Idioms Used in implementation of larger patterns Flow of control
Single Program Multiple Data Fork-Join Master/Worker Map/Reduce Loop-Parallelism
Data Shared Data Shared Queue Distributed Array
Synchronization Mutual exclusion (locks, semaphores) Barriers Collective communication (broadcast, reduction)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 165
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Example: DAG-man
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 166
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
One DAG Node
• You model the dependencies, DAGman sequences your batch jobs
Dependency Graph
DagMan Details
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 167
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
JOB id file• PARENT id CHILD id1 id2 …
Optional Commands RETRY, ABORT-DAG-ON, VARS, SCRIPT PRE/POST
$ condor_submit_dag DAGInputFile
Submission:
Dagman File Syntax :
JOB A A.condor JOB B B.condor JOB C C.condorJOB D D.condorPARENT A CHILD B CPARENT B C CHILD D
Example:
Integrating Analysis and Batch Phases Post-process R jobs for next steps
Dagman PRE/POST to set up phases External manipulation of output files:
PERL, Python Save model output in machine readable form, for
use in other stages Use command-line arguments to control scripts
(e.g. parseCommandArgs in the R “batch” package)
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 168
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g: S
calin
g O
ut
Lab: Scaling Out Create and launch
DAG jobs
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 169
[Source: http://andreymath.wikidot.com/ . Creative Commons Sharealike Licensnce]H
igh
Per
form
ance
Sta
tistic
al C
ompu
ting:
Sca
ling
Ou
t
Additional Resources
Free Computing Cloud Computing Software Reading
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 170
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 171
Computing Grants
Teragrid www.teragrid.org SDSC www.sdsc.edu ScienceClouds www.scienceclouds.org Amazon EC2 Grants
aws.amazon.com/education/ Google computing grants
research.google.com/university/ Microsoft/NSF computing grants
www.nsf.gov/dir/index.jsp?org=CISE
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 172
Insta-Clusters Grid Appliance www.grid-appliance.org
self configuring Linux grid Runs in “harvested cycles” on windows machines uses Virtual Machines, Condor, P2P networking
Boinc boinc.berkeley.org Large scale, independent computations Like SETI@home Must compile your application with BOINC libraries.
Rocks Cluster www.rocksclusters.org Linux cluster installer Many “rolls” – choices of configurations
Nimbus cluster www.nimbusproject.org Turn a local cluster into a cloud accessible through the
amazon EC2 API’s StarCluster web.mit.edu/stardev/cluster
Build an auto-expanding clusters on Amazon EC2
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
Cloud Computing Resources General Platform
Amazon EC2 aws.amazon.com/ec2/[Most mature & richly functional]
GoGrid www.gogrid.com ATT Synaptic www.synaptic.att.com FlexiScale www.flexiant.com Rackspace Cloud www.rackspacecloud.com
Restricted Computing Models Google App Engine code.google.com/appengine Microsoft Azure microsoft.com/windowsazure/
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 173
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
Software Profiling Tools Timing:
Windows: timeit.exe Linux: /usr/bin/time ; getitimer() OS X: /usr/bin/time ; getitimer
Process/Systems Analysis Windows: top.exe; taskmgr.exe; perfmon.exe Windows (advanced): procmon.exe
[http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx] Linux/ OSX top ; gnome-system-monitor; sar
vmstat ; iostat OS X: top ; Utilities->Activity monitor; sar
vmstat; at-monitor System Trace tools
Linux: strace; systap (add-on) OS-X: dtrace
Profilers Matlab: profile R : Rprof(); Rprofmem() Stata: profile C: gprof General linux: oprofile (oprofile.sourceforge.net) General OSX: Dtrace (dtrace.org) Intel: vtune software.intel.com/en-us/intel-vtune/ Linux:oprofile oprofile.sourceforge.net Parallel: tau www.cs.uoregon.edu/research/tau/
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 174
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 175
Additional Software Resources
Lots of scientific and HPC libraries:maltman.hmdc.harvard.edu/numal/resources
Lots of statistical and data analysis packages:maltman.hmdc.harvard.edu/socsci.shtml
CRAN HPC Task View for R: cran.r-project.org/web/views/HighPerformanceComputing.html
Large Scale Visualization Mondrian – large categorical data: rosuda.org/Mondrian/ GGOBI – interactive data visualization www.ggobi.org/ VISIT – parallel visualization wci.llnl.gov/codes/visit/ Paraview – parallel visualization www.paraview.org
Large data manipulation PDL , perl data library -- pdl.perl.org NumPy, numerical python – numpy.scipy.org Embed R in Postgresql – www.joeconway.com/plr/
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 176
Further Reading – Computing Hard Problems
Garey & Johnson, 1979, Computers and Intractability. Papadimitriou, 1993, Computational Complexity. Arora, 2009, Computational Complexity: A Modern Approach
Algorithms -- General T. Cormen et al., 2001. Introduction to Algorithms (3rd Ed) . S. Skiena, 1998. Algorithm Design Manual,.
Performance Optimization G.D. Musmeci & M. Loukides, 2002. System Performance Tuning, O’Reilly D. Shasha & P. Bonnet, 2003. Database Tuning, Morgan and Kaufman Faroult and Robson, 2006. Art of SQL. O’Reilly. Menasce, et. al 2004, Performance by Design
Parallelization M. Ben-Ari, Principles of Concurrent and Distributed Programming, Springer Miller & Boxer, 2005. Algorithms Sequential and Parallel Mattson, et. al 2004. Patterns for Parallel Programming
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 177
Further Reading – Statistical Computation S. Goedecker & A. Holsie, 2001. Performance Optimization of
Numerically Intensive Codes, Siam Press J. Nocedal & S.J. Wright, Numerical Optimization 1999. Springer A Unwin, M Theus, H Hofmann, 2006. Graphics of Large Data Sets –
Springer. G. Givens & J. Hoeting, 2005. Computational Statistics, Wiley. Kontoghiorghes, 2006. Handbook of Parallel Computing and
Statistics. Gentle, et. al, 2004. Handbook of Computational Statistics Gentle, 2009. Computational Statistics. Higham, 2005, Accuracy and Stability of Numerical Algorithms. Aggarwal, 2007, Data Streams Models and Algorithms.
Hig
h P
erfo
rman
ce S
tatis
tical
Com
putin
g :
Ad
dit
ion
al R
eso
urc
es
Review: Nine Steps to Faster Results1. Predict your resource needs through benchmarks,
models, algorithmic analysis
2. Select alternate algorithms when resource needs grow very rapidly with problem size
3. Identify resource bottlenecks using systems performance analysis tools
4. Address bottlenecks by increasing resources and/or changing program resource management
5. Discover hot-spots in programs using profiling tools
6. Adapt hot-spots to system architecture
7. Decompose the problem into independent subproblems
8. Distribute subproblems across pools of resources
9. Repeat analysis after making any changes
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 178
M. Altman & B. KinneyHigh Perf. Stat. Computing
(v.9/10/11) 179
Further Questions?
Statistical RCE accounts & assistance:
Questions about this talk, etc.:[email protected]