My general research area: System-level support for parallel and distributed computing
•User: run my_prog•System: load my_prog on nodes, run it, collect results
Efficiency: make program run as fast as possibleefficient use of system resources (CPU,RAM,disk)
User level abstractions that make the system easy to use
CPU 1 CPU 1
Memory Memory
DiskDisk
CPU 2 CPU 2 CPU 3 CPU 3 CPU 4 CPU 4 CPU n CPU n……
DiskDisk DiskDisk
my_progmy_prog
Nswap: Network RAM for cluster computers
Cluster computer: multiple inexpensive, “independent” machines connected by a network, run system SW that make them look and act like single parallel computer
……
Network
Cluster system softwareCluster system software
General Purpose Cluster Multiple parallel programs share the cluster
•Assigned to some, possibly overlapping, machines•Share NW, Memory, CPU, disk resources
Program workload changes over time•New programs may enter the system•Existing programs may complete and leave
==> imbalances in RAM and CPU usage across nodes some nodes don’t have enough RAM, some have unused RAM
P1P2P3
P1P2P3
When node doesn’t have enough RAM space?
Lots of Data that cannot all fit into Memory OS moves data to/from disk as needed (swapping)
time to access data:CPU: 0.000000005 secs RAM: 100 x slower than CPUDisk: 10 million x slower
-> Swapping is really, really slow
CPUCPU
Memory Memory
DiskDisk
Network Swapping in a Cluster
Disk
CPU
RAMBypass disk and swap pages of RAM to remote idle memory in the cluster
Node 1
CPURAM
Node 3
CPURAM
Node 2
network
• Network Swapping: expand Node 1’s memory using idle RAM of other cluster nodes rather than local disk
Why Nswap?
Swapping to Disk
Nswap
3561.66 seconds 105.50 seconds(speed-up of 33.8)
The network is much faster than disk, so swapping data over NW to remote RAM is much faster than
swapping to local disk
Nswap Architecture Divided into two parts that run on each node
1) Nswap client device driver for network swap “device”
•OS makes swap-in & swap-out requests to it2) Nswap server manages part of RAM for caching remotely swapped pages (Nswap Cache)
Nswap clientNswap client
swap out pageswap out page Nswap serverNswap server
Nswap serverNswap server
Node A Node B
OS spaceOS space
User spaceUser space
Nswap CacheNswap Cache Nswap clientNswap client
Nswap Communication Layer Nswap Communication Layer
NetworkNetwork
This summer Answer questions having to do with policies for
growing/shrinking RAM available for Nswap and implement solution(s):•How do we know when idle RAM is available?•Can we predict when idle RAM will be available for
a long enough time to make it useful for NSWAP?•How much idle RAM should we take for Nswap?•How much RAM should we “give back” to the
system when it needs it Investigate incorporating Flash memory into the
memory hierarchy and using it with Nswap to speed up swap-ins
System-level support for computation involving massive amounts of globally dispersed data (cluster computing on steroids)•Internet scale distributed/parallel computing•Caching, prefetching, programming interface?
How Pages Move Around Cluster
Swap Out:
Swap In:
Migrate from node to node with changes in WL:
Node A (not enough RAM)
Node B (has idle RAM)
A B
A B CWhen node B needs more RAM, it migrates A’s page to node C
Reliable Network RAM Automatically restore remotely swapped
page data lost in node crash
How: Need redundancy •Extra space to store redundant info.
•Avoid using slow disk •Use idle RAM in cluster to store redundant data•Minimize use of idle RAM for redundant data
•Extra computation to compute redundant data•Minimize extra computation overhead
Soln 1: Mirroring
When page is swapped out send it to be stored in idle RAM of 2 nodes•If first node fails, can fetch a copy from second node
+ easy to implement- Requires twice as much idle RAM space for same amount of data
- Requires twice as much network bandwidth
- two page sends across NW vs. one
Soln 2: Centralized ParityEncode redundant info for a set of
pages across diff nodes in a single parity page
If lose data, can recover it using parity page and other data pages in set
0 1 0 0 0 0 0 1 1 0 0 1parity
page
0 1 0 0 0 0 0 1 1 0 0 1 0 1 1
recovered page
Centralized Parity (cont.) A single dedicated cluster node is
the parity server•Stores all parity pages•Implements page recovery on a crash
Parity Logging: regular nodes compute a parity page locally as they swap-out pages, only when parity page is full is it sent to parity server•One extra page send to parity server on every N page swap-outs (vs. 2 on every swap-out for mirroring)
Soln 3: Decentralized Parity
0 1 0 0 0 0 0 1 1 0 0 1
0 0 0 1 0 1 1 0 0 0 0 1
0 1 0 0 1 0 1 1 0 1 1 0
0 1 1 0 0 1 1 1 1 1 0 1
No Dedicated Parity ServerParity Pages distributed across cluster nodes
Centralized vs. Decentralized
Results
Future Work
Acknowlegments
Sequential Programming Designed to run on computers
with one processor (CPU)•CPU knows how to do a small number of simple things (instructions)
Sequential program is ordered set of instructions CPU executes to solve larger problem(ex) Compute 34
1. Multiply 3 and 32. Multiply result and 33. Multiply result and 34. Print out result
CPU CPU
Memory Memory
DiskDisk
Sequential AlgorithmFor each time step do: For each grid element X do:
compute X’s new valueX = f(old X, neighbor 1, neighbor 2,
…)
xx
The Earth SimulatorJapan Agency for Marine-Earth Science and Technology
The Earth SimulatorJapan Agency for Marine-Earth Science and Technology
How Computer Executes Program
CPUCPU
Memory (RAM)
Memory (RAM)
DiskDisk
1. OS loads program code & data from Disk into RAM
2. OS loads CPU withfirst instruction to run
3. CPU starts executinginstructions one at a time
4. OS may need to movedata to/from RAM & Disk as prog runs
11
22
How Fast is this? CPU speed determines max of how
many instructions it can execute•Upper bound: 1 clock cycle: ~ 1 instruction •1 GHertz clock: ~1 billion instructions/sec
Max is never achieved•When CPU needs to access RAM
•takes ~100 cycles
•If OS needs to bring in more data from Disk•RAM is fixed-size, not all program data can fit•Takes ~1,000,000 cycles
Fast desktop machine This is the idea but check these
nubmers!!!!! GigaHertz processor
•Takes ~.000000005 seconds to access data 2 GigaBytes of memory
•231 bytes•Takes ~.000001 seconds to access data
80 GB of disk space•Takes ~ .01 seconds to access data 1 million times slower than if data is on CPU
Requirements of Simulation
Petabytes of data •1 petabyte is 250 bytes (1,125,899,906,842,624
bytes)
Billions of computations at each time step We need help:
•A single computer cannot do one time step in real time
•Need a supercomputer•Lots of processors run simulation program in parallel•Lots of memory space•Lots of disk storage space
Parallel Programming Divide data and computation into
several pieces and let several processors simultaneously compute their piece
3.61.22.32.6…
Processor 1
……
Processor 2
Processor n
Supercomputers of the 90’s
Massively parallel•1,000’s of processors
Custom, state of the art•Hardware •Operating System•Specialized Programming
Languages and Programming Tools
Fastest Computer*
1
10
100
1000
10000
100000
1000000
1990199219941998200020032005
GFlops/sec
*(www.top500.org & Jack Dongara)
computation took 1yr in 1980, takes 16mins in 1995, 27secs in 2000
CrayY-MP
CrayY-MP
ASCI WhiteASCI White
TMCCM-5
TMCCM-5
Blue GeneBlue Gene
Earth SimulatorEarth Simulator
ASCI BlueASCI Blue
Intel ParagonIntel Paragon
1
10
100
1000
10000
1990 1991 1992 1993 1994 1996 1998 1999 2000
GFlops/sec
Fastest Computer*
*(www.top500.org & Jack Dongara)
computation took 1yr in 1980, takes 16mins in 1995, 27secs in 2000
CrayY-MP
CrayY-MP
TMCCM-2
TMCCM-2
ASCI WhiteASCI White
ASCI BlueASCI Blue
Intel ParagonIntel Paragon
TMCCM-5
TMCCM-5
Fastest Computer*
0
50000
100000
150000
200000
250000
300000
200020032005
GFlops/sec
*(www.top500.org & Jack Dongara)
ASCI WhiteASCI White
Blue GeneBlue Gene
Earth SimulatorEarth Simulator
Problems with Supercomputers of the 90’s
Expensive Time to delivery
~2years Out of date soon
Cluster: Supercomputer of the 00’s
Massively parallel Supercomputer out of network of unimpressive PCs•Each node is off-the-shelf hardware running off-
the-shelf OS
NetworkNetwork
Are Clusters Good?+ Inexpensive
•Parallel computing for the masses
+ Easy to Upgrade• Individual components can be easily replaced
• Off-the-shelf parts, HW and SW•Can constantly and cheaply build a faster parallel
computer
- Using Off-The-Shelf Parts•Lag time between latest advances and availability
outside the research lab•Using parts that are not designed for parallel systems
Currently 7 of the world’s fastest 10 computers are clusters
System-level Support for Clusters
Implement view of a single large parallel machine on top of separate machines
Single, big, shared memory on top of n, small, separate ones
Single, big, shared disk on top of n, small, separate ones
NetworkNetwork
Nswap: Network Swapping Implements a view of a single, large,
shared memory on top of cluster nodes’ individual RAM (physical memory)•When one cluster node needs more memory space than it has, Nswap enables it use idle remote RAM of another cluster node(s) to increase its “memory” space
Traditional Memory Managementprocessor
RAM
Disk
swapProgram1
pages
Program2pages
files
OS moves parts (pages) of running
programs in/out of RAM
• RAM: limited size, expensive, fast, storage • Disk: larger, inexpensive, slow (1,000,000 x slower), storage• Swap: virtual memory that is really on disk expand memory using disk
Network Swapping in a Cluster
Disk
processor
RAMSwap pages to remote idle memory in the cluster
Node 1
processorRAM
Node 3
processorRAM
Node 2
network
files
• Network Swapping: expand memory using RAM of other cluster nodes
Nswap Goals: Transparency
•Processes don’t need to do anything special to use Nswap
Efficiency and Scalability•Point-to-Point model (rather then central server)•Don’t require complete state info to make
swapping decisions
Adaptability•Adjusts to local processing needs on each node•Grow/Shrink portion of node’s RAM used for
remote swapping as its memory use changes
Nswap Architecture Divided into two parts that run on each node
1) Nswap client device driver for network swap “device”
•OS makes swap-in & swap-out requests to it2) Nswap server manages part of RAM for caching remotely swapped pages (Nswap Cache)
Nswap clientNswap client
swap out pageswap out page Nswap serverNswap server
Nswap serverNswap server
Node A Node B
OS spaceOS space
User spaceUser space
Nswap CacheNswap Cache Nswap clientNswap client
Nswap Communication Layer Nswap Communication Layer
NetworkNetwork
How Pages Move Around Cluster
Swap Out:
Swap In:
Migrate from server to server:
Node A (client) Node B (server)
Client A Server B
Client A Server B Server CWhen Server B is full, it migrates A’s page to server C
Complications Simultaneous Conflicting
Operations•Ex. Migration and Swap-in for same page
Garbage Pages in the System•When program terminates, need to remove its remotely swapped pages from servers
Node failure•Can lose remotely swapped page data
Currently, our project… Implemented on Linux cluster of 8
nodes connected with a switched 100 Mb/sec Ethernet network•All nodes have faster disk than network
•Disk is up to 500 Mb/sec•Network up to 100 Mb/sec
-> We expect to be slower than swapping to disk
Experiments Workload 1: sequential R & W to
large chunk of memory•Best case for swapping to disk
Workload 2: random R & W to mem•Disk arm seeks w/in swap partition
Workload 3: Workload 1 + file I/O•Disk arm seeks between swap and file partitions
Workload 4: Workload 2 + file I/O
Workload Execution Times
•Nswap faster than swapping to much faster disk for workloads 2, 3 and 4
0
100
200
300
400
500
600
700
800
WL1 WL2 WL3 WL4
Disk500Mb/ sec Nswap100Mb/ sec
Nswap on Faster Networks
Measured on Disk, 10 Mb/s and 100 Mb/sCalculated speed-up values for 1,000 & 10,000 Mb/s
Work
load
Disk 10 Mb/s 100 Mb/s 1,000Mb/s
10,000
Mb/s(1) 12.27 306.69 56.8
speedup 5.4
28.9(10.6)
26.3(11.6)
(2) 266.79 847.74 153.5(5.5)
77.3(10.9)
70.3(12.1)
(4) 6265.39 9605.91 1733.9(5.54)
866.2(11.1)
786.7(12.2)
Conclusions Nswap: Scalable, Adaptable,
Transparent Network Swapping for Linux clusters
Results show Nswap is•comparable to swapping to disk on slow network
•much faster than disk on faster networks•Based on network vs. disk speed trends, Nswap will be even better in the future
AcknowledgementsStudents :
Sean Finney ’03 Matti Klock ’03Kuzman Ganchev ’03 Gabe Rosenkoetter ’02 Michael Spiegel ’03 Rafael Hinojosa ’01
Michener Fellowship for Second Semester Leave Support
More information, results, details:EuroPar’03 paper, CCSCNE’03 posterhttp://www.cs.swarthmore.edu/~newhall/