Date post: | 21-Feb-2017 |
Category: |
Science |
Upload: | soumya-banerjee |
View: | 286 times |
Download: | 1 times |
A Multi-Agent System Approach toLoad Balancing and Resource Allocation
for Distributed Computing
Soumya Banerjee & Joshua Hecker
� Age of distributed computing
� Trend in moving computation on inexpensive but geographically distributed computers
� SETI@home, LHC@home
� Need for efficient allocation algorithms
Motivation
Decentralized Computing
� Can alleviate computing load on centralized monitors
� Robust to single-point failures
� Can achieve application-level resource management (nodes can manage resources better than a global monitor)
� Can scale more gracefully since as the system grows; centralized monitor has to communicate with more and more nodes
� Can better respond to fluctuations in process requirements
� Scenario where it has to "forget" past process requirements and completely rebuild new clusters after servicing one process i.e. no locality
� An agent is a computing node; join together to form a cluster
� Multi-agent systems have emergent properties
� Have been used to model biological phenomenon and real-life problems (left: Keepaway soccer, right: Ant foraging):
Multi-Agent Systems
� A huge number of distributed nodes or agents
� Advantages to computing with geographically proximal computers due to network latency, bandwidth limitations, etc
� There is a global data structure which has a large number of tasks/processes
� A new process that comes in the system will declare a priori the number of threads that it can be parallelized into and its resource requirements (CPUreq)
� Cluster as a network of computers which together can completely service the resource requirements of a single task
� Over time clusters would be created, dissolved and created again dynamically in order to serve the resource requirements of the tasks in the queue
Problem Statement and Assumptions
� dRAP: Distributed Resource Allocation Procedure
� Mode 1: an agent/node that is currently not part of a cluster and has no task assigned to it
1. agent looks at queue Q, examines unallocated tasks and takes on the task which minimizes
� Mode 2: an agent/node that is currently not part of a cluster and has a task assigned to it
1. keep on executing task
2. if the task requirements are not completely satisfied, i.e., keep on querying your neighbors and try to
form a cluster such that
3. when task completes, go to Mode 1
dRAP Algorithm
|1| −reqCPU
1>reqCPU
CPU req = CPU cluster
� Mode 3: an agent/node that is currently part of a cluster and has no task assigned to it
1. agent looks at queue Q, examines unallocated tasks and takes on the task which minimizes
� Mode 4: an agent/node that is currently part of a cluster and has a task assigned to it
1. keep on executing task2. when task completes, breakup cluster and go to Mode 1
dRAP Algorithm
|CPUreq −CPUcluster |
� Caveat: Task list traversal requires O(nm) time per timestep, where n = number of tasks and m = number of clusters
� For entire simulation:
� Compare to FIFO scheduling - drops to O(nm)
� Does our algorithm’s increased complexity per timestep provide enough decrease in scheduling rate to be effective?
dRAP Algorithm
)()( 2
0
mnOminn
i
≈−∑=
� Example screenshots of implementation (lines show clusters, redsymbolizes task execution):
Simulation
� Example screenshots of implementation (lines show clusters, redsymbolizes task execution):
Simulation
� Example screenshots of implementation (lines show clusters, redsymbolizes task execution):
Simulation
� Comparisons with a null model (FIFO scheduling algorithm)
� Time to empty queue (of 1000 tasks) = Tcomplete
� Average waiting time (averaged over 1000 tasks) = Twait
� Values given in simulation time steps:
Experiments
Tcomplete Twait
RAP 845.60 342.54
FIFO 1071.20 475.31
� Utilization experiments
� We compared the cluster utilization ability of our algorithm vs. the FIFO scheduling algorithm
� Calculation for each task: (averaged over total number of tasks)
� Optimal value is 100% (our algorithm always achieves this):
Experiments
Utilization
RAP 100%
FIFO 56%
cluster
req
Nodes
Nodes
� Lastly we looked at how the average waiting time and time to completion scaled with the number of nodes in the system
Experiments
0
400
800
1200
1600
2000
0 200 400 600
T co
mp
lete
Nodes
Scaling of Tcomplete
0
200
400
600
800
0 200 400 600
T wai
t
Nodes
Scaling of Twait
� Same data using log2 on axes and a power curve fit:
Experiments
y = 63630x-0.927
R² = 0.9976
128
256
512
1024
2048
40 80 160 320 640
T co
mp
lete
(lo
g2)
Nodes (log2)
Scaling of Tcomplete
y = 47010x-1.075
R² = 0.9992
64
128
256
512
1024
40 80 160 320 640
T wai
t(lo
g2)
Nodes (log2)
Scaling of Twait
Optimizations Inspired by the Natural Immune System
• Operates under constraints of physical space
• Resource constrained (metabolic input, number of immune system cells)
• Performance scalability is an important concern (mice to horses)(Banerjee and Moses, 2010, in review)
Search Problem
• They have to search throughout the whole body to locate small quantities of pathogens
Response Problem
• Have to respond by producing antibodies
Nearly Scale-Invariant Search and Response
• How does the immune system search and respond in almost the same time irrespective of the size of the search space?
Crivellato et al. 2004
Solution?
Lymph Nodes (LN)
• A place in which IS cells and the pathogen can encounter each other in a small volume
• Form a decentralized detection network
Decentralized Detection Network
www.lymphadvice.com
Lymph Node Dynamics
Lymph Node Dynamics
Lymph Node Dynamics
Summary
• There are increasing costs to global communication as organisms grow bigger
• Semi-modular architecture balances the opposing goals of detecting pathogen (local communication) and recruiting IS cells (global communication)
• Can we emulate this modular RADAR strategy in distributed systems?
Optimizations inspired by the immune system
Optimizations inspired by the immune system
� The move towards distributed computing necessitates efficient scheduling algorithms
� Decentralized scheduling of large number of nodes leads to robustness, reduces load on centralized monitor and better response to fluctuations in task queue requirements
� Multi-agent systems have emergent properties and have been used here to adaptively create and allocate clusters to match task demand
� The algorithm outperforms our null model (FIFO scheduling) on average waiting time, time to empty task queue and utilization
� Further, our algorithm is robust to adversarial attack (task queue fluctuations in task processor requirements)
Conclusions
� Value of immune system inspired approaches
� General theory of scaling of artificial immune systems
Conclusions
� Compare with more null algorithms
� Compare with algorithms used in industry e.g. SLURM uses static allocation of nodes to clusters known as partitions
� Compare with cluster allocation algorithm used by Google in MapReduce (this algorithm can improve on their locality optimization since it seeks to form clusters with its neighbors)
� … and sell to the highest bidder!
Future Work
� Dr. Dorian Arnold
Acknowledgements