Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 218 times |
Download: | 3 times |
MONARC Simulation FrameworkMONARC Simulation Framework
Corina Stratan, Ciprian Dobre UPB
Iosif Legrand, Harvey Newman CALTECH
December 2003
I.C. Legrand 2
The GOALS of the Simulation FrameworkThe GOALS of the Simulation Framework
The aim of this work is to continue and improve the The aim of this work is to continue and improve the
development of the MONARC simulation frameworkdevelopment of the MONARC simulation framework
To perform realistic simulation and modelling of large scale distributed computing systems, customised for specific HEP applications.
To offer a dynamic and flexible simulation environment to be used as a design tool for large distributed systems
To provide a design framework to evaluate the performance of a range of possible computer systems, as measured by their ability to provide the physicists with the requested data in the required time, and to optimise the cost.
December 2003
I.C. Legrand 3
A Global View for ModellingA Global View for Modelling
Simulation Engine
Basic Components
Specific Components
Computing Models
LAN WAN
DB CPU
Scheduler Job
Catalog
Analysis
Distributed Scheduler
MetaDataJobs
MONITORING
REAL Systems Testbeds
December 2003
I.C. Legrand 4
Design ConsiderationsDesign Considerations
This Simulation framework is not intended to be a
detailed simulator for basic components such as operating systems, data base servers or routers.
Instead, based on realistic mathematical models and measured parameters on test bed systems for all the basic components, it aims to correctly describe the performance and limitations of large distributed systems with complex interactions.
December 2003
I.C. Legrand 5
Simulation EngineSimulation Engine
Simulation Engine
Basic Components
Specific Components
Computing Models
LAN WAN
DB CPU
Scheduler Job
Catalog
Analysis
Distributed Scheduler
MetaDataJobs
MONITORING
REAL Systems Testbeds
December 2003
I.C. Legrand 6
Design Considerations of the Design Considerations of the Simulation EngineSimulation Engine
A process oriented approach for discrete event simulation is well suited to describe concurrent running programs.
“Active objects” (having an execution thread, a program counter, stack...) provide an easy way to map the structure of a set of distributed running programs into the simulation environment.
The Simulation engine supports an “interrupt” scheme This allows effective & correct simulation for concurrent
processes with very different time scale by using a DES approach with a continuous process flow between events
December 2003
I.C. Legrand 7
The Simulation Engine – Tasks and EventsThe Simulation Engine – Tasks and Events
Created
Ready
Running
Waiting
Finished
Assigned to worker thread
semaphore.v()
Event happens or sleeping period is over
semaphore.p()
Task – for simulating an entity with time dependent behavior (active object, server, …)
5 possible states for a task: CREATED,READY, RUNNING,
FINISHED, WAITING
Each task maintains an internal semaphore necessary for switching between states.
Event - used for communication and synchronization between tasks: when a task must notify another task about something that happened or will happen in the future, it creates an event addressed to that task.
The events are queued and sent to the destination tasks by the engine’s scheduler.
Event - used for communication and synchronization between tasks: when a task must notify another task about something that happened or will happen in the future, it creates an event addressed to that task.
The events are queued and sent to the destination tasks by the engine’s scheduler.
December 2003
I.C. Legrand 8
Tests of the EngineTests of the Engine
Processing a TOTAL of 100 000 simple jobs in 1 , 10, 100, 1000, 2 000 , 4 000, 10 000 CPUs using the same number of parallel threads
more tests: http://monarc.cacr.caltech.edu/
1
10
100
1000
10000
10 100 1000 10000 100000
No of THREADS
Tim
e [
s]
2X2.4 GHz, Linux
2X450MHz Solaris
2X3GHz, Windows
December 2003
I.C. Legrand 9
Basic ComponentsBasic Components
Simulation Engine
Basic Components
Specific Components
Computing Models
LAN WAN
DB CPU
Scheduler Job
Catalog
Analysis
Distributed Scheduler
MetaDataJobs
MONITORING
REAL Systems Testbeds
December 2003
I.C. Legrand 10
Basic ComponentsBasic Components
These Basic components are capable to simulate the These Basic components are capable to simulate the core functionality for general distributed computing core functionality for general distributed computing systems. They are constructed based on the systems. They are constructed based on the simulation engine and are using efficiently the simulation engine and are using efficiently the implementation of the interrupt functionality for the implementation of the interrupt functionality for the active objects .active objects .
These components should be considered the basic These components should be considered the basic classes from which specific components can be classes from which specific components can be derived and constructed derived and constructed
December 2003
I.C. Legrand 11
Basic ComponentsBasic Components
Computing Nodes Computing Nodes Network Links and Routers , IO protocols Network Links and Routers , IO protocols Data Containers Data Containers Servers Servers
Data Base ServersData Base Servers File Servers (FTP, NFS … )File Servers (FTP, NFS … )
Jobs Jobs Processing JobsProcessing Jobs FTP jobs FTP jobs
Scripts & Graph execution schemes Scripts & Graph execution schemes Basic Scheduler Basic Scheduler Activities ( a time sequence of jobs ) Activities ( a time sequence of jobs )
December 2003
I.C. Legrand 12
Multitasking Processing ModelMultitasking Processing Model
Concurrent running tasks share resources (CPU, memory, I/O)
“Interrupt” driven scheme: For each new task or when one task is finished, an interrupt is
generated and all “processing times” are recomputed.
It provides:
Handling of concurrent jobs with different priorities.
An efficient mechanism to simulate multitask processing.
An easy way to apply different load balancingschemes.
December 2003
I.C. Legrand 13
LAN/WAN Simulation ModelLAN/WAN Simulation Model
Node Link
Node
Node
LANNode
Link
Node
Node
LAN
Node Link
Node
Node
LAN
Internet Connections
ROUTER
ROUTER“Interrupt” driven simulation : for each new message an interrupt is created and for all the active transfers the speed and the estimated time to complete the transfer are recalculated.
Continuous Flow between events !An efficient and realistic way to simulate concurrent transfers
having different sizes / protocols.
December 2003
I.C. Legrand 14
Network modelNetwork model
data traffic simulated for both local and wide area networks
a simulation at the packet level is practically impossible
we adopted a larger scale approach, based on an “interrupt” mechanism
Components of the network model
Network Entity:
• LAN, WAN, LinkPort
• main attribute: bandwidth
• keeps the evidence of the messages that traverse it
Network Entity:
• LAN, WAN, LinkPort
• main attribute: bandwidth
• keeps the evidence of the messages that traverse it
December 2003
I.C. Legrand 15
Simulating the network transfersSimulating the network transfers
CERNRouter
CPU LinkPortCPULinkPort
newMessage
CERN WAN
CERN LAN
Caltech WAN
Caltech LANCaltechRouter
INT
Message1Message3
INTMessage2
INT
1. The route and the available bandwidth for the new message are determined.1. The messages on the route are interrupted and their speeds are recalculated.
interrupt mechanism similar with the one used for job execution simulation
the initial speed of a message is determined by evaluating the bandwidth that each entity on the route can offer
different network protocols can be modelled
December 2003
I.C. Legrand 16
Job Scheduling and ExecutionJob Scheduling and Execution
Activity1class Activity1 extends Activity { … public void pushJobs() { … Job newJob = new Job (…); addJob(newJob); … } …}
Activity1class Activity1 extends Activity { … public void pushJobs() { … Job newJob = new Job (…); addJob(newJob); … } …}
Activity2class Activity2 extends Activity { …}
Activity2class Activity2 extends Activity { …}
CPU 1
CPU 2
CPU 3
Job 3(30% CPU)
Job 4(30% CPU)
Job 5(40% CPU)
Job 1(30% CPU)
Job 2(70% CPU)
Job 6(50% CPU)
Job 7(50% CPU)
INT
INT
1. The activity class creates a job and submits it to the farm. 2. The job scheduler sends the new job toa CPU unit. All the jobs executing on that CPU are interrupted.3. CPU power reallocated on the unit where the new job was scheduled. The interrupted jobs reestimate their completion time.
1
newJob
2Job 6
(33% CPU)
Job 7(33% CPU)
newJob(33% CPU)
3
December 2003
I.C. Legrand 17
Output of the simulationOutput of the simulation
Simulation Engine
Node
DB
Router
User C
Output Listener Filters
Output Listener Filters
Log Files EXEL
GRAPHICS
Any component in the system can generate generic results objects Any client can subscribe with a filter and will receive the results it is Interested in .VERY SIMILAR structure as in MonALISA . We will integrate soon The output of the simulation framework into MonaLISA
December 2003
I.C. Legrand 18
Specific ComponentsSpecific Components
Simulation Engine
Basic Components
Specific Components
Computing Models
LAN WAN
DB CPU
Scheduler Job
Catalog
Analysis
Distributed Scheduler
MetaDataJobs
MONITORING
REAL Systems Testbeds
December 2003
I.C. Legrand 19
Specific ComponentsSpecific Components
These Components should be derived from the basic These Components should be derived from the basic components and must implement the specific components and must implement the specific characteristics and way they will operate.characteristics and way they will operate.
Major Parts :Major Parts : Data Model Data Model Data Flow Diagrams from Production and Data Flow Diagrams from Production and
especially for Analysis Jobsespecially for Analysis Jobs Scheduling / pre-allocation policies Scheduling / pre-allocation policies Data Replication Strategies Data Replication Strategies
December 2003
I.C. Legrand 20
Generic Data Container
Size Event Type Event Range Access Count INSTANCE
Data ModelData Model
FTP ServerNode
DB Server NFS Server
FILE Data Base
Custom Data Server
NetworkFILE
META DATA CatalogReplication Catalog
Export / Import
December 2003
I.C. Legrand 21
Data Model (2)Data Model (2)
Data Container
JOB
META DATA CatalogReplication Catalog
Data Request
Data Container
Data Container
Data Container
List Of IO Transactions
Data Processing JOB
Select from the options
December 2003
I.C. Legrand 22
Database FunctionalityDatabase Functionality
Client-server model
Automatic storage management is possible, with data being sent to mass storage units
1. The job wants to write a container into the database DB1, but the server is out of storage space.2. The least frequently used container is moved to a mass storage unit. The new container is written to the database.
3 kinds of requests for the database server:
• write
• read
• get (read the data and erase it from the server)
Automatic storage management example:
DatabaseServer
DContainer 1
DContainer 2
DB1
DContainer 15
DContainer 16DB2
…
DContainer 20
DContainer 21
DContainer 22
DContainer 24
DContainer 3
Mass Storage 1
writeData()
1
Mass Storage 2
DContainer 23
December 2003
I.C. Legrand 23
Data Flow Diagrams for JOBSData Flow Diagrams for JOBS
Processing 1
Input
Output
Processing 2
Processing 3 Processing 4
Output
Output
Output
Processing 4 Output
Input
Input
Input
Input
10x
Input and output is a collection of data. This data is described by type and range
Process is
described by nameA fine granularity decomposition of processes which can be executed independently and the way they communicate can be very useful for optimization and parallel execution !
December 2003
I.C. Legrand 24
Job Scheduling Job Scheduling Centralized SchemeCentralized Scheme
CPU FARM
JobScheduler
CPU FARM
JobScheduler
Site A Site B
GLOBAL
Job Scheduler
Dynamically loadable module
December 2003
I.C. Legrand 25
Job Scheduling Job Scheduling Distributed Scheme – market modelDistributed Scheme – market model
CPU FARM
JobScheduler
CPU FARM
JobScheduler
Site A Site B
CPU FARM
JobScheduler
Site A
Request
COST
DECISION
December 2003
I.C. Legrand 26
Computing ModelsComputing Models
Simulation Engine
Basic Components
Specific Components
Computing Models
LAN WAN
DB CPU
Scheduler Job
Catalog
Analysis
Distributed Scheduler
MetaDataJobs
MONITORING
REAL Systems Testbeds
December 2003
I.C. Legrand 27
Activities: Arrival Patterns Activities: Arrival Patterns
A flexible mechanism to define the Stochastic process of how users perform data processing tasks
Dynamic loading of “Activity” tasks, which are threaded objects and are controlled by the simulation scheduling mechanism
Physics ActivitiesInjecting “Jobs”
Each “Activity” thread generates data processing jobs
for( int k =0; k< jobs_per_group; k++) { Job job = new Job( this, Job.ANALYSIS, "TAG”, 1, events_to_process); farm.addJob(job ); // submit the job sim_hold ( 1000 ); // wait 1000 s }
Regional Centre Farm
Job
Activity
Job
Job
Activity
These dynamic objects are used to model the users behavior
December 2003
I.C. Legrand 28
Regional Centre ModelRegional Centre Model
Complex Composite Object
Simplified topologyof the Centers
AB
C
D
EJobJobJob
Activity Activity Activity
Job Scheduler
AJob AJobAJobCPU
...LinkPort
AJob AJobAJobCPU
...LinkPort
AJob AJobAJobCPU
...LinkPort
DB
Index
DBServer
LinkPort
DBServer
LinkPort
FARM
REGIONAL CENTER
LAN
WAN
December 2003
I.C. Legrand 29
MONARC - Main ClassesMONARC - Main Classes
AJob
WorkerThread
LinkPortWAN
LANNetworkEntity
UDPMessage
TCPMessage
MessageTCPProtocol
UDPProtocol
Protocol
CPUCluster
CPUUnit
AbstractCPUUnit
Farm
RegionalCenter
Activity
MassStorage
DatabaseServer
DatabaseEntity
Database
DatabaseIndex
MetaJobJobDatabaseJobFTP
JobProcessData
Job
QScheduler
DistribScheduler
JobScheduler
Scheduler
DContainer
EventQueue
Pool
Task
Event
December 2003
I.C. Legrand 30
MonitoringMonitoring
Simulation Engine
Basic Components
Specific Components
Computing Models
LAN WAN
DB CPU
Scheduler Job
Catalog
Analysis
Distributed Scheduler
MetaDataJobs
MONITORING
REAL Systems Testbeds
December 2003
I.C. Legrand 31
Real Need for Flexible Monitoring SystemsReal Need for Flexible Monitoring Systems
It is important to measure & monitor the Key applications in It is important to measure & monitor the Key applications in a well defined test environment and to extract the parameters a well defined test environment and to extract the parameters we need for modeling we need for modeling
Monitor the farms used today, and try to understand how Monitor the farms used today, and try to understand how they work and simulate such systems. they work and simulate such systems.
It requires a flexible monitoring system able to dynamically It requires a flexible monitoring system able to dynamically add new parameters and provide access to historical dataadd new parameters and provide access to historical data
Interfacing monitoring tools to get the parameters we need in Interfacing monitoring tools to get the parameters we need in simulations in a nearly automatic waysimulations in a nearly automatic way
MonALISA was designed and developed based on the MonALISA was designed and developed based on the experience with the simulation problems.experience with the simulation problems.
December 2003
I.C. Legrand 32
EXAMPLES EXAMPLES
December 2003
I.C. Legrand 33
FTP and NFS clustersFTP and NFS clusters
FTP (NFS)Server
Client n
Client 3
Client 1
Client 2
request
events
This examples evaluate the performance of a local area network with a server and several worker stations. The server stores events used by the processing nodes.
NFS Example: the server concurrently delivers the events, one by one to the clients.
FTP Example: the server sends a whole file with events in a single transfer
December 2003
I.C. Legrand 34
FTP ClusterFTP Cluster
50 CPU units x 2 Jobs per unit
100 events per job, event size 1MB
LAN bandwidth 1 Gbps, server’s effective
bandwidth 60Mbps
December 2003
I.C. Legrand 35
NFS ClusterNFS Cluster
December 2003
I.C. Legrand 36
Distributed SchedulingDistributed Scheduling
CERN
Caltech
KEK
FNAL
RegionalCenter
Jobs
export()
export()
export()
• Job Migration: when a regional center is assigned too many jobs, it sends a part of them to other centers with more free resources
• New job scheduler implemented, which supports job migration, applying load balancing criteria
• Job Migration: when a regional center is assigned too many jobs, it sends a part of them to other centers with more free resources
• New job scheduler implemented, which supports job migration, applying load balancing criteria
We tested different configurations, with 1, 2 and 4 regional centers, and with different numbers of CPUs per regional center.
The number of jobs submitted is kept constant, the job arrival rate varying during a day.
December 2003
I.C. Legrand 37
Distributed Scheduling (2)Distributed Scheduling (2)
Average processing time and CPU usage for 1, 2, 4, 6 centers
Test Case:
• 4 regional centers, 20 CPUs per center
• average job processing time 3h, approx. 500 jobs per day submitted in a center
Test Case:
• 4 regional centers, 20 CPUs per center
• average job processing time 3h, approx. 500 jobs per day submitted in a center
December 2003
I.C. Legrand 38
CERN Caltech KEKFNAL
similar with the previous example, but the jobs are more complex, involving network transfers
centers connected in a chain configuration:
Every job submitted to a regional center needs an amount of data located in that center.
If the job is exported to another center, would the benefits be great enough to compensate the cost of the data transfer?
Distributed Scheduling (3)Distributed Scheduling (3)
Chain WAN connection
December 2003
I.C. Legrand 39
Distributed Scheduling (4)Distributed Scheduling (4)
The network transfers aremore intense in the centers from the middle of the chain
(like Caltech)
The average processing time significantly increases when
reducing the bandwidth and the number of CPUs
December 2003
I.C. Legrand 40
Distributed Scheduling (5)Distributed Scheduling (5)
December 2003
I.C. Legrand 41
Local Data ReplicationLocal Data Replication
Evaluates the performance improvements that can be obtained by replicating data.
We simulated a regional center which has a number of database servers, and another four centers which host jobs that process the data on those database servers
A better performance can be obtained if the data from the servers is replicated into the other regional centers
Evaluates the performance improvements that can be obtained by replicating data.
We simulated a regional center which has a number of database servers, and another four centers which host jobs that process the data on those database servers
A better performance can be obtained if the data from the servers is replicated into the other regional centers
December 2003
I.C. Legrand 42
Local Data Replication (2) Local Data Replication (2)
December 2003
I.C. Legrand 43
WAN Data ReplicationWAN Data Replication
Replica Common Link
Replica Common Link
Jobs
Jobs
Jobs
• similar with the previous example, but now with two central servers, each holding an equal amount of replicated data, and eight satellite regional centers, hosting worker jobs
• a worker job will get a number of events from one of the central regional centers (one event at a time) and process them locally
workers choose the “best” server to get the data from. They use a Replication Load balancing service (knowing the load of the network and of the servers)
VS
The server is chosen randomly
workers choose the “best” server to get the data from. They use a Replication Load balancing service (knowing the load of the network and of the servers)
VS
The server is chosen randomly
December 2003
I.C. Legrand 44
WAN Data ReplicationWAN Data Replication
Both servers have the same bandwidth
and support the same maximum load
One server has half of the other’s
bandwidth and supports half of its maximum load
Better average response time, total execution time is smaller when taking decisions based on load balancing
December 2003
I.C. Legrand 45
SummarySummary
Modelling and understanding current systems, their performance and limitations, is essential for the design of the large scale distributed processing systems. This will require continuous iterations between modelling and monitoring
Simulation and Modelling tools must provide the functionality to help in designing complex systems and evaluate different strategies and algorithms for the decision making units and the data flow management.
http://monarc.cacr.caltech.edu/