MONARC Simulation Framework Corina Stratan, Ciprian Dobre UPB Iosif Legrand, Harvey Newman CALTECH.

MONARC Simulation FrameworkMONARC Simulation Framework

Corina Stratan, Ciprian Dobre UPB

Iosif Legrand, Harvey Newman CALTECH

December 2003

I.C. Legrand 2

The GOALS of the Simulation FrameworkThe GOALS of the Simulation Framework

The aim of this work is to continue and improve the The aim of this work is to continue and improve the

development of the MONARC simulation frameworkdevelopment of the MONARC simulation framework

To perform realistic simulation and modelling of large scale distributed computing systems, customised for specific HEP applications.

To offer a dynamic and flexible simulation environment to be used as a design tool for large distributed systems

To provide a design framework to evaluate the performance of a range of possible computer systems, as measured by their ability to provide the physicists with the requested data in the required time, and to optimise the cost.

December 2003

I.C. Legrand 3

A Global View for ModellingA Global View for Modelling

Simulation Engine

Basic Components

Specific Components

Computing Models

LAN WAN

DB CPU

Scheduler Job

Catalog

Analysis

Distributed Scheduler

MetaDataJobs

MONITORING

REAL Systems Testbeds

December 2003

I.C. Legrand 4

Design ConsiderationsDesign Considerations

This Simulation framework is not intended to be a

detailed simulator for basic components such as operating systems, data base servers or routers.

Instead, based on realistic mathematical models and measured parameters on test bed systems for all the basic components, it aims to correctly describe the performance and limitations of large distributed systems with complex interactions.

December 2003

I.C. Legrand 5

Simulation EngineSimulation Engine

Simulation Engine

Basic Components

Specific Components

Computing Models

LAN WAN

DB CPU

Scheduler Job

Catalog

Analysis


MetaDataJobs

MONITORING


December 2003

I.C. Legrand 6

Design Considerations of the Design Considerations of the Simulation EngineSimulation Engine

A process oriented approach for discrete event simulation is well suited to describe concurrent running programs.

“Active objects” (having an execution thread, a program counter, stack...) provide an easy way to map the structure of a set of distributed running programs into the simulation environment.

The Simulation engine supports an “interrupt” scheme This allows effective & correct simulation for concurrent

processes with very different time scale by using a DES approach with a continuous process flow between events

December 2003

I.C. Legrand 7

The Simulation Engine – Tasks and EventsThe Simulation Engine – Tasks and Events

Created

Ready

Running

Waiting

Finished

Assigned to worker thread

semaphore.v()

Event happens or sleeping period is over

semaphore.p()

Task – for simulating an entity with time dependent behavior (active object, server, …)

5 possible states for a task: CREATED,READY, RUNNING,

FINISHED, WAITING

Each task maintains an internal semaphore necessary for switching between states.

Event - used for communication and synchronization between tasks: when a task must notify another task about something that happened or will happen in the future, it creates an event addressed to that task.

The events are queued and sent to the destination tasks by the engine’s scheduler.

Event - used for communication and synchronization between tasks: when a task must notify another task about something that happened or will happen in the future, it creates an event addressed to that task.

The events are queued and sent to the destination tasks by the engine’s scheduler.

December 2003

I.C. Legrand 8

Tests of the EngineTests of the Engine

Processing a TOTAL of 100 000 simple jobs in 1 , 10, 100, 1000, 2 000 , 4 000, 10 000 CPUs using the same number of parallel threads

more tests: http://monarc.cacr.caltech.edu/

1

10

100

1000

10000

10 100 1000 10000 100000

No of THREADS

Tim

e [

s]

2X2.4 GHz, Linux

2X450MHz Solaris

2X3GHz, Windows

http://monarc.cacr.caltech.edu/

December 2003

I.C. Legrand 9

Basic ComponentsBasic Components

Simulation Engine

Basic Components

Specific Components

Computing Models

LAN WAN

DB CPU

Scheduler Job

Catalog

Analysis


MetaDataJobs

MONITORING


December 2003

I.C. Legrand 10


These Basic components are capable to simulate the These Basic components are capable to simulate the core functionality for general distributed computing core functionality for general distributed computing systems. They are constructed based on the systems. They are constructed based on the simulation engine and are using efficiently the simulation engine and are using efficiently the implementation of the interrupt functionality for the implementation of the interrupt functionality for the active objects .active objects .

These components should be considered the basic These components should be considered the basic classes from which specific components can be classes from which specific components can be derived and constructed derived and constructed

December 2003

I.C. Legrand 11


Computing Nodes Computing Nodes Network Links and Routers , IO protocols Network Links and Routers , IO protocols Data Containers Data Containers Servers Servers

Data Base ServersData Base Servers File Servers (FTP, NFS … )File Servers (FTP, NFS … )

Jobs Jobs Processing JobsProcessing Jobs FTP jobs FTP jobs

Scripts & Graph execution schemes Scripts & Graph execution schemes Basic Scheduler Basic Scheduler Activities ( a time sequence of jobs ) Activities ( a time sequence of jobs )

December 2003

I.C. Legrand 12

Multitasking Processing ModelMultitasking Processing Model

Concurrent running tasks share resources (CPU, memory, I/O)

“Interrupt” driven scheme: For each new task or when one task is finished, an interrupt is

generated and all “processing times” are recomputed.

It provides:

Handling of concurrent jobs with different priorities.

An efficient mechanism to simulate multitask processing.

An easy way to apply different load balancingschemes.

December 2003

I.C. Legrand 13

LAN/WAN Simulation ModelLAN/WAN Simulation Model

Node Link

Node

Node

LANNode

Link

Node

Node

LAN

Node Link

Node

Node

LAN

Internet Connections

ROUTER

ROUTER“Interrupt” driven simulation : for each new message an interrupt is created and for all the active transfers the speed and the estimated time to complete the transfer are recalculated.

Continuous Flow between events !An efficient and realistic way to simulate concurrent transfers

having different sizes / protocols.

December 2003

I.C. Legrand 14

Network modelNetwork model

data traffic simulated for both local and wide area networks

a simulation at the packet level is practically impossible

we adopted a larger scale approach, based on an “interrupt” mechanism

Components of the network model

Network Entity:

• LAN, WAN, LinkPort

• main attribute: bandwidth

• keeps the evidence of the messages that traverse it

Network Entity:

• LAN, WAN, LinkPort

• main attribute: bandwidth

• keeps the evidence of the messages that traverse it

December 2003

I.C. Legrand 15

Simulating the network transfersSimulating the network transfers

CERNRouter

CPU LinkPortCPULinkPort

newMessage

CERN WAN

CERN LAN

Caltech WAN

Caltech LANCaltechRouter

INT

Message1Message3

INTMessage2

INT

1. The route and the available bandwidth for the new message are determined.1. The messages on the route are interrupted and their speeds are recalculated.

interrupt mechanism similar with the one used for job execution simulation

the initial speed of a message is determined by evaluating the bandwidth that each entity on the route can offer

different network protocols can be modelled

December 2003

I.C. Legrand 16

Job Scheduling and ExecutionJob Scheduling and Execution

Activity1class Activity1 extends Activity { … public void pushJobs() { … Job newJob = new Job (…); addJob(newJob); … } …}

Activity1class Activity1 extends Activity { … public void pushJobs() { … Job newJob = new Job (…); addJob(newJob); … } …}

Activity2class Activity2 extends Activity { …}

Activity2class Activity2 extends Activity { …}

CPU 1

CPU 2

CPU 3

Job 3(30% CPU)

Job 4(30% CPU)

Job 5(40% CPU)

Job 1(30% CPU)

Job 2(70% CPU)

Job 6(50% CPU)

Job 7(50% CPU)

INT

INT

1. The activity class creates a job and submits it to the farm. 2. The job scheduler sends the new job toa CPU unit. All the jobs executing on that CPU are interrupted.3. CPU power reallocated on the unit where the new job was scheduled. The interrupted jobs reestimate their completion time.

1

newJob

2Job 6

(33% CPU)

Job 7(33% CPU)

newJob(33% CPU)

3

December 2003

I.C. Legrand 17

Output of the simulationOutput of the simulation

Simulation Engine

Node

DB

Router

User C

Output Listener Filters

Output Listener Filters

Log Files EXEL

GRAPHICS

Any component in the system can generate generic results objects Any client can subscribe with a filter and will receive the results it is Interested in .VERY SIMILAR structure as in MonALISA . We will integrate soon The output of the simulation framework into MonaLISA

December 2003

I.C. Legrand 18

Specific ComponentsSpecific Components

Simulation Engine

Basic Components

Specific Components

Computing Models

LAN WAN

DB CPU

Scheduler Job

Catalog

Analysis


MetaDataJobs

MONITORING


December 2003

I.C. Legrand 19

Specific ComponentsSpecific Components

These Components should be derived from the basic These Components should be derived from the basic components and must implement the specific components and must implement the specific characteristics and way they will operate.characteristics and way they will operate.

Major Parts :Major Parts : Data Model Data Model Data Flow Diagrams from Production and Data Flow Diagrams from Production and

especially for Analysis Jobsespecially for Analysis Jobs Scheduling / pre-allocation policies Scheduling / pre-allocation policies Data Replication Strategies Data Replication Strategies

December 2003

I.C. Legrand 20

Generic Data Container

Size Event Type Event Range Access Count INSTANCE

Data ModelData Model

FTP ServerNode

DB Server NFS Server

FILE Data Base

Custom Data Server

NetworkFILE

META DATA CatalogReplication Catalog

Export / Import

December 2003

I.C. Legrand 21

Data Model (2)Data Model (2)

Data Container

JOB

META DATA CatalogReplication Catalog

Data Request

Data Container

Data Container

Data Container

List Of IO Transactions

Data Processing JOB

Select from the options

December 2003

I.C. Legrand 22

Database FunctionalityDatabase Functionality

Client-server model

Automatic storage management is possible, with data being sent to mass storage units

1. The job wants to write a container into the database DB1, but the server is out of storage space.2. The least frequently used container is moved to a mass storage unit. The new container is written to the database.

3 kinds of requests for the database server:

• write

• read

• get (read the data and erase it from the server)

Automatic storage management example:

DatabaseServer

DContainer 1

DContainer 2

DB1

DContainer 15

DContainer 16DB2

…

DContainer 20

DContainer 21

DContainer 22

DContainer 24

DContainer 3

Mass Storage 1

writeData()

1

Mass Storage 2

DContainer 23

December 2003

I.C. Legrand 23

Data Flow Diagrams for JOBSData Flow Diagrams for JOBS

Processing 1

Input

Output

Processing 2

Processing 3 Processing 4

Output

Output

Output

Processing 4 Output

Input

Input

Input

Input

10x

Input and output is a collection of data. This data is described by type and range

Process is

described by nameA fine granularity decomposition of processes which can be executed independently and the way they communicate can be very useful for optimization and parallel execution !

December 2003

I.C. Legrand 24

Job Scheduling Job Scheduling Centralized SchemeCentralized Scheme

CPU FARM

JobScheduler

CPU FARM

JobScheduler

Site A Site B

GLOBAL

Job Scheduler

Dynamically loadable module

December 2003

I.C. Legrand 25

Job Scheduling Job Scheduling Distributed Scheme – market modelDistributed Scheme – market model

CPU FARM

JobScheduler

CPU FARM

JobScheduler

Site A Site B

CPU FARM

JobScheduler

Site A

Request

COST

DECISION

December 2003

I.C. Legrand 26

Computing ModelsComputing Models

Simulation Engine

Basic Components

Specific Components

Computing Models

LAN WAN

DB CPU

Scheduler Job

Catalog

Analysis


MetaDataJobs

MONITORING


December 2003

I.C. Legrand 27

Activities: Arrival Patterns Activities: Arrival Patterns

A flexible mechanism to define the Stochastic process of how users perform data processing tasks

Dynamic loading of “Activity” tasks, which are threaded objects and are controlled by the simulation scheduling mechanism

Physics ActivitiesInjecting “Jobs”

Each “Activity” thread generates data processing jobs

for( int k =0; k< jobs_per_group; k++) { Job job = new Job( this, Job.ANALYSIS, "TAG”, 1, events_to_process); farm.addJob(job ); // submit the job sim_hold ( 1000 ); // wait 1000 s }

Regional Centre Farm

Job

Activity

Job

Job

Activity

These dynamic objects are used to model the users behavior

December 2003

I.C. Legrand 28

Regional Centre ModelRegional Centre Model

Complex Composite Object

Simplified topologyof the Centers

AB

C

D

EJobJobJob

Activity Activity Activity

Job Scheduler

AJob AJobAJobCPU

...LinkPort

AJob AJobAJobCPU

...LinkPort

AJob AJobAJobCPU

...LinkPort

DB

Index

DBServer

LinkPort

DBServer

LinkPort

FARM

REGIONAL CENTER

LAN

WAN

December 2003

I.C. Legrand 29

MONARC - Main ClassesMONARC - Main Classes

AJob

WorkerThread

LinkPortWAN

LANNetworkEntity

UDPMessage

TCPMessage

MessageTCPProtocol

UDPProtocol

Protocol

CPUCluster

CPUUnit

AbstractCPUUnit

Farm

RegionalCenter

Activity

MassStorage

DatabaseServer

DatabaseEntity

Database

DatabaseIndex

MetaJobJobDatabaseJobFTP

JobProcessData

Job

QScheduler

DistribScheduler

JobScheduler

Scheduler

DContainer

EventQueue

Pool

Task

Event

December 2003

I.C. Legrand 30

MonitoringMonitoring

Simulation Engine

Basic Components

Specific Components

Computing Models

LAN WAN

DB CPU

Scheduler Job

Catalog

Analysis


MetaDataJobs

MONITORING


December 2003

I.C. Legrand 31

Real Need for Flexible Monitoring SystemsReal Need for Flexible Monitoring Systems

It is important to measure & monitor the Key applications in It is important to measure & monitor the Key applications in a well defined test environment and to extract the parameters a well defined test environment and to extract the parameters we need for modeling we need for modeling

Monitor the farms used today, and try to understand how Monitor the farms used today, and try to understand how they work and simulate such systems. they work and simulate such systems.

It requires a flexible monitoring system able to dynamically It requires a flexible monitoring system able to dynamically add new parameters and provide access to historical dataadd new parameters and provide access to historical data

Interfacing monitoring tools to get the parameters we need in Interfacing monitoring tools to get the parameters we need in simulations in a nearly automatic waysimulations in a nearly automatic way

MonALISA was designed and developed based on the MonALISA was designed and developed based on the experience with the simulation problems.experience with the simulation problems.

December 2003

I.C. Legrand 32

EXAMPLES EXAMPLES

December 2003

I.C. Legrand 33

FTP and NFS clustersFTP and NFS clusters

FTP (NFS)Server

Client n

Client 3

Client 1

Client 2

request

events

This examples evaluate the performance of a local area network with a server and several worker stations. The server stores events used by the processing nodes.

NFS Example: the server concurrently delivers the events, one by one to the clients.

FTP Example: the server sends a whole file with events in a single transfer

December 2003

I.C. Legrand 34

FTP ClusterFTP Cluster

50 CPU units x 2 Jobs per unit

100 events per job, event size 1MB

LAN bandwidth 1 Gbps, server’s effective

bandwidth 60Mbps

December 2003

I.C. Legrand 35

NFS ClusterNFS Cluster

December 2003

I.C. Legrand 36

Distributed SchedulingDistributed Scheduling

CERN

Caltech

KEK

FNAL

RegionalCenter

Jobs

export()

export()

export()

• Job Migration: when a regional center is assigned too many jobs, it sends a part of them to other centers with more free resources

• New job scheduler implemented, which supports job migration, applying load balancing criteria

• Job Migration: when a regional center is assigned too many jobs, it sends a part of them to other centers with more free resources

• New job scheduler implemented, which supports job migration, applying load balancing criteria

We tested different configurations, with 1, 2 and 4 regional centers, and with different numbers of CPUs per regional center.

The number of jobs submitted is kept constant, the job arrival rate varying during a day.

December 2003

I.C. Legrand 37

Distributed Scheduling (2)Distributed Scheduling (2)

Average processing time and CPU usage for 1, 2, 4, 6 centers

Test Case:

• 4 regional centers, 20 CPUs per center

• average job processing time 3h, approx. 500 jobs per day submitted in a center

Test Case:

• 4 regional centers, 20 CPUs per center

• average job processing time 3h, approx. 500 jobs per day submitted in a center

December 2003

I.C. Legrand 38

CERN Caltech KEKFNAL

similar with the previous example, but the jobs are more complex, involving network transfers

centers connected in a chain configuration:

Every job submitted to a regional center needs an amount of data located in that center.

If the job is exported to another center, would the benefits be great enough to compensate the cost of the data transfer?


Chain WAN connection

December 2003

I.C. Legrand 39


The network transfers aremore intense in the centers from the middle of the chain

(like Caltech)

The average processing time significantly increases when

reducing the bandwidth and the number of CPUs

December 2003

I.C. Legrand 40


December 2003

I.C. Legrand 41

Local Data ReplicationLocal Data Replication

Evaluates the performance improvements that can be obtained by replicating data.

We simulated a regional center which has a number of database servers, and another four centers which host jobs that process the data on those database servers

A better performance can be obtained if the data from the servers is replicated into the other regional centers

Evaluates the performance improvements that can be obtained by replicating data.

We simulated a regional center which has a number of database servers, and another four centers which host jobs that process the data on those database servers

A better performance can be obtained if the data from the servers is replicated into the other regional centers

December 2003

I.C. Legrand 42

Local Data Replication (2) Local Data Replication (2)

December 2003

I.C. Legrand 43

WAN Data ReplicationWAN Data Replication

Replica Common Link

Replica Common Link

Jobs

Jobs

Jobs

• similar with the previous example, but now with two central servers, each holding an equal amount of replicated data, and eight satellite regional centers, hosting worker jobs

• a worker job will get a number of events from one of the central regional centers (one event at a time) and process them locally

workers choose the “best” server to get the data from. They use a Replication Load balancing service (knowing the load of the network and of the servers)

VS

The server is chosen randomly

workers choose the “best” server to get the data from. They use a Replication Load balancing service (knowing the load of the network and of the servers)

VS

The server is chosen randomly

December 2003

I.C. Legrand 44

WAN Data ReplicationWAN Data Replication

Both servers have the same bandwidth

and support the same maximum load

One server has half of the other’s

bandwidth and supports half of its maximum load

Better average response time, total execution time is smaller when taking decisions based on load balancing

December 2003

I.C. Legrand 45

SummarySummary

Modelling and understanding current systems, their performance and limitations, is essential for the design of the large scale distributed processing systems. This will require continuous iterations between modelling and monitoring

Simulation and Modelling tools must provide the functionality to help in designing complex systems and evaluate different strategies and algorithms for the decision making units and the data flow management.



Date post:	19-Dec-2015
Category:	Documents
View:	218 times
Download:	3 times

MONARC Simulation Framework Corina Stratan, Ciprian Dobre UPB Iosif Legrand, Harvey Newman CALTECH.

Documents