The OptorSim Archive of Questions Asked
Caitriana Nicholson, March 2008
This is an edited archive of user questions submitted to the OptorSim mailing lists, with developers'
and other users' responses. It is intended as a resource for other users, who may not receive ready
responses from the original developers now that they have all moved on to other things.
Questions are in plain font and answers are in italics. Some editing of grammar and spelling has
been done, but not extensively – so don't blame the editor for those!
Contents:
Current State of the Project
Running OptorSim in Windows
Running OptorSim in MacOS
Configuration File Questions
Netbeans and OptorSim
Compilation Problems
Class File Documentation
Various: Initial replica placement, CEs and worker nodes, file pinning, access cost, job processing
Simulating Security Functions
Timing Model
Adding New Replication / Scheduling Strategies
Statistics Output
Resource Monitoring
State of the project
What is the current state of this simulator? Is it still being developed, and will there be any new versions?
The simulator is not being actively used by people within the EDG project (the project under which it was created). In
fact the EDG project finished a number of years ago. However, others are using and extending the code-base. The
project is maintained in a repository at SourceForge (http://sourceforge.net/projects/optorsim) and new developers are
welcome to join there, but the original developers are all working elsewhere now and no longer have time to make new
releases. Any new questions should be addressed to the mailing list at [email protected] where they
will be answered on a besteffort basis.
OptorSim in Windows
Windows Path Instructions for UserGuide
I am trying to learn about grid simulation tools, and am excited by OptorSim. However, I am stuck using a Windows
XP system, and I would recommend adding (on page 4 of the OptorSim v2.0 Installation and User Guide):
for Windows users:
My Computer > Properties > Advanced > Environmental Variables, then highlight
the Path in the System Variables box, and click "Edit", and add to the end of
the path: %OptorSim2.0 Directory%\bin
where %OptorSim2.0 Directory% in my case was C:\optorsim2.0 Running OptorSim in windows
I can't find anybody that know how to run OptorSim in windows. I am not familiar with unix environment. can
you tell me how to run OptorSim using windows. the user guide i think more focuses on unix..
Running OptorSim in Windows is pretty much the same as running in Unix. In the optorsim-2.0\bin directory there is a
Windows executable called OptorSim.bat. Start up a command prompt, go into the optorsim-2.0 directory and run
bin\OptorSim.bat. Edit the examples\parameters.conf file to set the parameters you want. There are instructions for
running in Windows in the user guide, on pages 4 and 5 - all other instructions are the same as for unix.
How to execute OptorSim Simulator in windows OS?
I downloaded OptorSim simulator, but it not working. I am running this simulator under windows OS. whenever I am
using OptorSim.bat, the following error is coming.
Exception in thread "main"
java.lang.NoClassDefFoundError: org/edg/data/replication/optorsim/OptorSimMain
If you are using the OptorSim 2.0 downloaded from the website, and installed it according to the instructions in the
userguide, it should work. As Paul said, the classpath set in the OptorSim.bat file assumes you are running from within
the optorsim-2.0 directory; if you want to run it from a dif erent directory, please modify the paths in the file so that it
can find lib/edg-optorsim.jar, etc, from wherever it is running.
OptorSim with MacOS
I would like to find out if the simulation tool "OptorSim" can be used on a Macintosh Operating System.
In principle, OptorSim can be used on any system that has Java. Getting OptorSim working for Macs involved getting
Java working.
You will need two parts: the build environment (java compiler and the build tool "ant") and the runtime environment
(JRE). The web page: http://developer.apple.com/java/ and http://www.pepsan.com/javamac/ seem to be good
places to start.
A few wrapper scripts are included with OptorSim (in the directory "bin"). These will probably not work for Macs, but
it should be fairly easy to develop Mac equivalents.
Configuration Files
CMS testbed topology
I want to do an evaluation about our strategy with a promising topology of Grid like "Grid topology for CMS world
wide data production challenge in spring 2002" introduced in a paper, "Evaluation of an EconomyBased File
Replication Strategy for a DataGrid". As an undesirable case, I can configure the topology, which might be undesirable
in my point. Can you help me to obtain the configuration files of "Grid topology for CMS world wide data production
challenge in spring 2002"?
The CMS testbed configuration files are included in the examples/ directory of OptorSim :
cms_testbed_grid.conf
cms_testbed_jobs.conf
cms_testbed_bandwidths.conf
Job probabilitiesHi, I´m trying to understand extra examples that are in the web. And I don´t know how percentages are calculated. Do
you know where percentages came?
If I understand your question correctly, you're asking about the following part of the configuration file:
\begin{cescheduletable}
0 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0
3 jpsijob 0.14 highptlepjob 0.44 incelecjob 0.58 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0
7 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0
#8 jpsijob 0.29 highptlepjob 0.43 incelecjob 0.57 incmuonjob 0.71 highptphotjob 0.86 zbbbarjob 1.0
11 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0
12 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.58 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0
13 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.42 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0
14 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.42 incmuonjob 0.56 highptphotjob 0.7 zbbbarjob 1.0
15 jpsijob 0.3 highptlepjob 0.44 incelecjob 0.58 incmuonjob 0.72 highptphotjob 0.86 zbbbarjob 1.0
16 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0
17 jpsijob 0.14 highptlepjob 0.28 incelecjob 0.42 incmuonjob 0.56 highptphotjob 0.86 zbbbarjob 1.0
\end
Percentages are cumulative. For example, the meaning of the row
0 jpsijob 0.17 highptlepjob 0.34 incelecjob 0.5 incmuonjob 0.67 highptphotjob 0.84 zbbbarjob 1.0
is that on Computing Element 0 can run:
jpsijob with probability 0.17
highptlepjob with probability 0.34 0.17 = 0.17
incelecjob with probability 0.5 0.34 = 0.16
incmuonjob with probability 0.67 0.5 = 0.17
highptphotjob with probability 0.84 0.67 = 0.17
zbbbarjob with probability 1.0 0.64 = 0.16
Job running
I have one doubt in OptorSim. In job config file I defined some ten jobs. In parameter config file I declared
number.jobs = 100. How will it run the jobs 10 times for each job? What is the relation with the job selection
probability?
It will run 100 jobs. The jobs it chooses will depend on the job selection probability which you define in the job
configuration file. If you have given all your 10 jobs the same probability, it will run each job 10 times (on average it
will not be exact). If they have dif erent probabilities, they will run a dif erent number of times.
For example, suppose you have the following in your job configuration file:
\begin{jobselectionprobability}
jobA 0.5
jobB 0.25
jobC 0.15
[...]
jobJ 0.05
\end{jobselectionprobability}
Then, for 100 jobs, jobA would run about 50 times, jobB about 25 times, jobC about 15 times, and so on.Based upon the selection probability it will run job that much times. My question is whether all jobs will execute
sequentially? that is if the job1 have to run 10 times after it ran 10 times only the next job (job2) will run. Is it like that?
No, it is not like that. The jobs are chosen "at random", but weighted by their selection probability. A random number
between 0 and 1 is generated. This is then compared to the job probability; for example, if job1 has probability 0.5, if
the random number is less than 0.5 job1 is chosen to run and if it is bigger than 0.5 job1 is not chosen and another job
is considered. You can see the code for it in the randomJob() method of GridContainer.
How to calculate the number of files required for a particular job and the file names? Is there any function for this
implemented in optorsim? What is needed for getBestFile()? if the initial file distribution is more than one site then will
only it be used? How it is related to access cost?
These are set in the job configuration file. In the jobtable, you define the set of files for a particular job. Then in the
filesetfraction table, you define the fraction of the total fileset which one job needs. So if you have 100 files defined in a
fileset for job type jobA, and have jobA filesetfraction set at 0.25, each individual instance of jobA will process 25 files.
getBestFile() takes an array of lfns (logical file names) and an array of the corresponding file fractions. Then for each
file in the array, it tries to replicate it according to the chosen replication strategy. Each Optimiser class therefore has
its own implementation of getBestFile().
If the initial file distribution is for only one site, only that site will be used for the first replications, clearly. If files are
on more than one site, all the sites will be considered as sources of replicas. The site that gives the lowest access cost
(or wins the auction in the economic model, or whatever your optimiser does) is chosen to replicate from.
kSI2000
What is the meaning of kSI(2000)?
SI2000 (or CINT2000, but it's easier to use the kilo prefix) is a standard way of measuring CPU performance for
dif erent machines. See http://www.spec.org/cpu/ for more information, e.g. results for dif erent machines at
http://www.spec.org/cpu2000/results/res2007q1/
It's the way that the LCG project uses to calculate its resource requirements.
CMS testbed grid
In cms_testbed_grid the number of sites mentioned is 27 but while running only 19 sites it shows. Why? Similarly the
initial file distribution is in site 14. The site bandwidth in grid conf file shows, site 14 is connected to site15 no other
connection is there. Then how the files are transferred to other sites for job processing.
Some of the sites are router sites, so they have no SE or CE they just transfer the files through them and do not
appear in the simulation output.
For cms_testbed_grid, site 14 is actually connected to both site 15 (Lyon) and site 23 (a router). Even if a site has only
one connection, files can be transferred to other sites as long as they are all connected in the network. Files can go
*through* other sites on the way between the source and destination sites.
Netbeans and OptorSim
I am trying to use netbean4.0 (based on Apache Ant) for compiling modified code. Has any body tried that? If yes,
could you tell me if any changes are needed? It seems that it does not see some of the packages and complains about
some packages saying "package does not exist" at the import statements. It also shows a warning "deprecation: show()
in java.awt.window has been deprecated" this.show().
I think I remember someone having this problem before, and if I remember correctly, the solution was to put all the jar
files for the external packages together with the optorsim jar file in the lib/ directory. Or else all the jar files had to bein the Netbeans working directory, or in the top optorsim directory... it was certainly something to do with not finding
the correct paths to all the libraries. Try playing around with the location of the jar files and see if it works.
Compilation Problems
Problem with Compiling OptorSimV2
I have a problem with compiling the files with ant command. I didn't add any coding of my own, just the problem pop
up when I install and run it by using ant.
Buildfile: build.xml
init:
prepare:
build:
[javac] Compiling 95 source files to
/home/rony/optorsim2.0/build/classes
[javac]
/home/rony/optorsim2.0/src/org/edg/data/replication/optorsim/OptorS
imGUI.java:68: error: Type `JTextArea' not found in declaration of field
`tabTer m'.
[javac] private static JTextArea tabTerm;
[javac] ^
[javac]
/home/rony/optorsim2.0/src/org/edg/data/replication/optorsim/OptorS
imGUI.java:892: error: Type `JPEGImageEncoder' not found in the
declaration of t he local variable `encoder'.
[javac] JPEGImageEncoder encoder =
JPEGCodec.createJPEGEnc oder(out);
[javac] ^
[javac]
/home/rony/optorsim2.0/src/org/edg/data/replication/optorsim/OptorS
imGUI.java:912: error: Type `JPEGImageEncoder' not found in the
declaration of t he local variable `encoder'.
[javac] JPEGImageEncoder encoder =
JPEGCodec.createJPEGEnc oder(out);
[javac] ^
[javac]
/home/rony/optorsim2.0/src/org/edg/data/replication/optorsim/OptorS
imGUI.java:931: error: Type `JPEGImageEncoder' not found in the
declaration of t he local variable `encoder'.
[javac] JPEGImageEncoder encoder =
JPEGCodec.createJPEGEnc oder(out);
[javac] ^
[javac]
/home/rony/optorsim2.0/src/org/edg/data/replication/optorsim/OptorS
imGUI.java:951: error: Type `JPEGImageEncoder' not found in the
declaration of t he local variable `encoder'. [javac] JPEGImageEncoder encoder =
JPEGCodec.createJPEGEnc oder(out);
[javac] ^
[javac]
/home/rony/optorsim2.0/src/org/edg/data/replication/optorsim/OptorS
imGUI.java:1101: error: Type `JPEGImageEncoder' not found in the
declaration of the local variable `encoder'.
[javac] JPEGImageEncoder encoder =
JPEGCodec.createJPEGEncode r(out);
[javac] ^
[javac] 6 errors
BUILD FAILED
file:/home/rony/optorsim2.0/build.xml:33: Compile failed; see the compiler error output for details.
Both jTextArea (part of Swing) and JPEGImageEncoder (part of the com.sun.image.codec.jpeg package) are included
in standard Java-REs (or SDKs) these days. For some reason, it looks like your version of Java doesn't have access to
them. Which version of Java are you using?
A problem on buiding Optorsim with ant!
Several times I tried to re-build the Optorsim with ant as you have been described in the user guide but the build fails
with the following error message:
Build failed
G:\optorsim-2.1\build.xml:74: execute failed: java.io.IOException:createprocess:bin\optorsimTests.sh error=193
It looks like you are running on a Windows machine, is that correct? From the section of output you sent, it also looks
like you were trying to run the functional test suite, i.e. the command:
ant func-test
I think this may be the problem - the functional tests, which this command runs, only work in a UNIX operating system.
Everything else should be fine on Windows, but we only wrote the functional tests for UNIX, as page 4 of
the user guide mentions. Building the source for OptorSim itself, just by running 'ant', should work.
Documentation - Class explanations
I want to use OptorSim for my PhD thesis simulation, how can I find an explanation for each class code ?? I need to
understand how each class works before I make any change.
After installing OptorSim, you can get the JavaDoc expanation of each class by doing:
ant doc
which will generate the documentation in html format in the doc/api directory of your optorsim installation directory.
Page 4 of the User Guide also outlines this procedure. You can then read these html files with your web browser. This
would be the best way for you to get an idea of how the dif erent parts of the simulation work.
If you want to see more details of the code itself, please go into the src/ directory and open up the source files to read
them. The level of commenting in the code is somewhat variable, however.Initial Replica Placement
I have also seen that in OptorSim the initial file and replica placement is made randomly using uniform distribution and
I want to know if I can change this by implementing my initial placement strategy?
To do this, you would have to change the assignFilesToSites() method in the class JobConfFileReader. You could just
extend this class and override the method in the subclass.
CEs and Worker Nodes
I am a little not clear about the number of working nodes and computing elements. In the user manual, it says that a
maximum of 1 CE per site. In the code of the GridSite you have a vector for CEs at the site(Vector
_computingElementCollection = new Vector();), is this just for future work?
It was intended to further develop the model to allow more than 1 CE per site, so the CEs at each site were
implemented as a Vector, but extending this to actually having >1 CE/site was never actually implemented, so
yes, it is 'future work'.
I am also not clear about the function returning the total number of computing elements in the GridContainer class, is
this the total number in the whole Grid system?
Yes, it is the total number of CEs in the whole grid.
What is the need of worker nodes in OptorSim? Only one job at a time can be processed by CE. Then how worker node
involved in job processing.
The time to process the job is divided by the number of worker nodes, so if there are more worker nodes in a CE it
processes the job faster. This is a very simple model Antoine Vernois from Lyon developed a more sophisticated
model but I don't think it's included in the release.
Access cost
I am trying to compute the cost of accessing a file if stored at a certain storage element or site. This is in terms of the
bandwidth not the number of hops.
The access cost is currently calculated as
(file size) / (available bandwidth)
This is in the NetworkClient class, so if you want to modify it that is where you should make the changes.
Also is there any function that returns the best route; that is how to figure out the maximum bandwidth available in the
route. I hope question is clear cz I wrote in a hurry.
The best route is calculated at the beginning of the simulation using a Dijkstra search algorithm - see the
GridConfFileReader and GridContainer classes for details. For each pair of grid sites, the best route between them,
based on the maximum bandwidth, is calculated.
File Pinning
What is the meaning of pinned. [ pin status of the file]
If a file is pinned, it can't be deleted from the SE until it is unpinned. This is so that if an Optimiser decides to replicate
a particular file, it can prevent it from being deleted until the replication is finished.
RB job processing
User submitting jobs to the RB, RB submitting jobs to the CE based upon the scheduling algorithm. After all the jobs
are submitted only, OptorSim starts processing jobs. Why? Actually, the RB starts processing jobs as soon as the Users have started submitting them, so there can still be users
submitting jobs while the RB is processing them. If there is not a large number of jobs and it goes quickly, however, it
might *look* like the RB has not started until all the jobs have been submitted.
Simulating security functions
Can we simulate grid security functions by using OptorSim? The description of OptorSim at the DataGRID website
only describes data access optimization algorithm simulations. I wonder if it can be used for simulating the security
features.
Could you explain what kind of grid security functions you want to simulate? What level of detail are you looking at? It
is currently possible to simulate dif erent site policies (of which jobs to accept) using OptorSim, but investigating
security in a more detailed way would require extension to the code. As you say, it is designed for looking at
data replication algorithms so implementation of things like networking and security are quite highlevel, though you
could of course modify the code to your own requirements if you wish.
We are currently working on the pluggable security services. Initially, we are using the set of services defined in the
OGSA document. The idea behind this effort is to enable a VO members to invoke the set of security services that
adapts to their requirements (rather than a 'standard' set of security services). To avoid any mismatch in the set of
services invoked by the various members of the same VO, we are working on the conflictmanagement paradigms and
will require some mechanism to adequately simulate our propositions. Beside this aspect, we need to carry out a
number of other simulations like scalability, realtime invocation of these services, invocation by users as well as by
services, ... It is evident that our simulation requirements are quite different from the most intended use of OptorSim I
don't know how much modifications are required in its code to suit us!
It looks like it would require substantial code modification to enable OptorSim to match your requirements. I would
suggest looking for some lowerlevel simulators, although you can download the code if you want to examine it more
closely
Timing Model
OptorSim v. SimGrid
I am working on replication and caching optimization algorithms. I would like to know if OptorSim would be fast
enough. Some people claim that since it is written in Java it is not going to be as good as Simgrid (written in C). Can
you please comment and advise me if OptorSim will be efficient enough? Secondly, if my idea involves adding other
components to the simulator, is this doable; i.e., adding my own work and use it as part of the simulator?
[Antoine Vernois] to my mind, it's true that OptorSim is not as fast as Simgrid, but it's not due to language. It's due to
the fact that Simgrid is event-driven, ie time is advanced in calculated step while OptorSim is kind of real-time
simulator. So for example, simulation of a grid for 10h can only take 1hour with Simgrid (but it depend on what you
simulate, it can also take 10min, or 4hours or more), while it will take 10 hours with Optorsim. Hopefully there is a
scale factor in OptorSim that allow you to speedup the time (by dividing all sleep time by this factor).
[Editor's Note: The above was true for OptorSim 1.0 – in version 2.0 and above it also uses a more event-driven
model and no longer goes in real time (although the option to do so is still there).]
But I think that the choice of Simgrid or Optorsim, should not be done for their execution performance, but for tools
they of er to you. For example, i choose to use Optorsim (while main part of Simgrid is developed in my lab :-)
because it includes all mechanisms to manage data and their replication. It gives me routines to locate, retrieve, delete
data and gives me quite good estimation of access time. Moreover the global architecture (following EDG
architecture) is already implemented and is fine for my need.
In Simgrid, you have all tools to do that, but you have to do it yourself ! You will have to implement replica manager,storage element and so on...
Another point to look is the way the bandwidth sharing is simulated. In OptorSim the model is quite simple but ef icient
enough if you manage lots of transfers. But i think that improvement of this point is in OptorSim developers 's to dolist.
So, the choice of OptorSim or Simgrid mainly depends on your own needs. As a user of OptorSim with particular
requirements, i added lots of things to the OptorSim core to match my needs. It's quite easy as the code is well
commented and quite easy to understand.
[David Cameron] As you say Antoine, the dif erence in speed between the two simulations is the time model used, not
the language. I think the main consideration for which simulation to use should be how easy it is to adapt for your
particular purpose. In my experience anyway most of the time is spent developing and testing simulation code rather
than getting results. Since you say you are interested in replication caching strategies I think OptorSim already has all
the features necessary for this and should be easy to expand and implement your own algorithms.
Of course since I am one of the developers of OptorSim maybe I am biased towards it ;) but I think it would involve
more work on your part since OptorSim was designed to test replication strategies. As for adding your own code, the
license means you are free to do what you like to the code as long as you acknowledge the original authors and keep
the copyright headers in each java file.
Timing bug?
While using the simulator, I have encountered many strange bugs. For example, you claim that the time measured by
OptorSim does not depend on the computer on which it runs on (the time you use does not depend on system time).
However, by doing several simulations, I found out that the times differ significantly (sometimes order of magnitude)
when run on different machines. I double checked the input parameters and the "time.advance" parameter is set to
"yes". Is this a known problem, or am I interpreting this parameter wrongly?
For the time.advance, you are right: it should be independent of the underlying CPU speed. Essentially, time (within
the simulation) is frozen whilst anything is being calculated. Once all simulation work is finished there is a step-wise
jump in time to the next time that something would "happen". This should be independent of actual CPU ef ort,
although the time taken to simulate any given grid configuration will obviously depend on the machine's CPU.
Just to be sure the parser has picked up the time.advance option, could you check whether the CPU utilisation stays
high (over 90%) for the duration of the simulation? If it doesn't, then its using "linear" time and the discrepancy
would be expect. If the CPU utilisation does stay high, then there's a bug somewhere.
Adding new replication or scheduling strategies
New replication strategy
My work is about initial replica placement in Grid. In fact, my goal is to place initially in the grid (when the file f is
created) a number R of replicas of the file f to improve fault tolerance. Unfortunately, our faculty doesn't have the
necessary infrastructure for my experiments. I am very interested by your simulator OptorSim. I have read the user
guide and I have some questions. Is it possible for me to add new replication strategies?
Yes; by extending the Optimiser classes appropriately, you can add your own replication strategy fairly easily. You
would probably also have to extend the StorageElement classes according to your strategy.
New scheduling algorithm
i am proposing new scheduling and replication algorithm. i am using optorsim2.1 for my project. how to add my new
scheduling algorithm in OptorSim.
OK, so if you mention two things: a scheduling- and replication- algorithm. OptorSim's main focus is on replica-optimisation, so that's in a more advanced state; so, if I were you, I would start with that.
In /src/org/edg/data/replication/optorsim/optor directory you should see the available optimisation algorithms. The
key thing is that they all implement the Optimisable interface: this is how the rest of the software interacts with the
replica optimisation strategy.
The replica optimisation algorithms already in OptorSim form a strong class-hierarchy. At the bottom is a skeleton
class (SkelOptor) that implements some very basic functionality. All the others extend either this class or some other
(abstract) class.
To create your own replica optimisation algorithm, you must create a new class in this directory. Your new class can
either implement the Optimisable interface directly or extend one of the existing classes; it just depends on how your
algorithm is going to work.
For job scheduling, have a look in ResourceBrokerFactory class. This is a singleton class that is used to return a
singleton object, an object that implements the ResourceBroker interface. To begin with, you probably want to write a
new class that extend the extends the skeleton implementation (SkelResourceBroker). Have a look at
RandomCEResourceBroker to get the idea.
Thanks for your suggestion. According to your suggestion I am going to start implementing my Replication strategy
(Best Client). I have some doubt in that. I listed my doubts below.
1. In job config file (cms_testbed_jobs.conf), the third row is the schedule table. It contains the sites and the jobs they
are willing to run. My question is, is it those sites alone which will run jobs while executing the program?
Yes.
And the sites which are not specified will be idle at that time. Is it so?
Yes, indeed. This is a Grid paradigm. Although Grid computing is about providing access to a large computing facility,
a key aspect is that each site can choose which "virtual organisation" (VO) they wish to support. You can see this with
live WLCG data here: ht tp://gridmap.cern.ch/gm/ Click on the dif erent VOs and notice that some sites turn white,
indicating they don't support that VO (all sites support the OPS VO, though).
When someone submits a job (or, more likely, a series of jobs) to the Grid, they identify themselves as being a member
of a VO (in the HEP world ATLAS, CMS and LHCb are examples of VOs). This user can run their jobs, access and
store data (etc.) because of their membership of that VO.
Likewise, each site can choose which VOs they wish to support and how much they want to support them. So, a site may
choose to dedicated themselves to a particular VO, whereas other sites may strongly favour one VO but allow
work from many other VOs.
OptorSim attempts to simulate this ef ect. The cescheduletable describes which jobs a site is willing to run. In reality,
a site does not decide jobbyjob (instead, the decision is based on a number of factors, the most prominent being the
VO membership of the person submitting the job), but this should allow us to simulate real job-submission pattern.
And, yes, this might result in idle CPUs. But it's important to allow the sites their autonomy .. and in practice, this is
unlikely to be a problem. The computer hardware is bought to match predicted demand, so it's fairly unlikely that
computers will be idle.
2. Whether I have to create the config files first and then I have to start coding?
No, you should be able to use the existing files. The "simple" ones are a good place to start: simple_grid.conf
simple_job.conf. Just update the parameters.conf file. You will need to add support for a new scheduler (i.e. number 5). 3. Shall I use the existing config files for my proposed algorithm or do I have to create my own?
Whichever you feel more comfortable with; either will work. Personally, I would copy the existing one and edit it. That
way you have a local copy that hasn't been altered, allowing you to see what you've changed.
4. When developing my own scheduling algorithm in OptorSim, whether I have to change any existing code in
OptorSim or I can just inherit the classes alone.
Again, that depends on how you are going to implement your new class. Fundamentally, as long as your class
implements ResourceBroker (with correct semantics) it'll work.
However, there's a pretty useful skeleton class SkelResourceBroker that does much of the boring work, especially with
thread priorities. I would recommend writing the new RB that extends SkelResourceBroker. You then need only
implement the findCE() method.
5. I am going to implement best client replication strategy, which is the site from where more requests come. I have
to replicate the files to that site. For that from where and how i can get the required data.
You haven't said how you plan to decide *when* to replicate a file: you'll need to limit this somehow, otherwise a site
can (potentially) attempt to transfer so many files that none of the transfers will proceed. Also, the simulator was
designed so the SEs pulled files they wanted, rather than an external agent pushing them. You can, of course,
implement a coordinating agent that decides which files to replicate.
In general, this is one of the problems with distributed computing: how to collect information from many sites without
introducing a single-point-of-failure or performance bottleneck. In the simulator it is easy to cheat: one has
(potentially) complete access to any object's information and the cost of accessing this information is small (it's all
within the computer's memory). You could record the access patterns locally (on the CE, for example) and provide a
method for accessing those values.
However, in real life, it becomes more complex. There is a very large number of files being stored, with jobs
requesting them in complex patterns. One cannot record all file requests centrally as they happen far too frequently.
Registering the files would become a bottleneck and single-point-of-failure.
The solution used in the simulator is to hold an auction for each file request over a pier-to-pier (P2P) network. Sites
can choose to participate in an auction (they do by default, but they can time-out without af ecting the process). The
auction has two purposes: it selects the best available copy of a file and it also allows "nearby" sites to know that a
particular file was requested. The second purpose allows a decentralised knowledge of file requests without imposing
a heavy-weight solution (such as registering each file request).
The site itself will initiate the transfer, so it can (potentially) know the access patterns and how "hot" is any particular
file that isn't held locally. You will need to figure out how it can determine whether file X is "suf iciently hot" that it is
worth replicating it to the local storage (the site's SE).
I am expecting your reply for the following three questions mentioned below:
(i) I kindly request you to explain the working of the two scheduling algorithm(access cost and Queue Access Cost).
like how the execution is happening in OptorSim.
You will find the implementation of the Access Cost algorithm in the AccessCostResourceBroker class, and Queue
Access Cost in the CombinedCostResourceBroker class. Both of them extend SkelResourceBroker with a dif erent
findCE method. So for your algorithm, you should write a new class (e.g. QueueExecutionTimeResourceBroker) which
also extends SkelResourceBroker with a dif erent findCE method.
AccessCostResourceBroker, when it is given a job to allocate, iterates through all the available CEs. First it checks
whether the CE will accept that job and whether it has space in the job handler queue. If these are ok, it then calls thegetAccessCost method of the optimiser. This calculates the cost of accessing the files, depending on the optimiser
method selected (e.g. LFU, LRU, economic model). The CE which has the lowest access cost for the job is then selected
and the job sent there.
CombinedCostResourceBroker works in a similar way, but as well as calculating the access cost for the job in
question, it also access the job handler of the CE it is looking at and gets the access cost of each job in the queue (you
can see this in the getQueueAccessCost method of the JobHandler class). It combines the two costs, and the CE with
the lowest total cost is chosen.
(iii) The scheduling algorithm which i am trying to implement is Queue Execution Time.
the explanation of that algorithm is as follows.
Queue Execution Time Execution time of current job+All the jobs in the queue
where,
Execution Time of current job= Access cost of remote files+execution time of all the files
similar to Queue Access cost we are calculating our algorithm QueueExecutionTime. in QueueAccessCost, the access
cost of current job and all the jobs in the queue is calculated. we are calculating in addition the execution time of files
to run the job with the access cost.
My question is, to implement this algorithm from where i have to start.
I think you should start, as I mentioned above, by writing a new extension of SkelResourceBroker. You could copy most
of what is in CombinedCostResourceBroker, and simply add the execution times to the total cost for each CE. If you
are adding the execution time for all jobs in the queue, you would also need to modify the getQueueAccessCost method
in JobHandler, or add a new method such as getQueueExecutionCost, which would add the execution time for each job
as well as the access time. So, in fact, I think it is not too dif icult for you to implement your algorithm.
Will a new optimisation strategy affect schedulers?
all the scheduling algorithms are based on the optimisation strategy. i also going to implement new optimisation
strategy BestClient. that is if the number of file request increases and reached the threshold value then only the files
will be replicated. for this algorithm to implement where i have to start. whether it will affect the existing access cost
function.
The scheduling algorithms work independently of the file replication algorithms so your new strategy shouldn't af ect
the existing scheduling algorithms. You will need to write a new Optimiser class, e.g. BestClientOptimiser, which
extends ReplicatingOptimiser with a new getBestFile() method. If you don't put a getAccessCost method in it, it will use
the existing getAccessCost method from SkelOptor so it won't be af ected. You will also need a corresponding
StorageElement class, e.g. BestClientStorageElement, which defines which file(s) to delete when the SE is full. Have a
look at the existing Optimiser and StorageElement classes to see how it works.
Neighbour gridsites information
I have a problem in OptorSim programming. The problem is how to get the neighbors Gridsite's data files stored in
their SEs?
Can you explain your problem in some more detail?
If sites choose not to replicate a data file to their own SE, they can read it remotely from another site using the
simulateRemoteIO method in the SimpleComputingElement class. Do you mean this, or do you mean replicating the
data file from another site to its own SE? Perhaps I can give a better explanation if you can tell us some more about
what you want to do.Suppose that many jobs are submitted to each gridsite, this lead to the jobs in the same gridsite will read data files in its
local SEs or replicated from the remote gridsite. The neighbor gridsites may be a good choise for replicating data
files. Therefore, the question is how to get the data files' logical names stored in SEs at the neighbor gridsites in the
run-time .
Let me see if I understand you correctly. You want to write some code for OptorSim which will look at the neighbouring
grid sites and get the list of LFNs (logical filenames) of files in these sites.
First, to get the list of neighbouring sites for a particular site, you can use the method neighbouringSites() in the class
GridSite. Second, to get the files at these sites, you will have to use GridSite.getSE() to get the SE at each site, and
StorageElement.listFiles() to get the list of files in a human-readable format. You can also use
StorageElement.getAllFiles() to return them in the form of a hashtable.
Yes, that's my purpose. I tried these methods.
When I use neighbouringSites(), GridSite.getSE(), StorageElement.listFiles(), It seems can get the neighbour GridSite
and its SEs,but no LFNs.
So, I try ReplicaManager.listReplicas() to get all the replicas' name. This method works ok, but It cannot tell me the
replica which gridsite it belongs to.
Statistics Output
Reading Output
For evaluating the algorithm is there any other tool or software? I have to evaluate my scheduling algorithm with the
existing one implemented in OptorSim with the help of charts how to do that?
We normally used the statistics output which OptorSim gives at the end, writing all the OptorSim output to a file and
then using some scripts to extract the information we were interested in. The plots were then drawn using some
separate software, inputting the data 'by hand'.
If you have the statistics level in the parameters file set to 3, it gives the maximum information. For example, if you are
interested in evaluating your algorithm by comparing the job times with other algorithms, you can collect the
totalJobTime information for all the sites from the statistics output and get the mean to give you the 'mean job time'
variable that we used. . If there is some information you need which is not output, you can modify the getStatistics
method of the various grid elements (SE, CE, GridSite, GridContainer) to output what you want. I've attached the
script we used to extract the mean job time, to give you some idea, but you can probably come up with a solution
yourself to meet your own needs!
Otherwise, you can use the GUI (see section "Using the Graphical User Interface) in the user guide, but this is not so
useful if you are running a lot of experiments in batch mode.
CE usage
How to calculate the CEusage for the entire grid in optorsim?
When OptorSim finishes running and the Statistics tree is printed out at the end, you will see that the first element listed
there is the GridContainer. This is the whole grid you will see there is an item there called ceUsage, which is what I
think you are looking for. It looks like this:
ResourceBroker> all jobs finished, shutting down P2P network ...
Statistics for the GridContainer taken Fri Sep 09 12:21:49 BST 2005
| remoteReads = 0
| localReads = 74746 | ENU = 0.9941268
| replications = 74307
| ceUsage = 56.221153
| totalJobTime = 7.2084216E7
|
In this example, there was a CE usage of 56.22% for the whole grid.
Using the GUI
I am seeing the grid output using the GUI. In that, after i ran the simulator in the summary table the percentage of ce
usage is showing zero. Using the GUI, how to see or calculate the ce usage? I am not able to see the whole output
through the command prompt.. so tell me for GUI option.
I think that the easiest way for you to get the information, even if you are using the GUI, is to get it from the terminal
output. OptorSim still outputs to the terminal even if you are using the GUI. If you redirect the output to a file, you will
then be able to examine it at your leisure, e.g
bin\OptorSim.bat > myResults.txt
should do it (although I am not so familiar with using the Windows command prompt).. the GUI should still open up as
usual, and at the end you can read the output as well as using any results you have saved from the GUI.
Otherwise, you if you really wanted a CE usage tab to appear in the GUI for the 'Grid' node, you would have to do
some modification to the GUI code, which probably isn't worth it..
Memory used by Statistics
When the simulation is running the memory usage is constantly increasing. I suspect this is because of all the sttistics
OptorSim is collecting. Is it possible to turn all the statistics off. I know there is a parameter which specifies how
detailed the statistics is, but this parameter just defines what the simulator prints and not what is being collected during
the simulation. The problem arises when the simulated topology is quite big. In such instances the memory usage
becomes really high.
Unfortunately, there isn't any options to switch of collecting statistics. However, it should be fairly simple to disable
the code that stores the statistics you're not interested in. Simply comment out the relevant parts in (for example)
SimpleComputingElement.java should do the trick.
Remote reads per file
Each site in the grid should maintain the number of times a file is accessed from the remote site. [for eg. Job1 needs
file1 and file2. job2 needs file1. if job1 and job2 runs in site1 then the file 1 is accessed two times from the remote site
and file 2 is accessed once from remote site] Is there any function implemented in optorsim for that. If it is not, How to
implement that.
The getStatistics method for CEs returns the total number of remote reads and local reads by that CE, but it doesn't
store the number per file. You could add some more instrumentation to getStatistics for the CE to store that information
if you like.
Resource Monitoring
In OptorSim, the resource availability is determined, which means the available resource (CE, SE and network
bandwidth) can be either known beforehand or be calculated when scheduling decisions are made. Is my
understanding right?
Yes that's right. The resource broker has all information about the load at each CE and the network.
Realistically, the resource availability should be fed to broker, especially when resources are not dedicated and/or there
are multiple brokers. Therefore, the quality and freshness of reported by the Grid monitoring function is very
important.True, we have ignored simulating any monitoring system and assume the resource broker has perfect information,
which is easy of course in a simulation but not realistic in real life!
I intend to add a monitoring module into OptorSim to feed the scheduler/resource broker. The monitoring function is
responsible to monitor local resource (CE, SE, etc) and report the monitoring information to consumer (broker). Could
you suggest whether it's feasible to extend the OptorSim to have the monitoring function and where shall I start to do
this?
This sounds like a good idea, if your aim is to simulate the ef ects of monitoring ef iciencies on the ef iciency of running
jobs. Maybe you want to implement a P2P agent which sends monitoring information running at each site, similar to
the ones we have for the auction protocol, and a scheduling algorithm which uses information gathered from these
agents. The OptorSim code should (I hope!) make this relatively easy to do.