Online Monitoring with MonALISA
Online Monitoring with MonALISA
Dan Protopopescu Glasgow, UK
Dan Protopopescu Glasgow, UK
MonALISAMonALISA
Is a distributed service able to: collect any type of information from
different systems analyze this information in real time take automated decisions and perform
actions based on it optimize work flows in complex
environments Read more athttp://monalisa.caltech.edu
UsesUses Monitoring distributed computing, i.e. GRIDs Optimizing flow in complex system (VRVS, optics cable networks) ALICE also uses ML for monitoring online reconstruction Some benchmark figures for the service: ~ 800k monitored parameters at 50k updates/second > 10k running (alien) jobs monitored simultaneously > 100 WAN links
We are proposing ML as a high level monitoring and possible control system along with (or on top of) existing slow controls systems as epics, pvss etc.
AdvantagesAdvantages
MonALISA is simple to install, configure and use ApMon APIs are available in C, C++, Java, Python and Perl ROOT plugin allows macros to send data directly to MonaLISA Can easily interface with (or sit on top of) any existing or
future slow controls subsystem (epics, pvss) Data is stored in a standard PgSQL (or MySQL) database that
can be accessed by other applications, independently of ML Automatic data summarizing Several data repositories (and hence DBs) can exist (local
and remote) Easy access via WebService (WS) from service and/or
repository Fully supported by development team; work is being done in
this direction
CapabilitiesCapabilitiesBased on monitored information, actions can be
taken in: ML Service ML RepositoryActions can be triggered by: Values above/below given thresholds Absence/presence of values Correlations between several valuesPossible actions types: External command Plain event logging Annotation of repository charts; RSS feeds Email Instant messaging
ComponentsComponents
ServiceService ServiceService
RepositoryRepository
LUS/ProxiesLUS/Proxies
ApMonApMon ApMonApMon ApMonApMon
Web ServerWeb Server
ApMonApMon
Actions based on aggregated information
Actions based on aggregated information
Actions based on local informationActions based on local information
Quick actionsQuick actions
GUIGUI
Service setupService setup
ServiceService ServiceService
RepositoryRepository
LUSLUS
ApMonApMon ApMonApMon ApMonApMon
Web ServerWeb Server
ApMonApMon
Actions based on aggregated information
Actions based on aggregated information
Actions based on local informationActions based on local information
Quick actionsQuick actions
ML Service setup:
wget http://nuclear.gla.ac.uk/~protopop/ML/MonaLisa.tar.gztar -zxvf MonaLisa.tar.gzcd MonaLisa/./install.shcd ../MonaLisa/Service/CMD/./MLD start
ML Service setup:
wget http://nuclear.gla.ac.uk/~protopop/ML/MonaLisa.tar.gztar -zxvf MonaLisa.tar.gzcd MonaLisa/./install.shcd ../MonaLisa/Service/CMD/./MLD start
Repository setupRepository setup
ServiceService ServiceService
RepositoryRepository
LUSLUS
ApMonApMon ApMonApMon ApMonApMon
Web ServerWeb Server
ApMonApMon
Actions based on aggregated information
Actions based on aggregated information
Actions based on local informationActions based on local information
Quick actionsQuick actions
ML Repository setup:
wget http://nuclear.gla.ac.uk/~protopop/ML/MLrepository.tgztar -zxvf MLrepository.tgz[configure it]cd MLrepository./start.sh
ML Repository setup:
wget http://nuclear.gla.ac.uk/~protopop/ML/MLrepository.tgztar -zxvf MLrepository.tgz[configure it]cd MLrepository./start.sh
ApMon setupApMon setup
ServiceService ServiceService
RepositoryRepository
LUS/ProxiesLUS/Proxies
ApMonApMon ApMonApMon ApMon
Web ServerWeb Server
ApMonApMon
Actions based on aggregated information
Actions based on aggregated information
Actions based on local informationActions based on local information
Quick actionsQuick actions
ApMon setup:
wget http://nuclear.gla.ac.uk/~protopop/ML/ApMon_perl.tar.gztar -xzvf ApMon_perl.tar.gzcd ApMon_perl[create your script, say mysend.pl]perl mysend.pl
ApMon setup:
wget http://nuclear.gla.ac.uk/~protopop/ML/ApMon_perl.tar.gztar -xzvf ApMon_perl.tar.gzcd ApMon_perl[create your script, say mysend.pl]perl mysend.pl
Simple monitoring scriptSimple monitoring script
ServiceService ServiceService
RepositoryRepository
LUS
ApMonApMon ApMonApMon ApMonApMon
Web ServerWeb Server
ApMonApMon
Actions based on aggregated information
Actions based on aggregated information
Actions based on local informationActions based on local information
Quick actionsQuick actions
[monalisa@glasgow]$ cat mysend.pl
use ApMon;
my $apm = new ApMon({"glasgow.jlab.org:8884" =>{"sys_monitoring" => 0, "general_info" => 0}});
my @pair;while (1) {# loop forever
# get values from somewhere @pair = getmypar(“pspec_logic_ai_0”);
$apm->sendParameters(”Detector", “MOR”, @pair);
sleep (20);}
[monalisa@glasgow]$ cat mysend.pl
use ApMon;
my $apm = new ApMon({"glasgow.jlab.org:8884" =>{"sys_monitoring" => 0, "general_info" => 0}});
my @pair;while (1) {# loop forever
# get values from somewhere @pair = getmypar(“pspec_logic_ai_0”);
$apm->sendParameters(”Detector", “MOR”, @pair);
sleep (20);}
Time historyTime history
ServiceService ServiceService
RepositoryRepository
LUS
ApMonApMon ApMonApMon ApMonApMon
Web ServerWeb Server
ApMonApMon
Actions based on aggregated information
Actions based on aggregated information
Actions based on local informationActions based on local information
Quick actionsQuick actions
Time history example:
[monalisa@glasgow]$ cat mor.properties
page=histFarms=JlabMLClusters=DetectorNodes=MORFunctions=pspec_logic_ai_0ylabel=Tagger ratetitle=MORannotation.groups=2
Time history example:
[monalisa@glasgow]$ cat mor.properties
page=histFarms=JlabMLClusters=DetectorNodes=MORFunctions=pspec_logic_ai_0ylabel=Tagger ratetitle=MORannotation.groups=2
Web interfaceWeb interface
Java GUIJava GUI
Application controlApplication control
Key
Keystore
ML Clients
TCP based subscribe mechanism serialized, compressed objects with optional encryption
ML Proxies
Application commands are encrypted
ML Services
Standard and/or user’s sensors and/or application modules
ML Service
ApMon
YourApplication
Your custom Java client
GUI client
ML Repository
Your monmodule
Yourcustom view
AppMonC
bashYour application
Your appmodule
LUS
Alert-based ActionsAlert-based Actions
MySQL daemon is automatically restartedwhen it runs out of memoryTrigger: threshold on VSZ memory usage ALICE Production jobs queue is automatically
kept full by the automatic resubmissionTrigger: threshold on the number of aliprod waiting jobs
Administrators are kept up-to-date on the services’ statusTrigger: presence/absence of monitored information via instant messaging, RSS feeds, toolbar alerts etc.
SummarySummary
MonALISA is a very promising tool for online experiment monitoring and interfacing with a variety of slow control subsystems; GlueX are seriously considering ML for this task
Easy to configure, understand and use Experience from Grid monitoring and more Support from the developers group for
implementation of new modules/features Online experiment monitoring tests of
CLAS@Jlab were recently carried on; demo repository is at http://mlr1.gla.ac.uk:7002
More examples / ExtrasMore examples / Extras
Integrated Pie ChartsIntegrated Pie Charts
History Plots, Annotations
History Plots, Annotations
AliEn Services MonitoringAliEn Services Monitoring AliEn services
Periodically checked PID check + SOAP call Simple functional
tests SE space usage Efficiency
Job Network Traffic MonitoringJob Network Traffic Monitoring Based on the xrootd
transfer from every job Aggregated statistics for
Sites (incoming, outgoing, site to site, internal)
Storage Elements (incoming, outgoing)
Of Read and written files Transferred MB/s
Individual Job TrackingIndividual Job Tracking
Based on AliEn shell cmds. top, ps, spy, jobinfo, masterjob
Using the GUI ML Client Status, resource usage, per
job
Head Node MonitoringHead Node Monitoring Machine parameters, real-time & history, load, memory & swap usage,
processes, sockets
MonALISA in AliEnMonALISA in AliEn
The MonALISA framework is used as a primary monitoring tool for the ALICE Grid since 2004
Presently the system is used for monitoring of all (identified) services, jobs and network parameters necessary for the Grid operation and debugging
The number of concurrently monitored and stored parameters today is ~ 300.000 in 75 ML Services
The add-on tools for automatic events notification allow for more efficient reaction to problems
The framework design and flexibility answers all requirements for a monitoring system
The accumulated information allows to construct and implement automated decision making algorithms, thus increasing further the efficiency of the Grid operations