1
The new Fabric The new Fabric Management Tools in Management Tools in Production at CERNProduction at CERN
Thorsten Kleinwort forCERN IT/FIO
HEPiX Autumn 2003Triumf Vancouver
Monday, October 20, 2003
20 October 2003 Thorsten Kleinwort
IT/FIO/FS2
ContentsContents
• Introduction to CERN’s Fabric Management: Concepts
• Framework for CERN’s Fabric Management: Tools
• Configuration Mgmt• Software Mgmt• State Mgmt• Monitoring
20 October 2003 Thorsten Kleinwort
IT/FIO/FS3
Concepts: The NodeConcepts: The Node
The Node is the manageable unit:• Autonomous:
• Local configuration files• Programs work locally• No external dependencies• No remote management scripts
• Adheres to LSB (Linux Standard Base):• Init scripts /etc/init.d/, start daemons• Logfile directory /var/log, logrotate• Config directory /etc• (System) Programs in /(s)bin/, /usr/(s)bin
20 October 2003 Thorsten Kleinwort
IT/FIO/FS4
Concepts: Node -> Concepts: Node -> ClusterCluster
• Same functionality of nodes -> cluster(But not necessarily same HW)
• Management tools enforce uniform setup
• Cluster size varies:• LXBATCH > 1000 nodes• LXPLUS ~ 70 nodes• LXMASTER (Batch master) = 2 nodes
• Critical servers replaced by service clusters with redundant nodes
20 October 2003 Thorsten Kleinwort
IT/FIO/FS5
Concepts: PrinciplesConcepts: Principles
• Software installs/updates through RPM• Configuration through one tool• Configuration information through one
interface• Configuration information stored
centrally• Installation, configuration and
maintenance automated, but steerable• Reproducibility
20 October 2003 Thorsten Kleinwort
IT/FIO/FS6
FrameworkFramework
node
Mon Agent
MonitoringManager
Cfg Agent
ConfigManager
ConfigCache
SW Agent
SWManager
SWCache
HardwareManager
StateManager
20 October 2003 Thorsten Kleinwort
IT/FIO/FS7
FrameworkFramework
node
SW AgentCfg Agent
Mon Agent
CDBMonitoringManager
SWManager
HardwareManager
StateManager
CCMSW
Cache
20 October 2003 Thorsten Kleinwort
IT/FIO/FS8
Configuration (CDB & Configuration (CDB & CCM)CCM)
CDB (Configuration Data Base):• Development of EU Data Grid (WP4)• CDB is the configuration data base• Now ~ 1500 nodes, ~ 15 clusters• ~ 3200 configuration templates to
describe the nodes• Creates one (XML) profile per node • All information that is needed to install &
run the nodes now included• Currently 2 Linux versions: RH 7.3 & ES
2.1
20 October 2003 Thorsten Kleinwort
IT/FIO/FS9
CDB (cont’d)CDB (cont’d)
Additional Information to be added:(Merged from other sources)
• State information (->SMS)• Monitoring information (->MSA)• Vendor/Contract/Purchase
information:• Need for encryption to store secure data
New, high level Interfaces are provided:• “Add/Rename Node”• Change node state
20 October 2003 Thorsten Kleinwort
IT/FIO/FS10
CDB (cont’d)CDB (cont’d)
• Local caching on the node CCM (Configuration Cache Manager):• In test phase, deployed on a few nodes• Runs local daemon, which is notified on modification
of the nodes configuration information• Avoids peaks on CDB web servers
• Beside XML profiles, new SQL interface:• Allows SQL queries on CDB• Needed for cross machine view (e.g. give me all
nodes that belong to the cluster X)
20 October 2003 Thorsten Kleinwort
IT/FIO/FS11
FrameworkFramework
node
SPMACfg Agent
Mon Agent
CDBMonitoringManager SWRep
HardwareManager
StateManager
CCMSWRepCache
20 October 2003 Thorsten Kleinwort
IT/FIO/FS12
Software distributionSoftware distribution(SPMA & SWRep)(SPMA & SWRep)
SPMA (Software Package Management Agent):
• Development of EU Data Grid (WP4)• The tool to install all software on the nodes
• Uses RPM for SW distribution on Linux• Version for Solaris PKG package manager exists
• We install between 700 – 1000 RPMs per node
• Based on RPMT (Enhancement of RPM)• Crucial part of the framework
20 October 2003 Thorsten Kleinwort
IT/FIO/FS13
SPMA (cont’d)SPMA (cont’d)
• SPMA runs on every node (on demand)
• Can manage either a subset or all packages:• We manage all packages on all clusters but one,
which is for development• Missing packages are added and• Unknown packages are removed
• Package list created from CDB, but SPMA is independent of CDB
• SPMA allows to roll back versions
20 October 2003 Thorsten Kleinwort
IT/FIO/FS14
SPMA & SWRepSPMA & SWRep
SWRep (Software Repository):• Client-Server tool suite for storage
of software packages• Universal:
• Linux RPM/Solaris PKG• Multiple versions: RH 7.3, RH ES 2.1, RH 10
• Management interface:• ACL mechanism to add packages • Package list automatically kept up-to-date in
CDB
20 October 2003 Thorsten Kleinwort
IT/FIO/FS15
SPMA & SWRep (cont’d)SPMA & SWRep (cont’d)
Addresses Scalability:• HTTP as SW distribution protocol• Load balanced server cluster • SPMA run is randomly time delayed
within 10 minutes• Pre-caching of SW packages on the
node possible• Currently installed on 1500 nodes
20 October 2003 Thorsten Kleinwort
IT/FIO/FS16
FrameworkFramework
node
SPMANCMMon Agent
CDBMonitoringManager SWRep
HardwareManager
StateManager
CCMSWRepCache
20 October 2003 Thorsten Kleinwort
IT/FIO/FS17
Configuration Tool Configuration Tool (NCM)(NCM)
NCM (Node Configuration Manager):• Local configuration tool• EU Data Grid (WP4) development• First components have been (re-)written
and are tested on production nodes• Uses CDB for configuration information • Has its first public release:
• We have to transform all our SUE features into NCM components (~50)
• Plan is to do this while migrating to next Linux release
20 October 2003 Thorsten Kleinwort
IT/FIO/FS18
FrameworkFramework
node
SPMANCMMSA
CDBOraMon SWRep
CCMSWRepCache
HardwareManager
StateManager
20 October 2003 Thorsten Kleinwort
IT/FIO/FS19
MonitoringMonitoring(MSA & OraMon) (MSA & OraMon)
LEMON (LHC Era Monitoring):• EU Data Grid (WP4) development• Client (MSA):
• ~ 100 metrics are measured• Deployed on > 1500 nodes (more than currently
managed by CDB)• Configuration to be put into CDB
• Server (OraMon):• ORACLE database as back end• Stores current values as well as history• User API (in C, PERL, PHP, TCL) in test phase
20 October 2003 Thorsten Kleinwort
IT/FIO/FS20
FrameworkFramework
node
SPMANCMMSA
CDBOraMon SWRep
HMSSMS
CCMSWRepCache
20 October 2003 Thorsten Kleinwort
IT/FIO/FS21
State ManagementState Management(SMS & HMS)(SMS & HMS)
LEAF (LHC Era Automated Fabric):• HMS (Hardware Management
System), controls & tracks:• Node installation• Node Move & reinstall (rename)• Node retirement• Node repairs (Vendor calls)
• Remedy Workflow Application• Will interface to CDB
20 October 2003 Thorsten Kleinwort
IT/FIO/FS22
HMS & SMSHMS & SMS
SMS (State Management System):• Allows to set node states (in CDB) • Validates state transition• Handles new machine arrivals
(~400 in Nov)• Uses SOAP to interface to CDB• Working prototype
20 October 2003 Thorsten Kleinwort
IT/FIO/FS23
node
Tools:Tools:
SPMANCMMSA
CDBOraMon SWRep
CCMSWRepCache
HMSSMS
QUATTORLEMON
LEAF
= + +
20 October 2003 Thorsten Kleinwort
IT/FIO/FS24
Tools: ExamplesTools: Examples
• Batch System LSF:• Upgrade 4.2 -> 5.1 on > 1000 nodes within 15 min,
without stopping batch (with pre-caching)
• Kernel Upgrade:• SPMA can handle multiple versions of the same
package:• Allows to separate installation and reboot of new
kernel in time
• Security upgrades:• All security upgrades are done by SPMA (~once a
week):• SSH Security upgrade • KDE upgrade (~400 MB per node)
20 October 2003 Thorsten Kleinwort
IT/FIO/FS25
ReferencesReferences
• EU Data Grid:http://www.eu-datagrid.org
• EDG WP4:http://cern.ch/hep-proj-grid-fabric
• QUATTOR web page:http://quattor.org
• LEMON web page:http://cern.ch/lemon
• LEAF web page:http://cern.ch/leaf
• CERN IT/FIO:http://cern.ch/it-div-fio