Operational computing environment at EARS
Jure JermanJure JermanMeteorological OfficeMeteorological Office
Environmental Agency of Slovenia (EARS)Environmental Agency of Slovenia (EARS)
OutlineOutline
Linux Cluster at Environmental Agency Linux Cluster at Environmental Agency of Slovenia, history and present state of Slovenia, history and present state
Operational experiencesOperational experiences Future requirements for limited area Future requirements for limited area
modellingmodelling Needed ingredients for future system?Needed ingredients for future system?
History & backgroundHistory & background
EARS: small service, limited resources for NWPEARS: small service, limited resources for NWP Small NWP group, research & operationsSmall NWP group, research & operations First research Alpha-Linux cluster (1996) – 20 First research Alpha-Linux cluster (1996) – 20
nodes nodes First Linux operational cluster at EARS (1997)First Linux operational cluster at EARS (1997)
5 x Alpha CPU5 x Alpha CPU One among first operational clusters in Europe One among first operational clusters in Europe
in the field of meteorologyin the field of meteorology
Tuba – current cluster Tuba – current cluster systemsystem Installed 3 years ago, Installed 3 years ago,
already outdatedalready outdated Important for gathering Important for gathering
of experiences of experiences Hardware:Hardware:
13 Compute Nodes, 13 Compute Nodes, 1 Master Node, Dual 1 Master Node, Dual
Xeon 2.4 Ghz, Xeon 2.4 Ghz, 28 GB memory28 GB memory Gigabit EthernetGigabit Ethernet
Storage: 4 TB IDE2SCSI Storage: 4 TB IDE2SCSI disk array, xfs disk array, xfs filesystemfilesystem
Tuba softwareTuba software
Open source, whenever possibleOpen source, whenever possible Cluster management software:Cluster management software: OS: RH Linux + SCore (5.8.2) (OS: RH Linux + SCore (5.8.2) (
www.pccluster.orgwww.pccluster.org)) Mature parallel environmentMature parallel environment
Lower latency MPI implementationLower latency MPI implementation Transparent to userTransparent to user Gang schedulingGang scheduling Pre-emptingPre-empting CheckpointingCheckpointing Parallel shellParallel shell Automatic fault recovery (hardware Automatic fault recovery (hardware
or SCore)or SCore) FIFO schedulerFIFO scheduler Capability of integration with Capability of integration with
OpenPBS and SGEOpenPBS and SGE Lahey and Intel compilersLahey and Intel compilers
Ganglia - Cluster Health Ganglia - Cluster Health monitoringmonitoring
Operational experiencesOperational experiences
In production for In production for almost 3 yearsalmost 3 years
Unmonitored suiteUnmonitored suite Minimal hardware Minimal hardware
related problems so related problems so far!far!
Some problems with Some problems with SCore (mainly related SCore (mainly related to buffers in MPI)to buffers in MPI)
NFS related problemsNFS related problems ECMWF's SMS, solves ECMWF's SMS, solves
majority of problems majority of problems
ReliabilityReliability
Operational setupOperational setup
ALADIN modelALADIN model 290x240x37 domain290x240x37 domain 9.3 km resolution9.3 km resolution 54h integration54h integration Target: 1 hTarget: 1 h
OptimizationsOptimizations
Not everything in a hardwareNot everything in a hardwareCode optimizationsCode optimizations
B-Level parallelization (up two 20 % at greater number of B-Level parallelization (up two 20 % at greater number of processors)processors)
Load balancing of grid point computations (depending on Load balancing of grid point computations (depending on the number of processors)the number of processors)
Parameter tuningParameter tuning NPROMA cash tuning NPROMA cash tuning MPI message sizeMPI message size
Improvement in compilers (Lahey –> Intel 8.1 20 – Improvement in compilers (Lahey –> Intel 8.1 20 – 25 %)25 %)
Still to work on: OpenMP (better efficiency of Still to work on: OpenMP (better efficiency of memory usage)memory usage)
Non operational useNon operational use
Downscaling of ERA-40 Downscaling of ERA-40 reanalysis with ALADIN reanalysis with ALADIN modelmodel Estimation of wind energy Estimation of wind energy
potential over Sloveniapotential over Slovenia Multiple nesting of target Multiple nesting of target
computational domain into computational domain into ERA-40 dataERA-40 data
10 years period, 8 years / 10 years period, 8 years / month month
Major question: How to Major question: How to ensure coexistence with ensure coexistence with operational suiteoperational suite
Foreseen developments in Foreseen developments in limited area modelinglimited area modeling
Currently ALADIN 9 kmCurrently ALADIN 9 km 2008-2009 Arome, 2.5 km : ALADIN NH 2008-2009 Arome, 2.5 km : ALADIN NH
solver + Meso NH physicssolver + Meso NH physics 3 times more expensive per Grid Point3 times more expensive per Grid Point Target Arome: ~200 x – 300 x more Target Arome: ~200 x – 300 x more
expensive (same computational domain, expensive (same computational domain, same time range)same time range)
How to get there (if?)How to get there (if?)
Linux commodity cluster at EARS?Linux commodity cluster at EARS? First upgrade in the mid 2006First upgrade in the mid 2006 5 times the current system (if possible, 5 times the current system (if possible,
below 64 processors)below 64 processors) Tests going on with:Tests going on with:
New processors: AMD Opteron, Intel Itanium-2New processors: AMD Opteron, Intel Itanium-2 Interconnection: Infinyband, Quadrics?Interconnection: Infinyband, Quadrics?
Compilers: PathScale (AMD Opteron) Compilers: PathScale (AMD Opteron) Crucial: Parallel file system (TerraGrid), Crucial: Parallel file system (TerraGrid),
already installed, replacement of NFS already installed, replacement of NFS
How to stay at the open How to stay at the open side of the fence?side of the fence?
Linux and other OpenSource projects are evolvingLinux and other OpenSource projects are evolving Great number of more and more complex software Great number of more and more complex software
projectsprojects Specific (operational) requirements in meteorologySpecific (operational) requirements in meteorology Space for system integratorsSpace for system integrators Price/performance gap between commodity and Price/performance gap between commodity and
brand name systems is getting smaller when the brand name systems is getting smaller when the size of system is growingsize of system is growing
Pioneer time of Beowulf clusters seems to be over Pioneer time of Beowulf clusters seems to be over Importance of extensive test of all cluster Importance of extensive test of all cluster
componentscomponents
ConclusionsConclusions
Positive experiences with small Positive experiences with small commodity Linux cluster, great commodity Linux cluster, great price/performance ratioprice/performance ratio
Our present type of development of Our present type of development of new cluster works for small cluster, new cluster works for small cluster, might work for medium sized and might work for medium sized and doesn’t for big systemsdoesn’t for big systems
Future are probably Linux clusters, Future are probably Linux clusters, but brandedbut branded