+ All Categories
Home > Documents > Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez...

Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez...

Date post: 13-Jan-2016
Category:
Upload: clarence-hampton
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy Stanford, CA 94305
Transcript
Page 1: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia

Roxanne Martinez

Mentor: Yemi Adesanya

United States Department of Energy

Stanford, CA 94305

Page 2: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

SCCS

The Scientific Computing and Computing Services at SLAC:

• Provides computing power, technical support, communications capabilities.

• Core services include Unix systems, Windows, networking, network operations, telecommunications.

• Supplies dept. support, science applications, network security.

• Houses thousands of servers.

Page 3: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

The High Performance Computing Group of SCCS

• To ensure optimal computing performance of all of these servers, they must be monitored. This is the responsibility of the HPC group.

• The group watches data storage, electrical service to servers, cooling system abilities.

• This is made possible through the use of monitoring software: Nagios and Ganglia.

Page 4: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

SCCS Task

• Until last year, all computing capacity at SLAC was located within the SCCS computing building.

• By then the datacenter had reached its maximum electrical service and cooling system capacities.

• New experiments meant the need for more computing power.

• A new datacenter would take years and a lot of funding to complete.

Page 5: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

The Solution: Blackboxes

• This is a Sun Modular Datacenter produced by Sun Microsystems.

• It is a portable computing center built into a standard 8 foot by 20 foot shipping container.

• It is painted white for energy efficiency and is tightly sealed, insulated, and cooled.

• Today, SLAC maintains 2 blackboxes.

Page 6: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Blackbox Contents

• Blackbox 1– 252 bali machines

(Sun X2200 servers)

• Blackbox 2156 – yili machines (Sun

X4100 servers)– 139 boer machines

(Sun X2200 servers)The operating system on these machines is RedHat Enterprise Linux (RHEL) version 4.

Page 7: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Current Monitoring of the Blackboxes

The High Performance Computing Group currently uses Nagios and Ganglia to monitor:

• Percentage of CPU in use,• Amount of memory in use, and• Input/output rates.

The software periodically calls on utilities to extract monitoring data for the machines, displaying the info in graphs, storing the info in databases, and – in the case of Nagios – alerting administrators if machines reach warning or critical states.

Page 8: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Nagios

• User specifies items to be monitored by providing external plugins that return the status of machines to Nagios.

• If a warning or critical status is returned, Nagios can alert via email, IM, text, etc.

• Admins and users can view current status and history using a web browser.– MySQL runs as a server to provide

multi-user access to multiple databases. Interface: PerfParse.

– Round robin database (RRD) provides useful graphs of broad historical data. Popular because the database files do not increase in size over time.

Page 9: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Ganglia

• Robust scalable distributed monitoring system designed for clusters and grids.

• Based on a hierarchical design: uses a tree of connections to representative nodes for each cluster, reducing overheads.

• Updates the RRD.• Has a web frontend like Nagios but does

not have alerting feature.

Page 10: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Additional Monitoring Needed

• Temperature

• Fan speed

• Power supply voltage

Page 11: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

“Materials”

• Baseboard management controller (BMC)– Service processor that monitors physical state of machine.– Located in the motherboard.– Performs monitoring through use of machines sensors. – Part of the Intelligent Platform Management Interface (IPMI)

which provides set of interfaces to manage and monitor a system.

• IPMI tool – Open source utility. – Can be used to extract physical parameters and parameter

thresholds. These are important in determining the status.• Lower Non-Recoverable, Lower Critical, Lower Non-Critical, Upper Non-

Critical, Upper Critical, and Upper Non-Recoverable

Page 12: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

“Materials” continued“sudo ipmitool –c sdr”

“sudo ipmitool sensor list”

Output for both commands are when connected to the Sun X2200 server boer0113.

Page 13: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

“Materials” continued

• Cron (Chronograph)– Time-based scheduling service in Unix.– Used for security reasons since root user is needed to

collect data.• Perl

– ideal Unix scripting language for the task.– Interpreted language; no compiler.– Efficient programming language that is powerful for

file input and output because of its text manipulation capabilities and fast development cycle .

Page 14: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Task

Create three Perl scripts (temperature, fan speed, voltage) that can be used on any machine regardless of the specific BMC.– Work first with yili0113, bali0113, and boer0113.– Cron will run root user to call on IPMI tool and will store data

every 15 minutes in a readable file.– The scripts will read the data every 15 minutes from the file to

produce the current machine parameters and interpret the current status of the machine (OK, WARNING, CRITICAL, UNKNOWN).

– For Nagios, the scripts will return the current status and parameters.

– For Ganglia, the scripts will call on the Ganglia command which passes in the parameters.

Page 15: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Results

• In a test of the check_cpu_temp.pl script on the bali0113 machine, the following results were produced using the Perl interpreter:

“Temperature OK - CPU_0_Temp=49.000, CPU_1_Temp=51.000 | CPU_0_Temp=49.000 CPU_1_Temp=51.000”

Page 16: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

The Scripts as Nagios Plugins

Page 17: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 18: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 19: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 20: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 21: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 22: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 23: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 24: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 25: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 26: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Page 27: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Ganglia work is still underway!

Page 28: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Conclusions• Perl scripts, Nagios monitoring, and graphics tools

work successfully.• All three test machines are running with acceptable

temperatures, fan speeds, and power supply voltages. This suggests that current cooling systems and electrical supplies in blackboxes are effective. The monitoring must be done on all servers, however, for a complete evaluation to be possible.

• The HPC group is much closer to ensuring optimal computing performance for the lab.

Page 29: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Future Work

• The scripts are portable.– 3 test machines– KIPAC machines– All blackbox machines upon approval– Possibly more to come

• The scripts can also be edited to monitor different parameters.

Page 30: Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.

Acknowledgements

Thank you to the U.S. Department of Energy Office of Science and the Stanford Linear Accelerator Center for the opportunity to participate in the Science Undergraduate Laboratory Internships program. Thank you to Steve, Susan, and Farah. Thank you to my mentor, Yemi Adesanya, for his mentorship throughout the project.


Recommended