+ All Categories
Home > Documents > Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya...

Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya...

Date post: 19-Dec-2015
Category:
View: 223 times
Download: 0 times
Share this document with a friend
Popular Tags:
20
Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006
Transcript
Page 1: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Monitoring Temperature and Fan Speed Using Ganglia and Winbond

Chips

Caitie McCaffrey, Yemi Adesanya

August 2006

Page 2: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

“The SLAC Computing Services Group is dedicated to providing leadership and support in computing and communications to the laboratory as a whole, and to physics research, in particular”

Major Concerns• Power consumption

• Cooling

• Monitoring

Page 3: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

• I/O Rate

• CPU usage

• Memory Usage

• Temperature

• Fan Speed

• LoadMonitoring Software

-low overhead

-scalable

-low impact on individual machines

What Is My Computer Doing???

Page 4: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters

and Grids”

• Scalable, overhead increases by number of clusters not nodes• Works on multiple operating systems• Round Robin Database• Measures metrics like CPU usage, load, I/O rate, and memory usage

GMOND, GMETAD, GMETRIC

Page 5: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

B C

A

1

3

2

4

Cluster One

All machines know state of entire cluster

Cluster Two

Machines 1 and 3 know state of entire cluster

Updates RRD, polls clusters periodically

Ganglia Architecturehttp://www.slac.stanford.edu/comp/unix/ganglia/index.html

Page 6: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.
Page 7: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

GMETRIC

Allows users to monitor metrics to expand on the core monitored by the daemon gmond

• Name• Value• Type• Units

gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius

Good because allows us to be more machine specific, can monitor temperature and fan speed

Page 8: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

A little bit on hardware

Noma - batch machines• Tyan Thunder LE-T motherboard• Winbond w83782d (lm_sensor compatible)• 2 pentium III processors

Why is temperature important?•Chip specifications give temperature range

•Behavior is unpredictable outside temperature range

•Clues to weird machine behavior

•Pentiums have a max temp of 77°-82° C

Tyan Thunder LE-T

Page 9: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

What’s a Noma?• Horse from Noma County Japan

• Smallest native Japanese pony 10.1 -10.3 hands

• Super rare 27 pure blood nomas left (1988)

Some more machines

COBDON

TORI

MORABORLOV

NOMA

Page 10: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

• caitiem@noma0449 $ sensors• w83782d-i2c-0-29• Adapter: SMBus PIIX4 adapter at 0580• Algorithm: Non-I2C SMBus adapter• VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V)• VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V)• +3.3V: +3.37 V (min = +2.97 V, max = +3.63 V)• +5V: +4.97 V (min = +4.50 V, max = +5.48 V)• +12V: +12.08 V (min = +10.79 V, max = +13.11 V)• -12V: -1.03 V (min = -13.21 V, max = -10.90 V)• -5V: +2.84 V (min = -5.51 V, max = -4.51 V)• V5SB: +5.12 V (min = +4.50 V, max = +5.48 V)• VBat: +3.34 V (min = +2.70 V, max = +3.29 V)• fan1: 8231 RPM (min = 3000 RPM, div = 2)• fan2: 8333 RPM (min = 3000 RPM, div = 2)• fan3: 0 RPM (min = 3000 RPM, div = 2)• temp1: +77°C (limit = +60°C) sensor = thermistor• ALARM• temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor• ALARM• temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor• ALARM• vid: +1.450 V• alarms: Chassis intrusion detection ALARM• beep_enable:• Sound alarm disabled

Page 11: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Perl

Fills gap between low level languages like C and C++ and high level languages like shell.

-mostly fast-basically unlimited-good for working with text-portable

Regular Expressions/^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/

matchestemp1: +77°C (limit = +60°C) sensor = thermistor

temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor

Page 12: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Sample Time - Decreasing• Time interval = 12.15 minutes• Fri Aug 11 03:04:05 PDT 2006

• FanSpeed1 8035• FanSpeed2 7941• Temp 1: 77• Change: 0• Temp 2: 64.0• Change: 0• Temp 3: 64.0• Change: 1• Time interval = 9.8415 minutes• Fri Aug 11 03:16:15 PDT 2006

Parameters

•Trigger = 0.5 degrees

•Decrement = 0.9

•MaxTime = 15 minutes

•MinTime = 1 minute

New time = old time * Decrement ^(Change / Trigger)

*if new time < min time then newTime = minTime

New time = 12.15 * .9 ^ (1 / .05) = 9.8415

Want Sample time to decrease faster when

temperatures are changing faster

Page 13: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Sample Time – Increasing • Time interval = 12.15 minutes• Fri Aug 11 08:25:18 PDT 2006

• Found FanSpeed1 8035• Found FanSpeed2 7941• Temp 1: 77• Change: 0• Temp 2: 64.0• Change: 0• Temp 3: 64.0• Change: 0• Time interval = 13.5 minutes• Fri Aug 11 08:37:28 PDT 2006

Parameters

•Trigger = 0.5 degrees

•Decrement = 0.9

•MaxTime = 15 minutes

•MinTime = 1 minute

NewTime = OldTime / Decrement

NewTime = 12.15 / 0.9 = 13.5

Want Sample Time to Increase Temperature is

changing slowly or not at all

*If we increase by large amounts we could miss valuable data

Page 14: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

noma0450

noma0449

Page 15: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Up and running on two Nomas currently • Noma0449• Noma0450

Will be installed on all Nomas

Can be used on any Ganglia monitored machine with a compatible Winbond chip

Much thanks to the DOE, SCCS systems group and especially Yemi Adesanya, John Goebel, & Karl Amrhein for all their help throughout the summer.

Page 16: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Smartmontools for SCSI devices

• Command smartctl –l error /dev/sda

Error counter log:

Errors Corrected Total Total Correction Gigabytes Total delay: [rereads/ errors algorithm processed uncorrected minor | major rewrites] corrected invocations [10^9 bytes] errorsread: 234237 0 0 234237 234237 605.516 0write: 0 0 0 0 0 1457.589 0

Non-medium error count: 0

http://smartmontools.sourceforge.net/smartmontools_scsi.html

Page 17: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Corrected Errors

• Minor/ Fast• Correction algorithm works successfully• No delay to reading later sectors• These are ok

• Major / Slow•Correction algorithm works successfully

•Delay in reading later sectors

•Not so good

• Uncorrected Errors•Correction algorithm fails

•Very Bad

Page 18: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Other Information• Total [rereads/rewrites] – errors corrected by applying retries

• Total errors corrected – number of all correctable errors

• Correction Algorithm Invocation – number of times algorithm

is used

• Gigabytes Processed – number of bytes successfully and unsuccessfully read or written

Page 19: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

This indicates there might be a problem

This should be a flag as well

This is ok, its correcting the errors and not losing any time doing so

Page 20: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Monitors• Read Uncorrected Errors• Read Delayed Errors• Read No Delay Errors• Write Uncorrected Errors• Write Delayed Errors• Write No Delay Errors• Total Uncorrected Errors• Total Delayed Errors

Collects Data Once a Day

errorsWatch

-Noma

-Don

-Tori

-Cob

-Morab

-Orlov


Recommended