ppt - Monitoring Temperature and Fan Speed Using Gaglia and LM_Sensors by kala22

VIEWS: 24 PAGES: 20

									 Monitoring Temperature and Fan
Speed Using Ganglia and Winbond
              Chips



Caitie McCaffrey, Yemi Adesanya
August 2006
“The SLAC Computing Services Group is dedicated to
providing leadership and support in computing and
communications to the laboratory as a whole, and to physics
research, in particular”

Major Concerns
• Power consumption
• Cooling
• Monitoring
    What Is My Computer Doing???

•   I/O Rate
•   CPU usage
•   Memory Usage
•   Temperature
•   Fan Speed
                   Monitoring Software
•   Load           -low overhead
                   -scalable
                   -low impact on individual machines
    “Ganglia is a scalable distributed monitoring system for
     high-performance computing systems such as clusters
                            and Grids”

•    Scalable, overhead increases by number of clusters not nodes
•    Works on multiple operating systems
•    Round Robin Database
•    Measures metrics like CPU usage, load, I/O rate, and memory usage


     GMOND, GMETAD, GMETRIC
                     Ganglia Architecture
                     http://www.slac.stanford.edu/comp/unix/ganglia/index.html

                     Updates RRD, polls                     Cluster Two
                     clusters periodically
                                                            Machines 1 and 3
                                                            know state of entire
                                                            cluster



                                                        1                  2




                           A                                                       4
                                                                     3
Cluster One
All machines
know state of
entire cluster
                 B                 C
GMETRIC
Allows users to monitor metrics to expand on the core
           monitored by the daemon gmond

                              •    Name
                              •    Value
                               •   Type
                               •   Units
 gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius


Good because allows us to be more machine specific,
       can monitor temperature and fan speed
A little bit on hardware
Noma - batch machines
• Tyan Thunder LE-T motherboard
• Winbond w83782d (lm_sensor compatible)
• 2 pentium III processors




Why is temperature important?
•Chip specifications give temperature range
•Behavior is unpredictable outside temperature range
                                                       Tyan Thunder LE-T
•Clues to weird machine behavior
•Pentiums have a max temp of 77°-82° C
What’s a Noma?                                        NOMA

•    Horse from Noma County Japan
•    Smallest native Japanese pony 10.1 -10.3 hands
•    Super rare 27 pure blood nomas left (1988)

    Some more machines

                                                      DON
    COB



                                                             TORI

                             ORLOV
          MORAB
•   caitiem@noma0449 $ sensors
•   w83782d-i2c-0-29
•   Adapter: SMBus PIIX4 adapter at 0580
•   Algorithm: Non-I2C SMBus adapter
•   VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V)
•   VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V)
•   +3.3V: +3.37 V (min = +2.97 V, max = +3.63 V)
•   +5V:     +4.97 V (min = +4.50 V, max = +5.48 V)
•   +12V: +12.08 V (min = +10.79 V, max = +13.11 V)
•   -12V:    -1.03 V (min = -13.21 V, max = -10.90 V)
•   -5V:     +2.84 V (min = -5.51 V, max = -4.51 V)
•   V5SB:      +5.12 V (min = +4.50 V, max = +5.48 V)
•   VBat:     +3.34 V (min = +2.70 V, max = +3.29 V)
•   fan1: 8231 RPM (min = 3000 RPM, div = 2)
•   fan2: 8333 RPM (min = 3000 RPM, div = 2)
•   fan3:      0 RPM (min = 3000 RPM, div = 2)
•   temp1:       +77°C (limit = +60°C)             sensor = thermistor
•    ALARM
•   temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
•   ALARM
•   temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
•   ALARM
•   vid:    +1.450 V
•   alarms: Chassis intrusion detection             ALARM
•   beep_enable:
•          Sound alarm disabled
                                  Perl

Fills gap between low level languages like C and C++ and high
   level languages like shell.

-mostly fast
-basically unlimited
-good for working with text
-portable

Regular Expressions
   /^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/
matches
   temp1:    +77°C (limit = +60°C)               sensor = thermistor
  temp2:    +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
Sample Time - Decreasing
• Time interval = 12.15 minutes
•   Fri Aug 11 03:04:05 PDT 2006

•   FanSpeed1 8035
                                                    Want Sample time to
•   FanSpeed2 7941
•   Temp 1: 77                                      decrease faster when
•   Change: 0                                         temperatures are
•   Temp 2: 64.0
                                                       changing faster
•   Change: 0
•   Temp 3: 64.0
•   Change: 1
• Time interval = 9.8415 minutes
•   Fri Aug 11 03:16:15 PDT 2006


       New time = old time * Decrement ^(Change / Trigger)
       *if new time < min time   then newTime = minTime
                                                                Parameters
                                                                •Trigger = 0.5 degrees
       New time = 12.15 * .9 ^ (1 / .05) = 9.8415
                                                                •Decrement = 0.9
                                                                •MaxTime = 15 minutes
                                                                •MinTime = 1 minute
Sample Time – Increasing
•   Time interval = 12.15 minutes
•   Fri Aug 11 08:25:18 PDT 2006
                                                     Want Sample Time to
•   Found FanSpeed1 8035                           Increase Temperature is
•   Found FanSpeed2 7941                          changing slowly or not at all
•   Temp 1: 77
•   Change: 0
•   Temp 2: 64.0
•   Change: 0
                                                  *If we increase by large amounts
•   Temp 3: 64.0                                     we could miss valuable data
•   Change: 0
•   Time interval = 13.5 minutes
•   Fri Aug 11 08:37:28 PDT 2006


                  NewTime = OldTime / Decrement
                           NewTime = 12.15 / 0.9 = 13.5             Parameters
                                                                    •Trigger = 0.5 degrees
                                                                    •Decrement = 0.9
                                                                    •MaxTime = 15 minutes
                                                                    •MinTime = 1 minute
noma0450




           noma0449
Up and running on two Nomas currently
• Noma0449
• Noma0450


Will be installed on all Nomas

Can be used on any Ganglia monitored machine with a
 compatible Winbond chip



 Much thanks to the DOE, SCCS systems group and especially
 Yemi Adesanya, John Goebel, & Karl Amrhein for all their help
 throughout the summer.
Smartmontools for SCSI devices
• Command smartctl –l error /dev/sda
  Error counter log:

         Errors Corrected Total        Total    Correction Gigabytes      Total
           delay:         [rereads/     errors    algorithm   processed uncorrected
          minor | major    rewrites]   corrected invocations [10^9 bytes] errors
  read: 234237 0            0           234237 234237         605.516       0
  write:      0      0      0              0         0        1457.589       0

  Non-medium error count:       0




  http://smartmontools.sourceforge.net/smartmontools_scsi.html
Corrected Errors
• Minor/ Fast
  • Correction algorithm works successfully
  • No delay to reading later sectors
  • These are ok

• Major / Slow
  •Correction algorithm works successfully
  •Delay in reading later sectors
  •Not so good

 • Uncorrected Errors
    •Correction algorithm fails
    •Very Bad
Other Information
• Total [rereads/rewrites] – errors corrected by applying retries

• Total errors corrected – number of all correctable errors

• Correction Algorithm Invocation – number of times algorithm
  is used

• Gigabytes Processed – number of bytes successfully and
  unsuccessfully read or written
This indicates there might be a
problem




 This should be a flag as well




 This is ok, its correcting the
 errors and not losing any time
 doing so
                      errorsWatch
Monitors
  •   Read Uncorrected Errors
  •   Read Delayed Errors           -Noma

  •   Read No Delay Errors          -Don
                                    -Tori
  •   Write Uncorrected Errors
                                    -Cob
  •   Write Delayed Errors
                                    -Morab
  •   Write No Delay Errors         -Orlov
  •   Total Uncorrected Errors
  •   Total Delayed Errors



  Collects Data Once a Day

								
To top