Administration Tools for Managing Large Scale Linux Cluster

Document Sample
Administration Tools for Managing Large Scale Linux Cluster Powered By Docstoc
					Administration Tools for
 Managing Large Scale
    Linux Cluster

     CRC KEK Japan
  S.Kawabata,A.Manabe
Linux PC Clusters in KEK
PC Cluster 2
PenIII 800MHz
80CPU (40 nodes)




                       PC Cluster 1
                       PenIII Xeon 500MHz
                       144 CPUs (36 nodes)
  2002/6/26        ACAT2002              3
PC Cluster 3
Pentium III Xeon 700MHz 320CPU
(80 nodes)
PC Cluster 4

1U server
Pentium III
1.2GHz
256 CPU
(128 nodes)
                             3U




PC Cluster 5 Blade server:
 LP Pentium III 700MHz
         40CPU (40 nodes)
               PC clusters

Already more than 400 nodes installed.
     Only counting >middle size PC cluster.
     All PC clusters are managed by individual
      user group.
A major exp. group plan to install several
 x100 nodes of blade server in this year.



2002/6/26               ACAT2002                  7
Center Machine

KEK Computer Center plan to have >1000
 nodes in near future (~2004).
Will be installed under `~4year rental‟
 contract.
System will be share among many user
 groups. (don‟t dedicate to one gr.)
     According to their demand to cpu power,
      system partition will be vary.
2002/6/26              ACAT2002                 8
PC cluster for system R&D
            Fujitsu TS225 50 nodes
                PentiumIII 1GHz x 2CPU
                512MB memory
                31GB disk
                100BaseTX x 2
                1U rack mount model
                RS232C x2
                       Remote BIOS setting
                       Remote reset/power-off



2002/6/26   ACAT2002                             9
Necessary admin tools

Installation /update

Command Execution

Configuration

Status Monitoring


2002/6/26            ACAT2002   10
            Installation tool




2002/6/26         ACAT2002      11
Installation tool
   Image Cloning




      Install system/application


                           Copy disk partition image to nodes



2002/6/26                      ACAT2002                         12
Installation tool


            Package server



                                          Clients
       Package       Package
     Information     archive
         Db




2002/6/26                      ACAT2002             13
Remote Installation via NW

Cloning disk image
     SystemImager (VA) http://systemimager.sourceforge.net/
     CATS-i (soongsil Univ.)
     CloneIt http://www.ferzkopp.net/Software/CloneIt/
     Comercial: ImageCast, Ghost,…..
Packages/Applications installation
     Kickstart + rpm (RedHat)
     LUI (IBM) http://oss.software.ibm.com/developerworks/projects/lui
     Lucie (TiTec) http://matsu-www.is.titech.ac.jp/~takamiya/lucie/

2002/6/26                        ACAT2002                                 14
                            Dolly+

We developed „image cloning via NW‟ installer
 `dolly+‟.
WHY ANOTHER?
     We install/update
             frequently (according to user needs)
             100~1000 nodes at a time.
     Traditional Server/Client software suffer server
      bottleneck.
     Multicast copy with ~GB image seems unstable.
      (No free soft ? )


2002/6/26                         ACAT2002               15
    (few) Server - (Many) Client model
       Server could be a daemon process.
        (you don„t need to start it by hand)
       Performance is not scalable against # of
        nodes.
          • Server bottle neck. Network congestion
S


     Multicasting or Broadcasting
       No server bottle neck.
       Get max performance of network which
        support multicasting in switch fablics.
       Nodes failure does not affect to all the
        process very much, it could be robust.
       Since failed node need re-transfer.
        Speed is governed by the slowest node
        as in RING topology.
       Not TCP but UDP, so application must
        take care of transfer reliability.
              Dolly and Dolly+
                                     Dolly
 A Linux application software to copy/clone files or/and
  disk images among many PCs through a network.
 Dolly is originally developed by CoPs project in ETH
  (Swiss) and a free software.
                          Dolly+ features
Sequential files (no limitation of over 2GB) and/or normal files
  (optinal:decompress and untar on the fly)   transfer/copy via TCP/IP
 network.
Virtual RING network connection topology.
Pipeline and multi-threading mechanism for speed-up.
Fail recovery mechanism for robust operation.
   Dolly: Virtual Ring Topology
Server = host having original image
                                       • Physical network connection is as
                                         you like.
                                       • Logically „Dolly‟ makes a node ring
                                         chain which is specified by dolly‟s
                                         config file.
                                       • Though transfer is only between
                                         its two adjacent nodes, it can
                                         utilize max. performance ability of
                                         switching network of full duplex
       node PC                           ports.
       network hub switch              • Good for network complex by
       physical connection               many switches.
        Logical (virtual) connection
   Cascade Topology

Server bottle neck could be
 overcome.
Cannot get maximum network
 performance but better than
 many client to only one serv.
 topology.
Week against a node failure.
 Failure will spread in cascade
 way as well and difficult to
 recover.
                             PIPELINING &
BOF
 1 23 4 5 6 7 8    9 …..
                         EOF
                             multi threading
                                   File chunk =4MB

                                   6
        9

    8        7
                           6
                 network                       5
    Server
                       8       7
                        Node 1 network         5
                 3 thread in parallel
                                         7         6
                                             Node 2    Next node
   Fail recovery mechanism
• Only one node failure could be                       S
  “show stopper” in RING (=series
  connection) topology.

• Dolly+ provides automatic „short
  cut‟ mechanism in node problem.
     • In a node trouble, the upper stream     time out
       node detect it by sending time out.
     • The upper stream node negotiate
       with the lower stream node for
       reconnection and retransfer of a file
       chunk.                                  Short cutting
     • RING topology makes its
       implementation easy.
2002/6/26                        ACAT2002                      21
                                 Re-transfer in
BOF
 1 2 3 4 5 6 7 8 9 …..
                       EOF
                                 short cutting
                                     File chunk =4MB

                                     6
         9

     8        7
                             6
                  network                        5
     Server
                        8        7
                        Node 1 network           5

                                           7         6
                                               Node 2    Next node
    Dolly+: How do you start it
             on linux
Server side      (which has the original file)     Config file example
                                              iofiles 3                                       # of files to Xfer
  % dollyS [-v] -f config_file                /dev/hda1 > /tmp/dev/hda1
                                              /data/file.gz >> /data/file
                                              boot.tar.Z >> /boot
Nodes side                                    server n000.kek.jp
                                              firstclient n001.kek.jp
                                                                                              server name
                                              lastclient n020.kek.jp
                                              client 20
  % dollyC [-v]                               n001
                                                                                              # of client nodes
                                              n002                                            clients names
                                                :
                                              n020
                                              endconfig                                       end code




  The left of ‘>’ is input file in the server. The right is output file in clients. '>' means dolly+ does
  not modify the image. '>>' indicate dolly+ should cook (decompress , untar ..) the file according
     the name
  to2002/6/26 of the file.                        ACAT2002                                               23
2002/6/26   ACAT2002   24
                 Performance of dolly+

                        Elapsed time for cloning vs number of nodes
                        15                                                               Less than 5min!
                                              measured by TS225                          for 100 nodes
   Elapsed time (min)




                                         4MB chunk size, ~10MB/s transfer speed
                                                                                         expected
                        10

                                 total 4GB disk image cloning

                        5                                                               HW: FujitsuTS225
                                                                                        PenIII 1GHz x2,
                                                 total 2GB disk image cloning           SCSI disk,
                                                                                        512MB mem,
                        0                                                               100BaseT NW
                             1       5     10              50    100              500
                                         Number of hosts
2002/6/26                                                 ACAT2002                                   25
                               Dolly+ transfer speed scalability with size of image

                               600
                                                                                PC: Hardware spec.
                1500           500                                               (server & nodes)
transfered bytes (MB)




                                                                                 1GHz PentiumIII x 2
                               400
                                                                                 IDE-ATA/100 disk
                               300                                               100BASE-TX net
                                 40   50   60   70
                                                                                  256MB memory
                1000


                                                            setup      elapsed time   speed
                        500
                                                            1server-1nodes 230sec     8.2MB/s
                                                            1server-2nodes 252sec     7.4MB/s x2
                                                            1server-7nodes 266sec     7.0MB/s x7
                                                            1server-10nodes 260sec     7.2MB/s x10
                          0
                           0                         100               200
                                                elapsed time (sec)
    How does dolly+ start after
           rebooting.
Nodes broadcast over the LAN in search of an
 installation server.
PXE server respond to nodes with information
 about the nodes IP and kernel download server.
The kernel and `ram disk / FS’ are Multicast
 TFTP’ed to the nodes and the kernel gets start.
The kernel hands off to an installation
 script which run a disk tool and ‘dolly+ ’.

2002/6/26            ACAT2002                  27
    How does dolly+ start after
           rebooting.
The code partitions the hard drive, creates
 file systems and start `dolly+’ client on the
 node.
You start `dolly+’ master on the master
 host to start up a disk clone process.
The code then configure individual nodes.
 (Host name, IP addess… etc.)
ready to boot from its hard drive for the
 first time.
2002/6/26           ACAT2002                28
   Remote Execution




2002/6/26    ACAT2002   29
Remote execution

Administrator sometimes need to issue a
 command to all nodes urgently.
Remote execution could be
  rsh/ssh/pikt/cfengine/SUT(mpich )* ….
Points are
     To make it easy to know the execution result
      (fail or success) at a glance.
     Parallel execution among nodes.
  *) Scalable Unix tools for cluster http://www-unix.mcs.anl.gov/sut/
2002/6/26                      ACAT2002                             30
WANI

WEB base remote command executer.
     Easy to select nodes concerned.
     Easy to specify script/type-in commands.
     Issue the commands to nodes in parallel.
Collect result after error/fail detection.
Currently, the software is in prototyping
 by combinations of existed protocol and
 tools. (Anyway it works!)

2002/6/26              ACAT2002                  31
WANI is implemented on
`Webmin’ GUI

                                    Start
Command input




                           Node selection




2002/6/26       ACAT2002              32
            Switch to another page

Command execution result




                                     Host name


                              Results
                               from 200nodes
                              in 1 Page
2002/6/26        ACAT2002                        33
 Error detection

  Exit code
  “fail/failure/error” word `grep –i`
  *sys_errlist[] (perror) list check
  `strings /bin/sh` output check
Flame color represents;
White: initial
Yellow: command starts
Black: finished
                          1     2    3   4 BG color
  2002/6/26               ACAT2002                    34
                          Stdout output




             Click here
Click here




       Stderr output
                                          Node hosts




                                               execution

       WEB Browser
                                           Piktc_svc

         Result Pages Command                    Result
PIKT server
Webmin server
                   Piktc
                                               error detector

            print_filter Error marked Result
                                 lpr
            Lpd
Status Monitoring

Cfengine /Pikt*1/Pica*2
Ganglia*3




 *1 http://pikt.org *2 http://pica.sourceforge.net/wtf.html
 *3 http://ganglia.sourceforge.net


2002/6/26                          ACAT2002                   37
                  Conclusion
Installation: dolly+
     Install/Update/Switch system very quick.
Remote Command Execution
     `Result at a glance‟ is important for quick iteration.
     Parallel execution is important
Status monitor

Configuration manager
     Not matured yet.

                        from
Software is available ACAT2002
2002/6/26                                                      38
2002/6/26   ACAT2002   39
                            Synchronizing time by rsync
                            (dir = 4096, filesize ~20kB # of file=43680 total size=1.06GB)
                            80
                                                                                     y=Σan xn
                                                                                     a0=8.68524263e-01
                                                                                     a1=4.24465056e-01
                            60
       elapsed time (sec)




                                                                                     2.04576224e+00
                                                                                     |r|=9.97385098e-01

                            40


                            20
                                                   total 1GB 43680files ~20kB/file
                                                   total 2GB           ~50kB/file
                               0                   100                  200
                                   aggregate of modified file size (MB)



2002/6/26                                                 ACAT2002                                        40