ARC @ HOME

Document Sample
scope of work template
							WWW.UNI-C.DK




                                 ARC@HOME
                     - Running compute elements on distributed
                                 virtual machines


                                Michael Grønager, PhD
                              UNI-C / Virtual Reality Center
                              Present: Niels Bohr Institute

               www.uni-c.dk                                      1
WWW.UNI-C.DK


                Talk outline

                • Why Screen Saver Science ?

                • First approach: A distributed cluster

                • Second approach: Sandboxing grid jobs for BOINC




               www.uni-c.dk
WWW.UNI-C.DK


                Idle time computing (not the complete list)

                • Condor (dates back to 1986): Runs on Unix and
                  Windows. Best suited for LANs – can run arbitrary
                  programs.

                • SETI (1996-1999): Mainly Windows, uses pull model so
                  well suited for internet. Runs only one program.

                • BOINC (2004): A generalization of SETI (same team),
                  Windows/Linux. Mainly added an API for other SETI
                  programs.

               www.uni-c.dk
WWW.UNI-C.DK


                SETI@HOME

                •   3.8M users in 226 countries
                •   1200 CPU years/day1.7
                •   ZETAflop over last 3 years (10^21)
                •   38 TF sustained performance




               www.uni-c.dk
WWW.UNI-C.DK


                Motivation

                • BONIC/Condor is on to something…
                • However:
                   • We don’t want to boinc’ify all our applications
                     (including porting to Windows!).
                   • Condor has a Vanilla universe, but again we need to
                     port our code to Windows – and do we trust arbitrary
                     code to run on our desktop?

                • We need secure linux cycles to run on idle windows
                  machines – for standard batch jobs (vanilla)!

               www.uni-c.dk
WWW.UNI-C.DK

                I: The distributed cluster -
                            Setup and components

                • How to distribute a cluster?
                     • Front-end centralized, worker nodes distributed (geographically)
                     • Use an intelligent batch submission system
                • How to contact worker nodes on other networks (behind
                  firewalls etc.)
                     • Office grids (do it within a trusted organization)
                     • VPN
                • How to change Windows cycles into Linux cycles and
                  “jail” the grid jobs.
                     • Reboots
                     • Machine virtualization


               www.uni-c.dk
WWW.UNI-C.DK


                Changing windows into linux cycles = coLinux

                • Support for Linux on Windows and Linux on Linux
                • Kernel runs in NT ring 0 = cooperative mode
                   • High performance compared to other VMs
                • Runs several distros
                   • 2.4 and 2.6 support through a kernel patch
                   • Gentoo, Debian, Fedora etc.
                • Several network modes
                   • Bridged network (through WinPCAP)
                   • Routed network (through TAP device)
                   • Shared device (through Slirp driver)

               www.uni-c.dk
WWW.UNI-C.DK


                Building a distributed private network = VPN

                • OpenVPN
                  • Open Source
                  • Supports one server to many clients
                  • Supports full RSA security
                  • The windows host becomes invisible to the virtual
                    machine and vice versa




               www.uni-c.dk
WWW.UNI-C.DK


                The batch submission sys.

                • Condor
                   • + support for checkpointing and migration
                   • - above not supported for vanilla jobs
                   • + handles dynamic nodes well
                   • - rather big wn installation
                • Torque (PBS)
                   • + small footprint
                   • + ok support for dynamic nodes
                   • - no checkpointing and migration

               www.uni-c.dk
WWW.UNI-C.DK


                Adding the grid component

                • NorduGRID/ARC:
                   • Minimal intrusive
                   • Uses a well suited frontend model
                   • Supports Condor and Torque as LRMS




               www.uni-c.dk
WWW.UNI-C.DK


                ARC@HOME cluster

                NG GIIS




               ARC@HOME Frontend                                           Windows Desktop

               FQHN Server:                                             XP
                OpenVPN                                                                 coLinux
                                                                      Service
                Nordugrid                                                           OpenVPN
                Torque                                                              PBS mom
                                                                      Slirp
                                                                     network



                                   Windows   Windows   Windows   Windows      Windows       Windows
                                   Desktop   Desktop   Desktop   Desktop      Desktop       Desktop




               www.uni-c.dk
WWW.UNI-C.DK


                The Windows Desktop

                • coLinux
                   • 384MB partition(plain windows file) with Distro+Torque
                   • NFS automount for /scratch/grid
                • NT service
                   • net start/stop ARCHOME
                   • Monitor for controlling the service (timed start/stop)
                • Setup.exe file for deployment: 43MB Automatic setup




               www.uni-c.dk
WWW.UNI-C.DK


                Thoughts on the distributed cluster
                • coLinux together with NorduGRID and Torque makes
                  it possible for the first time to utilize windows PC’s in
                  a production grid environment.
                • However some problems still exists:
                   • Need for checkpointing
                   • The queuing systems (Condor/Torque) accept, but
                      really doesn’t like nodes just dying all the time.
                   • When to mark a job failed, because it takes too long to
                      finish?
                   • We only keep the VPN line open to please the queuing
                      systems
                   • Deployment ! How do we get people to offer us CPUs?

               www.uni-c.dk
WWW.UNI-C.DK


                II: Sandboxing grid jobs for BOINC

                • What is BOINC?
                  • Essentially a batch system for running many similar
                    completely sandboxed jobs.
                  • Has a lot of users in several projects (=Well deployed)
                  • New projects can join (latest LHC@HOME).
                  • Supports also big jobs (one Climate Modeling job takes
                    half a year and turns over gigabytes of data…)

                     • But how to run general grid jobs within BOINC?


               www.uni-c.dk
WWW.UNI-C.DK


                BOINC’ifying grid jobs

                 ARC@BOINC Frontend
                 Nordugrid server
                                                                         Windows Desktop
                 BOINC project server
                                                                                BOINC

                                                                      Project EXE         Data
                                                                         SETI          Sandboxed
                                                                                        Nordugrid
                                                                      coLinux++
                                                                                           job
                                                                       Climate…        (sessiondir
                                                                                           tar)




                                        Windows   Windows   Windows   Windows       Windows      Windows
                                        Desktop   Desktop   Desktop   Desktop       Desktop      Desktop




               www.uni-c.dk
WWW.UNI-C.DK


               Workflow
               • Nordugrid server receives job (session dir)
               • BOINC queuing system makes a tar-ball of job
               • Tar-ball is added to the list of data-packages of the ARC-
                 BOINC server
                • Client connects and download some ARC data
                • The ARC-BOINC job starts:
                    • The tar-ball is placed at a specific place
                    • coLinux start, mount the tar-ball and executes
                    • (Running jobs can be checkpointed by LSS2)
                    • Generated data is wrapped as another tar-ball
                    • coLinux finishes, and out-data is send back to server
               www.uni-c.dk
WWW.UNI-C.DK


                Conclusions

                • Cycle-scavenging can be used in a production grid
                • The important components are:
                   • Virtualization layer
                   • Checkpointing
                   • Queuing system
                • Sandboxed grid jobs in BOINC is god for bigger runs
                • The distributed cluster with OpenVPN will still be superior
                  for small tutorial like jobs.



               www.uni-c.dk

						
Related docs
Other docs by iyf57920
What's new from our Associate Members
Views: 31  |  Downloads: 0
PICAXE-28X1 (OCR AQA ASSEMBLER)
Views: 108  |  Downloads: 0
Understanding JSP and Apache Tomcat
Views: 10  |  Downloads: 0
Archival Storage Box
Views: 18  |  Downloads: 0
APP 2004
Views: 38  |  Downloads: 0
Associate Professor - Excel
Views: 964  |  Downloads: 0
INDUSTRIAL 4 - 20 mA PRESSURE TRANSMITTER ATM
Views: 19  |  Downloads: 0