ARC @ HOME
Document Sample


WWW.UNI-C.DK
ARC@HOME
- Running compute elements on distributed
virtual machines
Michael Grønager, PhD
UNI-C / Virtual Reality Center
Present: Niels Bohr Institute
www.uni-c.dk 1
WWW.UNI-C.DK
Talk outline
• Why Screen Saver Science ?
• First approach: A distributed cluster
• Second approach: Sandboxing grid jobs for BOINC
www.uni-c.dk
WWW.UNI-C.DK
Idle time computing (not the complete list)
• Condor (dates back to 1986): Runs on Unix and
Windows. Best suited for LANs – can run arbitrary
programs.
• SETI (1996-1999): Mainly Windows, uses pull model so
well suited for internet. Runs only one program.
• BOINC (2004): A generalization of SETI (same team),
Windows/Linux. Mainly added an API for other SETI
programs.
www.uni-c.dk
WWW.UNI-C.DK
SETI@HOME
• 3.8M users in 226 countries
• 1200 CPU years/day1.7
• ZETAflop over last 3 years (10^21)
• 38 TF sustained performance
www.uni-c.dk
WWW.UNI-C.DK
Motivation
• BONIC/Condor is on to something…
• However:
• We don’t want to boinc’ify all our applications
(including porting to Windows!).
• Condor has a Vanilla universe, but again we need to
port our code to Windows – and do we trust arbitrary
code to run on our desktop?
• We need secure linux cycles to run on idle windows
machines – for standard batch jobs (vanilla)!
www.uni-c.dk
WWW.UNI-C.DK
I: The distributed cluster -
Setup and components
• How to distribute a cluster?
• Front-end centralized, worker nodes distributed (geographically)
• Use an intelligent batch submission system
• How to contact worker nodes on other networks (behind
firewalls etc.)
• Office grids (do it within a trusted organization)
• VPN
• How to change Windows cycles into Linux cycles and
“jail” the grid jobs.
• Reboots
• Machine virtualization
www.uni-c.dk
WWW.UNI-C.DK
Changing windows into linux cycles = coLinux
• Support for Linux on Windows and Linux on Linux
• Kernel runs in NT ring 0 = cooperative mode
• High performance compared to other VMs
• Runs several distros
• 2.4 and 2.6 support through a kernel patch
• Gentoo, Debian, Fedora etc.
• Several network modes
• Bridged network (through WinPCAP)
• Routed network (through TAP device)
• Shared device (through Slirp driver)
www.uni-c.dk
WWW.UNI-C.DK
Building a distributed private network = VPN
• OpenVPN
• Open Source
• Supports one server to many clients
• Supports full RSA security
• The windows host becomes invisible to the virtual
machine and vice versa
www.uni-c.dk
WWW.UNI-C.DK
The batch submission sys.
• Condor
• + support for checkpointing and migration
• - above not supported for vanilla jobs
• + handles dynamic nodes well
• - rather big wn installation
• Torque (PBS)
• + small footprint
• + ok support for dynamic nodes
• - no checkpointing and migration
www.uni-c.dk
WWW.UNI-C.DK
Adding the grid component
• NorduGRID/ARC:
• Minimal intrusive
• Uses a well suited frontend model
• Supports Condor and Torque as LRMS
www.uni-c.dk
WWW.UNI-C.DK
ARC@HOME cluster
NG GIIS
ARC@HOME Frontend Windows Desktop
FQHN Server: XP
OpenVPN coLinux
Service
Nordugrid OpenVPN
Torque PBS mom
Slirp
network
Windows Windows Windows Windows Windows Windows
Desktop Desktop Desktop Desktop Desktop Desktop
www.uni-c.dk
WWW.UNI-C.DK
The Windows Desktop
• coLinux
• 384MB partition(plain windows file) with Distro+Torque
• NFS automount for /scratch/grid
• NT service
• net start/stop ARCHOME
• Monitor for controlling the service (timed start/stop)
• Setup.exe file for deployment: 43MB Automatic setup
www.uni-c.dk
WWW.UNI-C.DK
Thoughts on the distributed cluster
• coLinux together with NorduGRID and Torque makes
it possible for the first time to utilize windows PC’s in
a production grid environment.
• However some problems still exists:
• Need for checkpointing
• The queuing systems (Condor/Torque) accept, but
really doesn’t like nodes just dying all the time.
• When to mark a job failed, because it takes too long to
finish?
• We only keep the VPN line open to please the queuing
systems
• Deployment ! How do we get people to offer us CPUs?
www.uni-c.dk
WWW.UNI-C.DK
II: Sandboxing grid jobs for BOINC
• What is BOINC?
• Essentially a batch system for running many similar
completely sandboxed jobs.
• Has a lot of users in several projects (=Well deployed)
• New projects can join (latest LHC@HOME).
• Supports also big jobs (one Climate Modeling job takes
half a year and turns over gigabytes of data…)
• But how to run general grid jobs within BOINC?
www.uni-c.dk
WWW.UNI-C.DK
BOINC’ifying grid jobs
ARC@BOINC Frontend
Nordugrid server
Windows Desktop
BOINC project server
BOINC
Project EXE Data
SETI Sandboxed
Nordugrid
coLinux++
job
Climate… (sessiondir
tar)
Windows Windows Windows Windows Windows Windows
Desktop Desktop Desktop Desktop Desktop Desktop
www.uni-c.dk
WWW.UNI-C.DK
Workflow
• Nordugrid server receives job (session dir)
• BOINC queuing system makes a tar-ball of job
• Tar-ball is added to the list of data-packages of the ARC-
BOINC server
• Client connects and download some ARC data
• The ARC-BOINC job starts:
• The tar-ball is placed at a specific place
• coLinux start, mount the tar-ball and executes
• (Running jobs can be checkpointed by LSS2)
• Generated data is wrapped as another tar-ball
• coLinux finishes, and out-data is send back to server
www.uni-c.dk
WWW.UNI-C.DK
Conclusions
• Cycle-scavenging can be used in a production grid
• The important components are:
• Virtualization layer
• Checkpointing
• Queuing system
• Sandboxed grid jobs in BOINC is god for bigger runs
• The distributed cluster with OpenVPN will still be superior
for small tutorial like jobs.
www.uni-c.dk
Get documents about "