Deploying a High Throughput Computing Cluster
Kyung Wook Ye
(Dept of Computer Science, KAIST)
KAIS
T
Deploying a High Throughput Computing Cluster
HTC strives to provide large amount of processing capacity to
customers over long periods of time by exploiting existing
computing resources on the network.
Require solution, which includes resource management framework.
Must be reliable and maintainable.
The HTC S/W must be robust and feature-rich to meet the needs of
resource owners, customers, and system administrators.
Describes some of the challenges faced by software developers
and system administrators when deploying an HTC cluster, and
some of the approaches for meeting those challenges.
De ploying a High Throughput Computing Cluster 2
Condor Overview
Condor resource management architecture
Matchmaker
Resource Resource
Request Offer
Match
Notification
Customer Claiming Resource
Agent Owner
Protocol Agent
De ploying a High Throughput Computing Cluster 3
Software Development
Four primary challenges:
Utilization of heterogeneous resources (system portability)
Evolution of network protocols
Remote file access
Utilization of non-dedicated resources
(preempt and resume using check point)
De ploying a High Throughput Computing Cluster 4
Software Development
Layered Software Architecture
portability
De ploying a High Throughput Computing Cluster 5
Software Development
Layered Resource Management Architecture
Modular system design
Separates the advertising, matchmaking, and claiming
protocols
Protocol Flexibility
use a general-purpose data format
add new parameters (Older agents ignore new parameters)
De ploying a High Throughput Computing Cluster 6
Software Development
Remote File Access
Use existing distributed file system
Authenticate the customer’s application
redirect file I/O -> interposition system
De ploying a High Throughput Computing Cluster 7
Software Development
Checkpointing
A snapshot of execution program’s state
Can be used to restart the program
preemptive-resume scheduling
A portable, robust user-level checkpointing : challenge
De ploying a High Throughput Computing Cluster 8
System Administration
Enforce the access polices of resource owners
failure (detecting, investigating the causes, avoiding)
accounting of system usage and availability
Access Polices
who may use a resource, how they may use it, when they may use it
Define a set of expressions which specify when an application may
begin using a resource and when and how an application must stop
using a resource
Requirements
Rank ( preference of resource owner)
Suspend
Continue
Vacate
Kill
De ploying a High Throughput Computing Cluster 9
System Administration
Access Polices
Requirements = (KeyboardIdle > 15*Minute) && (LoadAvg 2*Minute)
Vacate = (SuspendTime > 5*Minute)
Kill = (VacateTime > 5*Minute)
vacate : save immediate results (checkpoint)
kill : do not save any results
periodic checkpoint
De ploying a High Throughput Computing Cluster 10
System Administration
Reliability
variety of risks of failure (network, hardware, os, HTC s/w itself)
The system processes don’t fail and leave running applications
unattended -> Monitoring HTC processes
Master process
detect failure and invoke recovery mechanism
cache executable files on the local disk
serve as an administrative module (gathering statistics)
distinguish between normal and abnormal termination of application
monitor system calls
ask the customer
choose correct checkpoint
must decide when it is safe to restart the application
problem of one bad node
De ploying a High Throughput Computing Cluster 11
System Administration
Problem Diagnosis via System Logs
System log files: primary tools for diagnosing system failures
application log: system call trace, checkpoint information and
statistics, remote I/O trace with statistics, errors occurring during the
allocation
customer log: allocation information and statistics, application arrival
and termination, matchmaking and claiming errors
resource log: allocation information and statistics, policy action trace
master log: HTC agent (re-)starts, administrative commands, agent
upgrades
scheduling log: record of all matches, allocation history (accounting)
security log: record of all rejected requests, record of all authenticated
actions
log file management: history, distributed or centralized
De ploying a High Throughput Computing Cluster 12
System Administration
Security
protecting the resource from unauthorized access
user authentication mechanism
resource agent maintain control over the application and
monitor its activity
set the resource limit
intercept the system call
Remote Customers
provide remote access to the HTC cluster
HTC account
De ploying a High Throughput Computing Cluster 13
Summary
HTC software
portable, reliable, and maintainable
A layered architecture with flexible network protocol
To provide remote file access and checkpoint mechanism
To balance the needs of resource owners and HTC customers
To provide reliable, secure services with effective logging and
accounting tools for monitoring resource usage and diagnosing
problem
HTC challenge
Effectively managing complexities such as heterogeneity distributed
ownership, etc for the HTC customers, resource owners, and
administrators
De ploying a High Throughput Computing Cluster 14