Embed
Email

HTC

Document Sample

Shared by: xiaopangnv
Categories
Tags
Stats
views:
0
posted:
12/8/2011
language:
pages:
14
Deploying a High Throughput Computing Cluster









Kyung Wook Ye

(Dept of Computer Science, KAIST)



KAIS

T

Deploying a High Throughput Computing Cluster

HTC strives to provide large amount of processing capacity to

customers over long periods of time by exploiting existing

computing resources on the network.





Require solution, which includes resource management framework.

Must be reliable and maintainable.





The HTC S/W must be robust and feature-rich to meet the needs of

resource owners, customers, and system administrators.





Describes some of the challenges faced by software developers

and system administrators when deploying an HTC cluster, and

some of the approaches for meeting those challenges.



De ploying a High Throughput Computing Cluster 2

Condor Overview

Condor resource management architecture



Matchmaker







Resource Resource

Request Offer

Match

Notification





Customer Claiming Resource

Agent Owner

Protocol Agent









De ploying a High Throughput Computing Cluster 3

Software Development

Four primary challenges:

Utilization of heterogeneous resources (system portability)

Evolution of network protocols

Remote file access

Utilization of non-dedicated resources

(preempt and resume using check point)









De ploying a High Throughput Computing Cluster 4

Software Development

Layered Software Architecture

portability









De ploying a High Throughput Computing Cluster 5

Software Development

Layered Resource Management Architecture

Modular system design

Separates the advertising, matchmaking, and claiming

protocols

Protocol Flexibility

use a general-purpose data format

add new parameters (Older agents ignore new parameters)









De ploying a High Throughput Computing Cluster 6

Software Development

Remote File Access

Use existing distributed file system

Authenticate the customer’s application

redirect file I/O -> interposition system









De ploying a High Throughput Computing Cluster 7

Software Development

Checkpointing

A snapshot of execution program’s state

Can be used to restart the program

preemptive-resume scheduling

A portable, robust user-level checkpointing : challenge









De ploying a High Throughput Computing Cluster 8

System Administration

Enforce the access polices of resource owners

failure (detecting, investigating the causes, avoiding)

accounting of system usage and availability

Access Polices

who may use a resource, how they may use it, when they may use it

Define a set of expressions which specify when an application may

begin using a resource and when and how an application must stop

using a resource

Requirements

Rank ( preference of resource owner)

Suspend

Continue

Vacate

Kill





De ploying a High Throughput Computing Cluster 9

System Administration

Access Polices

Requirements = (KeyboardIdle > 15*Minute) && (LoadAvg 2*Minute)

Vacate = (SuspendTime > 5*Minute)

Kill = (VacateTime > 5*Minute)



vacate : save immediate results (checkpoint)

kill : do not save any results

periodic checkpoint









De ploying a High Throughput Computing Cluster 10

System Administration

Reliability

variety of risks of failure (network, hardware, os, HTC s/w itself)

The system processes don’t fail and leave running applications

unattended -> Monitoring HTC processes

Master process

detect failure and invoke recovery mechanism

cache executable files on the local disk

serve as an administrative module (gathering statistics)

distinguish between normal and abnormal termination of application

monitor system calls

ask the customer

choose correct checkpoint

must decide when it is safe to restart the application

problem of one bad node





De ploying a High Throughput Computing Cluster 11

System Administration

Problem Diagnosis via System Logs

System log files: primary tools for diagnosing system failures

application log: system call trace, checkpoint information and

statistics, remote I/O trace with statistics, errors occurring during the

allocation

customer log: allocation information and statistics, application arrival

and termination, matchmaking and claiming errors

resource log: allocation information and statistics, policy action trace

master log: HTC agent (re-)starts, administrative commands, agent

upgrades

scheduling log: record of all matches, allocation history (accounting)

security log: record of all rejected requests, record of all authenticated

actions

log file management: history, distributed or centralized





De ploying a High Throughput Computing Cluster 12

System Administration

Security

protecting the resource from unauthorized access

user authentication mechanism

resource agent maintain control over the application and

monitor its activity

set the resource limit

intercept the system call





Remote Customers

provide remote access to the HTC cluster

HTC account







De ploying a High Throughput Computing Cluster 13

Summary

HTC software

portable, reliable, and maintainable

A layered architecture with flexible network protocol

To provide remote file access and checkpoint mechanism

To balance the needs of resource owners and HTC customers

To provide reliable, secure services with effective logging and

accounting tools for monitoring resource usage and diagnosing

problem

HTC challenge

Effectively managing complexities such as heterogeneity distributed

ownership, etc for the HTC customers, resource owners, and

administrators







De ploying a High Throughput Computing Cluster 14



Related docs
Other docs by xiaopangnv
180617
Views: 0  |  Downloads: 0
apostar-por-crear-una-empresa
Views: 0  |  Downloads: 0
Contemplative Pedagogy Principles and Design
Views: 1  |  Downloads: 0
PreApplications
Views: 1  |  Downloads: 0
Basic or Pure Science vs. Applied Science
Views: 0  |  Downloads: 0
Algorithmic Problems Related To The Internet
Views: 0  |  Downloads: 0
E07-PC-23-03a_EFET Wish list
Views: 0  |  Downloads: 0
ATT
Views: 2  |  Downloads: 0
1793A_Example
Views: 1  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!