What is SAM-Grid?
Job Handling Data Handling Monitoring and Information
Problems To Solve
How can a large, geographically distributed, dynamic, physics collaboration work together? How can this collaboration make use of available distributed computing resources? How can it handle the huge amount of data (PBs) generated by the experiment?
Answers – The GRID & SAM-Grid
GRID
A network of middleware services that tie together distributed resources (Fabric – processors, storage).
Integrate the standard middleware to achieve a complete Job, Data, and Information management infrastructure thereby enabling fully distributed computing.
SAM-Grid
SAM-Grid Architecture
Job Management
Grid-level (global) job scheduling (selection of a cluster to run) distinguished from local scheduling (distribution of the job within the cluster) We distinguish structured jobs from unstructured.
Structured jobs have their details known to Grid middleware Unstructured jobs are mapped as a whole onto a cluster For data-intensive jobs, sites are ranked by the amount of data cached at the site
Scheduler is interfaced with the data handling system.
Job Handling
User User Interface Interface Submission Service Informatio Information nCollector Collector JOB
Grid/Fabri c Interface
Match Match Making Making Service
Resource Selection
external algorithm
Exec Site #1
Execution Site #n
Grid/Fabri c Interface
Generic Service
Computin Computing gElement Element
Computing Element
Generic Service
Grid Grid Sensor Sensors s
Grid Grid Sensor Sensors s
Computing Element
Data Handling - SAM
SAM is a distributed data movement and management service SAM stations are resources pooled together to enable data management Data replication is achieved by the use of disk caches during file routing. SAM is a fully functional metadata catalog. A station can access a remote resource via the services offered by other connected stations
MSS – Mass Storage System Control Flow Data Flow
Remote Station Cache2
Local Station 1 Cache1 Local Station 1 Cache2
MSS2
MSS1
Local Station 2 Cache1
Remote Station Cache1
Data Handling
services Global Resource Manager(s)
Database Server(s)
(Central Database)
Shared Globally
Name Server
Log server
Local To Site
Station 1 Servers Station 3 Servers Station n Servers Mass Storage System(s)
Station 2 Servers
Arrows indicate Control and data flow
Shared Locally
Monitoring and Information
This includes:
configuration framework resource description for job brokering infrastructure for monitoring Sites (resources), services and jobs monitoring Distributed knowledge about jobs etc. Incremental knowledge building Grid Monitoring Architecture for current state inquiries, Logging for recent history studies All Web based
Main features
Monitoring and Information
Web Browser Web Browser
Web Server 1 Web Server
Web Server N
Site 1 Information System
Site 2 Information System
Site N Information System
IP IP IP
Challenges with Grid/Fabric Interface
The Globus toolkit Grid/Fabric interfaces are not sufficiently…
…flexible: they expect a “standard” batch system configuration. …scalable: a process per grid job is started up at the gateway machine. We want/need aggregation. …comprehensive: they interface to the batch system only. How about data handling, local monitoring, databases, etc. …robust: if the batch system forgets about the jobs, they cannot react.
Flexibility
Addressing the peculiarity of the configuration of each batch system requires modification to the Globus toolkit job-manager We address the problem by writing jobmanagers that use a level of abstraction on top of the batch systems. Each batch system adapter can be locally configured to conform to the local batch system interface
Scalability
The Globus gatekeeper starts up a process at the gateway node for every job entering the site This limits the number of grid jobs at a site to around 300, for the typical commodity computer We split single grid jobs into multiple batch processes in the SAM-Grid job-managers. Not only does this increase scalability, but it also increases the manageability of the job
Comprehensiveness
The standard job-managers interface only to the local batch system We notify other fabric services when a job enters a site
Data handling: for data pre-staging Monitoring: to monitor a non-running job Database: to aggregate queries
Robustness
The standard job-managers cannot react to temporary failures of the local batch systems In our experience, PBS, Condor and BQS have failed to report the status of a job We write wrappers around the batch systems. These wrappers implement extra robustness. We call them “idealizers”