Learning Center
Plans & pricing Sign in
Sign Out



  • pg 1
									                               ARTS – Project Proposal
                                  (ALICE Offline)

 1. Project Aim and Context
 2. Project Impact, Related Fields and Usage of the Apple Computing Solution
 3. Project Timescale

                                                                         design lifetime of 20 years, ALICE will produce more than
                                                                         109 data files per year, and will continuously use tens of
1 Project Aim and Context                                                thousands of CPUs to process and analyze them.

 CERN, the European Organization for Nuclear Research, is                Since the very conception of the LHC, it was evident that
 the world's largest particle physics research centre. Its               the computing power needed for the storage and
 scientific program is aimed at exploring the properties of              processing of the data collected by the experiments
 matter’s smallest building blocks and the forces holding                cannot be handled by a single ‘super computing’ centre –
 them together. The newest addition to the CERN research                 the costs would be too high. Enter the GRID computing
 tools is the Large Hadron Collider (LHC) – a particle                   paradigm. The concept of distributed computing follows
 accelerator designed to collide protons and heavy ions at               naturally the collaborative structure of the LHC
 very high energies. LHC will start operation in the                     experiments – they are formed by thousand of physicists
 summer of 2008.                                                         from hundreds of institutes and universities around the
                                                                         world, which build and operate the detectors. Similarly, it
 One of the experiments operating at the LHC is ALICE – A                was agreed that the participating countries will also
 Large Ion Collider Experiment. ALICE is purpose-built to                provide the computing power needed for the distributed
 test the Big Bang cosmological theory, which postulates                 storage, processing and analysis of the data. These
 that the Universe, born from the Big Bang, went through                 resources, heterogeneous and geographically distributed,
 a stage that lasted no more than a few millionths of a                  would be harnessed by the Grid
 second, during which matter existed as a sort of
 extremely hot, dense soup, called Quark-Gluon Plasma                    In addition, this software would allow the physicists
 (QGP) and composed of elementary particles. The LHC                     around the world to use the raw computing power
 artificially reenacts QGP by triggering “Little Bangs” and              transparently, without the need to adapt their programs
 the ALICE detectors observe the properties of the                       to the various operating system, security and storage
 particles created under these extreme constraints and                   standards in existence.
 how they aggregate to form ordinary matter.
                                                                         Following these general concepts, the ALICE experiment
                                                                         has developed a set of Grid computing services- the AliEn
                                                                         toolkit. It provides a single user interface for the ALICE
                                                                         physicists    and    inter-operates   with   the    various
                                                                         computational Grids in existence around the world today,
                                                                         such as the gLite in Europe, OSG in the US and ARC in
                                                                         the Scandinavian countries. AliEn is in operation for more
                                                                         than 6 years and is operating simultaneously over 5000
                                                                         CPUs and thousands of TB of storage at 60 computing
                                                                         centres worldwide. The development of AliEn is ongoing,
                                                                         following the latest technology trends in distributed

                                                                         The Apple Research & Technology Support (ARTS)
                                                                         program would support a data analysis project based on
                                                                         three related tasks in ALICE Offline devoted to enhance
                                                                         the analysis activities of the collaboration. They consist of
                                                                         a computing cluster running as a computational node of
                                                                         the AliEn Grid, the same cluster exploited for parallel
                                                                         analysis using PROOF and a general purpose interactive
            Fig.1 – ALICE experiment at CERN                             distributed analysis framework.

 Once in operation, ALICE will collect data at the
 unprecedented rate of up to 4PB per year. During its

                                                          - Pg. 11   -
2 Project impact, related fields and usage                               On a computing node, the AliEn Grid runs the following
                                                                         services (see Fig.3):
  of the Apple computing solution
2.1 A Mac computational Grid node                                           the Cluster Monitor, that is used to route messages
                                                                             from AliEn central servers to the remote sites;
 The first goal is to set up a Mac/Intel cluster running as a               the Computing Element (CE), that is used to
 computational node of the Grid in ALICE. Over the years,                    interface the local batch system and the Worker
 AliEn (ALICE Environment) has represented the main                          Nodes (WNs);
 distributed computing environment developed by the                         the Storage Element (SE), that provides the interface
 ALICE collaboration as a production environment for the                     to the local Mass Storage System (MSS);
 simulation, reconstruction and analysis of physics data.                   the File Transfer Daemon (FTD), that is used to
 AliEn is a lightweight Grid framework, interfaced to larger                 transfer files between different sites using gridftp
 Grids, built around open source components using the                        and host certificates to authenticate each other.
 combination of web service and a distributed agent

 The system has been put into production for ALICE users
 at the end of 2001. Currently ALICE is using AliEn mainly
 for distributed production of Monte-Carlo data, detector
 simulation and reconstruction at over 60 sites located on
 four continents. Around half a million ALICE-jobs have
 been run under AliEn control worldwide during the annual
 Physics Data Challenge exercises (PDC).

                                                                             Fig.3 – AliEn framework and interfaces to the remote

                                                                         AliEn’s loosely coupled resources architecture, based on a
                                                                         pull-mode mechanism, will facilitate the addition of a Mac
                                                                         CE/SE node, providing a good opportunity to benchmark
                                                                         a different platform in a real functioning Grid system.

                                                                     2.2 A Mac cluster for parallel analysis
                                                                         The Apple cluster will also be interfaced to and integrated
                                                                         into the CERN Analysis Facility (CAF). The CAF is a local
                                                                         computing cluster at CERN aims to run and test PROOF, a
                                                                         framework for interactive parallel analysis based on

                                                                         In its final configuration it will consist of 500 CPUs and at
  Fig.2 – Sites collaborating in the ALICE experiment using              least 50 TB data storage. The data sets will be retrieved
                         the AliEn Grid                                  from the Grid through an interface to the AliEn
                                                                         middleware. Simulated data, produced during Physics
                                                                         Data Challenge 2006, are used currently on the prototype
 The effort to configure and run jobs on a Mac cluster will              CAF system for tests.
 be an interesting proof of the fact that:
    the whole ALICE Offline computing environment can                   The analysis on CAF using PROOF is conceptually
     work on heterogeneous platforms (currently mostly                   different from the analysis on the Grid. Due to the limited
     Linux);                                                             amount of CPU and storage, CAF cannot be used to
    Apple’s solution can integrate seamlessly the                       analyze all data collected by the ALICE experiment.
     computing environment of the largest Grid in the                    However, it will allow one to obtain results almost
     world.                                                              immediately - after a few minutes or even seconds.

                                                                         This framework is adopted by the ALICE collaboration to
 The Mac node should run the AliEn services and ALICE
                                                                         do a time-critical analysis of specific portions of the data,
 Offline frameworks and applications. Although AliRoot
                                                                         for example calibration and alignment work as well as
 ( and ROOT (
                                                                         physics analysis, requiring fast feedback. The PROOF
 frameworks have already been ported and tested on a
                                                                         framework, combined with the other ALICE Offline
 Mac, the AliEn services porting and testing is ongoing.
                                                                         frameworks (ROOT, AliRoot and AliEn) will allow the
 AliRoot and ROOT are the OO frameworks used in ALICE
                                                                         ALICE physicists to access the computing facilities and
 to handle and analyze large amounts of data in an
                                                                         the experimental data through a coherent and self-
 efficient way.
                                                                         contained set of software tools.

                                                              - Pg. 21   -
                                                                       users at the same time running different types of
                                                                       analysis) are being performed.

                                                                       By adding a new local Mac based CAF it will be possible to
                                                                       test PROOF intensively on Apple architecture and proceed
                                                                       with the ambitious project of realizing a virtual parallel
                                                                       distributed interactive analysis. A single virtual PROOF
                                                                       master will talk to different “local” masters at each
                                                                       physical CAF and send them jobs depending on the loads,
                                                                       in order to reproduce the AliEn architecture in a parallel
                                                                       analysis environment.

                                                                       Since PROOF allows for interactive analysis in a local
                                                                       cluster environment, two changes are necessary to
                                                                       enable its usage in a Grid environment: create a
                                                                       hierarchy of PROOF slave/master servers and use AliEn to
                                                                       allow for job splitting. In each local CAF cluster, PROOF
  Fig.4 – Parallel analysis using the PROOF architecture               workers are handled by a PROOF master server, which
                                                                       distributes tasks and collects results.
The principles of a distributed parallel data analysis
project are derived from a programming paradigm that is            2.3 A general-purpose framework for
a natural consequence of the use of computers connected
through a network. The PROOF design focuses on a
                                                                       interactive distributed analysis
distributed, open, scalable, transparent and fault tolerant            Exploiting the results from the above activities (PROOF
system. Additionally, PROOF aims to exploit the                        working on a Mac local cluster, heterogeneous
underlying Grid technologies, which offer access to                    environment, Grid computing on the Mac using AliEn)
distributed computing and storage resources.                           requires as a last step the construction of a general
                                                                       purpose framework for batch and interactive distributed
The primary objective of PROOF is to allow the user to do              analysis.
an interactive analysis of large amounts of data. To that
end, it will parallelize the analysis task by splitting it in          The development of such a framework and its GUI will
many small portions, each of them dealing with a specific              allow for interaction with the Grid and CAF nodes
sub-set of the requested data.                                         described in the previous sections.

                                                                       The project has the aim to develop a user interface that
                                                                       will provide functionality to browse the AliEn catalogue,
                                                                       upload macros for any kind of analysis (interactive,
                                                                       distributed,    batch),   specify   tag-based     filters to
                                                                       characterize different levels of physics data abstraction
                                                                       and, finally, transparently run the analysis itself.

                   Fig.5 – CAF Schema

 ‘Interactive’ in this context means that the user is able
to see the results right away, contrary to a batch job
where it is necessary to wait for the job to finish.
‘Parallel’ means that the split tasks will be executed on
many computing nodes at the same time.
                                                                             Fig.6 – A prototype of GUI interface for interactive
The PROOF system is currently under intensive                                                     analysis
development       and    improvement:      various      high-
performance parallel algorithms and techniques for
                                                                       The underlying distributed application wrapped by the
developing concurrent processes are being implemented.
                                                                       GUI interface essentially consists of four steps.
At the same time it is necessary to perform intensive
tests to establish the scalability and proportionality of the
PROOF algorithms with respect to the hardware in use.                  1.   The analysis framework has to extract a subset of
To accomplish this task a significant amount of speed-up                    the datasets from the AliEn file catalogue using
tests (only one user exploiting the whole cluster with a                    metadata conditions provided by the user. The result
number of increasing nodes) and realistic tests (several                    is an XML file containing the list of matching logical

                                                            - Pg. 31   -
     files. For example, you could ask for a collection of            Task splitting is achieved by using AliEn classes for
     ESD files corresponding to some 2007 run, collisions             asynchronous analysis. The SuperPROOF master server
     of type proton-proton and created in a specified                 assigns the input data sets for each site that runs PROOF
     period of time.                                                  locally: these data sets are locally readable, hence each
                                                                      local PROOF master can distribute them to the PROOF
2.   As a second step, the user can impose event                      slaves like in a conventional local cluster.
     selection criteria to reduce the amount of data to be
     analyzed. ALICE implements an event tag system                   A progress bar within the GUI should provide to the user
     that reduces the time needed for analysis by                     the level of job completion while data are being
     providing just the events of interest as the users               processed. At the end, the distributed analysis framework
     themselves specify them.                                         collects and merges available results from all terminated
                                                                      sub-jobs on request, and provides persistency facilities in
3.   The user can specify PAR (PROOF ARchive) archives                the Grid environment to allow the users to get offline and
     containing all the classes that are necessary to do              reload an analysis.
     the analysis. Once selected, PAR archives are
     uploaded, built and enabled automatically within the         3 Project Timescale
     distributed analysis environment.

4.   The final step is to steer the analysis transparently
     towards the right distributed (AliEn or CAF) or local

To accomplish this final goal, tasks have to be split
according to the location of data sets. The Grid system
must achieve a balanced operation between optimal use
of the available resources and minimal data movements.
Ideally jobs should be executed where the data are
stored. Since one cannot expect a uniform storage
location distribution for every subset of data, the analysis
framework has to negotiate with dedicated Grid services
the balancing between local data access and data

The batch distributed analysis will be achieved submitting
sub-jobs to the Resource Broker with precise job
descriptions (JDL).

The interactive distributed analysis will be achieved by
batch jobs that will register themselves to the local
PROOF master. This PROOF master will register to a
SuperPROOF master that, at last, will start the job
distribution throughout the PROOF workers hierarchy.

This is the scenario of a multi-site configuration, where
each PROOF cluster can be seen as a SuperPROOF slave
server of a SuperPROOF master server running on the
user machine. Each PROOF master server has therefore
to implement functionalities to act either as a PROOF
master server or a SuperPROOF slave server at the same

          Fig.7 – Multi-site configuration of PROOF

                                                           - Pg. 41   -

To top