Hadoop File System as part of a CMS Storage

Document Sample
Hadoop File System as part of a CMS Storage Powered By Docstoc
					   Hadoop File System as part of a CMS Storage Element

   1. Introduction

In the last year, several new storage technologies have matured to the point where they have become
viable candidates for use at CMS Tier2 sites. One particular technology is HDFS, the distributed
filesystem used by the Hadoop data processing system.

   2. The HDFS SE
HDFS is a file system; as such, it must be complemented by other components in order to build a grid
SE. We consider a minimal set of additional components to be:
   a) FUSE / FUSE-DFS: FUSE, a standard linux kernel module, allows filesystems to be written in
       userspace. The FUSE-DFS library makes HDFS into a FUSE filesystem. This allows a
       POSIX-like interface to HDFS, necessary for user applications.
   b) Globus GridFTP: Provides a WAN transfer protocol. Using Globus GridFTP with HDFS
       requires a plugin developed by Nebraska.
   c) BeStMan: Provides an SRMv2 interface. UCSD and Nebraska have implemented plugins for
       BeStMan to allow smarter selection of GridFTP servers.
There are optional components that may be layered on top of HDFS, including XrootD, Apache HTTP,
and FDT.

   3. Requirements
This document aims to show that the combination of HDFS FUSE, Globus GridFTP , BeSTMan, and
some plugins developed by CMS Tier-2 teams at Nebraska, Caltech, and UCSD, meet the SE
requirements set forth by USCMS Tier 2 management. The requirements are typeset in italics, and the
responses are given below them in normal typesetting. Throughout this document, we use “Hadoop”
and “HDFS” interchangeably; generally, “Hadoop” refers to the entire data processing system. In this
context, we take it to only refer to the filesystem components.

   4. Management of the SE

Requirement 1. A SE technology must have a credible support model that meets the reliability,
availability, and security expectations consistent with the area in the CMS computing infrastructure in
which the SE will be deployed.

Support for this SE solution is provided by a combination of OSG, LBNL, Globus, the Apache
Software Foundation (ASF), DISUN, and possibly the US CMS Tier-2 program as follows.

BeSTMan is supported by LBNL, GridFTP by Globus. Both are part of the OSG portfolio of storage
solutions. HDFS is supported by ASF, as elaborated below, and FUSE is part of the standard Linux
distribution. This leaves us with two types of plugins that are required to integrate all the pieces into a
system, as well as the packaging support. The first is a plugin to BeSTMan to pick a GridFTP server
from a list of available GridFTP servers; this list is reloaded every 30 seconds from disk and servers are
randomly selected (otherwise, the default policy is a simple round-robin and it requires an SRM restart
to alter the GridFTP server list). The second is a GridFTP plugin to interface to HDFS. We propose that
both of these types of plugins continue to be supported by their developers from DISUN and US CMS,
until OSG has had an opportunity to gain sufficient experience to adopt the source, and own them.
Similarly, packaging support in form of RPMs is to be provided initially by DISUN/Caltech, and later
by US CMS/Caltech, until OSG has had an opportunity to adopt it as part of a larger migration towards
providing native packaging in form of RPMs. OSG ownership of these software artifacts has been
agreed upon in principle, but not yet formally. OSG support for this solution will start with Year 4 of
OSG (October 1st, 2009), and will initially be restricted to:

   a) Pick a set of RPMs twice a year, verify that this set is completely consistent, providing a well
      integrated system. We refer to this as the “golden set”.
   b) Document installation instructions for this golden set.
   c) Do a simple validation test on supported platforms (with the validation preferably automatic).
   d) Performance test the golden set, and document that test. This performance will be in-depth.
   e) Provide operations support for two golden sets at a time. This means that there is a staff person
      on OSG who is responsible for tracking support requests, answering simple questions, and
      finding solutions to difficult questions via the community support group organized in osg-
      hadoop@opensciencegrid.org listserv.
   f) OSG will provide updates to a golden set only for important bug and security fixes; these
      critical patches will go through validation test, but not performance tests.

Support will be official for RHEL4 and RHEL5-derivants on both 32-bit and 64-bit platforms. The
core HDFS software (the namenode and datanode) is usable on any platform providing the Java 1.6
JDK. Currently, Caltech and Nebraska both run datanodes on Solaris. The main limitation for FUSE
clients is support for the FUSE kernel module; this is supported on any Linux 2.6 series kernel, and
FUSE was merged into the kernel itself in 2.6.14.

Upgrades in HDFS are covered in this wiki document:
http://wiki.apache.org/hadoop/Hadoop_Upgrade. On our supported version, upgrades on the same
major version will only require a “yum update”. For major version upgrades, the procedure is:
    a) Shutdown the cluster.
    b) Upgrade via yum or RPMs.
    c) Start the namenode manually with the “-upgrade” flag.
    d) Start the cluster. The cluster will stay in read-only mode.
    e) Once the cluster’s health has been verified, issue the “hadoop dfsadmin –finalizeUpgrade”
        command. After the command has been issued, no rollback may be performed.
The wiki explains additional recommended safety precautions.

Hadoop is an open-source project hosted by the Apache Software Foundation. ASF hosts multiple
mailing lists for both users and developers, which is actively watched by both the Hadoop developers
and the larger Hadoop community. Members of this community include Yahoo employees who use
Hadoop in their workplace, as well as employees of Cloudera, which provides commercial packaging
and support for Hadoop. As it is a top-level Apache project, it has contributors from at least three
companies and strong project management. Yahoo has stated that it has invested millions of dollars
into the project and intends to continue doing so (http://developer.yahoo.net/blogs/hadoop).

Hadoop depends heavily upon its JIRA instance, http://issues.apache.org/jira/browse/HADOOP, for
bug reporting and tracking. As it is an open-source project, there’s no guarantee of response time to
issues. However, we have found high-priority issues (those that may lead to data loss) get solved
quickly because bugs affecting a T2 site have a high probability of affecting Yahoo’s production
infrastructure.

Requirement 2. The SE technology must demonstrate the ability to interface with the global data
transfer system PhEDEx and the transfer technologies of SRM tools and FTS as well as demonstrate
the ability to interface to the CMSSW application locally through ROOT.

Caltech is using Hadoop exclusively for OSG RSV tests, PhEDEx load tests, user stageouts via SRM,
ProdAgent stageout via SRM and POSIX-like access via FUSE. User analysis jobs submitted to
Caltech with CRAB have been running against data stored in Hadoop for several months.

UNL has been using HDFS at large scale since approximately late February 2009. It is used for:
   PhEDEx data transfers for all links (transfers are done with SRM and FTS).
   OSG RSV tests (used to meet the WLCG MoU availability requirements)
   Monte Carlo simulation and merging.
   User analysis via CMSSW (both grid-based submissions using CRAB and local interactive
      analysis).

UCSD has been using HDFS for serving /store/user.

Caltech, UNL, and UCSD have been leading the effort to demonstrate the scalability of the BeStMan
SRM server. This has recently resulted in a srmLs processing rate of 200Hz in a single server; this is 4
times larger than the rate required by CMS for FY2010, and is in fact the most scalable SRM solution
deployed in global CMS.

Requirement 3. There must be sufficient documentation of the SE so that it can be installed and
operated by a site with minimal support from the original developers (i.e. nothing more than "best
effort"). This documentation should be posted on the OSG Web site, and any specific issues in
interfacing the external product to CMS product should be highlighted.

Installation, operation, and troubleshooting directions can be found at
http://twiki.grid.iu.edu/bin/view/Storage/Hadoop. This has already been discussed in Requirement 1,
and we elaborate further here. It should be plausible for any site admin to install HDFS and the
corresponding grid components without support from the original developers. This provides a coherent
install experience; all components, including BeStMan and GridFTP, are available as RPMs. Admins
experienced with the RedHat tool “yum” will find that the SE is installable via a simple “yum install
hadoop”. The DISUN/Caltech packaging also provides useful logging defaults that enable one to
easily centrally log errors happening in HDFS; this greatly aids admins in troubleshooting.

With the OSG 1.2 release, there are no specific issues pertaining to using HDFS with CMS.

Experience shows that operational overhead at Caltech has been equivalent to approximately 1FTE,
and that includes R&D activities for packaging and testing. Going forward, we believe the overhead
will decrease, as the R&D portions will be greatly reduced. Nebraska, which has reached a stable state
with Hadoop for several months, shows that operational overhead is less than 1 FTE. UCSD, which is
presently supporting HDFS for /store/user (113TB) and dCache for all else (273TB) reports the same
experience. The main reason why this solution is experienced as less costly to operate is because it has
many fewer moving parts. This much reduced complexity results in significantly lower operational
overhead.
Requirement 4. There must be a documented procedure for how problems are reported to the
developers of those products, and how these problems are subsequently fixed.

Starting with Year 4 of OSG, all problems are reported via OSG. OSG then uses the support
mechanisms as discussed in Requirement 1.

In particular, Hadoop has an online ticket system called JIRA. JIRA is heavily used by the developers
to received and track bugs and features requests for Hadoop and Hadoop-related projects. JIRA is open
for viewing by everyone, and requires a simple account registration for posting comments and new
tickets. All commits in the project must be traceable via JIRA and go through the quality control
process (which includes code review by a different developer and passing automated tests).

The JIRA system can be found at http://issues.apache.org/jira/browse/HADOOP. The Hadoop
community has also written a guide to filing bug reports,
http://wiki.apache.org/hadoop/HowToContribute.

Requirement 5. Source code required to interface the external product to CMS products must be
made available so that site operators can understand what they are operating. If at all possible, source
code for the external product itself should also be available.

All software components are open source. In particular, Hadoop source code can be downloaded from
http://www.apache.org/dyn/closer.cgi/hadoop/core/. Patches for specific problems in a release can be
downloaded from JIRA (see above). Any patches currently applied to the Caltech Hadoop distribution
have been submitted to JIRA, and we have tried to make sure that they get committed in a timely
manner. This helps us minimize the costs of maintaining our own RPM installs.

Yahoo has publicly committed to releasing its “stable patch set” to the world beginning with 0.20.0.
They are committed to keeping all patches used publicly available in the JIRA; the only added
knowledge is which patches are stable enough to be applied to current releases. When we upgrade to
this version, we will be able to tap into this significant resource; Yahoo has a committed QA team and a
test cluster an order of magnitude larger than our production cluster.

The BeStMan source code is open source for academic users; the OSG is working on clearing up the
licensing of this software and making the code more freely available. Each of the used plug-ins (for
BeStMan and GridFTP) are available in the Nebraska SVN repository which allows anonymous access.

   5. Reliability of the SE

Requirement 6. The SE must have well-defined and reliable behavior for recovery from the failure
of any hardware components. This behavior should be tested and documented.

We broadly classify the HDFS grid SE into three parts: metadata components, data components, and
grid components. Below, we document the risks a failure in each poses and the suggested recovery
mechanisms.
     Metadata components: The two major metadata components for HDFS and are the namenode
       and the “secondary namenode” (backup) servers. The namenode is the single point of failure
       for normal user operation. Because of this, there are several built-in protections each site is
  recommended to take:
       o Write out multiple copies of the journal and file system image. HDFS heavily relies on
          the metadata journal as a log of all the operations that alter the namespace. HDFS
          config files allow for writing the journal onto multiple partitions. It is recommended to
          write the journal on two separate physical disks in the namenode and suggested a third
          copy be written to a NFS server.
       o The secondary namenode / backup server allows the site admin to create checkpoint files
          at regular intervals (default is every hour or whenever the journal reaches 64MB in
          size). It is strongly recommended to run the secondary namenode on a separate
          physical host. The last two checkpoints are automatically kept by the secondary
          namenode.
       o The checkpoints should be archived, preferably off-site.
  Future versions of HDFS (0.21.x) plan on having an offline checkpoint verification tool and
  stream journal information to the backup node in real-time, as opposed to an hourly checkpoint.
  In the case of namenode failure, the following remedies are suggested:
       1. Restart the namenode from the file system image and journal found on the namenode’s
          disk. The only resulting file loss will be the files being written at the time of namenode
          failure. This action will work as long as the image and journal are not corrupted.
       2. Copy the checkpoint file from the secondary namenode to the namenode. Any files
          created between the time the checkpoint was made and the namenode failure will be
          lost. This action will work as long as the checkpoint has not been corrupted.
       3. Use an archived checkpoint. Any files created between the checkpoint creation and the
          namenode failure will be lost. This action will work as long as one good checkpoint
          exists.
  Note that all the namenode information is kept on two files written directly to the Linux
  filesystem.
  If none of these actions work, the entire file system will be lost; this is why we place such
  importance on backup creation. In addition to the normal preventative measures, the following
  can be done:
      1. Have hot-spare hardware available. HDFS is offline when the namenode is offline. If
          the failure does not have an immediate, obvious cause, we recommend using the standby
          hardware instead of prolonging the downtime by troubleshooting the issue.
      2. Create a high-availability (HA) setup. It is possible, using DRBD and Heartbeat to
          completely automate recovery from a namenode failure using a primary-secondary
          setup. This would allow services to automatically be restored on the order of a minute
          with the loss of only the files which were currently being created at the time of the
          failure. As most of the CMS tools automatically retry writing into the SE, this means
          that the actual file loss would be minimal. These high-availability setups have been
          used by external companies, but not yet done by a grid site; the extra setup complexity is
          not yet perceived as worth the time.
 Data components: Datanode loss is an expected occurrence in HDFS and there are multiple
  layers of protection against this.
       1. The first layer of protection is block replication. Hadoop has a robust block replication
          feature that ensures that duplicate file blocks are placed on separate nodes and even
          separate racks in the cluster. This helps ensure that complete copies of every file are
          available in the case that a single node becomes unavailable, and even if an entire rack
          becomes unavailable.
       2. Hadoop periodically requests an entire block report from every data node. This protects
          against synchronization bugs where the namenode’s view of the datanode’s contents is
             different from reality
             3.      It is often possible to have a bad hard drive causing corruption issues, but not
             have that hard drive fail. Each datanode schedules each block to be checksummed once
             every 2 weeks. If the block fails the checksum, it will be deleted and the namenode
             notified – allowing automatic healing from hard drive errors.
    Grid components: Grid components require the least amount of failure planning because they
     are stateless for the HDFS SE. Multiple instances of the grid components (gridftp, bestman
     SRM) can be installed and used in a failover fashion using Linux LVS or round-robin DNS.

We are aware of one significant failure mode. BeStMan SRM is known to lock up under very heavy
load (over 1000 concurrent SRM requests, which is at least 10 fold what’s been observed in
production), and requires a restart when this happens. We believe this to be a problem with the Globus
java container. BeStMan is scheduled to transition to a different container technology in Fall 2009 in
BeStMan 2. OSG is committed to follow this issue, and validate the new version when it is released.
However, this issue manifests itself only under extreme loads at the existing deployments, and is thus
presently not an operational problem.

In addition, there is a minor issue with the way HDFS handles deployments with server nodes that
serve space of largely varying sizes. HDFS need to be routinely re-balanced, which is typically done
via crontab in extreme cases (e.g. Nebraska), and manually once every two weeks or so for
deployments where disk space differs by no more than a factor 2-4 or so between nodes (e.g. UCSD).
There is no significant operational impact from routinely running the balancer beyond network traffic;
HDFS allows the site to throttle the per-node rate going to the balancer. This imbalance is caused
because the selection of servers for writing files is mostly random, while the distribution of free space
is not necessarily random.


Requirement 7. The SE must have a well-defined and reliable method of replicating files to protect
against the loss of any individual hardware system. Alternately there should be a documented hardware
requirement of using only systems that protect against data loss at the hardware level.

Hadoop uses block decomposition to store its data, breaking each file up into blocks of 64MB apiece
(this block size is a per-file setting; the default is 64MB, but most sites have increased this to 128MB
for new files). Each file has a replication factor, and the HDFS namenode attempts to keep the correct
number of replicas of each of the blocks in the file. The replication policy is:
    a) First replica goes to the “closest datanode” – the local node is the highest priority, followed by a
        datanode on the same rack, followed by any data node.
    b) Second replica goes to a random datanode on a different rack
    c) Third replica (if requested) goes to a random datanode on the second rack.
If two replicas are requested, each will end up on two separate racks; if three replicas are requested,
they will end up on two separate racks.

Replication level is set by the client at the time it creates the block. The replication level may be
increased or decreased by the admin at any time per-file, and can be done recursively. The namenode
attempts to satisfy the client’s request; as long as the number of successfully created replicas is between
the namenode’s configured minimum and maximum, the HDFS considers the write a success. Because
the client requests a replication level for each write, one cannot set a default replication for a directory
tree.
For example, at Caltech, a cron script automatically sets the replication level on known directories in
order to ensure clients request the desired replication level. They currently use:

         hadoop fs -setrep -R 3 /store/user
         hadoop fs -setrep -R 3 /store/unmerged
         hadoop fs -setrep -R 1 /store/data/CRUZET09
         hadoop fs -setrep -R 1 /store/data/CRAFT09
It is not possible to pin files to specific datanodes, or to set replication based on the datanodes where
the files will be located. Hadoop treats all datanodes as equally unreliable.

The namenode keeps track of all block locations in the system, and will automatically delete replicas or
create new copies when needed. Replicas may be deleted either when the associated file is removed or
the block has too more replicas than desired. Datanodes send in a heartbeat signal once every 3
seconds, allowing the server to keep up-to-date with the system status.

For example, when a datanode fails, the server allows it to miss up to 10 minutes of heartbeats (setting
configurable). Once it is declared dead, the namenode starts to make inter-node transfer requests to
bring the blocks that were on the datanode back up to the desired level. This often is quite quick; all
the desired replicas can be done in around an hour per TB (the majority of the transfers go faster than
that; a good portion of the hour is “transfer tails”). When decommissioning is started, the system does
not prefer to copy data off the decommissioned node. Assuming the number of replicas is greater than
one, the load of replication is distributed randomly throughout the cluster. The percentage of transfers
the decommissioned node will get is (# of TB of files on datanode)/(# of TB in entire system). Because
older nodes typically have a smaller disk size, it will comparatively get less load.

If a dead node reappears (host rebooted/fixed, disk is physically moved to a different node), the blocks
it previously hosted will now be overreplicated. The namenode will then reduce the number of replicas
in the system, starting with replicas located on nodes with the least amount of free space (as a
percentage of total space on the node). If the blocks belonged to files which have been deleted, the
node will be instructed to delete them in the response to its block report.

The number of under-replicated blocks can be seen by viewing the system report using fsck or by
looking at the namenode’s Ganglia statistics.

The success of Hadoop's automatic block replication was seen when Caltech suffered a simultaneous
failure of 3 large (6TB) datanodes on the evening of Sunday, Jul. 12:

       “Within about an hour of installing a faulty Nagios probe, 3 of the 2U datanodes had crashed,
       all within minutes of each other. Each of the 2U datanodes hosts just over 6TB of raid-5 data.
       Nagios started sending alarms indicating that we had lost our datanodes. We
       had seen kernel panics caused by this nagios probe before, so we had no trouble locating the
       cause of the problem. The immediate corrective action was to disable this Nagios probe. This
       was done immediately to avoid further loss of datanodes.

       Ganglia showed Hadoop reporting ~80k underreplicated blocks, and Hadoop started replicating
       them after the datanodes failed to check in after ~10 minutes. The network activity on the
       cluster jumped from ~200MB/s to 2GB/s. Since we run Rocks, 2 of the machines went into a
       reinstall immediately after dumping the kernel core file. Within an hour they were back up and
       running. Hadoop did not start automatically, however, due to a Rocks misconfiguration. I
       started Hadoop manually on these two nodes, which caused our underreplicated block count to
       drop from ~20k to ~4k. Within ~90 minutes the number of underreplicated blocks was back to
       zero.

       At this point we still had one datanode that had not recovered. I checked the system console
       and discovered that the system disk had also died. Since it was late on a Sunday evening, and
       we run 2X replication in Hadoop, I decided we could leave the datanode offline and wait until
       the following morning to replace the disk.

       After the replication was done, I ran “hadoop fsck /” to check the health of HDFS. To my
       surprise, hadoop was reporting 40 missing blocks and 33 corrupted files. This seemed strange
       because all of our files (aside from LoadTest data) is replicated 2x, so the
       loss of a single datanode should not have caused the loss of any files (aside from LoadTest
       data). After parsing the output of “hadoop fsck /”, I found that we had been accidentally setting
       the replication on /store/relval to 1, instead of leaving it at the default of 2. This was fixed.

       The next morning we replaced the system disk in the dead datanode and waited for it to reinstall
       (which took a few hours longer than it ought have). Almost immediately after starting Hadoop
       on this last datanode, hadoop fsck reported the filesystem was clean again. By 2:15 the
       following afternoon everything had returned to normal and hadoop was healthy again. ”

Requirement 8. The SE must have a well-defined and reliable procedure for decommissioning
hardware which is being removed from the cluster; this procedure should ensure that no files are lost
when the decommissioned hardware is removed. This procedure should be tested and documented.

The process of decommissioning hardware is documented in the Hadoop twiki under the Operations
guide. The process goes approximately like this:
   1. Edit the hosts exclude file to exclude the to-be-decommissioned host from the cluster.
   2. Issue the “refreshNodes” command in the Hadoop CLI to get the namenode to re-read the file.
       The node should show up as “Decommissioning” in the web interface at this point.
   3. Watch the web interface or the “report” command in the Hadoop CLI and wait until the node is
       listed as “dead”.

This process is not only straightforward, but a very routine process at each site. Decommissioning is
done whenever a node needs to be taken offline for any upgrade lasting more than 10 minutes at
Nebraska.

Requirement 9. The SE must have well-defined and reliable procedure for site operators to
regularly check the integrity of all files in the SE. This should include basic file existence tests as well
as the comparison against a registered checksum to avoid data corruption. The impact of this operation
(e.g. load on system) should be documented.

Hadoop’s command line utility allows site admins to regularly check the file integrity of the system. It
can be viewed using “hadoop fsck /”. At the end of the output, it will either say the file system is
“HEALTHY” or “CORRUPTED”. If it is corrupted, it provides the outputs necessary to repair or
remove broken files.

HDFS registers a checksum at the block level within the block’s metadata. HDFS automatically
schedules background checksum verifications (default is to have every block scanned once every 2
weeks) and automatically invalidates any block with the incorrect checksum. The checksumming
interval can be adjusted downward at the cost of increased background activity on the cluster. We do
not currently have statistics on the rate of failures avoided by checksumming.

Whenever a file is read by a client (even partially – checksums are kept for every 4KB), the client
receives both data and checksum and computes the validity of the data on the client side. Similarly,
when block is transferred (for example, through rebalancing), the checksum is computed by the
receiving node and compared to the sender’s data.

Note about catastrophic loss:
We have emphasized that with 2 replicas, file loss is very rare due to:
     Failures that occur rapidly (>2 hours between failure) cause little to no loss because the re-
       replication in the file system is extremely fast; one guidance is to expect 1TB per hour to be re-
       replicated.
     Multi-disk failures happening within an hour usually are due to some common piece of
       equipment (such as the rack switch or PDU). Rack-awareness prevents an entire rack
       disappearing from causing file loss.
However, what happens if we make the assumption that all safeguards are bypassed and 2 disks are
lost? This is not without precedent; at Caltech, a misconfiguration told Hadoop 2 nodes on the same
rack were on different racks. This bypassed the normal protections from rack awareness. The rack’s
PDU failed and two disks failed to come back up. Caltech lost 54 blocks of file.
Using the binomial distribution, the expected number of blocks lost is:
                         (# of blocks lost) = (# of blocks) * P(single block loss)
The binomial distribution is appropriate because the loss of one block does not affect the probability of
another block loss. The standard deviation is approximately the square root of the number of blocks
lost. The probability of a single block loss is
                     P(single block loss) = P(block on node 1) * P(block on node 2)
The probability a block is on a given node is approximately:
                 P(block on node 1) = (replication level)*(size of node)/(size of HDFS)
assuming that the cluster is well-balanced and blocks are randomly distributed. Both assumptions
appear to be safe in currently-deployed clusters.

Plugging in Caltech’s numbers (1,540,263 blocks; 342.64 TB in the system, each lost disk was 1TB),
the expected number of lost blocks was 52.4 with a standard deviation of 7.2. This is strikingly close to
the actual loss, 54 blocks.

If only complete files were written (i.e., no block decomposition), then the expected loss would be

                            (# of files lost) = (# of files) * P(single block loss)

So, assuming 128MB and experimental files of around 1GB, the number of files lost would be 10x
lower. In the end, CMS site would lose 10x more files using HDFS. We believe this is an acceptable
risk, especially as the recovery procedure for 5 files versus 50 files is similar. In the case of
simultaneous triple-disk-failure on triple-replicated files, the expected loss would be less than 1 file for
Caltech’s HDFS instance.

Requirement 10. The SE must have well-defined interfaces to monitoring systems such as Nagios so
that site operators can be notified if there are any hardware or software failures.
HDFS integrates with Ganglia; provided that the site admin points HDFS to the right Ganglia endpoint,
many relevant statistics for the namenode and datanodes appear in the Ganglia gmetad webpages.
Many monitoring and notification applications can set up alerts based on this.

Caltech has also contributed several HDFS-Nagios plugins to the public that monitor various aspects of
the health of the system directly. They have released a TCL-based desktop application, “gridftpspy”
which monitors the health and activity of the Globus gridftp servers. Some of these are based on the
JMX (Java Management eXtensions) interface into HDFS. JMX can integrate with a wider range of
monitoring system. There is also an external project providing Cacti templates for monitoring HDFS.
The Nagios and gridftpspy components are packaged in the Caltech yum repository, but not officially
integrated; we foresee labeling them experimental for the OSG-supported first release.

Finally, Caltech has developed the “Hadoop Chronicle”, a nightly email that sends administrators the
basic Hadoop usage statistics. This has an appropriate level of details to inform site executives about
Hadoop’s usage. The Hadoop Chronicle is now part of the OSG Storage Operations toolkit. This is
currently in use at Caltech and in testing at Nebraska.

Note about admin intervention:
The previous two requirements start to cover the topic of “what HDFS activities do site admins engage
in?” and at what interval. We have the following feedback from Nebraska and Caltech site admins,
respectively:
     Nebraska:
            o Daily tasks: Check Hadoop Chronicle, look at RSV monitoring
            o Once a week: Clean up dead hardware, restart dead components. The component which
                crashes most often is BeStMan at about once every 2 weeks.
            o Once every 2 months: Some sort of data recovery or in-depth maintenance. Examples
                include debugging an underreplicated block or recovering a corrupted file.
     Caltech (note: Caltech runs an experimental kernel, which may explain the reason there’s more
        kernel-related maintenance than at Nebraska):
            o Continuously: wait for Nagios alerts
            o Hourly tasks: Check namenode web pages and gridftp logs via gridftpspy (admittedly a
                bit excessive)
            o Daily tasks: Read Hadoop Chronicle, browse PhEDEx rate/error pages
            o Weekly tasks: reboot nodes due to kernel panic, adjust gridftp server list (BeStMan
                plugin currently not used), track down lost blocks (for datasets replicated once),
                maintain ROCKS configuration.
            o Once a month: Reboot namenode with new kernel, reinstall data nodes with bugfix
                update.

   6. Performance of the SE

All aspects of performance must be documented.

Requirement 11. The SE must be capable of delivering at least 1 MB/s/batch slot for CMS
applications such as CMSSW. If at all possible, this should be tested in a cluster on the scale of a
current US CMS Tier-2 system.

To test this requirement, Caltech ran a test using dd to read from HDFS through the fuse mount on each
of the 89 worker nodes on the Tier2 cluster. dd was used to maximize the throughput from the storage
system. We acknowledge that the IO characteristics from dd are not identical to that of CMSSW
applications, which tend to read smaller chunks of data in random patterns. Each worker node ran 8 dd
processes in parallel, one per core. Each dd process/batch slot on a single worker node read a different
2.6GB file from HDFS 10 times in sequence. The same 8 files were read from each of the 89 worker
nodes. At the end of each file read, dd reported the rate at which the file was read. A total of 18.1TB
was read during this test. The final dd was finished approximately 4.25 hours after the test was started.




The average read rate reported by dd was 2.3MB/s ± 1.5MB/s. The fastest read was 22.8MB/s and the
slowest was 330KB/s.

The rate per file delivered from HDFS was 18.1TB/4.25hours = 1238MB/s, or approximately 155MB/s
(1Gbps) per HDFS file.

It should be noted that this test was run while the cluster was also 100% full with CMS production and
CMS analysis jobs, most of which were also reading and writing to Hadoop at the same time. The
background HDFS traffic from this CMS activity was not included in these results.

UCSD ran a separate test with a standard CMSSW application consuming physics data. The same
application has been used for computing challenging or scalability exercise. The application is very I/O
intensive. Here we mainly focused on the application reading the data that is located locally in the
hadoop.

During the tests, there are 15 datanodes holds the data files with 1GB in size. The block size in UCSD's
Hadoop is 128 MB. The replication of the data files is set to 2. For each file, there are 16 blocks well
distributed across all the datanodes. The application was configured to run against 1 file or 10 files per
job slot. The number of jobs running simultaneously ranged from 20 to 200. The maximal number of
jobs running simultaneously is 250, which is roughly a quarter of available job slots at UCSD at that
time. The rest of slots were running production or user analysis jobs. So the test was running under a
very typical Tier-2 condition. The test application itself didn't significantly changes the overall
condition of the cluster.

The ratio of average job slots running the tests to the number of Hadoop datanodes ranged from 10-20.
Eventually this ratio will be 8 if all the WNs are configured as Hadoop datanodes, and each WN runs 8
slots. This will increase the I/O capability per job slot for 50-100% from the results we measured in the
test.
The average processing time per job is 200 and 4000 second for the application processing 1 and 10
GB of data respectively. The average I/O in reading the data are shown in the following: average I/O
for application consuming 1GB (left) and 10 GB (right). The test shows the 1MB/s per slot requirement
is at the low end of the rate that is actually delivered by the HDFS. The average is ~2-3 MB/s per job.




Requirement 12. The SE must be capable of writing files from the wide area network at a
performance of at least 125MB/s while simultaneously writing data from the local farm at an average
rate of 20MB/s.


Below is a graph for the Nebraska worker node cluster
During this time, HDFS was servicing user requests at a rate of about 2500/sec (as determined by
syslog monitoring using the HadoopViz application). Each user request is a minimum of 32KB, so this
is at least 80MB of internal traffic. At the same time, we were writing in excess of 100MB/s as
measured by PhEDEx

Below is an example of HDFS serving data to a CRAB-based analysis launched by an external user. At
the time (December 2008), the read-ahead was set to 10MB. This provided an impressive amount of
network bandwidth (about 8GB/s) to the local farm, but is not an every day occurrence. The currently
recommended read-ahead size is 32KB.
Requirement 13. The SE must be capable of serving as an SRM endpoint that can send and receive
files across the WAN to/from other CMS sites. The SRM must meet all WLCG and/or CMS requirements
for such endpoints. File transfer rates within the PhEDEx system should reach at least 125MB/s
between the two endpoints for both inbound and outbound transfers.




During Aug. 20-24, Caltech and Nebraska ran inter-site load tests using PhEDEx to exercise the
gridftp-hdfs servers.
During this time period, PhEDEx recorded a 48-hour average of 171MB/s coming into the Caltech
Hadoop SE, with files primarily originating from UNL. Peak rates of up to 300MB/s were observed.
There was a temporary drop to zero at ~23:00 Aug. 24 due to an expired CERN CRL.




During this same time period, Caltech was exporting files at an average rate of 140MB/s, with files
primarily destined for UNL. For several hours during this time period the transfer rates exceeded
200MB/s.

It must be noted that the PhEDEx import/export load tests were not run in isolation. While these
PhEDEx load tests were running Caltech was downloading multi-TB datasets from FNAL, CNAF, and
other sites with an average rate of 115MB/s and peaks reaching almost 200MB/s.
UCSD has additionally been working on an in-depth study of the scalability of BeStMan, especially at
different levels of concurrency. The graph below shows how the effective processing rate has scaled
with the increasing number of concurrent clients.




This demonstrates processing rates well above the levels currently needed for USCMS. It is sufficient
for high-rate transfers of gigabyte-sized files and uncontrolled chaotic analysis.
   7. Site-specific Requirements

Requirement 14. A candidate SE should be subject to all of the regular, low-stress tests that are
performed by CMS. These include appropriate SAM tests, job-robot submissions, and PhEDEx load
tests. The SE should pass these tests 80% of the time over a period of two weeks. (This is also the level
needed to maintain commissioned status.)

The below chart shows the status of the site commissioning tests from CMS, which is a combination of
all the regular low-stress tests performed.




Additionally, Caltech's use of a Hadoop SE has maintained a 100% Commissioned site status for the
two weeks prior to Aug. 17:

http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport_20090817.html#T2_US_Caltech.

Requirement 15. The new storage element should be filled to 90% with CMS data. These datasets
should be chosen such that they are currently "popular" in CMS and will thus attract a significant
number of user jobs. Failures of jobs due to failure to open the file or deliver the data products from the
storage systems (as opposed to user error, CE issues, etc.) should be at the level of less than 1 in 10^5
level.
     A suggested test would be a simple "bomb" of scripts that repeatedly opens random files and
       reads a few bytes from them with a high parallelism; for the 10^5 test, it's not necessary to do it
       through CMSSW or CRAB. An example would be to have 200 worker nodes open 500 random
       files each and read a few bytes from the middle of the file.

This was performed using the “se_punch.py” tool found in Nebraska’s se_testkit. There were no file
access failures. This script implemented the suggested test – all worker nodes in the Nebraska cluster
simultaneously started opening random files and reading a few bytes from the middle of each.

Nebraska is now working on a script utilizing PyROOT (which is distributed with CMSSW) that opens
all files on the SE with ROOT. This not only verifies files can be opened, but demonstrates a minimal
level of validity of the contents of the file. Opening with ROOT should fully protect against truncation
(as the metadata required to open the file is written at the end of the file) and whole-file corruption. It
does not detect corruptions in the middle of the file, but built-in HDFS protections should detect these.
Nebraska ran with HDFS over 90% full during May 2009 and encountered no significant problems
other than writes failing when all space was exhausted. Caltech also experienced some corrupted
blocks when HDFS was filled to 96.8% and certain datanodes reached 100% capacity. Some
combination of failed writes, rebalancing, and failing disks resulted in two corrupted blocks and two
corrupted files. These files had to be invalidated and retransferred to the site. This is the only time that
Caltech has lost data in HDFS since putting it into production 6 months ago. There are a few
recommendations to help avoid this situation in the future:
    1) Run the balancer often enough to prevent any datanode from reaching 100%
    2) Don't allow HDFS to fill up enough that an individual datanode partition reaches 100%
    3) If using multiple data partitions on a single datanode, make them of equal size, or merge them
        into a single raid device so that hadoop sees only a single partition.
Future versions of Hadoop (0.20) have a more robust API to help manage datanode partitions that have
been completely filled to 100%.




Requirement 16. In addition, there should be a stress test of the SE using these same files. Over the
course of two weeks, priority should be given to skimming applications that will stress the IO system.
Specific CMS skim workflows were run at Nebraska on June 6. However, the results of these were not
interesting as the workflows only lasted 8 hours (no significant failures occurred).

However, the “stress” of the skim tests is far less than the stress of user jobs (especially PAT-based
analysis) due to the number of active branches in ROOT; see CMS Internal Note 2009-18. Many active
branches in ROOT result in a large number of small reads; a CMS job on an idle system will read
typically no more than 32KB per read and achieve 1MB/s. Hence, 1000 jobs will achieve 30,000 IOPS
if they are not bound by the underlying disk system. Because the HDFS installs have relatively high
bandwidth due to the large number of data nodes, but the same number of hard drives as other systems,
bandwidth is usually not a concern while I/O operations per second (IOPS) is. See the below graphs
demonstrating a large number of IOPS; even at the max request rate, the corresponding bandwidth
required is only 5Gbps. For the hard drives deployed at the time the graph was generated, this
represented about 60 IOPS per hard drive, which matched independent benchmarks of the hard drives.
The bandwidth usage of 5Gbps represents only a fraction of the bandwidth available to HDFS.

Because HDFS approaches the underlying hardware limits of the system during production, we
consider typical user jobs are the best stressor of the system. Such “stress tests” occur in large batches
on a weekly basis at both Nebraska and Caltech.
Requirement 17. As part of the stress tests, the site should intentionally cause failures in various
parts of the storage system, to demonstrate the recovery mechanisms.

As noted in Requirement 16, a HDFS instance in large-scale production is sufficient for demonstrating
stress. During production at Nebraska and at Caltech, we have observed failures of the following
components:
     Namenode: When a namenode dies, the only currently used recovery mechanism is to replace
        the server (or fix the existing server) and copy a checkpoint file into the appropriate directory.
        A high-availability setup have not yet been investigated by our production sites, mostly due to
        the perceived complexity for little perceived benefit (namenode failure is rare). This has been
        demonstrated in production at Nebraska and Caltech. When the namenode fails, writes will not
        continue and reads will fail if the client had not yet cached the block locations for open files.
     Datanode: Datanode failures are designed to be an everyday occurrence, and they have indeed
        occurred at both Nebraska and Caltech. The largest operational impact is the amount of traffic
        generated by the system while it is re-replicating blocks to new hosts.
     Globus GridFTP servers: Each transfer is spawned as separate process on the host by xinetd.
        This results in the server being extremely reliable in the face of failures or bugs in the GridFTP
        server. When the GridFTP host dies, others may be used by SRM. Nebraska and UCSD have
        implemented schemes where the SRM server stops sending new transfers to the GridFTP server.
        Caltech has also implemented a Gridftp appliance integrated with the Rocks cluster
        management software that can be used to install and configure a new gridftp server in 10
        minutes.
     SRM server: When the SRM server fails, all SRM based transfers will fail until it has been
        restarted manually (the service health is monitored via RSV). This happens infrequently
        enough in production that no automated system has been implemented, although LVS-based
        failover and load-balancing is plausible because BeStMan is stateless. Caltech has
        implemented a Bestman appliance integrated with the Rocks cluster management software that
        can be used to install and configure a new Bestman server in 10 minutes.

   8. Security Concerns

HDFS
HDFS has unix-like user/group authorization, but no strict authentication. HDFS should only be
exposed to a secure internal network which only non-malicious users are able to access. For users with
unrestricted access to the local cluster, it is not difficult at all to bypass authentication. There is no
encryption or strong authentication between the client and server, meaning that one must have
both trusted server and client. This is the primary reason why HDFS must be segregated onto an
internal network.
It is possible to reasonably lock-down access by:
     1. Preventing unknown or untrusted machines from accessing the internal network. This
         requirement can be removed by turning on SSL sockets in lieu of regular sockets for inter-
         process communication. We have not pursued this method due to the perceived performance
         penalty.
             a. By “untrusted machines”, we include allowing end-user’s laptops or desktops to access
                 HDFS. Such access could be allowed via Xrootd redirectors (for ROOT-based analysis)
                 or exporting the file system via HTTPS (allowing whole-file download).
     2. Prevent non-fuse users from accessing HDFS ports on the known machines on the network.
         This will mean only the HDFS FUSE process will be able to access the datanodes and
         namenode; this allows the Linux filesystem interface to sanitize requests and prevents users
         from TCP-level access to HDFS.
It’s important to point out that in (2), we are relying on the security of the clients on the network. If a
host is compromised at the root-level, the attacker can perform any arbitrary action with sufficient
effort.
Security concerns are actively being worked on by Yahoo. The progress can be followed on this master
JIRA issue:
     https://issues.apache.org/jira/browse/HADOOP-4487
In release 0.21.0, access tokens issued by the namenode prevents clients from accessing arbitrary data
on the datanode (currently, one only needs to know the block ID to access it). Also in 0.21.0, the
transition to the Java Authentication and Authorization Service has begun; this will provide the building
blocks for Kerberos-based access (Yahoo’s eventual end goal). Judging by current progress,
transitioning to Kerberos-based components could happen during 2010.
If a vulnerability is discovered, we would release updated RPMs within one workweek (sooner if the
packaging is handled by the VDT). This probably will not be necessary as the security model is
already very permissive. Security vulnerabilities are one of the few reasons we will update the “golden
set” of RPMs.

Grid Components (GridFTP and BeStMan)
Globus GridFTP and BeStMan both use standard GSI security with VOMS extensions; we assume this
is familiar to both CMS and FNAL. Because both components are well-known, we do not examine
their security models here.
If a vulnerability is discovered in any of these components, we would release a RPM update once our
upstream source (the VDT) has this update. The target response time would be one workweek while
packaging is done at Caltech, and in lockstep with the VDT update when that team does packaging.

   9. Risk Analysis
In this section, we analyze different risks that are posed to the different pieces of the HDFS-based SE.
We attempt to present the most pressing risks in the proposed solution (both technical and
organizational), and point out any mitigating factors.

HDFS
HDFS is both the core component and a component external to grid computing. Hence, its risk must be
examined most closely.
   1. Health of Hadoop project: HDFS is completely dependent on the existence and continued
      maintenance of the Hadoop project. Continued development and growth of this project is
      critical. Hadoop is a top-level project of the Apache Software Foundation; in order to achieve
      this status, the following requirements were necessary:
           a. Legal
                     i. All code ASL'ed (Apache Software License, a highly permissive open-source
                        license).
                     ii.       The code base must contain only ASL or ASL-compatible dependencies.
                     iii.      License grant complete.
                     iv.       Contributor License Agreement on file.
                     v.        Check of project name for trademark issues.
               This legal legwork protects us from code licensing issues and various other legal issues.
           b. Meritocracy / Community
                     i. Demonstrate an active and diverse development community
                     ii.       The project is not highly dependent on any single contributor (there are at
                      least 3 legally independent committers and there is no single company or entity
                      that is vital to the success of the project)
                   iii.       The above implies that new committers are admitted according to ASF
                      practices
                   iv.        ASF style voting has been adopted and is standard practice
                   v.         Demonstrate ability to tolerate and resolve conflict within the community.
                   vi.        Release plans are developed and executed in public by the community.
                   vii.       ASF Board for a Top Level Project, has voted for final acceptance.
            The ASF has shown that these community guidelines and requirements are hallmarks of
            a good open source project.
   The fact that HDFS is an ASF project and not a Yahoo corporation project means that it is not
   tied to the health of Yahoo. The current HDFS lead is employed by Facebook not Yahoo. At
   this point in the project’s life, about 40% of the patches come from non-Yahoo employees.
   Relevant to the recent changes to Microsoft as the company’s search engine provider, Yahoo has
   made public statements that:
        Hadoop is used for almost every piece of the Yahoo infrastructure, including: spam
            fighting, ads, news, and analytics.
        Hadoop is critical to Yahoo as a company, and is not a subproject of the search engine.
            It is possible that money previously invested into the search engine technology will now
            be invested into Hadoop.
   Cloudera has received about $16 million in start-up capital and employs several key developers,
   including Doug Cutting, the original author of the system. Hadoop maintains a listing of web
   sites and companies utilizing its technology, http://wiki.apache.org/hadoop/PoweredBy.

   Condor currently funds a developer working on Hadoop, and is investigating the use of HDFS
   as a core component.

   While we believe these reasons mitigate the risk of HDFS development becoming stagnate, we
   believe this is the top long-term risk associated with the project.
2. Hadoop support / resolution of bugs: There is no direct monetary support for large-scale
   HDFS development, nor is the success of HDFS dependent upon WLCG usage. We have no
   paid support for HDFS (although it can be purchased). This is mitigated by:
       a. Paid support is available: We have good contacts with the Cloudera technical staff, and
          would be able to purchase development support as needed. Several project committers
          are on Cloudera staff.
       b. Critical bugs affect large corporations: Any bug we are exposed to affects Yahoo and
          Facebook, whose businesses depend on HDFS. Hence, any data loss bug we discover
          will be of immediate interest to their development teams. When Nebraska started with
          HDFS, we had issues with blocks truncated by ext3 file system recovery. This triggered
          a long investigation by a member of the Yahoo HDFS team, resulting in many patches
          for 0.19.0. Since that version, we have not seen the truncation issue again.
       c. Acceptance of patches: Nebraska has contributed on the order of 5 patches to HDFS,
          and has not had issues with getting patches accepted by the upstream project. The major
          issue has been passing the acceptance criteria – each patch must meet coding guidelines,
          pass code review from a different coder, and come with a unit test (or an explanation of
          why a new unit test is not needed).
       d. Large number of unittests: HDFS core has good unit test coverage (Clover coverage of
          76% http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/clover/). All
          nontrivial commits require a unit test to be committed along with it. Because of the
              initial difficulties in getting completely safe sync/append functionality, a large set of
              new unit tests was developed for 0.21.0 based on a fault-injection framework. The fault
              injection framework provides developers with the ability to better demonstrate not only
              correct behaviors, but correct behaviors under a variety of fault conditions.

              The unit tests are run nightly using Apache Hudson
              (http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/) and take several hours.
      Each point helps to mitigate the issue, but does not completely remove the issue. In the extreme
      case, we are prepared to run on locally-developed patches that are not accepted by the upstream
      project. This would hurt our efforts to keep support costs under control, so we would avoid this
      situation.
   3. Hadoop feature set: We believe HDFS currently has all the features necessary for adoption.
      We do not believe that any new features are required in the core. However, it should be pointed
      out the system is not fully POSIX compliant. Specifically, the following is missing:
          a. File update support: Once a file is closed, it cannot be altered. In HDFS 0.21.0,
              append support will be enabled. We do not believe this will ever be necessary for
              USCMS.
          b. Multiple write streams / random writes: Only a single stream of data can write to an
              open file and doing a seek() during write is not supported. This means that one may not
              write a TFile directly to HDFS using ROOT; first, the file must be written to disk, then
              copied to HDFS. We developed in-memory stream reordering in GridFTP in order to
              avoid this limitation. If USCMS decides to write files directly to the SE and not use
              local scratch, HDFS will not be immediately supported. We believe this to be a low
              risk.
          c. Flush and append: A file is not guaranteed to be fully visible until it has been closed.
              Until it is closed, it is not defined how much data a reader will see if they attempt to read
              the file. Flush and append support will be available for HDFS 0.21.0, which will
              provide guaranteed semantics about when data will be available to readers. We do not
              believe this will be an issue for CMS.


FUSE/FUSE-DFS
FUSE-DFS, as a contributed project in Hadoop, shares many risks with HDFS. There are a few
concerns we believe relevant enough to merit their own category.
   1. FUSE support: FUSE has no commercial company providing support. However, is a part of
       the mainline Linux kernel, over 5 years old, and has had a stable interface for quite awhile. We
       have never seen any issue from FUSE itself. We believe the FUSE kernel module is a risk
       because OSG has less experience packaging kernel modules, and kernel modifications often
       result in support issues. This is mitigated by the fact that OSG-supported Xrootd requires the
       FUSE kernel (meaning that HDFS isn’t unique in this situation) and that the ATrpms repository
       provides a FUSE kernel module and tools to build the RPM module for non-standard kernels.
       UCSD and Caltech both build their own kernel modules; Nebraska uses the ATrpms ones.
   2. FUSE-DFS support: FUSE-DFS is the name of the userland library that implements the FUSE
       filesystem. This was originally implemented by Facebook and is in the HDFS SVN repository
       as a contributed module. It does not have the same level of support as HDFS Core because it is
       contributed; it also does not have as many companies using it in production. Through the
       process of adopting HDFS, we have discovered critical bugs, submitted, and had accepted
       several FUSE-DFS related patches. We have even recently found memory leak bugs in the
       libhdfs C wrappers (this had not been previously discovered because it only is noticeable when
       things are in continuous production). We believe this component is the short-term highest-
       risk software component of the entire solution. The mitigating factors for FUSE-DFS are:
           a. Small, stable codebase: FUSE-DFS is basically a small layer of glue between FUSE
               and the libhdfs library (a core HDFS component, used by Yahoo). The entire code base
               is around 2,000 lines, about 4% of the total HDFS size. During our usage of HDFS,
               neither the libhdfs nor the FUSE API has changed. This limits the number of
               undiscovered bugs and rate of bug introduction. We believe the majority of the possible
               issues are fixed.
           b. Production experience: We have been running FUSE for more than 8 months, and feel
               like we have a good understanding of possible production issues. As of the latest
               release, the largest outstanding issue is the fact that FUSE must be remounted whenever
               users are added or removed from groups (user-to-group mappings are currently cached
               indefinitely). This is well understood and possible to work-around. This bug may be
               mitigated in future versions of HDFS, as it will be necessary to the future Kerberos-
               based authorization / authentication.
           c. Extensive debugging experience available: The last FUSE-DFS memory leak bug
               tackled required in-depth debugging at Nebraska. We believe we have the experience
               and tools necessary to handle any future bugs. We intend to make sure that any locally
               developed patches are upstreamed to the HDFS project.

BeStMan
BeStMan is an already supported component of the OSG. We have identified the associated risks:
   1. BeStMan runs out of funding: As BeStMan is quickly becoming an essential OSG package,
      we believe that it will always meet the needs of USLHC, even if it is not funded at LBNL.
   2. BeStMan currently uses Globus 3 container: The Globus 3 web services container was never
      in large-scale use, and currently suffers from debilitating bugs and unmaintained architecture.
      The BeStMan team is currently using most of their effort in replacing this with an industry-
      standard Tomcat webapp container. This should be delivered fall-winter 2009. We believe this
      will remove many bugs and improve the overall source code. This would make it possible for
      external parties to submit improvements.

Globus GridFTP
Globus GridFTP is an already-supported component of the OSG. We have identified the associated
risks:
    1. Globus GridFTP runs out of funding: Globus GridFTP is an essential component to the OSG.
       If it runs out of funding, we will use whatever future solution the OSG adopts.
    2. Globus GridFTP model possibly not satisfactory: The Globus GridFTP model is based on
       processes being launched by xinetd. Because each transfer is a separate process, issues
       affecting one transfer are very separate from other transfers. However, this makes it extremely
       hard to enforce limits on the number of active transfers per node. This can lead to either
       instability issues (by having no limit) or odd errors (globus-url-copy does not gracefully report
       when xinetd refuses to start new servers). We would like to investigate multi-threaded daemon-
       mode Globus GridFTP, but have not identified effort yet. Current T2 sites mitigate this by
       mostly controlling the number of concurrent transfers (except CRAB stageouts) and providing
       sufficient hardware to accommodate for an influx of transfers.
Component Plug-ins
Both BeStMan and GridFTP require plug-ins in order to achieve the desired level of functionality in
this SE. We have identified the associated risks:
    1. Future changes in versions of underlying components: We may have to update plugin code if
        the related component changes its interface. For example, BeStMan2 may require a new Java
        interface to implement GridFTP selector plugins. Even if the API remains the same, it’s
        possible for the underlying assumptions to change – i.e., if GridFTP plug-in needed to become
        thread-safe.
    2. Original authors leaves USCMS: If the original author leaves USCMS, then much knowledge
        would be lost, even if the effort is replaced. This is why focus is being put into clean
        packaging, documentation, and ownership by an organization (OSG) as opposed to just one
        person. The BeStMan component is relatively simple and straightforward, mitigating this
        concern. The GridFTP component is not due to the complexity of the Globus DSI interface (by
        far, the most complex interface in the SEs). This is high-performance C code and difficult to
        change. If the original author left and the Globus DSI module changed significantly, USCMS
        would need to invest about 1 man-month of effort to perform the upgrade. This is mitigated by
        the fact that the current system does not have any necessary GridFTP feature upgrades –
        USCMS can run on the same plugin for a significant amount of time.


Packaging
We have worked hard to provide packaging for the entire solution. The current packaging does offer a
few pitfalls:
   1. Original author leaves USCMS: The setup at Caltech is based on “mock”, the standard
        Fedora/Redhat build tool. The VDT cannot currently does not have the processes in place to
        package RPMs effectively, but this is a planned development for Year 4. Until the packaging
        duties can be transferred from Caltech to VDT (perhaps late Year 4), we will be dependent on
        the setup there. We are attempting to get it better documented in order to mitigate risk.
   2. Patches fail to get upstreamed: It is crucial to send patches upstream and maintain the
        minimum number of changes from the base install. We must remain diligent in making sure to
        commit upstream fixes for any bugs.
   3. Rate of change: Even with only bug fix updates for “golden releases”, the rate of updates is
        always worrying. Most of the updates recently have been related to packaging issues,
        especially for platforms not present at any production T2 cluster. We hope that the added OSG
        effort in Year 4 will enable us to drastically reduce the rate of change.
   4. Update mechanisms for ROCKS clusters: Currently, doing a “yum install” is the correct way
        to install the latest version of the software. However, when a administrator adds the RPMs to a
        ROCKS roll, they get locked into that specific version and must manually take action to
        upgrade the RPMs. This means there will always be significant resistance to changing versions.
        This makes decreasing the rate of updates even more important.

Experts and Funding
Much of this work was done using several CMS experts. We outline two risks:
  a) Loss of experts: As mentioned above, we take a significant hit if our experts leave the
      organization. We are focusing heavily on documentation, packaging, and “finishing off”
      development (in fact, preparation for this review has prompted us to clear several long-standing
      issues). This will allow us to do the first “golden set”, but also increase the length of time
      HDFS can be maintained between experts.
    b) Loss of OSG: Much of the risk and effort is being shouldered with the OSG to leverage their
        packaging expertise. Having HDFS in the OSG taps into an additional pool of human resources
        outside the experts in USCMS. However, the current funding for the OSG runs out in 2 years
        (and is reduced in 1 year). If the OSG funding is lost, then we will have to again rely internally
        on USCMS personnel, similar to FY2009.
The catastrophe scenario for HDFS adoption is both funding loss in the OSG and loss of the experts. In
this case, the survival plan would be:
     Identify funding for new experts (from experience, it takes about 6 months to train a new expert
        once they are in place). This can be taken from the pool of HDFS sysadmins; as HDFS gains
        wider use, the pool of potential experts is broadened.
     No new “golden set” until a packaging, testing, and integration program can be re-established.
        If this becomes a chronic problem, a hard focus would be made on to switching entirely to
        Cloudera’s distribution in order to offload the Q/A testing of major changes to an external
        organization.
     No new USCMS-specific features. We believe that HDFS has all the necessary major features
        for CMS adaptation, but we do find small useful ones (an example would be the development of
        Ganglia 3.1 compatibility). Without a local expert, developing these for CMS would not be
        possible. Without a local expert, any running with patches not accepted by the upstream project
        becomes increasingly dangerous.