Hadoop File System as part of a CMS Storage Element by zhouwenjuan


									     Hadoop File System as part of a CMS Storage Element

   1. Introduction

In the last year, several new storage technologies have matured to the point where they have
become viable candidates for use at CMS Tier2 sites. One particular technology is HDFS, the
distributed filesystem used by the Hadoop data processing system.

   2. The HDFS SE
HDFS is a file system; as such, it must be complemented by other components in order to build a
grid SE. We consider a minimal set of additional components to be:
    a) FUSE / FUSE-DFS: FUSE, a standard linux kernel module, allows filesystems to be written
       in userspace. The FUSE-DFS library makes HDFS into a FUSE filesystem. This allows a
       POSIX-like interface to HDFS, necessary for user applications.
    b) Globus GridFTP: Provides a WAN transfer protocol. Using Globus GridFTP with HDFS
       requires a plugin developed by Nebraska.
    c) BeStMan: Provides an SRMv2 interface. UCSD and Nebraska have implemented plugins for
       BeStMan to allow smarter selection of GridFTP servers.
There are optional components that may be layered on top of HDFS, including XrootD, Apache
HTTP, and FDT.

   3. Requirements
This document aims to show that the combination of HDFS FUSE, Globus GridFTP , BeSTMan, and
some plugins developed by CMS Tier-2 teams at Nebraska, Caltech, and UCSD, meet the SE
requirements set forth by USCMS Tier 2 management. The requirements are typeset in italics, and
the responses are given below them in normal typesetting. Throughout this document, we use
“Hadoop” and “HDFS” interchangeably; generally, “Hadoop” refers to the entire data processing
system. In this context, we take it to only refer to the filesystem components.

   4. Management of the SE

Requirement 1. A SE technology must have a credible support model that meets the reliability,
availability, and security expectations consistent with the area in the CMS computing infrastructure
in which the SE will be deployed.

Support for this SE solution is provided by a combination of OSG, LBNL, Globus, the Apache
Software Foundation (ASF), DISUN, and possibly the US CMS Tier-2 program as follows.
BeSTMan is supported by LBNL, GridFTP by Globus. Both are part of the OSG portfolio of
storage solutions. HDFS is supported by ASF, as elaborated below, and FUSE is part of the
standard Linux distribution. This leaves us with two types of plugins that are required to integrate
all the pieces into a system, as well as the packaging support. The first is a plugin to BeSTMan to
pick a GridFTP server from a list of available GridFTP servers; this list is reloaded every 30
seconds from disk and servers are randomly selected (otherwise, the default policy is a simple
round-robin and it requires an SRM restart to alter the GridFTP server list). The second is a
GridFTP plugin to interface to HDFS. We propose that both of these types of plugins continue to
be supported by their developers from DISUN and US CMS, until OSG has had an opportunity to
gain sufficient experience to adopt the source, and own them. Similarly, packaging support in form of
RPMs is to be provided initially by DISUN/Caltech, and later by US CMS/Caltech, until OSG has
had an opportunity to adopt it as part of a larger migration towards providing native packaging in
form of RPMs. OSG ownership of these software artifacts has been agreed upon in principle, but
not yet formally. OSG support for this solution will start with Year 4 of OSG (October 1st, 2009),
and will initially be restricted to:

   a) Pick a set of RPMs twice a year, verify that this set is completely consistent, providing a well
      integrated system. We refer to this as the “golden set”.
   b) Document installation instructions for this golden set.
   c) Do a simple validation test on supported platforms (with the validation preferably automatic).
   d) Performance test the golden set, and document that test. This performance will be in-depth.
   e) Provide operations support for two golden sets at a time. This means that there is a staff
      person on OSG who is responsible for tracking support requests, answering simple questions,
      and finding solutions to difficult questions via the community support group organized in
      osg-hadoop@opensciencegrid.org listserv.
   f) OSG will provide updates to a golden set only for important bug and security fixes; these
      critical patches will go through validation test, but not performance tests.

Support will be official for RHEL4 and RHEL5-derivants on both 32-bit and 64-bit platforms. The
core HDFS software (the namenode and datanode) is usable on any platform providing the Java 1.6
JDK. Currently, Caltech and Nebraska both run datanodes on Solaris. The main limitation for
FUSE clients is support for the FUSE kernel module; this is supported on any Linux 2.6 series
kernel, and FUSE was merged into the kernel itself in 2.6.14.

Upgrades in HDFS are covered in this wiki document:
http://wiki.apache.org/hadoop/Hadoop_Upgrade. On our supported version, upgrades on the same
major version will only require a “yum update”. For major version upgrades, the procedure is:
    a) Shutdown the cluster.
    b) Upgrade via yum or RPMs.
    c) Start the namenode manually with the “-upgrade” flag.
    d) Start the cluster. The cluster will stay in read-only mode.
    e) Once the cluster’s health has been verified, issue the “hadoop dfsadmin –finalizeUpgrade”
       command. After the command has been issued, no rollback may be performed.
The wiki explains additional recommended safety precautions.

Hadoop is an open-source project hosted by the Apache Software Foundation. ASF hosts multiple
mailing lists for both users and developers, which is actively watched by both the Hadoop
developers and the larger Hadoop community. Members of this community include Yahoo
employees who use Hadoop in their workplace, as well as employees of Cloudera, which provides
commercial packaging and support for Hadoop. As it is a top-level Apache project, it has
contributors from at least three companies and strong project management. Yahoo has stated that
it has invested millions of dollars into the project and intends to continue doing so

Hadoop depends heavily upon its JIRA instance, http://issues.apache.org/jira/browse/HADOOP,
for bug reporting and tracking. As it is an open-source project, there’s no guarantee of response
time to issues. However, we have found high-priority issues (those that may lead to data loss) get
solved quickly because bugs affecting a T2 site have a high probability of affecting Yahoo’s
production infrastructure.

Requirement 2. The SE technology must demonstrate the ability to interface with the global data
transfer system PhEDEx and the transfer technologies of SRM tools and FTS as well as demonstrate
the ability to interface to the CMSSW application locally through ROOT.

Caltech is using Hadoop exclusively for OSG RSV tests, PhEDEx load tests, user stageouts via
SRM, ProdAgent stageout via SRM and POSIX-like access via FUSE. User analysis jobs submitted
to Caltech with CRAB have been running against data stored in Hadoop for several months.

UNL has been using HDFS at large scale since approximately late February 2009. It is used for:
   PhEDEx data transfers for all links (transfers are done with SRM and FTS).
   OSG RSV tests (used to meet the WLCG MoU availability requirements)
   Monte Carlo simulation and merging.
   User analysis via CMSSW (both grid-based submissions using CRAB and local interactive

UCSD has been using HDFS for serving /store/user.

Caltech, UNL, and UCSD have been leading the effort to demonstrate the scalability of the
BeStMan SRM server. This has recently resulted in a srmLs processing rate of 200Hz in a single
server; this is 4 times larger than the rate required by CMS for FY2010, and is in fact the most
scalable SRM solution deployed in global CMS.

Requirement 3. There must be sufficient documentation of the SE so that it can be installed and
operated by a site with minimal support from the original developers (i.e. nothing more than "best
effort"). This documentation should be posted on the OSG Web site, and any specific issues in
interfacing the external product to CMS product should be highlighted.

Installation, operation, and troubleshooting directions can be found at
http://twiki.grid.iu.edu/bin/view/Storage/Hadoop. This has already been discussed in Requirement
1, and we elaborate further here. It should be plausible for any site admin to install HDFS and the
corresponding grid components without support from the original developers. This provides a
coherent install experience; all components, including BeStMan and GridFTP, are available as
RPMs. Admins experienced with the RedHat tool “yum” will find that the SE is installable via a
simple “yum install hadoop”. The DISUN/Caltech packaging also provides useful logging defaults
that enable one to easily centrally log errors happening in HDFS; this greatly aids admins in

With the OSG 1.2 release, there are no specific issues pertaining to using HDFS with CMS.

Experience shows that operational overhead at Caltech has been equivalent to approximately 1FTE,
and that includes R&D activities for packaging and testing. Going forward, we believe the overhead
will decrease, as the R&D portions will be greatly reduced. Nebraska, which has reached a stable
state with Hadoop for several months, shows that operational overhead is less than 1 FTE. UCSD,
which is presently supporting HDFS for /store/user (113TB) and dCache for all else (273TB)
reports the same experience. The main reason why this solution is experienced as less costly to
operate is because it has many fewer moving parts. This much reduced complexity results in
significantly lower operational overhead.

Requirement 4. There must be a documented procedure for how problems are reported to the
developers of those products, and how these problems are subsequently fixed.

Starting with Year 4 of OSG, all problems are reported via OSG. OSG then uses the support
mechanisms as discussed in Requirement 1.

In particular, Hadoop has an online ticket system called JIRA. JIRA is heavily used by the
developers to received and track bugs and features requests for Hadoop and Hadoop-related
projects. JIRA is open for viewing by everyone, and requires a simple account registration for
posting comments and new tickets. All commits in the project must be traceable via JIRA and go
through the quality control process (which includes code review by a different developer and passing
automated tests).

The JIRA system can be found at http://issues.apache.org/jira/browse/HADOOP. The Hadoop
community has also written a guide to filing bug reports,
Requirement 5. Source code required to interface the external product to CMS products must be
made available so that site operators can understand what they are operating. If at all possible,
source code for the external product itself should also be available.

All software components are open source. In particular, Hadoop source code can be downloaded
from http://www.apache.org/dyn/closer.cgi/hadoop/core/. Patches for specific problems in a
release can be downloaded from JIRA (see above). Any patches currently applied to the Caltech
Hadoop distribution have been submitted to JIRA, and we have tried to make sure that they get
committed in a timely manner. This helps us minimize the costs of maintaining our own RPM

Yahoo has publicly committed to releasing its “stable patch set” to the world beginning with 0.20.0.
They are committed to keeping all patches used publicly available in the JIRA; the only added
knowledge is which patches are stable enough to be applied to current releases. When we upgrade
to this version, we will be able to tap into this significant resource; Yahoo has a committed QA team
and a test cluster an order of magnitude larger than our production cluster.

The BeStMan source code is open source for academic users; the OSG is working on clearing up the
licensing of this software and making the code more freely available. Each of the used plug-ins (for
BeStMan and GridFTP) are available in the Nebraska SVN repository which allows anonymous

   5. Reliability of the SE

Requirement 6. The SE must have well-defined and reliable behavior for recovery from the failure
of any hardware components. This behavior should be tested and documented.

We broadly classify the HDFS grid SE into three parts: metadata components, data components,
and grid components. Below, we document the risks a failure in each poses and the suggested
recovery mechanisms.
    Metadata components: The two major metadata components for HDFS and are the
       namenode and the “secondary namenode” (backup) servers. The namenode is the single
       point of failure for normal user operation. Because of this, there are several built-in
       protections each site is recommended to take:
           o Write out multiple copies of the journal and file system image. HDFS heavily relies
              on the metadata journal as a log of all the operations that alter the namespace.
              HDFS config files allow for writing the journal onto multiple partitions. It is
              recommended to write the journal on two separate physical disks in the namenode and
              suggested a third copy be written to a NFS server.
           o The secondary namenode / backup server allows the site admin to create checkpoint
              files at regular intervals (default is every hour or whenever the journal reaches 64MB
          in size). It is strongly recommended to run the secondary namenode on a separate
          physical host. The last two checkpoints are automatically kept by the secondary
       o The checkpoints should be archived, preferably off-site.
  Future versions of HDFS (0.21.x) plan on having an offline checkpoint verification tool and
  stream journal information to the backup node in real-time, as opposed to an hourly
  In the case of namenode failure, the following remedies are suggested:
       1. Restart the namenode from the file system image and journal found on the
          namenode’s disk. The only resulting file loss will be the files being written at the
          time of namenode failure. This action will work as long as the image and journal are
          not corrupted.
       2. Copy the checkpoint file from the secondary namenode to the namenode. Any files
          created between the time the checkpoint was made and the namenode failure will be
          lost. This action will work as long as the checkpoint has not been corrupted.
       3. Use an archived checkpoint. Any files created between the checkpoint creation and
          the namenode failure will be lost. This action will work as long as one good
          checkpoint exists.
  Note that all the namenode information is kept on two files written directly to the Linux
  If none of these actions work, the entire file system will be lost; this is why we place such
  importance on backup creation. In addition to the normal preventative measures, the
  following can be done:
       1. Have hot-spare hardware available. HDFS is offline when the namenode is offline. If
          the failure does not have an immediate, obvious cause, we recommend using the
          standby hardware instead of prolonging the downtime by troubleshooting the issue.
       2. Create a high-availability (HA) setup. It is possible, using DRBD and Heartbeat to
          completely automate recovery from a namenode failure using a primary-secondary
          setup. This would allow services to automatically be restored on the order of a
          minute with the loss of only the files which were currently being created at the time of
          the failure. As most of the CMS tools automatically retry writing into the SE, this
          means that the actual file loss would be minimal. These high-availability setups have
          been used by external companies, but not yet done by a grid site; the extra setup
          complexity is not yet perceived as worth the time.
 Data components: Datanode loss is an expected occurrence in HDFS and there are multiple
  layers of protection against this.
       1. The first layer of protection is block replication. Hadoop has a robust block
          replication feature that ensures that duplicate file blocks are placed on separate
          nodes and even separate racks in the cluster. This helps ensure that complete copies
          of every file are available in the case that a single node becomes unavailable, and
          even if an entire rack becomes unavailable.
       2. Hadoop periodically requests an entire block report from every data node. This
          protects against synchronization bugs where the namenode’s view of the datanode’s
            contents is different from reality
            3.      It is often possible to have a bad hard drive causing corruption issues, but not
            have that hard drive fail. Each datanode schedules each block to be checksummed
            once every 2 weeks. If the block fails the checksum, it will be deleted and the
            namenode notified – allowing automatic healing from hard drive errors.
    Grid components: Grid components require the least amount of failure planning because
     they are stateless for the HDFS SE. Multiple instances of the grid components (gridftp,
     bestman SRM) can be installed and used in a failover fashion using Linux LVS or round-robin

We are aware of one significant failure mode. BeStMan SRM is known to lock up under very heavy
load (over 1000 concurrent SRM requests, which is at least 10 fold what’s been observed in
production), and requires a restart when this happens. We believe this to be a problem with the
Globus java container. BeStMan is scheduled to transition to a different container technology in Fall
2009 in BeStMan 2. OSG is committed to follow this issue, and validate the new version when it is
released. However, this issue manifests itself only under extreme loads at the existing deployments,
and is thus presently not an operational problem.

In addition, there is a minor issue with the way HDFS handles deployments with server nodes that
serve space of largely varying sizes. HDFS need to be routinely re-balanced, which is typically done
via crontab in extreme cases (e.g. Nebraska), and manually once every two weeks or so for
deployments where disk space differs by no more than a factor 2-4 or so between nodes (e.g.
UCSD). There is no significant operational impact from routinely running the balancer beyond
network traffic; HDFS allows the site to throttle the per-node rate going to the balancer. This
imbalance is caused because the selection of servers for writing files is mostly random, while the
distribution of free space is not necessarily random.

Requirement 7. The SE must have a well-defined and reliable method of replicating files to protect
against the loss of any individual hardware system. Alternately there should be a documented
hardware requirement of using only systems that protect against data loss at the hardware level.

Hadoop uses block decomposition to store its data, breaking each file up into blocks of 64MB
apiece (this block size is a per-file setting; the default is 64MB, but most sites have increased this
to 128MB for new files). Each file has a replication factor, and the HDFS namenode attempts to
keep the correct number of replicas of each of the blocks in the file. The replication policy is:
     a) First replica goes to the “closest datanode” – the local node is the highest priority, followed
        by a datanode on the same rack, followed by any data node.
     b) Second replica goes to a random datanode on a different rack
     c) Third replica (if requested) goes to a random datanode on the second rack.
If two replicas are requested, each will end up on two separate racks; if three replicas are
requested, they will end up on two separate racks.
Replication level is set by the client at the time it creates the block. The replication level may be
increased or decreased by the admin at any time per-file, and can be done recursively. The
namenode attempts to satisfy the client’s request; as long as the number of successfully created
replicas is between the namenode’s configured minimum and maximum, the HDFS considers the
write a success. Because the client requests a replication level for each write, one cannot set a
default replication for a directory tree.

For example, at Caltech, a cron script automatically sets the replication level on known directories
in order to ensure clients request the desired replication level. They currently use:

        hadoop fs -setrep -R 3 /store/user
        hadoop fs -setrep -R 3 /store/unmerged
        hadoop fs -setrep -R 1 /store/data/CRUZET09
        hadoop fs -setrep -R 1 /store/data/CRAFT09
It is not possible to pin files to specific datanodes, or to set replication based on the datanodes
where the files will be located. Hadoop treats all datanodes as equally unreliable.

The namenode keeps track of all block locations in the system, and will automatically delete replicas
or create new copies when needed. Replicas may be deleted either when the associated file is
removed or the block has too more replicas than desired. Datanodes send in a heartbeat signal
once every 3 seconds, allowing the server to keep up-to-date with the system status.

For example, when a datanode fails, the server allows it to miss up to 10 minutes of heartbeats
(setting configurable). Once it is declared dead, the namenode starts to make inter-node transfer
requests to bring the blocks that were on the datanode back up to the desired level. This often is
quite quick; all the desired replicas can be done in around an hour per TB (the majority of the
transfers go faster than that; a good portion of the hour is “transfer tails”). When decommissioning
is started, the system does not prefer to copy data off the decommissioned node. Assuming the
number of replicas is greater than one, the load of replication is distributed randomly throughout the
cluster. The percentage of transfers the decommissioned node will get is (# of TB of files on
datanode)/(# of TB in entire system). Because older nodes typically have a smaller disk size, it will
comparatively get less load.

If a dead node reappears (host rebooted/fixed, disk is physically moved to a different node), the
blocks it previously hosted will now be overreplicated. The namenode will then reduce the number
of replicas in the system, starting with replicas located on nodes with the least amount of free space
(as a percentage of total space on the node). If the blocks belonged to files which have been
deleted, the node will be instructed to delete them in the response to its block report.

The number of under-replicated blocks can be seen by viewing the system report using fsck or by
looking at the namenode’s Ganglia statistics.
The success of Hadoop's automatic block replication was seen when Caltech suffered a simultaneous
failure of 3 large (6TB) datanodes on the evening of Sunday, Jul. 12:

       “Within about an hour of installing a faulty Nagios probe, 3 of the 2U datanodes had
       crashed, all within minutes of each other. Each of the 2U datanodes hosts just over 6TB of
       raid-5 data. Nagios started sending alarms indicating that we had lost our datanodes. We
       had seen kernel panics caused by this nagios probe before, so we had no trouble locating the
       cause of the problem. The immediate corrective action was to disable this Nagios probe.
       This was done immediately to avoid further loss of datanodes.

       Ganglia showed Hadoop reporting ~80k underreplicated blocks, and Hadoop started
       replicating them after the datanodes failed to check in after ~10 minutes. The network
       activity on the cluster jumped from ~200MB/s to 2GB/s. Since we run Rocks, 2 of the
       machines went into a reinstall immediately after dumping the kernel core file. Within an hour
       they were back up and running. Hadoop did not start automatically, however, due to a
       Rocks misconfiguration. I started Hadoop manually on these two nodes, which caused our
       underreplicated block count to drop from ~20k to ~4k. Within ~90 minutes the number of
       underreplicated blocks was back to zero.

       At this point we still had one datanode that had not recovered. I checked the system
       console and discovered that the system disk had also died. Since it was late on a Sunday
       evening, and we run 2X replication in Hadoop, I decided we could leave the datanode offline
       and wait until the following morning to replace the disk.

       After the replication was done, I ran “hadoop fsck /” to check the health of HDFS. To my
       surprise, hadoop was reporting 40 missing blocks and 33 corrupted files. This seemed
       strange because all of our files (aside from LoadTest data) is replicated 2x, so the
       loss of a single datanode should not have caused the loss of any files (aside from LoadTest
       data). After parsing the output of “hadoop fsck /”, I found that we had been accidentally
       setting the replication on /store/relval to 1, instead of leaving it at the default of 2. This
       was fixed.

       The next morning we replaced the system disk in the dead datanode and waited for it to
       reinstall (which took a few hours longer than it ought have). Almost immediately after
       starting Hadoop on this last datanode, hadoop fsck reported the filesystem was clean again.
       By 2:15 the following afternoon everything had returned to normal and hadoop was healthy
       again. ”

Requirement 8. The SE must have a well-defined and reliable procedure for decommissioning
hardware which is being removed from the cluster; this procedure should ensure that no files are
lost when the decommissioned hardware is removed. This procedure should be tested and
The process of decommissioning hardware is documented in the Hadoop twiki under the Operations
guide. The process goes approximately like this:
    1. Edit the hosts exclude file to exclude the to-be-decommissioned host from the cluster.
    2. Issue the “refreshNodes” command in the Hadoop CLI to get the namenode to re-read the
       file. The node should show up as “Decommissioning” in the web interface at this point.
    3. Watch the web interface or the “report” command in the Hadoop CLI and wait until the
       node is listed as “dead”.

This process is not only straightforward, but a very routine process at each site. Decommissioning
is done whenever a node needs to be taken offline for any upgrade lasting more than 10 minutes at

Requirement 9. The SE must have well-defined and reliable procedure for site operators to
regularly check the integrity of all files in the SE. This should include basic file existence tests as
well as the comparison against a registered checksum to avoid data corruption. The impact of this
operation (e.g. load on system) should be documented.

Hadoop’s command line utility allows site admins to regularly check the file integrity of the system.
It can be viewed using “hadoop fsck /”. At the end of the output, it will either say the file system
is “HEALTHY” or “CORRUPTED”. If it is corrupted, it provides the outputs necessary to repair
or remove broken files.

HDFS registers a checksum at the block level within the block’s metadata. HDFS automatically
schedules background checksum verifications (default is to have every block scanned once every 2
weeks) and automatically invalidates any block with the incorrect checksum. The checksumming
interval can be adjusted downward at the cost of increased background activity on the cluster. We
do not currently have statistics on the rate of failures avoided by checksumming.

Whenever a file is read by a client (even partially – checksums are kept for every 4KB), the client
receives both data and checksum and computes the validity of the data on the client side. Similarly,
when block is transferred (for example, through rebalancing), the checksum is computed by the
receiving node and compared to the sender’s data.

Note about catastrophic loss:
We have emphasized that with 2 replicas, file loss is very rare due to:
    Failures that occur rapidly (>2 hours between failure) cause little to no loss because the re-
      replication in the file system is extremely fast; one guidance is to expect 1TB per hour to be
    Multi-disk failures happening within an hour usually are due to some common piece of
      equipment (such as the rack switch or PDU). Rack-awareness prevents an entire rack
       disappearing from causing file loss.
However, what happens if we make the assumption that all safeguards are bypassed and 2 disks are
lost? This is not without precedent; at Caltech, a misconfiguration told Hadoop 2 nodes on the
same rack were on different racks. This bypassed the normal protections from rack awareness. The
rack’s PDU failed and two disks failed to come back up. Caltech lost 54 blocks of file.
Using the binomial distribution, the expected number of blocks lost is:
                        (# of blocks lost) = (# of blocks) * P(single block loss)
The binomial distribution is appropriate because the loss of one block does not affect the probability
of another block loss. The standard deviation is approximately the square root of the number of
blocks lost. The probability of a single block loss is
                    P(single block loss) = P(block on node 1) * P(block on node 2)
The probability a block is on a given node is approximately:
                P(block on node 1) = (replication level)*(size of node)/(size of HDFS)
assuming that the cluster is well-balanced and blocks are randomly distributed. Both assumptions
appear to be safe in currently-deployed clusters.

Plugging in Caltech’s numbers (1,540,263 blocks; 342.64 TB in the system, each lost disk was
1TB), the expected number of lost blocks was 52.4 with a standard deviation of 7.2. This is
strikingly close to the actual loss, 54 blocks.

If only complete files were written (i.e., no block decomposition), then the expected loss would be

                          (# of files lost) = (# of files) * P(single block loss)

So, assuming 128MB and experimental files of around 1GB, the number of files lost would be 10x
lower. In the end, CMS site would lose 10x more files using HDFS. We believe this is an
acceptable risk, especially as the recovery procedure for 5 files versus 50 files is similar. In the
case of simultaneous triple-disk-failure on triple-replicated files, the expected loss would be less
than 1 file for Caltech’s HDFS instance.

Requirement 10. The SE must have well-defined interfaces to monitoring systems such as Nagios
so that site operators can be notified if there are any hardware or software failures.

HDFS integrates with Ganglia; provided that the site admin points HDFS to the right Ganglia
endpoint, many relevant statistics for the namenode and datanodes appear in the Ganglia gmetad
webpages. Many monitoring and notification applications can set up alerts based on this.

Caltech has also contributed several HDFS-Nagios plugins to the public that monitor various
aspects of the health of the system directly. They have released a TCL-based desktop application,
“gridftpspy” which monitors the health and activity of the Globus gridftp servers. Some of these
are based on the JMX (Java Management eXtensions) interface into HDFS. JMX can integrate with
a wider range of monitoring system. There is also an external project providing Cacti templates for
monitoring HDFS. The Nagios and gridftpspy components are packaged in the Caltech yum
repository, but not officially integrated; we foresee labeling them experimental for the OSG-
supported first release.

Finally, Caltech has developed the “Hadoop Chronicle”, a nightly email that sends administrators
the basic Hadoop usage statistics. This has an appropriate level of details to inform site executives
about Hadoop’s usage. The Hadoop Chronicle is now part of the OSG Storage Operations toolkit.
This is currently in use at Caltech and in testing at Nebraska.

Note about admin intervention:
The previous two requirements start to cover the topic of “what HDFS activities do site admins
engage in?” and at what interval. We have the following feedback from Nebraska and Caltech site
admins, respectively:
    Nebraska:
           o Daily tasks: Check Hadoop Chronicle, look at RSV monitoring
           o Once a week: Clean up dead hardware, restart dead components. The component
              which crashes most often is BeStMan at about once every 2 weeks.
           o Once every 2 months: Some sort of data recovery or in-depth maintenance.
              Examples include debugging an underreplicated block or recovering a corrupted file.
    Caltech (note: Caltech runs an experimental kernel, which may explain the reason there’s
       more kernel-related maintenance than at Nebraska):
           o Continuously: wait for Nagios alerts
           o Hourly tasks: Check namenode web pages and gridftp logs via gridftpspy (admittedly a
              bit excessive)
           o Daily tasks: Read Hadoop Chronicle, browse PhEDEx rate/error pages
           o Weekly tasks: reboot nodes due to kernel panic, adjust gridftp server list (BeStMan
              plugin currently not used), track down lost blocks (for datasets replicated once),
              maintain ROCKS configuration.
           o Once a month: Reboot namenode with new kernel, reinstall data nodes with bugfix

   6. Performance of the SE

All aspects of performance must be documented.

Requirement 11. The SE must be capable of delivering at least 1 MB/s/batch slot for CMS
applications such as CMSSW. If at all possible, this should be tested in a cluster on the scale of a
current US CMS Tier-2 system.

To test this requirement, Caltech ran a test using dd to read from HDFS through the fuse mount on
each of the 89 worker nodes on the Tier2 cluster. dd was used to maximize the throughput from
the storage system. We acknowledge that the IO characteristics from dd are not identical to that of
CMSSW applications, which tend to read smaller chunks of data in random patterns. Each worker
node ran 8 dd processes in parallel, one per core. Each dd process/batch slot on a single worker
node read a different 2.6GB file from HDFS 10 times in sequence. The same 8 files were read from
each of the 89 worker nodes. At the end of each file read, dd reported the rate at which the file was
read. A total of 18.1TB was read during this test. The final dd was finished approximately 4.25
hours after the test was started.

The average read rate reported by dd was 2.3MB/s ± 1.5MB/s. The fastest read was 22.8MB/s
and the slowest was 330KB/s.

The rate per file delivered from HDFS was 18.1TB/4.25hours = 1238MB/s, or approximately
155MB/s (1Gbps) per HDFS file. The test was run as more of a test to see how the system behaves
for the 'hot file' problem. As such, this test shows that HDFS can deliver even 'hot files' to the
batch slots at the required rates.

It should be noted that this test was run while the cluster was also 100% full with CMS production
and CMS analysis jobs, most of which were also reading and writing to Hadoop at the same time.
The background HDFS traffic from this CMS activity was not included in these results.

UCSD ran a separate test with a standard CMSSW application consuming physics data. The same
application has been used for computing challenging or scalability exercise. The application is very
I/O intensive. Here we mainly focused on the application reading the data that is located locally in
the hadoop.

During the tests, there are 15 datanodes holds the data files with 1GB in size. The block size in
UCSD's Hadoop is 128 MB. The replication of the data files is set to 2. For each file, there are 16
blocks well distributed across all the datanodes. The application was configured to run against 1 file
or 10 files per job slot. The number of jobs running simultaneously ranged from 20 to 200. The
maximal number of jobs running simultaneously is 250, which is roughly a quarter of available job
slots at UCSD at that time. The rest of slots were running production or user analysis jobs. So the
test was running under a very typical Tier-2 condition. The test application itself didn't significantly
changes the overall condition of the cluster.
The ratio of average job slots running the tests to the number of Hadoop datanodes ranged from
10-20. Eventually this ratio will be 8 if all the WNs are configured as Hadoop datanodes, and each
WN runs 8 slots. This will increase the I/O capability per job slot for 50-100% from the results we
measured in the test.

The average processing time per job is 200 and 4000 second for the application processing 1 and 10
GB of data respectively. The average I/O in reading the data are shown in the following: average
I/O for application consuming 1GB (left) and 10 GB (right). The test shows the 1MB/s per slot
requirement is at the low end of the rate that is actually delivered by the HDFS. The average is ~2-
3 MB/s per job.

Requirement 12. The SE must be capable of writing files from the wide area network at a
performance of at least 125MB/s while simultaneously writing data from the local farm at an average
rate of 20MB/s.

Below is a graph for the Nebraska worker node cluster
During this time, HDFS was servicing user requests at a rate of about 2500/sec (as determined by
syslog monitoring using the HadoopViz application). Each user request is a minimum of 32KB, so
this is at least 80MB of internal traffic. At the same time, we were writing in excess of 100MB/s as
measured by PhEDEx

Below is an example of HDFS serving data to a CRAB-based analysis launched by an external user.
At the time (December 2008), the read-ahead was set to 10MB. This provided an impressive
amount of network bandwidth (about 8GB/s) to the local farm, but is not an every day occurrence.
The currently recommended read-ahead size is 32KB.
Requirement 13. The SE must be capable of serving as an SRM endpoint that can send and
receive files across the WAN to/from other CMS sites. The SRM must meet all WLCG and/or CMS
requirements for such endpoints. File transfer rates within the PhEDEx system should reach at least
125MB/s between the two endpoints for both inbound and outbound transfers.

During Aug. 20-24, Caltech and Nebraska ran inter-site load tests using PhEDEx to exercise the
gridftp-hdfs servers.
During this time period, PhEDEx recorded a 48-hour average of 171MB/s coming into the Caltech
Hadoop SE, with files primarily originating from UNL. Peak rates of up to 300MB/s were observed.
There was a temporary drop to zero at ~23:00 Aug. 24 due to an expired CERN CRL.

During this same time period, Caltech was exporting files at an average rate of 140MB/s, with files
primarily destined for UNL. For several hours during this time period the transfer rates exceeded

It must be noted that the PhEDEx import/export load tests were not run in isolation. While these
PhEDEx load tests were running Caltech was downloading multi-TB datasets from FNAL, CNAF,
and other sites with an average rate of 115MB/s and peaks reaching almost 200MB/s.

UCSD has additionally been working on an in-depth study of the scalability of BeStMan, especially
at different levels of concurrency. The graph below shows how the effective processing rate has
scaled with the increasing number of concurrent clients.
The operation used was srmLs without full details; this causes a “stat” operation on the file system,
but reduces the amount of XML generated by the BeStMan server. This demonstrates processing
rates well above the levels currently needed for USCMS. It is sufficient for high-rate transfers of
gigabyte-sized files and uncontrolled chaotic analysis.

   7. Site-specific Requirements
Note: We believe the requirements set out here cover a subset of the functionality required at a
CMS T2 site. We believe that the better test has been putting the storage elements into
production at several sites – the combination of all activities and chaotic loads appears to be better
than artificial tests. An additional test that we recommend below is replacing the skims (which are
bandwidth-heavy and IOPS-light, unlike most T2 activities) with a few analysis jobs (which are
bandwidth-medium and IOPS-heavy).

Note: We've done the best we could without owning more storage (by the end of 2010, each site will
probably double in size). We believe we have demonstrated that the potential bottlenecks (the
namenode) scale out for what we'll need in the next three years. As long as the ratio of cores to
usable terabytes stays on the order of 1 to 1 and not 1 to 10, we believe IOPS will scale as
demonstrated. We believe the fact that Yahoo has demonstrated multi-petabyte clusters shows the
number of raw terabytes will scale.

Note: We believe that the architectures deployed at the current T2 sites (UCSD, Caltech, and
Nebraska) can be repeated at others – in particular, any site that does not rely entirely on a small
number of RAID arrays. It is applicable for sites having issues with reliability or site admin

Requirement 14. A candidate SE should be subject to all of the regular, low-stress tests that are
performed by CMS. These include appropriate SAM tests, job-robot submissions, and PhEDEx load
tests. The SE should pass these tests 80% of the time over a period of two weeks. (This is also the
level needed to maintain commissioned status.)

The below chart shows the status of the site commissioning tests from CMS, which is a combination
of all the regular low-stress tests performed.
Additionally, Caltech's use of a Hadoop SE has maintained a 100% Commissioned site status for the
two weeks prior to Aug. 17:


Requirement 15. The new storage element should be filled to 90% with CMS data. These datasets
should be chosen such that they are currently "popular" in CMS and will thus attract a significant
number of user jobs. Failures of jobs due to failure to open the file or deliver the data products from
the storage systems (as opposed to user error, CE issues, etc.) should be at the level of less than 1
in 10^5 level.
     A suggested test would be a simple "bomb" of scripts that repeatedly opens random files and
        reads a few bytes from them with a high parallelism; for the 10^5 test, it's not necessary to
        do it through CMSSW or CRAB. An example would be to have 200 worker nodes open 500
        random files each and read a few bytes from the middle of the file.

This was performed using the “se_punch.py” tool found in Nebraska’s se_testkit. There were no file
access failures. This script implemented the suggested test – all worker nodes in the Nebraska
cluster simultaneously started opening random files and reading a few bytes from the middle of each.

Nebraska is now working on a script utilizing PyROOT (which is distributed with CMSSW) that
opens all files on the SE with ROOT. This not only verifies files can be opened, but demonstrates a
minimal level of validity of the contents of the file. Opening with ROOT should fully protect against
truncation (as the metadata required to open the file is written at the end of the file) and whole-file
corruption. It does not detect corruptions in the middle of the file, but built-in HDFS protections
should detect these.

Nebraska ran with HDFS over 90% full during May 2009 and encountered no significant problems
other than writes failing when all space was exhausted. Caltech also experienced some corrupted
blocks when HDFS was filled to 96.8% and certain datanodes reached 100% capacity. Some
combination of failed writes, rebalancing, and failing disks resulted in two corrupted blocks and two
corrupted files. These files had to be invalidated and retransferred to the site. This is the only
time that Caltech has lost data in HDFS since putting it into production 6 months ago. There are a
few recommendations to help avoid this situation in the future:
    1) Run the balancer often enough to prevent any datanode from reaching 100%
    2) Don't allow HDFS to fill up enough that an individual datanode partition reaches 100%
    3) If using multiple data partitions on a single datanode, make them of equal size, or merge
       them into a single raid device so that hadoop sees only a single partition.
Future versions of Hadoop (0.20) have a more robust API to help manage datanode partitions that
have been completely filled to 100%.

Requirement 16. In addition, there should be a stress test of the SE using these same files. Over
the course of two weeks, priority should be given to skimming applications that will stress the IO
Specific CMS skim workflows were run at Nebraska on June 6. However, the results of these were
not interesting as the workflows only lasted 8 hours (no significant failures occurred).

However, the “stress” of the skim tests is far less than the stress of user jobs (especially PAT-
based analysis) due to the number of active branches in ROOT; see CMS Internal Note 2009-18.
Many active branches in ROOT result in a large number of small reads; a CMS job on an idle system
will read typically no more than 32KB per read and achieve 1MB/s. Hence, 1000 jobs will achieve
30,000 IOPS if they are not bound by the underlying disk system. Because the HDFS installs have
relatively high bandwidth due to the large number of data nodes, but the same number of hard
drives as other systems, bandwidth is usually not a concern while I/O operations per second (IOPS)
is. See the below graphs demonstrating a large number of IOPS; even at the max request rate, the
corresponding bandwidth required is only 5Gbps. For the hard drives deployed at the time the
graph was generated, this represented about 60 IOPS per hard drive, which matched independent
benchmarks of the hard drives. The bandwidth usage of 5Gbps represents only a fraction of the
bandwidth available to HDFS.

Because HDFS approaches the underlying hardware limits of the system during production, we
consider typical user jobs are the best stressor of the system. Such “stress tests” occur in large
batches on a weekly basis at both Nebraska and Caltech. During the tests in this requirement and
others, Nebraska and Caltech’s systems were in full production for CMS – simulation, analysis, and
WAN transfer – and often the batch slots were 100% utilized. By default, data went to HDFS and
only a few datasets were kept on dCache. UCSD’s system was smaller and shared the CMS
activities with a dCache instance.
Requirement 17. As part of the stress tests, the site should intentionally cause failures in various
parts of the storage system, to demonstrate the recovery mechanisms.

As noted in Requirement 16, a HDFS instance in large-scale production is sufficient for
demonstrating stress. During production at Nebraska and at Caltech, we have observed failures of
the following components:
     Namenode: When a namenode dies, the only currently used recovery mechanism is to replace
        the server (or fix the existing server) and copy a checkpoint file into the appropriate
        directory. A high-availability setup have not yet been investigated by our production sites,
        mostly due to the perceived complexity for little perceived benefit (namenode failure is rare).
        This has been demonstrated in production at Nebraska and Caltech. When the namenode
        fails, writes will not continue and reads will fail if the client had not yet cached the block
        locations for open files.
     Datanode: Datanode failures are designed to be an everyday occurrence, and they have
        indeed occurred at both Nebraska and Caltech. The largest operational impact is the
        amount of traffic generated by the system while it is re-replicating blocks to new hosts.
     Globus GridFTP servers: Each transfer is spawned as separate process on the host by
        xinetd. This results in the server being extremely reliable in the face of failures or bugs in
        the GridFTP server. When the GridFTP host dies, others may be used by SRM. Nebraska
        and UCSD have implemented schemes where the SRM server stops sending new transfers to
        the GridFTP server. Caltech has also implemented a Gridftp appliance integrated with the
        Rocks cluster management software that can be used to install and configure a new gridftp
        server in 10 minutes.
     SRM server: When the SRM server fails, all SRM based transfers will fail until it has been
        restarted manually (the service health is monitored via RSV). This happens infrequently
        enough in production that no automated system has been implemented, although LVS-based
        failover and load-balancing is plausible because BeStMan is stateless. Caltech has
        implemented a Bestman appliance integrated with the Rocks cluster management software
        that can be used to install and configure a new Bestman server in 10 minutes.

   8. Security Concerns

HDFS has unix-like user/group authorization, but no strict authentication. HDFS should only be
exposed to a secure internal network which only non-malicious users are able to access. For users
with unrestricted access to the local cluster, it is not difficult at all to bypass authentication. There
is no encryption or strong authentication between the client and server, meaning that one must
have both trusted server and client. This is the primary reason why HDFS must be segregated onto
an internal network.
It is possible to reasonably lock-down access by:
     1. Preventing unknown or untrusted machines from accessing the internal network. This
        requirement can be removed by turning on SSL sockets in lieu of regular sockets for inter-
         process communication. We have not pursued this method due to the perceived
         performance penalty.
             a. By “untrusted machines”, we include allowing end-user’s laptops or desktops to
                access HDFS. Such access could be allowed via Xrootd redirectors (for ROOT-
                based analysis) or exporting the file system via HTTPS (allowing whole-file download).
     2. Prevent non-fuse users from accessing HDFS ports on the known machines on the network.
         This will mean only the HDFS FUSE process will be able to access the datanodes and
         namenode; this allows the Linux filesystem interface to sanitize requests and prevents users
         from TCP-level access to HDFS.
It’s important to point out that in (2), we are relying on the security of the clients on the network.
If a host is compromised at the root-level, the attacker can perform any arbitrary action with
sufficient effort. During the various tests outlined above, the sites’ security was based on either
the internal NAT (Caltech and Nebraska) or firewalls eliminating access to the outside world
Security concerns are actively being worked on by Yahoo. The progress can be followed on this
master JIRA issue:
In release 0.21.0, access tokens issued by the namenode prevents clients from accessing arbitrary
data on the datanode (currently, one only needs to know the block ID to access it). Also in 0.21.0,
the transition to the Java Authentication and Authorization Service has begun; this will provide the
building blocks for Kerberos-based access (Yahoo’s eventual end goal). Judging by current
progress, transitioning to Kerberos-based components could happen during 2010.
If a vulnerability is discovered, we would release updated RPMs within one workweek (sooner if the
packaging is handled by the VDT). This probably will not be necessary as the security model is
already very permissive. Security vulnerabilities are one of the few reasons we will update the
“golden set” of RPMs.

Note: Example damage a rogue batch job could do
To demonstrate the security model, we give a few examples of what a rogue job could do:
    Excessive memory usage by the rogue job could starve the datanode process and cause it to
       crash. Most sites limit the amount of memory allowed for individual batch jobs, so this is not
       a big concern.
    If the rogue job has write access to the datanode partition, then it could fill up the partition
       with garbage which would prevent the datanode from writing any further blocks. This will not
       cause the datanode to fail, but will cause a loss of usable space in the SE.
           o Most sites use Unix file system permissions to prevent this.
    A malicious batch job with telnet access to the Hadoop datanode could request any block of
       data if it knows the block ID. This is fixed in the HDFS 0.21.0 branch (to be released
       approximately in November).
    A malicious batch job with telnet access to the Hadoop namenode could perform arbitrary file
       system commands. This could result in a lot of damage to the storage system, and why we
       recommend client-side firewalls.
           o This is a known weakness in the current security model and is being addressed in
              current Hadoop development.

Grid Components (GridFTP and BeStMan)
Globus GridFTP and BeStMan both use standard GSI security with VOMS extensions; we assume
this is familiar to both CMS and FNAL. Because both components are well-known, we do not
examine their security models here.
If a vulnerability is discovered in any of these components, we would release a RPM update once our
upstream source (the VDT) has this update. The target response time would be one workweek
while packaging is done at Caltech, and in lockstep with the VDT update when that team does

   9. Risk Analysis
In this section, we analyze different risks that are posed to the different pieces of the HDFS-based
SE. We attempt to present the most pressing risks in the proposed solution (both technical and
organizational), and point out any mitigating factors.

HDFS is both the core component and a component external to grid computing. Hence, its risk
must be examined most closely.
   1. Health of Hadoop project: HDFS is completely dependent on the existence and continued
      maintenance of the Hadoop project. Continued development and growth of this project is
      critical. Hadoop is a top-level project of the Apache Software Foundation; in order to
      achieve this status, the following requirements were necessary:
          a. Legal
                    i. All code ASL'ed (Apache Software License, a highly permissive open-source
                    ii.        The code base must contain only ASL or ASL-compatible
                    iii.       License grant complete.
                    iv.        Contributor License Agreement on file.
                    v.         Check of project name for trademark issues.
              This legal legwork protects us from code licensing issues and various other legal
          b. Meritocracy / Community
                    i. Demonstrate an active and diverse development community
                    ii.        The project is not highly dependent on any single contributor (there
                       are at least 3 legally independent committers and there is no single company
                       or entity that is vital to the success of the project)
                    iii.       The above implies that new committers are admitted according to ASF
                    iv.        ASF style voting has been adopted and is standard practice
                    v.         Demonstrate ability to tolerate and resolve conflict within the
                 vi.      Release plans are developed and executed in public by the community.
                 vii.     ASF Board for a Top Level Project, has voted for final acceptance.
           The ASF has shown that these community guidelines and requirements are hallmarks
           of a good open source project.
   The fact that HDFS is an ASF project and not a Yahoo corporation project means that it is
   not tied to the health of Yahoo. The current HDFS lead is employed by Facebook not
   Yahoo. At this point in the project’s life, about 40% of the patches come from non-Yahoo
   employees. Relevant to the recent changes to Microsoft as the company’s search engine
   provider, Yahoo has made public statements that:
         Hadoop is used for almost every piece of the Yahoo infrastructure, including: spam
           fighting, ads, news, and analytics.
         Hadoop is critical to Yahoo as a company, and is not a subproject of the search
           engine. It is possible that money previously invested into the search engine
           technology will now be invested into Hadoop.
   Cloudera has received about $16 million in start-up capital and employs several key
   developers, including Doug Cutting, the original author of the system. Hadoop maintains a
   listing of web sites and companies utilizing its technology,

   Condor currently funds a developer working on Hadoop, and is investigating the use of
   HDFS as a core component.

   While we believe these reasons mitigate the risk of HDFS development becoming stagnate,
   we believe this is the top long-term risk associated with the project.
2. Hadoop support / resolution of bugs: There is no direct monetary support for large-scale
   HDFS development, nor is the success of HDFS dependent upon WLCG usage. We have no
   paid support for HDFS (although it can be purchased). This is mitigated by:
       a. Paid support is available: We have good contacts with the Cloudera technical staff,
          and would be able to purchase development support as needed. Several project
          committers are on Cloudera staff.
       b. Critical bugs affect large corporations: Any bug we are exposed to affects Yahoo and
          Facebook, whose businesses depend on HDFS. Hence, any data loss bug we discover
          will be of immediate interest to their development teams. When Nebraska started
          with HDFS, we had issues with blocks truncated by ext3 file system recovery. This
          triggered a long investigation by a member of the Yahoo HDFS team, resulting in
          many patches for 0.19.0. Since that version, we have not seen the truncation issue
       c. Acceptance of patches: Nebraska has contributed on the order of 5 patches to HDFS,
          and has not had issues with getting patches accepted by the upstream project. The
          major issue has been passing the acceptance criteria – each patch must meet coding
          guidelines, pass code review from a different coder, and come with a unit test (or an
          explanation of why a new unit test is not needed).
                    i. We have opened 30 issues. 10 of these issues have been fixed. 4 have been
                       closed as duplicate. 4 have been closed as invalid. 12 remain open; 6 of
                       these have a patch available, but have not been committed. Of the remaining
                       open issues, only 1 is applied to our local distribution (the same patch is also
                       applied to the Cloudera distribution).
          d. Large number of unittests: HDFS core has good unit test coverage (Clover coverage
             of 76% http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/clover/).
             All nontrivial commits require a unit test to be committed along with it. Because of
             the initial difficulties in getting completely safe sync/append functionality, a large set
             of new unit tests was developed for 0.21.0 based on a fault-injection framework. The
             fault injection framework provides developers with the ability to better demonstrate
             not only correct behaviors, but correct behaviors under a variety of fault conditions.

             The unit tests are run nightly using Apache Hudson
             (http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/) and take several
      Each point helps to mitigate the issue, but does not completely remove the issue. In the
      extreme case, we are prepared to run on locally-developed patches that are not accepted by
      the upstream project. This would hurt our efforts to keep support costs under control, so
      we would avoid this situation.
   3. Hadoop feature set: We believe HDFS currently has all the features necessary for adoption.
      We do not believe that any new features are required in the core. However, it should be
      pointed out the system is not fully POSIX compliant. Specifically, the following is missing:
          a. File update support: Once a file is closed, it cannot be altered. In HDFS 0.21.0,
             append support will be enabled. We do not believe this will ever be necessary for
          b. Multiple write streams / random writes: Only a single stream of data can write to an
             open file and doing a seek() during write is not supported. This means that one may
             not write a TFile directly to HDFS using ROOT; first, the file must be written to
             disk, then copied to HDFS. We developed in-memory stream reordering in GridFTP
             in order to avoid this limitation. If USCMS decides to write files directly to the SE
             and not use local scratch, HDFS will not be immediately supported. We believe this
             to be a low risk.
          c. Flush and append: A file is not guaranteed to be fully visible until it has been closed.
             Until it is closed, it is not defined how much data a reader will see if they attempt to
             read the file. Flush and append support will be available for HDFS 0.21.0, which will
             provide guaranteed semantics about when data will be available to readers. We do
             not believe this will be an issue for CMS.

FUSE-DFS, as a contributed project in Hadoop, shares many risks with HDFS. There are a few
concerns we believe relevant enough to merit their own category.
   1. FUSE support: FUSE has no commercial company providing support. However, is a part of
      the mainline Linux kernel, over 5 years old, and has had a stable interface for quite awhile.
      We have never seen any issue from FUSE itself. We believe the FUSE kernel module is a
      risk because OSG has less experience packaging kernel modules, and kernel modifications
      often result in support issues. This is mitigated by the fact that OSG-supported Xrootd
      requires the FUSE kernel (meaning that HDFS isn’t unique in this situation) and that the
      ATrpms repository provides a FUSE kernel module and tools to build the RPM module for
      non-standard kernels. UCSD and Caltech both build their own kernel modules; Nebraska
      uses the ATrpms ones.
   2. FUSE-DFS support: FUSE-DFS is the name of the userland library that implements the
      FUSE filesystem. This was originally implemented by Facebook and is in the HDFS SVN
      repository as a contributed module. It does not have the same level of support as HDFS
      Core because it is contributed; it also does not have as many companies using it in
      production. Through the process of adopting HDFS, we have discovered critical bugs,
      submitted, and had accepted several FUSE-DFS related patches. We have even recently
      found memory leak bugs in the libhdfs C wrappers (this had not been previously discovered
      because it only is noticeable when things are in continuous production). We believe this
      component is the short-term highest-risk software component of the entire solution. The
      mitigating factors for FUSE-DFS are:
          a. Small, stable codebase: FUSE-DFS is basically a small layer of glue between FUSE
              and the libhdfs library (a core HDFS component, used by Yahoo). The entire code
              base is around 2,000 lines, about 4% of the total HDFS size. During our usage of
              HDFS, neither the libhdfs nor the FUSE API has changed. This limits the number of
              undiscovered bugs and rate of bug introduction. We believe the majority of the
              possible issues are fixed.
          b. Production experience: We have been running FUSE for more than 8 months, and feel
              like we have a good understanding of possible production issues. As of the latest
              release, the largest outstanding issue is the fact that FUSE must be remounted
              whenever users are added or removed from groups (user-to-group mappings are
              currently cached indefinitely). This is well understood and possible to work-around.
              This bug may be mitigated in future versions of HDFS, as it will be necessary to the
              future Kerberos-based authorization / authentication.
          c. Extensive debugging experience available: The last FUSE-DFS memory leak bug
              tackled required in-depth debugging at Nebraska. We believe we have the
              experience and tools necessary to handle any future bugs. We intend to make sure
              that any locally developed patches are upstreamed to the HDFS project.
      There have been other FUSE binding attempts, but this is the only one that has been
      supported or developed by a major company (Facebook) and committed as a part of the
      HDFS projects. The other attempts appear to have never been completed or kept up-to-
      date with HDFS.

BeStMan is an already supported component of the OSG. We have identified the associated risks:
   1. BeStMan runs out of funding: As BeStMan is quickly becoming an essential OSG package,
      we believe that it will always meet the needs of USLHC, even if it is not funded at LBNL.
   2. BeStMan currently uses Globus 3 container: The Globus 3 web services container was never
      in large-scale use, and currently suffers from debilitating bugs and unmaintained
      architecture. The BeStMan team is currently using most of their effort in replacing this with
      an industry-standard Tomcat webapp container. This should be delivered fall-winter 2009.
      We believe this will remove many bugs and improve the overall source code. This would
      make it possible for external parties to submit improvements.

Globus GridFTP
Globus GridFTP is an already-supported component of the OSG. We have identified the associated
    1. Globus GridFTP runs out of funding: Globus GridFTP is an essential component to the
       OSG. If it runs out of funding, we will use whatever future solution the OSG adopts.
    2. Globus GridFTP model possibly not satisfactory: The Globus GridFTP model is based on
       processes being launched by xinetd. Because each transfer is a separate process, issues
       affecting one transfer are very separate from other transfers. However, this makes it
       extremely hard to enforce limits on the number of active transfers per node. This can lead
       to either instability issues (by having no limit) or odd errors (globus-url-copy does not
       gracefully report when xinetd refuses to start new servers). We would like to investigate
       multi-threaded daemon-mode Globus GridFTP, but have not identified effort yet. Current
       T2 sites mitigate this by mostly controlling the number of concurrent transfers (except
       CRAB stageouts) and providing sufficient hardware to accommodate for an influx of transfers.

Component Plug-ins
Both BeStMan and GridFTP require plug-ins in order to achieve the desired level of functionality in
this SE. We have identified the associated risks:
    1. Future changes in versions of underlying components: We may have to update plugin code if
       the related component changes its interface. For example, BeStMan2 may require a new
       Java interface to implement GridFTP selector plugins. Even if the API remains the same,
       it’s possible for the underlying assumptions to change – i.e., if GridFTP plug-in needed to
       become thread-safe.
    2. Original authors leaves USCMS: If the original author leaves USCMS, then much knowledge
       would be lost, even if the effort is replaced. This is why focus is being put into clean
       packaging, documentation, and ownership by an organization (OSG) as opposed to just one
       person. The BeStMan component is relatively simple and straightforward, mitigating this
       concern. The GridFTP component is not due to the complexity of the Globus DSI interface
       (by far, the most complex interface in the SEs). This is high-performance C code and
       difficult to change. If the original author left and the Globus DSI module changed
       significantly, USCMS would need to invest about 1 man-month of effort to perform the
       upgrade. This is mitigated by the fact that the current system does not have any necessary
       GridFTP feature upgrades – USCMS can run on the same plugin for a significant amount of
We have worked hard to provide packaging for the entire solution. The current packaging does
offer a few pitfalls:
    1. Original author leaves USCMS: The setup at Caltech is based on “mock”, the standard
        Fedora/Redhat build tool. The VDT cannot currently does not have the processes in place
        to package RPMs effectively, but this is a planned development for Year 4. Until the
        packaging duties can be transferred from Caltech to VDT (perhaps late Year 4), we will be
        dependent on the setup there. We are attempting to get it better documented in order to
        mitigate risk.
    2. Patches fail to get upstreamed: It is crucial to send patches upstream and maintain the
        minimum number of changes from the base install. We must remain diligent in making sure
        to commit upstream fixes for any bugs.
    3. Rate of change: Even with only bug fix updates for “golden releases”, the rate of updates is
        always worrying. Most of the updates recently have been related to packaging issues,
        especially for platforms not present at any production T2 cluster. We hope that the added
        OSG effort in Year 4 will enable us to drastically reduce the rate of change.
    4. Update mechanisms for ROCKS clusters: Currently, doing a “yum install” is the correct way
        to install the latest version of the software. However, when a administrator adds the RPMs
        to a ROCKS roll, they get locked into that specific version and must manually take action to
        upgrade the RPMs. This means there will always be significant resistance to changing
        versions. This makes decreasing the rate of updates even more important.

Experts and Funding
Much of this work was done using several CMS experts. We outline two risks:
   a) Loss of experts: As mentioned above, we take a significant hit if our experts leave the
       organization. We are focusing heavily on documentation, packaging, and “finishing off”
       development (in fact, preparation for this review has prompted us to clear several long-
       standing issues). This will allow us to do the first “golden set”, but also increase the length
       of time HDFS can be maintained between experts.
           a. A significant amount of CMS T2 funding comes from the DISUN project, which ends in
               Spring 2010. DISUN personnel contribute to the HDFS effort. This is a going
               concern to the HDFS effort and CMS T2 program as a whole.
   b) Loss of OSG: Much of the risk and effort is being shouldered with the OSG to leverage their
       packaging expertise. Having HDFS in the OSG taps into an additional pool of human
       resources outside the experts in USCMS. However, the current funding for the OSG runs
       out in 2 years (and is reduced in 1 year). If the OSG funding is lost, then we will have to
       again rely internally on USCMS personnel, similar to FY2009.
The catastrophe scenario for HDFS adoption is both funding loss in the OSG and loss of the
experts. In this case, the survival plan would be:
    Identify funding for new experts (from experience, it takes about 6 months to train a new
       expert once they are in place). This can be taken from the pool of HDFS sysadmins; as
  HDFS gains wider use, the pool of potential experts is broadened.
 No new “golden set” until a packaging, testing, and integration program can be re-
  established. If this becomes a chronic problem, a hard focus would be made on to switching
  entirely to Cloudera’s distribution in order to offload the Q/A testing of major changes to
  an external organization.
 No new USCMS-specific features. We believe that HDFS has all the necessary major
  features for CMS adaptation, but we do find small useful ones (an example would be the
  development of Ganglia 3.1 compatibility). Without a local expert, developing these for CMS
  would not be possible. Without a local expert, any running with patches not accepted by the
  upstream project becomes increasingly dangerous.

To top