Hadoop File System as part of a CMS Storage Element
In the last year, several new storage technologies have matured to the point where they have
become viable candidates for use at CMS Tier2 sites. One particular technology is HDFS, the
distributed filesystem used by the Hadoop data processing system.
2. The HDFS SE
HDFS is a file system; as such, it must be complemented by other components in order to build a
grid SE. We consider a minimal set of additional components to be:
a) FUSE / FUSE-DFS: FUSE, a standard linux kernel module, allows filesystems to be written
in userspace. The FUSE-DFS library makes HDFS into a FUSE filesystem. This allows a
POSIX-like interface to HDFS, necessary for user applications.
b) Globus GridFTP: Provides a WAN transfer protocol. Using Globus GridFTP with HDFS
requires a plugin developed by Nebraska.
c) BeStMan: Provides an SRMv2 interface. UCSD and Nebraska have implemented plugins for
BeStMan to allow smarter selection of GridFTP servers.
There are optional components that may be layered on top of HDFS, including XrootD, Apache
HTTP, and FDT.
This document aims to show that the combination of HDFS (as filesystem), Fuse (as local “posix-
like” interface)FUSE, Globus GridFTP (as WAN transfer protocol), BeSTMan (as srm), and some
plugins developed by CMS Tier-2 teams at Nebraska, Caltech, and UCSD, meet the SE
requirements set forth by USCMS Tier 2 management. The requirements are typeset in italics, and
the responses are given below them in normal typesetting. Throughout this document, we use
“Hadoop” and “HDFS” interchangeably; generally, “Hadoop” refers to the entire data processing
system. In this context, we take it to only refer to the filesystem components.
3.4. Management of the SE
Requirement 1. A SE technology must have a credible support model that meets the reliability,
availability, and security expectations consistent with the area in the CMS computing infrastructure
in which the SE will be deployed.
Support for this SE solution is provided by a combination of OSG, LBNL, Globus, the Apache
Software Foundation (AFSF), DISUN, and possibly the US CMS Tier-2 program as follows.
BeSTMan is supported by LBNL,. GrifFTP GridFTP by Globus, . and b Both are part of the OSG
portfolio of storage solutions. HDFS is supported by AFSASF, as elaborated below, and Fuse FUSE
is part of the standard Linux distribution. This leaves us with two types of plugins that are required
to integrate all the pieces into a system, as well as the packaging support. The first is a plugin to
BeSTMan to pick a gridFTP GridFTP server from a list of available gridFTP GridFTP servers; this
list is reloaded every 30 seconds from disk and servers are randomly selected (otherwise, the default
policy is a simple round-robin and it requires an SRM restart to alter the GridFTP server list). The
second is a GgridFTP plugin to interface to HDFS. We propose that both of these types of plugins
continue to be supported by their developers from DISUN and US CMS, until OSG has had an
opportunity to gain sufficient experience to adopt the source, and make it its own them. Similarly,
packaging support in form of rpm’s RPMs is to be provided initially by DISUN/Caltech, and later by
US CMS/Caltech, until OSG has had an opportunity to adopt it as part of a larger migration
towards providing native packaging in form of RPMs. OSG ownership of these software artifacts has
been agreed upon in principle, but not yet formally. OSG support for this solution will start with
Year 4 of OSG (October 1st, 2009), and will initially be restricted to:
a) Pick a set of RPMs twice a year, verify that this set is completely consistent, providing a well
integrated system. We refer to this as the “golden set”.
b) Document installation instructions for this golden set.
c) Do a simple validation test on supported platforms (with the validation preferably automatic).
d) Performance test the golden set, and document that test. This performance will be in-depth.
e) Provide operations support for two golden sets at a time. This means that there is a staff
person on OSG who is responsible for tracking support requests, answering simple questions,
and finding solutions to difficult questions via the community support group organized in
f) Repeat the above procedures for important bug and security fixes as needed throughout the
yearOSG will provide updates to a golden set only for important bug and security fixes; these
critical patches will go through validation test, but not performance tests..
Support will be official for RHEL4 and RHEL5-derivants on both 32-bit and 64-bit platforms. The
core HDFS software (the namenode and datanode) is usable on any platform providing the Java 1.6
JDK. Currently, Caltech and Nebraska both run datanodes on Solaris. The main limitation for
FUSE clients is support for the FUSE kernel module; this is supported on any Linux 2.6 series
kernel, and FUSE was merged into the kernel itself in 2.6.14.
Upgrades in HDFS are covered in this wiki document:
http://wiki.apache.org/hadoop/Hadoop_Upgrade. On our supported version, upgrades on the same
major version will only require a “yum update”. For major version upgrades, the procedure is:
a) Shutdown the cluster. Formatted: Bullets and
b) Upgrade via yum or RPMs.
c) Start the namenode manually with the “-upgrade” flag.
d) Start the cluster. The cluster will stay in read-only mode.
e) Once the cluster’s health has been verified, issue the “hadoop dfsadmin –finalizeUpgrade”
command. After the command has been issued, no rollback may be performed.
The wiki explains additional recommended safety precautions.
Hadoop is an open-source project hosted by the Apache Software Foundation. ASF hosts multiple
mailing lists for both users and developers, which is actively watched by both the Hadoop
developers and the larger Hadoop community. Members of this community include Yahoo
employees who use Hadoop in their workplace, as well as employees of Cloudera, which provides
commercial packaging and support for Hadoop. As it is a top-level Apache project, it has
committers contributors from at least three companies and strong project management. Yahoo has
stated that it has invested millions of dollars into the project and intends to continue doing so
Hadoop depends heavily upon its JIRA instance, http://issues.apache.org/jira/browse/HADOOP,
for bug reporting and tracking. As it is an open-source project, there’s no guarantee of response
time to issues. However, we have found high-priority issues (those that may lead to data loss) get
solved quickly because bugs affecting a T2 site have a high probability of affecting Yahoo’s
Requirement 2. The SE technology must demonstrate the ability to interface with the global data
transfer system PhEDEx and the transfer technologies of SRM tools and FTS as well as demonstrate
the ability to interface to the CMSSW application locally through ROOT.
Caltech is using Hadoop exclusively for OSG RSV tests, PhEDEx load tests, user stageouts via
SRM, ProdAgent stageout via SRM and posixPOSIX-like access via FuseFUSE. User analysis jobs
submitted to Caltech with CRAB have been running against data stored in Hadoop for several
UNL has been using HDFS at large scale since approximately late February 2009. It is used for:
PhEDEx data transfers for all links (transfers are done with SRM and FTS).
OSG RSV tests (used to meet the WLCG MoU availability requirements)
Monte Carlo simulation and merging.
User analysis via CMSSW (both grid-based submissions using CRAB and local interactive
UCSD has been using HDFS for serving /store/user.
Caltech, UNL, and UCSD have been leading the effort to demonstrate the scalability of the
BeStMan SRM server. This has recently resulted in a srmLs processing rate of 200Hz in a single
server; this is 4 times larger than the rate required by CMS for FY2010, and is in fact the most
scalable SRM solution deployed in global CMS.
Requirement 3. There must be sufficient documentation of the SE so that it can be installed and
operated by a site with minimal support from the original developers (i.e. nothing more than "best
effort"). This documentation should be posted on the OSG Web site, and any specific issues in
interfacing the external product to CMS product should be highlighted.
Installation, operation, and troubleshooting directions can be found at
http://twiki.grid.iu.edu/bin/view/Storage/Hadoop. This has already been discussed in Requirement
1, and we elaborate further here further. It should be plausible for any site admin to install HDFS
and the corresponding grid components without support from the original developers. This provides
a coherent install experience; all components, including BeStMan and GridFTP, are available as
RPMs. Admins experienced with the RedHat tool “yum” will find that the SE is installable via a
simple “yum install hadoop”. The DISUN/Caltech packaging also provides useful logging defaults
that enable one to easily centrally log errors happening in HDFS; this greatly aids admins in
With the OSG 1.2 release, there are no specific issues pertaining to using HDFS with CMS.
Experience shows that operational overhead at Caltech has been equivalent to the operational
overhead of dCacheapproximately 1FTE, and that includes R&D activities for packaging and testing.
Going forward, we believe the overhead will decrease, as the R&D portions are no longer neededwill
be greatly reduced. Nebraska, which has reached a stable state with Hadoop for several months,
shows that operational overhead is less than dCache1 FTE. UCSD, which is presently supporting
HDFS for /store/user (113TB) and dCache for all else (273TB) reports the same experience. The
main reason why this solution is experienced as less costly to operate is because it has many fewer
moving parts. This much reduced complexity results in significantly lower operational overhead.
Requirement 4. There must be a documented procedure for how problems are reported to the
developers of those products, and how these problems are subsequently fixed.
Starting with Year 4 of OSG, all problems are reported via OSG. OSG then uses the support
mechanisms as discussed in Requirement 1.
In particular, Hadoop has an online ticket system called JIRA. JIRA which is heavily used by the
developers to received and track bugs and features requests for Hadoop and Hadoop-related
projects. JIRA is open for viewing by everyone, and requires a simple account registration for
posting comments and new tickets. All commits in the project must be traceable via JIRA and go
through the quality control process (which includes code review by a different developer and passing
The JIRA system can be found at http://issues.apache.org/jira/browse/HADOOP. The Hadoop
community has also written a guide to filing bug reports,
Requirement 5. Source code required to interface the external product to CMS products must be
made available so that site operators can understand what they are operating. If at all possible,
source code for the external product itself should also be available.
All software components are open source. In particular, Hadoop source code can be downloaded
from http://www.apache.org/dyn/closer.cgi/hadoop/core/. Patches for specific problems in a
release can be downloaded from JIRA (see above). Any patches currently applied to the Caltech
Hadoop distribution have been submitted to JIRA, and we have tried to make sure that they get
committed in a timely manner. This helps us minimize the costs of maintaining our own RPM
Yahoo has publicly committed to releasing its “stable patch set” to the world beginning with 0.20.0.
They are committed to keeping all patches used publicly available in the JIRA; the only added
knowledge is which patches are stable enough to be applied to current releases. When we upgrade
to this version, we will be able to tap into this significant resource; Yahoo has a committed QA team
and a test cluster an order of magnitude larger than our production cluster.
The BeStMan source code is open source for academic users; the OSG is working on clearing up the
licensing of this software and making the code more freely available. Each of the used plug-ins (for
BeStMan and GridFTP) are available in the Nebraska SVN repository which allows anonymous
4.5. Reliability of the SE
Requirement 6. The SE must have well-defined and reliable behavior for recovery from the failure
of any hardware components. This behavior should be tested and documented.
We broadly classify the HDFS grid SE into three parts: metadata components, data components,
and grid components. Below, we document the risks a failure in each poses and the suggested
Metadata components: The two major metadata components for HDFS and are the
namenode and the “secondary namenode” (recently renamed to the “backup”) servers. The
namenode is the single point of failure for normal user operation. Because of this, there are
several built-in protections each site is recommended to take:
o Write out multiple copies of the journal and file system image. HDFS heavily relies
on the metadata journal as a log of all the operations that alter the namespace.
HDFS config files allow for writing the journal onto multiple partitions. It is
recommended to write the journal on two separate physical disks in the namenode and
suggested a third copy be written to a NFS server.
o The secondary namenode / backup server allows the site admin to create checkpoint
files at regular intervals (default is every hour or whenever the journal reaches 64MB
in size). It is strongly recommended to run the secondary namenode on a separate
physical host. The last two checkpoints are automatically kept by the secondary
o The checkpoints should be archived, preferably off-site.
Future versions of HDFS (0.21.x) plan on having an offline checkpoint verification tool and
stream journal information to the backup node in real-time, as opposed to an hourly
In the case of namenode failure, the following remedies are suggested:
1. Restart the namenode from the file system image and journal found on the
namenode’s disk. The only resulting file loss will be the files being written at the
time of namenode failure. This action will work as long as the image and journal are
2. Copy the checkpoint file from the secondary namenode to the namenode. Any files
created between the time the checkpoint was made and the namenode failure will be
lost. This action will work as long as the checkpoint has not been corrupted.
3. Use an archived checkpoint. Any files created between the checkpoint creation and
the namenode failure will be lost. This action will work as long as one good
Note that all the namenode information is kept on two files written directly to the Linux Formatted: Indent: Left: 0.49"
If none of these actions work, the entire file system will be lost; this is why we place such
importance on backup creation. In addition to the normal preventative measures, the
following can be done:
1. Have hot-spare hardware available. HDFS is offline when the namenode is offline. If
the failure does not have an immediate, obvious cause, we recommend using the
standby hardware instead of prolonging the downtime by troubleshooting the issue.
2. Create a high-availability (HA) setup. It is possible, using DRBD and Heartbeat to
completely automate recovery from a namenode failure using a primary-secondary
setup. This would allow services to automatically be restored on the order of a
minute with the loss of only the files which were currently being created at the time of
the failure. As most of the CMS tools automatically retry writing into the SE, this
means that the actual file loss would be minimal. These high-availability setups have
been used by external companies, but not yet done by a grid site; the extra setup
complexity is not yet perceived as worth the time.
Data components: Datanode loss is an expected occurrence in HDFS and there are multiple
layers of protection against this.
1. The first layer of protection is block replication. Hadoop has a robust block Formatted: Indent: Hanging:
replication feature that ensures that duplicate file blocks are placed on separate
nodes and even separate racks in the cluster. This helps ensure that complete copies
of every file are available in the case that a single node becomes unavailable, and
even if an entire rack becomes unavailable.
2. Hadoop periodically requests an entire block report from every data node. This
protects against synchronization bugs where the namenode’s view of the datanode’s
contents is different from reality
2.3. It is often possible to have a bad hard drive causing corruption issues??, but
not have that hard drive fail. Each datanode schedules each block to be
checksummed once every 2 weeks. If the block fails the checksum, it will be deleted
and the namenode notified – allowing automatic healing from hard drive errors.
Grid components: Grid components require the least amount of failure planning because
they are stateless for the HDFS SE. Multiple instances of the grid components (gridftp,
bestman SRM) can be installed and used in a failover fashion using Linux LVS or round-robin
We are aware of one significant failure mode. BeStTMan SRM is known to lock up under very heavy
load (over 1000 concurrent SRM requests, which is at least 10 fold what’s been observed in
production), and requires a restart when this happens. We believe this to be a problem with the
Gglobus java container. BeStTMan is scheduled to transition to a different container technology in
Fall 2009 in BeStMan 2. OSG is committed to follow this issue, and validate the new version when it
is released. However, this issue manifests itself only under extreme loads at the existing
deployments, and is thus presently not an operational problem.
In addition, there is a minor issue with the way HDFS handles deployments with server nodes that
serve space of largely varying sizes. HDFS need to be routinely re-balanced, which is typically done
via crontab in extreme cases (e.g. Nebraska), and manually once every two weeks or so for
deployments where disk space differs by no more than a factor 2-4 or so between nodes (e.g.
UCSD). There is no significant operational impact from routinely running the balancer beyond
network traffic; HDFS allows the site to throttle the per-node rate going to the balancer. This
imbalance is caused because the selection of servers for writing files is mostly random, while the
distribution of free space is not necessarily random.
Requirement 7. The SE must have a well-defined and reliable method of replicating files to protect
against the loss of any individual hardware system. Alternately there should be a documented
hardware requirement of using only systems that protect against data loss at the hardware level.
Hadoop uses block decomposition to store its data, breaking each file up into blocks of 64MB
apiece (this block size is a per-file setting; the default is 64MB, but most sites have increased this
to 128MB for new files). Each file has a replication factor, and the HDFS namenode attempts to
keep the correct number of replicas of each of the blocks in the file. The replication policy is:
a) First replica goes to the “closest datanode” – the local node is the highest priority, followed Formatted: Bullets and
by a datanode on the same rack, followed by any data node.
b) Second replica goes to a random datanode on a different rack
c) Third replica (if requested) goes to a random datanode on the second rack.
If two replicas are requested, each will end up on two separate racks; if three replicas are
requested, they will end up on two separate racks.
Replication level is set by the client at the time it creates the block. The replication level may be
increased or decreased by the admin at any time per-file, and can be done recursively. The
namenode attempts to satisfy the client’s request; as long as the number of successfully created
replicas is between the namenode’s configured minimum and maximum, the HDFS considers the
write a success. Because the client requests a replication level for each write, one cannot set a
default replication for a directory tree.
For example, at Caltech, a cron script automatically sets the replication level on known directories
in order to ensure clients request the desired replication level. They currently use:
hadoop fs -setrep -R 3 /store/user Formatted
hadoop fs -setrep -R 3 /store/unmerged Formatted: Indent: First line:
0.5", Don't hyphenate, Adjust
hadoop fs -setrep -R 1 /store/data/CRUZET09 space between Latin and Asian
hadoop fs -setrep -R 1 /store/data/CRAFT09 text, Adjust space between
Asian text and numbers
It is not possible to pin files to specific datanodes, or to set replication based on the datanodes Formatted
where the files will be located. Hadoop treats all datanodes as equally unreliable. Formatted
The namenode keeps track of all block locations in the system, and will automatically delete files
replicas or create new copies when needed. Replicas may be deleted either when the associated file
is removed or the block has too more replicas than desired. Datanodes send in a heartbeat signal
once every 3 seconds, allowing the server to keep up-to-date with the system status.
For example, when a datanode fails, the server allows it to miss up to 10 minutes of heartbeats
(setting configurable). Once it is declared dead, the namenode starts to make inter-node transfer
requests to bring the blocks that were on the datanode back up to the desired level. This often is
quite quick; all the desired replicas can be done in around an hour per TB (the majority of the
transfers go faster than that; a good portion of the hour is “transfer tails”). When decommissioning
is started, the system does not prefer to copy data off the decommissioned node. Assuming the
number of replicas is greater than one, the load of replication is distributed randomly throughout the
cluster. The percentage of transfers the decommissioned node will get is (# of TB of files on
datanode)/(# of TB in entire system). Because older nodes typically have a smaller disk size, it will
comparatively get less load.
If a dead node reappears (host rebooted/fixed, disk is physically moved to a different node), the
blocks it previously hosted will now be overreplicated. The namenode will then reduce the number
of replicas in the system, starting with replicas located on nodes with the least amount of free space
(as a percentage of total space on the node). If the blocks belonged to files which have been
deleted, the node will be instructed to delete them in the response to its block report.
The number of under-replicated blocks can be seen by viewing the system report using fsck or by
looking at the namenode’s Ganglia statistics.
The success of Hadoop's automatic block replication was seen when Caltech suffered a simultaneous
failure of 3 large (6TB) datanodes on the evening of Sunday, Jul. 12:
“Within about an hour of installing a faulty Nagios probe, 3 of the 2U datanodes had Formatted: Indent: Left: 0.5"
crashed, all within minutes of each other. Each of the 2U datanodes hosts just over 6TB of
raid-5 data. Nagios started sending alarms indicating that we had lost our datanodes. We
had seen kernel panics caused by this nagios probe before, so we had no trouble locating the
cause of the problem. The immediate corrective action was to disable this Nagios probe.
This was done immediately to avoid further loss of datanodes.
Ganglia showed Hadoop reporting ~80k underreplicated blocks, and Hadoop started
replicating them after the datanodes failed to check in after ~10 minutes. The network
activity on the cluster jumped from ~200MB/s to 2GB/s. Since we run Rocks, 2 of the
machines went into a reinstall immediately after dumping the kernel core file. Within an hour
they were back up and running. Hadoop did not start automatically, however, due to a
Rocks misconfiguration. I started Hadoop manually on these two nodes, which caused our
underreplicated block count to drop from ~20k to ~4k. Within ~90 minutes the number of
underreplicated blocks was back to zero.
At this point we still had one datanode that had not recovered. I checked the system
console and discovered that the system disk had also died. Since it was late on a Sunday
evening, and we run 2 xX replication in Hadoop, I decided we could leave the datanode
offline and wait until the following morning to replace the disk.
After the replication was done, I ran “hadoop fsck /” to check the health of HDFS. To my
surprise, hadoop was reporting 40 missing blocks and 33 corrupted files. This seemed
strange because all of our files (aside from LoadTest data) is replicated 2x, so the
loss of a single datanode should not have caused the loss of any files (aside from LoadTest
data). After parsing the output of “hadoop fsck /”, I found that we had been accidentally
setting the replication on /store/relval to 1, instead of leaving it at the default of 2. This
The next morning we replaced the system disk in the dead datanode (bug #213) and waited
for it to reinstall (which took a few hours longer than it ought have). Almost immediately
after starting Hhadoop on this last datanode, hadoop fsck reported the filesystem was clean
again. By 2:15 the following afternoon everything had returned to normal and hadoop was
healthy again. ”
Requirement 8. The SE must have a well-defined and reliable procedure for decommissioning
hardware which is being removed from the cluster; this procedure should ensure that no files are
lost when the decommissioned hardware is removed. This procedure should be tested and
Formatted: Font: Not Italic
The process of decommissioning hardware is documented in the Hadoop twiki under the Operations
guide. The process goes approximately like this:
1. Edit the hosts exclude file to exclude the to-be-decommissioned host from the cluster.
2. Issue the “refreshNodes” command in the Hadoop CLI to get the namenode to re-read the
file. The node should show up as “Decommissioning” in the web interface at this point.
3. Watch the web interface or the “report” command in the Hadoop CLI and wait until the
node is listed as “dead”.
This process is not only straightforward, but a very routine process at each site. Decommissioning
is done whenever a node needs to be taken offline for any upgrade lasting more than 10 minutes at
Requirement 9. The SE must have well-defined and reliable procedure for site operators to
regularly check the integrity of all files in the SE. This should include basic file existence tests as
well as the comparison against a registered checksum to avoid data corruption. The impact of this
operation (e.g. load on system) should be documented.
Hadoop’s command line utility allows site admins to regularly check the file integrity of the system.
It can be viewed using “hadoop fsck /”. At the end of the output, it will either say the file system
is “HEALTHY” or “CORRUPTED”. If it is corrupted, it provides the outputs necessary to repair
or remove broken files.
HDFS registers a checksum at the block level within the block’s metadata. HDFS automatically
schedules background checksum verifications (default is to have every block scanned once every 2
weeks) and automatically invalidates any block with the incorrect checksum. The checksumming
interval can be adjusted downward at the cost of increased background activity on the cluster. We
do not currently have statistics on the rate of failures avoided by checksumming.
Whenever a file is read by a client (even partially – checksums are kept for every 4KB), the client
receives both data and checksum and computes the validity of the data on the client side. Similarly,
when block is transferred (for example, through rebalancing), the checksum is computed by the
receiving node and compared to the sender’s data.
Note about catastrophic loss: Formatted: Font: Bold
We have emphasized that with 2 replicas, file loss is very rare due to:
Failures that occur rapidly (>2 hours between failure) cause little to no loss because the re- Formatted: Bullets and
replication in the file system is extremely fast; one guidance is to expect 1TB per hour to be
Multi-disk failures happening within an hour usually are due to some common piece of
equipment (such as the rack switch or PDU). Rack-awareness prevents an entire rack
disappearing from causing file loss.
However, what happens if we make the assumption that all safeguards are bypassed and 2 disks are Formatted: Font: Bold
lost? This is not without precedent; at Caltech, a misconfiguration told Hadoop 2 nodes on the Formatted: Font: Bold
same rack were on different racks. This bypassed the normal protections from rack awareness. The
rack’s PDU failed and two disks failed to come back up. Caltech lost 54 blocks of file.
Using the binomial distribution, the expected number of blocks lost is:
(# of blocks lost) = (# of blocks) * P(single block loss) Formatted: Centered
The binomial distribution is appropriate because the loss of one block does not affect the probability
of another block loss. The standard deviation is approximately the square root of the number of
blocks lost. The probability of a single block loss is
P(single block loss) = P(block on node 1) * P(block on node 2)
The probability a block is on a given node is approximately:
P(block on node 1) = (replication level)*(size of node)/(size of HDFS)
assuming that the cluster is well-balanced and blocks are randomly distributed. Both assumptions
appear to be safe in currently-deployed clusters.
Plugging in Caltech’s numbers (1,540,263 blocks; 342.64 TB in the system, each lost disk was
1TB), the expected number of lost blocks was 52.4 with a standard deviation of 7.2. This is
strikingly close to the actual loss, 54 blocks.
If only complete files were written (i.e., no block decomposition), then the expected loss would be
(# of files lost) = (# of files) * P(single block loss) Formatted: Centered
So, assuming 128MB and experimental files of around 1GB, the number of files lost would be 10x
lower. In the end, CMS site would lose 10x more files using HDFS. We believe this is an
acceptable risk, especially as the recovery procedure for 5 files versus 50 files is similar. In the
case of simultaneous triple-disk-failure on triple-replicated files, the expected loss would be less
than 1 file for Caltech’s HDFS instance.
Requirement 10. The SE must have well-defined interfaces to monitoring systems such as Nagios
so that site operators can be notified if there are any hardware or software failures.
HDFS integrates with Ganglia; provided that the site admin points HDFS to the right Ganglia
endpoint, many relevant statistics for the namenode and datanodes appear in the Ganglia gmetad
webpages. Many monitoring and notification applications can set up alerts based on this.
Caltech has also contributed several HDFS-Nagios plugins to the public which that monitor various
aspects of the health of the system directly. They have released a TCL-based desktop application,
“gridftpspy” which monitors the health and activity of the Globus gridftp servers. Some of these
are based on the JMX (Java Management eXtensions) interface into HDFS. JMX can integrate with
a wider range of monitoring system. There is also an external project providing Cacti templates for
monitoring HDFS. The Nagios and gridftpspy components are packaged in the Caltech yum
repository, but not officially integrated; we foresee labeling them experimental for the OSG-
supported first release.
Finally, Caltech has developed the “Hadoop Chronicle”, a nightly email that sends administrators
the basic Hadoop usage statistics. This has an appropriate level of details to inform site executives
about Hadoop’s usage. The Hadoop Chronicle is now part of the OSG Storage Operations toolkit.
This is currently in use at Caltech and in testing at Nebraska.
Note about admin intervention: Formatted: Font: Bold
The previous two requirements start to cover the topic of “what HDFS activities do site admins
engage in?” and at what interval. We have the following feedback from Nebraska and Caltech site
Nebraska: Formatted: Bullets and
o Daily tasks: Check Hadoop Chronicle, look at RSV monitoring
o Once a week: Clean up dead hardware, restart dead components. The component
which crashes most often is BeStMan at about once every 2 weeks.
o Once every 2 months: Some sort of data recovery or in-depth maintenance.
Examples include debugging an underreplicated block or recovering a corrupted file.
Caltech (note: Caltech runs an experimental kernel, which may explain the reason there’s
more kernel-related maintenance than at Nebraska):
o Continuously: wait for Nagios alerts
o Hourly tasks: Check namenode web pages and gridftp logs via gridftpspy (admittedly a
o Daily tasks: Read Hadoop Chronicle, browse PhEDEx rate/error pages
o Weekly tasks: reboot nodes due to kernel panic, adjust gridftp server list (BeStMan
plugin currently not used), track down lost blocks (for datasets replicated once),
maintain ROCKS configuration.
o Once a month: Reboot namenode with new kernel, reinstall data nodes with bugfix
5.6. Performance of the SE
All aspects of performance must be documented.
Requirement 11. The SE must be capable of delivering at least 1 MB/s/batch slot for CMS
applications such as CMSSW. If at all possible, this should be tested in a cluster on the scale of a
current US CMS Tier-2 system.
To test this requirement, Caltech ran a test using dd to read from HDFS through the fuse mount on
each of the 89 worker nodes on the Tier2 cluster. dd was used to maximize the throughput from
the storage system. We acknowledge that the IO characteristics from dd are not identical to that of
CMSSW applications, which tend to read smaller chunks of data in random patterns. Each worker
node ran 8 dd processes in parallel, one per core. Each dd process/batch slot on a single worker
node read a different 2.6GB file from HDFS 10 times in sequence. The same 8 files were read from
each of the 89 worker nodes. At the end of each file read, dd reported the rate at which the file
was read. A total of 18.1TB was read during this test. The final dd was finished approximately 4.25
hours after the test was started.
The average read rate reported by dd was 2.3MB/s ± 1.5MB/s. The fastest read was 22.8MB/s
and the slowest was 330KB/s.
The rate per file delivered from HDFS was 18.1TB/4.25hours = 1238MB/s, or approximately
155MB/s (1Gbps) per HDFS file. The test was run as more of a test to see how the system behaves
for the 'hot file' problem. As such, this test shows that HDFS can deliver even 'hot files' to the
batch slots at the required rates.
It should be noted that this test was run while the cluster was also 100% full with cms CMS
production and cms CMS analysis jobs, most of which were also reading and writing to hadoop
Hadoop at the same time. The background HDFS traffic from this CMS activity was not included in
UCSD ran a separate test with a standard CMSSW application consuming physics data. The same
application has been used for computing challenging or scalability exercise. The application is very
I/O intensive. Here we mainly focused on the application reading the data that is located locally in
During the tests, there are 15 datanodes holds the data files with 1GB in size. The block size in
UCSD's Hadoop is 128 MB. The replication of the data files is set to 2. For each file, there are 16
blocks well distributed across all the datanodes. The application was configured to run against 1 file
or 10 files per job slot. The number of jobs running simultaneously ranged from 20 to 200. The
maximal number of jobs running simultaneously is 250, which is roughly a quarter of available job
slots at UCSD at that time. The rest of slots were running production or user analysis jobs. So the
test was running under a very typical Tier-2 condition. The test application itself didn't significantly
changes the overall condition of the cluster.
The ratio of average job slots running the tests to the number of Hadoop datanodes ranged from
10-20. Eventually this ratio will be 8 if all the WNs are configured as Hadoop datanodes, and each
WN runs 8 slots. This will increase the I/O capability per job slot for 50-100% from the results we
measured in the test.
The average processing time per job is 200 and 4000 second for the application processing 1 and 10
GB of data respectively. The average I/O in reading the data are shown in the following: average
I/O for application consuming 1GB (left) and 10 GB (right). The test shows the 1MB/s per slot
requirement is at the low end of the rate that is actually delivered by the HDFS. The average is ~2-
3 MB/s per job.
<It would be nice to have a comparison of CPU/wall clock time for the same typical CMS application
running on a data set in dCache vs one in hadoop. Maybe we can quickly do this at UCSD and/or
Requirement 12. The SE must be capable of writing files from the wide area network at a
performance of at least 125MB/s while simultaneously writing data from the local farm at an average
rate of 20MB/s.
[TODO: Can Caltech provide the appropriate Ganglia/PhEDEx plots showing this simultaneously
[Response from Mike: Our gridftp servers can run at up to ~300MB/s each in a controlled
environment. But the memory usage from reordering the gridftp-hdfs streams is preventing us from
getting decent rates at the moment]
[Comment from fkw: I recall James showing me plots that easily beat this number. He’s on vacation
right now. Let’s look at this when he’s back. Unless you guys know how to dial up the LoadTest to
UCSD, and are willing to fix things if we break them while James is gone. I believe all of our
LoadTest is running on HDFS.]
Below is a graph for the Nebraska worker node cluster
During this time, HDFS was servicing user requests at a rate of about 2500/sec (as determined by
syslog monitoring using the HadoopViz application). Each user request is a minimum of 32KB, so
this is at least 80MB of internal traffic. At the same time, we were writing in excess of 100MB/s as
measured by PhEDEx
Below is an example of HDFS serving data to a CRAB-based analysis launched by an external user.
At the time (December 2008), the read-ahead was set to 10MB. This provided an impressive
amount of network bandwidth (about 8GB/s) to the local farm, but is not an every day occurrence.
The currently recommended read-ahead size is 32KB.
Requirement 13. The SE must be capable of serving as an SRM endpoint that can send and
receive files across the WAN to/from other CMS sites. The SRM must meet all WLCG and/or CMS
requirements for such endpoints. File transfer rates within the PhEDEx system should reach at least
125MB/s between the two endpoints for both inbound and outbound transfers.
Nebraska currently uses HDFS to serve a majority of its files; it is used for the CMS site readiness
plots, which Nebraska passes. This demonstrates HDFS passes CMS requirements for such
[TODO: Do we have load test plots for Caltech/UCSD/Nebraska transfers?]
[Comment fkw: Yes, but the are not very impressive because they are dialed to the minimum.]
During Aug. 20-24, Caltech and Nebraska ran inter-site load tests using PhEDEx to exercise the
During this time period, PhEDEx recorded a 48-hour average of 171MB/s coming into the Caltech
Hadoop SE, with files primarily originating from UNL. Peak rates of up to 300MB/s were observed.
There was a temporary drop to zero at ~23:00 Aug. 24 due to an expired CERN CRL.
During this same time period, Caltech was exporting files at an average rate of 140MB/s, with files
primarily destined for UNL. For several hours during this time period the transfer rates exceeded
It must be noted that the PhEDEx import/export load tests were not run in isolation. While these
PhEDEx load tests were running Caltech was downloading multi-TB datasets from FNAL, CNAF,
and other sites with an average rate of 115MB/s and peaks reaching almost 200MB/s.
UCSD has additionally been working on an in-depth study of the scalability of BeStMan, especially
at different levels of concurrency. The graph below shows how the effective processing rate has
scaled with the increasing number of concurrent clients.
Formatted: Font: 14 pt
The operation used was srmLs without full details; this causes a “stat” operation on the file system,
but reduces the amount of XML generated by the BeStMan server. This demonstrates processing
rates well above the levels currently needed for USCMS. It is sufficient for high-rate transfers of
gigabyte-sized files and uncontrolled chaotic analysis. Formatted: Font: Liberation
7. Site-specific Requirements
Note: We believe the requirements set out here cover a subset of the functionality required at a Formatted: Font: Bold
CMS T2 site. We believe that the better test has been putting the storage elements into
production at several sites – the combination of all activities and chaotic loads appears to be better
than artificial tests. An additional test that we recommend below is replacing the skims (which are
bandwidth-heavy and IOPS-light, unlike most T2 activities) with a few analysis jobs (which are
bandwidth-medium and IOPS-heavy).
Note: We've done the best we could without owning more storage (by the end of 2010, each site will Formatted: Font: Bold
probably double in size). We believe we have demonstrated that the potential bottlenecks (the Formatted: Font: Not Bold
namenode) scale out for what we'll need in the next three years. As long as the ratio of cores to Formatted: Font: Not Bold
usable terabytes stays on the order of 1 to 1 and not 1 to 10, we believe IOPS will scale as Formatted: Font: Not Bold
demonstrated. We believe the fact that Yahoo has demonstrated multi-petabyte clusters shows the Formatted: Font: Not Bold
number of raw terabytes will scale. Formatted: Font: Not Bold
6. Note: We believe that the architectures deployed at the current T2 sites (UCSD, Caltech,
and Nebraska) can be repeated at others – in particular, any site that does not rely entirely on a
small number of RAID arrays. It is applicable for sites having issues with reliability or site admin
Requirement 14. A candidate SE should be subject to all of the regular, low-stress tests that are
performed by CMS. These include appropriate SAM tests, job-robot submissions, and PhEDEx load
tests. The SE should pass these tests 80% of the time over a period of two weeks. (This is also the
level needed to maintain commissioned status.)
The below chart shows the status of the site commissioning tests from CMS, which is a combination
of all the regular low-stress tests performed.
Additionally, Caltech's use of a Hadoop SE has maintained a 100% Commissioned site status for the
two weeks prior to Aug. 17:
Requirement 15. The new storage element should be filled to 90% with CMS data. These datasets
should be chosen such that they are currently "popular" in CMS and will thus attract a significant
number of user jobs. Failures of jobs due to failure to open the file or deliver the data products from
the storage systems (as opposed to user error, CE issues, etc.) should be at the level of less than 1
in 10^5 level.
A suggested test would be a simple "bomb" of scripts that repeatedly opens random files and
reads a few bytes from them with a high parallelism; for the 10^5 test, it's not necessary to
do it through CMSSW or CRAB. An example would be to have 200 worker nodes open 500
random files each and read a few bytes from the middle of the file.
This was performed using the “se_punch.py” tool found in Nebraska’s se_testkit. There were no file
access failures. This script implemented the suggested test – all worker nodes in the Nebraska
cluster simultaneously started opening random files and reading a few bytes from the middle of each.
Nebraska is now working on a script utilizing PyROOT (which is distributed with CMSSW) that
opens all files on the SE with ROOT. This not only verifies files can be opened, but demonstrates a
minimal level of validity of the contents of the file. Opening with ROOT should fully protect against
truncation (as the metadata required to open the file is written at the end of the file) and whole-file
corruption. It does not detect corruptions in the middle of the file, but built-in HDFS protections
should detect these.
Nebraska ran with HDFS over 90% full during May 2009 and encountered no significant problems
other than writes failing when all space was exhausted. When Caltech also experienced some
corrupted blocks when HDFS was filled to 96.8% and certain datanodes reached 100% capacity.
Some combination of failed writes, rebalancing, and failing disks resulted in two corrupted blocks
and two corrupted files. These files had to be invalidated and retransferred to the site. This is the
only time that Caltech has lost data in HDFS since putting it into production 6 months ago. There
are a few recommendations to help avoid this situation in the future:
1) Run the balancer often enough to prevent any datanode from reaching 100%
2) Don't allow HDFS to fill up enough that an individual datanode partition reaches 100%
3) If using multiple data partitions on a single datanode, make them of equal size, or merge
them into a single raid device so that hadoop sees only a single partition.
Future versions of Hadoop (0.20) have a more robust API to help manage datanode partitions that
have been completely filled to 100%.
Caltech has been running with HDFS over 90% for the past week and has not observed any problems
that were not already present when the space usage was closer to 50%. During this > 90% usage
period, several datanodes filled up to 100%, but were still able to serve data to the cluster.
Requirement 16. In addition, there should be a stress test of the SE using these same files. Over
the course of two weeks, priority should be given to skimming applications that will stress the IO
Specific CMS skim workflows were run at Nebraska on June 6. However, the results of these were
not interesting as the workflows only lasted 8 hours (no significant failures occurred).
However, the “stress” of the skim tests is far less than the stress of user jobs (especially PAT-
based analysis) due to the number of active branches in ROOT; see CMS Internal Note 2009-18.
Many active branches in ROOT result in a large number of small reads; a CMS job on an idle system
will read typically no more than 32KB per read and achieve 1MB/s. Hence, 1000 jobs will achieve
30,000 IOPS if they are not bound by the underlying disk system. Because the HDFS installs have
relatively high bandwidth due to the large number of data nodes, but the same number of hard
drives as other systems, bandwidth is usually not a concern while I/O operations per second (IOPS)
is. See the below graphs demonstrating a large number of IOPS; even at the max request rate, the
corresponding bandwidth required is only 5Gbps. For the hard drives deployed at the time the
graph was generated, this represented about 60 IOPS per hard drive, which matched independent
benchmarks of the hard drives. The bandwidth usage of 5Gbps represents only a fraction of the
bandwidth available to HDFS.
Because HDFS approaches the underlying hardware limits of the system during production, we
consider typical user jobs are the best stressor of the system. Such “stress tests” occur in large
batches on a weekly basis at both Nebraska and Caltech. During the tests in this requirement and
others, Nebraska and Caltech’s systems were in full production for CMS – simulation, analysis, and
WAN transfer – and often the batch slots were 100% utilized. By default, data went to HDFS and
only a few datasets were kept on dCache. UCSD’s system was smaller and shared the CMS
activities with a dCache instance.
Requirement 17. As part of the stress tests, the site should intentionally cause failures in various
parts of the storage system, to demonstrate the recovery mechanisms.
As noted in Requirement 16, a HDFS instance in large-scale production is sufficient for
demonstrating stress. During production at Nebraska and at Caltech, we have observed failures of
the following components:
Namenode: When a namenode dies, the only currently used recovery mechanism is to replace
the server (or fix the existing server) and copy a checkpoint file into the appropriate
directory. A high-availability setup have not yet been investigated by our production sites,
mostly due to the perceived complexity for little perceived benefit (namenode failure is rare).
This has been demonstrated in production at Nebraska and Caltech. When the namenode
fails, writes will not continue and reads will fail if the client had not yet cached the block
locations for open files.
Datanode: Datanode failures are designed to be an everyday occurrence, and they have
indeed occurred at both Nebraska and Caltech. The largest operational impact is the
amount of traffic generated by the system while it is re-replicating blocks to new hosts.
Globus GridFTP servers: Each transfer is spawned as separate process on the host by
xinetd. This results in the server being extremely reliable in the face of failures or bugs in
the GridFTP server. When the GridFTP host dies, others may be used by SRM. Nebraska
and UCSD have implemented schemes where the SRM server stops sending new transfers to
the GridFTP server. Caltech has also implemented a Gridftp appliance integrated with the
Rocks cluster management software that can be used to install and configure a new gridftp
server in 10 minutes.
SRM server: When the SRM server fails, all SRM based transfers will fail until it has been
restarted manually (the service health is monitored via RSV). This happens infrequently
enough in production that no automated system has been implemented, although LVS-based
failover and load-balancing is plausible because BeStMan is stateless. Caltech has
implemented a Bestman appliance integrated with the Rocks cluster management software
that can be used to install and configure a new Bestman server in 10 minutes.
Formatted: Bullets and
8. Security Concerns
Formatted: Heading 2
HDFS has unix-like user/group authorization, but no strict authentication. HDFS should only be
exposed to a secure internal network which only non-malicious users are able to access. For users
with unrestricted access to the local cluster, it is not difficult at all to bypass authentication. There Formatted: Font: Bold
is no encryption or strong authentication between the client and server, meaning that one must
have both trusted server and client. This is the primary reason why HDFS must be segregated onto Formatted: Font: Not Italic
an internal network.
It is possible to reasonably lock-down access by:
1. Preventing unknown or untrusted machines from accessing the internal network. This Formatted: Bullets and
requirement can be removed by turning on SSL sockets in lieu of regular sockets for inter-
process communication. We have not pursued this method due to the perceived
a. By “untrusted machines”, we include allowing end-user’s laptops or desktops to
access HDFS. Such access could be allowed via Xrootd redirectors (for ROOT-
based analysis) or exporting the file system via HTTPS (allowing whole-file download).
2. Prevent non-fuse users from accessing HDFS ports on the known machines on the network.
This will mean only the HDFS FUSE process will be able to access the datanodes and
namenode; this allows the Linux filesystem interface to sanitize requests and prevents users
from TCP-level access to HDFS.
It’s important to point out that in (2), we are relying on the security of the clients on the network.
If a host is compromised at the root-level, the attacker can perform any arbitrary action with
sufficient effort. During the various tests outlined above, the sites’ security was based on either
the internal NAT (Caltech and Nebraska) or firewalls eliminating access to the outside world
Security concerns are actively being worked on by Yahoo. The progress can be followed on this
master JIRA issue:
In release 0.21.0, access tokens issued by the namenode prevents clients from accessing arbitrary
data on the datanode (currently, one only needs to know the block ID to access it). Also in 0.21.0,
the transition to the Java Authentication and Authorization Service has begun; this will provide the
building blocks for Kerberos-based access (Yahoo’s eventual end goal). Judging by current
progress, transitioning to Kerberos-based components could happen during 2010.
If a vulnerability is discovered, we would release updated RPMs within one workweek (sooner if the
packaging is handled by the VDT). This probably will not be necessary as the security model is
already very permissive. Security vulnerabilities are one of the few reasons we will update the
“golden set” of RPMs.
Note: Example damage a rogue batch job could do Formatted: Font: Bold
To demonstrate the security model, we give a few examples of what a rogue job could do: Formatted: Font: Not Bold
Excessive memory usage by the rogue job could starve the datanode process and cause it to Formatted: Bullets and
crash. Most sites limit the amount of memory allowed for individual batch jobs, so this is not
a big concern.
If the rogue job has write access to the datanode partition, then it could fill up the partition
with garbage which would prevent the datanode from writing any further blocks. This will not
cause the datanode to fail, but will cause a loss of usable space in the SE.
o Most sites use Unix file system permissions to prevent this.
A malicious batch job with telnet access to the Hadoop datanode could request any block of
data if it knows the block ID. This is fixed in the HDFS 0.21.0 branch (to be released
approximately in November).
A malicious batch job with telnet access to the Hadoop namenode could perform arbitrary file
system commands. This could result in a lot of damage to the storage system, and why we
recommend client-side firewalls.
o This is a known weakness in the current security model and is being addressed in Formatted
current Hadoop development.
Formatted: Heading 2
Grid Components (GridFTP and BeStMan)
Globus GridFTP and BeStMan both use standard GSI security with VOMS extensions; we assume
this is familiar to both CMS and FNAL. Because both components are well-known, we do not
examine their security models here.
If a vulnerability is discovered in any of these components, we would release a RPM update once our
upstream source (the VDT) has this update. The target response time would be one workweek
while packaging is done at Caltech, and in lockstep with the VDT update when that team does
9. Risk Analysis
In this section, we analyze different risks that are posed to the different pieces of the HDFS-based
SE. We attempt to present the most pressing risks in the proposed solution (both technical and
organizational), and point out any mitigating factors.
HDFS is both the core component and a component external to grid computing. Hence, its risk
must be examined most closely.
1. Health of Hadoop project: HDFS is completely dependent on the existence and continued
maintenance of the Hadoop project. Continued development and growth of this project is
critical. Hadoop is a top-level project of the Apache Software Foundation; in order to
achieve this status, the following requirements were necessary:
i. All code ASL'ed (Apache Software License, a highly permissive open-source
ii. The code base must contain only ASL or ASL-compatible
iii. License grant complete.
iv. Contributor License Agreement on file.
v. Check of project name for trademark issues.
This legal legwork protects us from code licensing issues and various other legal
b. Meritocracy / Community
i. Demonstrate an active and diverse development community
ii. The project is not highly dependent on any single contributor (there
are at least 3 legally independent committers and there is no single company
or entity that is vital to the success of the project)
iii. The above implies that new committers are admitted according to ASF
iv. ASF style voting has been adopted and is standard practice
v. Demonstrate ability to tolerate and resolve conflict within the
vi. Release plans are developed and executed in public by the community.
vii. ASF Board for a Top Level Project, has voted for final acceptance.
The ASF has shown that these community guidelines and requirements are hallmarks
of a good open source project.
The fact that HDFS is an ASF project and not a Yahoo corporation project means that it is
not tied to the health of Yahoo. The current HDFS lead is employed by Facebook not
Yahoo. At this point in the project’s life, about 40% of the patches come from non-Yahoo
employees. Relevant to the recent changes to Microsoft as the company’s search engine
provider, Yahoo has made public statements that:
Hadoop is used for almost every piece of the Yahoo infrastructure, including: spam
fighting, ads, news, and analytics.
Hadoop is critical to Yahoo as a company, and is not a subproject of the search
engine. It is possible that money previously invested into the search engine
technology will now be invested into Hadoop.
Cloudera has received about $16 million in start-up capital and employs several key
developers, including Doug Cutting, the original author of the system. Hadoop maintains a
listing of web sites and companies utilizing its technology,
Condor currently funds a developer working on Hadoop, and is investigating the use of
HDFS as a core component.
While we believe these reasons mitigate the risk of HDFS development becoming stagnate,
we believe this is the top long-term risk associated with the project. Formatted: Font: Bold
2. Hadoop support / resolution of bugs: There is no direct monetary support for large-scale Formatted: Font: Bold
HDFS development, nor is the success of HDFS dependent upon WLCG usage. We have no
paid support for HDFS (although it can be purchased). This is mitigated by:
a. Paid support is available: We have good contacts with the Cloudera technical staff,
and would be able to purchase development support as needed. Several project
committers are on Cloudera staff.
b. Critical bugs affect large corporations: Any bug we are exposed to affects Yahoo and
Facebook, whose businesses depend on HDFS. Hence, any data loss bug we discover
will be of immediate interest to their development teams. When Nebraska started
with HDFS, we had issues with blocks truncated by ext3 file system recovery. This
triggered a long investigation by a member of the Yahoo HDFS team, resulting in
many patches for 0.19.0. Since that version, we have not seen the truncation issue
c. Acceptance of patches: Nebraska has contributed on the order of 5 patches to HDFS,
and has not had issues with getting patches accepted by the upstream project. The
major issue has been passing the acceptance criteria – each patch must meet coding
guidelines, pass code review from a different coder, and come with a unit test (or an
explanation of why a new unit test is not needed).
i. We have opened 30 issues. 10 of these issues have been fixed. 4 have been Formatted: Bullets and
closed as duplicate. 4 have been closed as invalid. 12 remain open; 6 of
these have a patch available, but have not been committed. Of the remaining
open issues, only 1 is applied to our local distribution (the same patch is also
applied to the Cloudera distribution).
d. Large number of unittests: HDFS core has good unit test coverage (Clover coverage
of 76% http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/clover/).
All nontrivial commits require a unit test to be committed along with it. Because of
the initial difficulties in getting completely safe sync/append functionality, a large set
of new unit tests was developed for 0.21.0 based on a fault-injection framework. The
fault injection framework provides developers with the ability to better demonstrate
not only correct behaviors, but correct behaviors under a variety of fault conditions.
The unit tests are run nightly using Apache Hudson
(http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/) and take several
Each point helps to mitigate the issue, but does not completely remove the issue. In the Formatted: Indent: Left: 0.5"
extreme case, we are prepared to run on locally-developed patches that are not accepted by
the upstream project. This would hurt our efforts to keep support costs under control, so
we would avoid this situation.
3. Hadoop feature set: We believe HDFS currently has all the features necessary for adoption.
We do not believe that any new features are required in the core. However, it should be
pointed out the system is not fully POSIX compliant. Specifically, the following is missing:
a. File update support: Once a file is closed, it cannot be altered. In HDFS 0.21.0,
append support will be enabled. We do not believe this will ever be necessary for
b. Multiple write streams / random writes: Only a single stream of data can write to an
open file and doing a seek() during write is not supported. This means that one may
not write a TFile directly to HDFS using ROOT; first, the file must be written to
disk, then copied to HDFS. We developed in-memory stream reordering in GridFTP
in order to avoid this limitation. If USCMS decides to write files directly to the SE
and not use local scratch, HDFS will not be immediately supported. We believe this
to be a low risk.
c. Flush and append: A file is not guaranteed to be fully visible until it has been closed.
Until it is closed, it is not defined how much data a reader will see if they attempt to
read the file. Flush and append support will be available for HDFS 0.21.0, which will
provide guaranteed semantics about when data will be available to readers. We do
not believe this will be an issue for CMS.
1. Formatted: Bullets and
FUSE-DFS, as a contributed project in Hadoop, shares many risks with HDFS. There are a few
concerns we believe relevant enough to merit their own category.
1. FUSE support: FUSE has no commercial company providing support. However, is a part of
the mainline Linux kernel, over 5 years old, and has had a stable interface for quite awhile.
We have never seen any issue from FUSE itself. We believe the FUSE kernel module is a
risk because OSG has less experience packaging kernel modules, and kernel modifications
often result in support issues. This is mitigated by the fact that OSG-supported Xrootd
requires the FUSE kernel (meaning that HDFS isn’t unique in this situation) and that the
ATrpms repository provides a FUSE kernel module and tools to build the RPM module for
non-standard kernels. UCSD and Caltech both build their own kernel modules; Nebraska
uses the ATrpms ones.
2. FUSE-DFS support: FUSE-DFS is the name of the userland library that implements the
FUSE filesystem. This was originally implemented by Facebook and is in the HDFS SVN
repository as a contributed module. It does not have the same level of support as HDFS
Core because it is contributed; it also does not have as many companies using it in
production. Through the process of adopting HDFS, we have discovered critical bugs, Formatted: Font: Bold
submitted, and had accepted several FUSE-DFS related patches. We have even recently Formatted: Font: Bold
found memory leak bugs in the libhdfs C wrappers (this had not been previously discovered
because it only is noticeable when things are in continuous production). We believe this Formatted: Font: Bold
component is the short-term highest-risk software component of the entire solution. The Formatted: Font: Bold
mitigating factors for FUSE-DFS are:
a. Small, stable codebase: FUSE-DFS is basically a small layer of glue between FUSE
and the libhdfs library (a core HDFS component, used by Yahoo). The entire code
base is around 2,000 lines, about 4% of the total HDFS size. During our usage of
HDFS, neither the libhdfs nor the FUSE API has changed. This limits the number of
undiscovered bugs and rate of bug introduction. We believe the majority of the
possible issues are fixed.
b. Production experience: We have been running FUSE for more than 8 months, and feel
like we have a good understanding of possible production issues. As of the latest
release, the largest outstanding issue is the fact that FUSE must be remounted
whenever users are added or removed from groups (user-to-group mappings are
currently cached indefinitely). This is well understood and possible to work-around.
This bug may be mitigated in future versions of HDFS, as it will be necessary to the
future Kerberos-based authorization / authentication.
c. Extensive debugging experience available: The last FUSE-DFS memory leak bug
tackled required in-depth debugging at Nebraska. We believe we have the
experience and tools necessary to handle any future bugs. We intend to make sure
that any locally developed patches are upstreamed to the HDFS project.
There have been other FUSE binding attempts, but this is the only one that has been Formatted: Indent: Left: 0.49"
supported or developed by a major company (Facebook) and committed as a part of the
HDFS projects. The other attempts appear to have never been completed or kept up-to-
date with HDFS.
BeStMan is an already supported component of the OSG. We have identified the associated risks:
1. BeStMan runs out of funding: As BeStMan is quickly becoming an essential OSG package,
we believe that it will always meet the needs of USLHC, even if it is not funded at LBNL.
2. BeStMan currently uses Globus 3 container: The Globus 3 web services container was never
in large-scale use, and currently suffers from debilitating bugs and unmaintained
architecture. The BeStMan team is currently using most of their effort in replacing this with
an industry-standard Tomcat webapp container. This should be delivered fall-winter 2009.
We believe this will remove many bugs and improve the overall source code. This would
make it possible for external parties to submit improvements.
Globus GridFTP is an already-supported component of the OSG. We have identified the associated
1. Globus GridFTP runs out of funding: Globus GridFTP is an essential component to the
OSG. If it runs out of funding, we will use whatever future solution the OSG adopts.
2. Globus GridFTP model possibly not satisfactory: The Globus GridFTP model is based on Formatted: Bullets and
processes being launched by xinetd. Because each transfer is a separate process, issues Formatted: Font: Not Bold
affecting one transfer are very separate from other transfers. However, this makes it
extremely hard to enforce limits on the number of active transfers per node. This can lead
to either instability issues (by having no limit) or odd errors (globus-url-copy does not
gracefully report when xinetd refuses to start new servers). We would like to investigate
multi-threaded daemon-mode Globus GridFTP, but have not identified effort yet. Current
T2 sites mitigate this by mostly controlling the number of concurrent transfers (except
CRAB stageouts) and providing sufficient hardware to accommodate for an influx of transfers.
Both BeStMan and GridFTP require plug-ins in order to achieve the desired level of functionality in
this SE. We have identified the associated risks:
1. Future changes in versions of underlying components: We may have to update plugin code if
the related component changes its interface. For example, BeStMan2 may require a new
Java interface to implement GridFTP selector plugins. Even if the API remains the same,
it’s possible for the underlying assumptions to change – i.e., if GridFTP plug-in needed to
2. Original authors leaves USCMS: If the original author leaves USCMS, then much knowledge
would be lost, even if the effort is replaced. This is why focus is being put into clean
packaging, documentation, and ownership by an organization (OSG) as opposed to just one
person. The BeStMan component is relatively simple and straightforward, mitigating this
concern. The GridFTP component is not due to the complexity of the Globus DSI interface
(by far, the most complex interface in the SEs). This is high-performance C code and
difficult to change. If the original author left and the Globus DSI module changed
significantly, USCMS would need to invest about 1 man-month of effort to perform the
upgrade. This is mitigated by the fact that the current system does not have any necessary
GridFTP feature upgrades – USCMS can run on the same plugin for a significant amount of
Formatted: Heading 2
We have worked hard to provide packaging for the entire solution. The current packaging does
offer a few pitfalls:
1. Original author leaves USCMS: The setup at Caltech is based on “mock”, the standard Formatted: Font: Bold
Fedora/Redhat build tool. The VDT cannot currently does not have the processes in place Formatted: Bullets and
to package RPMs effectively, but this is a planned development for Year 4. Until the
packaging duties can be transferred from Caltech to VDT (perhaps late Year 4), we will be
dependent on the setup there. We are attempting to get it better documented in order to
2. Patches fail to get upstreamed: It is crucial to send patches upstream and maintain the Formatted: Font: Not Bold
minimum number of changes from the base install. We must remain diligent in making sure
to commit upstream fixes for any bugs.
3. Rate of change: Even with only bug fix updates for “golden releases”, the rate of updates is Formatted: Font: Bold
always worrying. Most of the updates recently have been related to packaging issues,
especially for platforms not present at any production T2 cluster. We hope that the added
OSG effort in Year 4 will enable us to drastically reduce the rate of change.
4. Update mechanisms for ROCKS clusters: Currently, doing a “yum install” is the correct way Formatted: Font: Not Bold
to install the latest version of the software. However, when a administrator adds the RPMs
to a ROCKS roll, they get locked into that specific version and must manually take action to
upgrade the RPMs. This means there will always be significant resistance to changing
versions. This makes decreasing the rate of updates even more important.
Experts and Funding
Much of this work was done using several CMS experts. We outline two risks:
a) Loss of experts: As mentioned above, we take a significant hit if our experts leave the Formatted: Font: Bold
organization. We are focusing heavily on documentation, packaging, and “finishing off” Formatted: Bullets and
development (in fact, preparation for this review has prompted us to clear several long-
standing issues). This will allow us to do the first “golden set”, but also increase the length
of time HDFS can be maintained between experts.
a. A significant amount of CMS T2 funding comes from the DISUN project, which ends in
Spring 2010. DISUN personnel contribute to the HDFS effort. This is a going
concern to the HDFS effort and CMS T2 program as a whole.
b) Loss of OSG: Much of the risk and effort is being shouldered with the OSG to leverage their Formatted: Font: Bold
packaging expertise. Having HDFS in the OSG taps into an additional pool of human
resources outside the experts in USCMS. However, the current funding for the OSG runs
out in 2 years (and is reduced in 1 year). If the OSG funding is lost, then we will have to
again rely internally on USCMS personnel, similar to FY2009.
The catastrophe scenario for HDFS adoption is both funding loss in the OSG and loss of the
experts. In this case, the survival plan would be:
Identify funding for new experts (from experience, it takes about 6 months to train a new Formatted: Bullets and
expert once they are in place). This can be taken from the pool of HDFS sysadmins; as
HDFS gains wider use, the pool of potential experts is broadened.
No new “golden set” until a packaging, testing, and integration program can be re-
established. If this becomes a chronic problem, a hard focus would be made on to switching
entirely to Cloudera’s distribution in order to offload the Q/A testing of major changes to
an external organization.
No new USCMS-specific features. We believe that HDFS has all the necessary major Formatted: Bulleted + Level: 1
+ Aligned at: 0.25" + Indent
features for CMS adaptation, but we do find small useful ones (an example would be the at: 0.5"
development of Ganglia 3.1 compatibility). Without a local expert, developing these for CMS
would not be possible. Without a local expert, any running with patches not accepted by the
upstream project becomes increasingly dangerous.