Hadoop File System as part of a CMS Storage Element 1. Introduction In the last year, several new storage technologies have matured to the point where they have become viable candidates for use at CMS Tier2 sites. One particular technology is HDFS, the distributed filesystem used by the Hadoop data processing system. 2. The HDFS SE HDFS is a file system; as such, it must be complemented by other components in order to build a grid SE. We consider a minimal set of additional components to be: a) FUSE / FUSE-DFS: FUSE, a standard linux kernel module, allows filesystems to be written in userspace. The FUSE-DFS library makes HDFS into a FUSE filesystem. This allows a POSIX-like interface to HDFS, necessary for user applications. b) Globus GridFTP: Provides a WAN transfer protocol. Using Globus GridFTP with HDFS requires a plugin developed by Nebraska. c) BeStMan: Provides an SRMv2 interface. UCSD and Nebraska have implemented plugins for BeStMan to allow smarter selection of GridFTP servers. There are optional components that may be layered on top of HDFS, including XrootD, Apache HTTP, and FDT. 3. Requirements This document aims to show that the combination of HDFS FUSE, Globus GridFTP , BeSTMan, and some plugins developed by CMS Tier-2 teams at Nebraska, Caltech, and UCSD, meet the SE requirements set forth by USCMS Tier 2 management. The requirements are typeset in italics, and the responses are given below them in normal typesetting. Throughout this document, we use “Hadoop” and “HDFS” interchangeably; generally, “Hadoop” refers to the entire data processing system. In this context, we take it to only refer to the filesystem components. 4. Management of the SE Requirement 1. A SE technology must have a credible support model that meets the reliability, availability, and security expectations consistent with the area in the CMS computing infrastructure in which the SE will be deployed. Support for this SE solution is provided by a combination of OSG, LBNL, Globus, the Apache Software Foundation (ASF), DISUN, and possibly the US CMS Tier-2 program as follows. BeSTMan is supported by LBNL, GridFTP by Globus. Both are part of the OSG portfolio of storage solutions. HDFS is supported by ASF, as elaborated below, and FUSE is part of the standard Linux distribution. This leaves us with two types of plugins that are required to integrate all the pieces into a system, as well as the packaging support. The first is a plugin to BeSTMan to pick a GridFTP server from a list of available GridFTP servers; this list is reloaded every 30 seconds from disk and servers are randomly selected (otherwise, the default policy is a simple round-robin and it requires an SRM restart to alter the GridFTP server list). The second is a GridFTP plugin to interface to HDFS. We propose that both of these types of plugins continue to be supported by their developers from DISUN and US CMS, until OSG has had an opportunity to gain sufficient experience to adopt the source, and own them. Similarly, packaging support in form of RPMs is to be provided initially by DISUN/Caltech, and later by US CMS/Caltech, until OSG has had an opportunity to adopt it as part of a larger migration towards providing native packaging in form of RPMs. OSG ownership of these software artifacts has been agreed upon in principle, but not yet formally. OSG support for this solution will start with Year 4 of OSG (October 1st, 2009), and will initially be restricted to: a) Pick a set of RPMs twice a year, verify that this set is completely consistent, providing a well integrated system. We refer to this as the “golden set”. b) Document installation instructions for this golden set. c) Do a simple validation test on supported platforms (with the validation preferably automatic). d) Performance test the golden set, and document that test. This performance will be in-depth. e) Provide operations support for two golden sets at a time. This means that there is a staff person on OSG who is responsible for tracking support requests, answering simple questions, and finding solutions to difficult questions via the community support group organized in firstname.lastname@example.org listserv. f) OSG will provide updates to a golden set only for important bug and security fixes; these critical patches will go through validation test, but not performance tests. Support will be official for RHEL4 and RHEL5-derivants on both 32-bit and 64-bit platforms. The core HDFS software (the namenode and datanode) is usable on any platform providing the Java 1.6 JDK. Currently, Caltech and Nebraska both run datanodes on Solaris. The main limitation for FUSE clients is support for the FUSE kernel module; this is supported on any Linux 2.6 series kernel, and FUSE was merged into the kernel itself in 2.6.14. Upgrades in HDFS are covered in this wiki document: http://wiki.apache.org/hadoop/Hadoop_Upgrade. On our supported version, upgrades on the same major version will only require a “yum update”. For major version upgrades, the procedure is: a) Shutdown the cluster. b) Upgrade via yum or RPMs. c) Start the namenode manually with the “-upgrade” flag. d) Start the cluster. The cluster will stay in read-only mode. e) Once the cluster’s health has been verified, issue the “hadoop dfsadmin –finalizeUpgrade” command. After the command has been issued, no rollback may be performed. The wiki explains additional recommended safety precautions. Hadoop is an open-source project hosted by the Apache Software Foundation. ASF hosts multiple mailing lists for both users and developers, which is actively watched by both the Hadoop developers and the larger Hadoop community. Members of this community include Yahoo employees who use Hadoop in their workplace, as well as employees of Cloudera, which provides commercial packaging and support for Hadoop. As it is a top-level Apache project, it has contributors from at least three companies and strong project management. Yahoo has stated that it has invested millions of dollars into the project and intends to continue doing so (http://developer.yahoo.net/blogs/hadoop). Hadoop depends heavily upon its JIRA instance, http://issues.apache.org/jira/browse/HADOOP, for bug reporting and tracking. As it is an open-source project, there’s no guarantee of response time to issues. However, we have found high-priority issues (those that may lead to data loss) get solved quickly because bugs affecting a T2 site have a high probability of affecting Yahoo’s production infrastructure. Requirement 2. The SE technology must demonstrate the ability to interface with the global data transfer system PhEDEx and the transfer technologies of SRM tools and FTS as well as demonstrate the ability to interface to the CMSSW application locally through ROOT. Caltech is using Hadoop exclusively for OSG RSV tests, PhEDEx load tests, user stageouts via SRM, ProdAgent stageout via SRM and POSIX-like access via FUSE. User analysis jobs submitted to Caltech with CRAB have been running against data stored in Hadoop for several months. UNL has been using HDFS at large scale since approximately late February 2009. It is used for: PhEDEx data transfers for all links (transfers are done with SRM and FTS). OSG RSV tests (used to meet the WLCG MoU availability requirements) Monte Carlo simulation and merging. User analysis via CMSSW (both grid-based submissions using CRAB and local interactive analysis). UCSD has been using HDFS for serving /store/user. Caltech, UNL, and UCSD have been leading the effort to demonstrate the scalability of the BeStMan SRM server. This has recently resulted in a srmLs processing rate of 200Hz in a single server; this is 4 times larger than the rate required by CMS for FY2010, and is in fact the most scalable SRM solution deployed in global CMS. Requirement 3. There must be sufficient documentation of the SE so that it can be installed and operated by a site with minimal support from the original developers (i.e. nothing more than "best effort"). This documentation should be posted on the OSG Web site, and any specific issues in interfacing the external product to CMS product should be highlighted. Installation, operation, and troubleshooting directions can be found at http://twiki.grid.iu.edu/bin/view/Storage/Hadoop. This has already been discussed in Requirement 1, and we elaborate further here. It should be plausible for any site admin to install HDFS and the corresponding grid components without support from the original developers. This provides a coherent install experience; all components, including BeStMan and GridFTP, are available as RPMs. Admins experienced with the RedHat tool “yum” will find that the SE is installable via a simple “yum install hadoop”. The DISUN/Caltech packaging also provides useful logging defaults that enable one to easily centrally log errors happening in HDFS; this greatly aids admins in troubleshooting. With the OSG 1.2 release, there are no specific issues pertaining to using HDFS with CMS. Experience shows that operational overhead at Caltech has been equivalent to approximately 1FTE, and that includes R&D activities for packaging and testing. Going forward, we believe the overhead will decrease, as the R&D portions will be greatly reduced. Nebraska, which has reached a stable state with Hadoop for several months, shows that operational overhead is less than 1 FTE. UCSD, which is presently supporting HDFS for /store/user (113TB) and dCache for all else (273TB) reports the same experience. The main reason why this solution is experienced as less costly to operate is because it has many fewer moving parts. This much reduced complexity results in significantly lower operational overhead. Requirement 4. There must be a documented procedure for how problems are reported to the developers of those products, and how these problems are subsequently fixed. Starting with Year 4 of OSG, all problems are reported via OSG. OSG then uses the support mechanisms as discussed in Requirement 1. In particular, Hadoop has an online ticket system called JIRA. JIRA is heavily used by the developers to received and track bugs and features requests for Hadoop and Hadoop-related projects. JIRA is open for viewing by everyone, and requires a simple account registration for posting comments and new tickets. All commits in the project must be traceable via JIRA and go through the quality control process (which includes code review by a different developer and passing automated tests). The JIRA system can be found at http://issues.apache.org/jira/browse/HADOOP. The Hadoop community has also written a guide to filing bug reports, http://wiki.apache.org/hadoop/HowToContribute. Requirement 5. Source code required to interface the external product to CMS products must be made available so that site operators can understand what they are operating. If at all possible, source code for the external product itself should also be available. All software components are open source. In particular, Hadoop source code can be downloaded from http://www.apache.org/dyn/closer.cgi/hadoop/core/. Patches for specific problems in a release can be downloaded from JIRA (see above). Any patches currently applied to the Caltech Hadoop distribution have been submitted to JIRA, and we have tried to make sure that they get committed in a timely manner. This helps us minimize the costs of maintaining our own RPM installs. Yahoo has publicly committed to releasing its “stable patch set” to the world beginning with 0.20.0. They are committed to keeping all patches used publicly available in the JIRA; the only added knowledge is which patches are stable enough to be applied to current releases. When we upgrade to this version, we will be able to tap into this significant resource; Yahoo has a committed QA team and a test cluster an order of magnitude larger than our production cluster. The BeStMan source code is open source for academic users; the OSG is working on clearing up the licensing of this software and making the code more freely available. Each of the used plug-ins (for BeStMan and GridFTP) are available in the Nebraska SVN repository which allows anonymous access. 5. Reliability of the SE Requirement 6. The SE must have well-defined and reliable behavior for recovery from the failure of any hardware components. This behavior should be tested and documented. We broadly classify the HDFS grid SE into three parts: metadata components, data components, and grid components. Below, we document the risks a failure in each poses and the suggested recovery mechanisms. Metadata components: The two major metadata components for HDFS and are the namenode and the “secondary namenode” (backup) servers. The namenode is the single point of failure for normal user operation. Because of this, there are several built-in protections each site is recommended to take: o Write out multiple copies of the journal and file system image. HDFS heavily relies on the metadata journal as a log of all the operations that alter the namespace. HDFS config files allow for writing the journal onto multiple partitions. It is recommended to write the journal on two separate physical disks in the namenode and suggested a third copy be written to a NFS server. o The secondary namenode / backup server allows the site admin to create checkpoint files at regular intervals (default is every hour or whenever the journal reaches 64MB in size). It is strongly recommended to run the secondary namenode on a separate physical host. The last two checkpoints are automatically kept by the secondary namenode. o The checkpoints should be archived, preferably off-site. Future versions of HDFS (0.21.x) plan on having an offline checkpoint verification tool and stream journal information to the backup node in real-time, as opposed to an hourly checkpoint. In the case of namenode failure, the following remedies are suggested: 1. Restart the namenode from the file system image and journal found on the namenode’s disk. The only resulting file loss will be the files being written at the time of namenode failure. This action will work as long as the image and journal are not corrupted. 2. Copy the checkpoint file from the secondary namenode to the namenode. Any files created between the time the checkpoint was made and the namenode failure will be lost. This action will work as long as the checkpoint has not been corrupted. 3. Use an archived checkpoint. Any files created between the checkpoint creation and the namenode failure will be lost. This action will work as long as one good checkpoint exists. Note that all the namenode information is kept on two files written directly to the Linux filesystem. If none of these actions work, the entire file system will be lost; this is why we place such importance on backup creation. In addition to the normal preventative measures, the following can be done: 1. Have hot-spare hardware available. HDFS is offline when the namenode is offline. If the failure does not have an immediate, obvious cause, we recommend using the standby hardware instead of prolonging the downtime by troubleshooting the issue. 2. Create a high-availability (HA) setup. It is possible, using DRBD and Heartbeat to completely automate recovery from a namenode failure using a primary-secondary setup. This would allow services to automatically be restored on the order of a minute with the loss of only the files which were currently being created at the time of the failure. As most of the CMS tools automatically retry writing into the SE, this means that the actual file loss would be minimal. These high-availability setups have been used by external companies, but not yet done by a grid site; the extra setup complexity is not yet perceived as worth the time. Data components: Datanode loss is an expected occurrence in HDFS and there are multiple layers of protection against this. 1. The first layer of protection is block replication. Hadoop has a robust block replication feature that ensures that duplicate file blocks are placed on separate nodes and even separate racks in the cluster. This helps ensure that complete copies of every file are available in the case that a single node becomes unavailable, and even if an entire rack becomes unavailable. 2. Hadoop periodically requests an entire block report from every data node. This protects against synchronization bugs where the namenode’s view of the datanode’s contents is different from reality 3. It is often possible to have a bad hard drive causing corruption issues, but not have that hard drive fail. Each datanode schedules each block to be checksummed once every 2 weeks. If the block fails the checksum, it will be deleted and the namenode notified – allowing automatic healing from hard drive errors. Grid components: Grid components require the least amount of failure planning because they are stateless for the HDFS SE. Multiple instances of the grid components (gridftp, bestman SRM) can be installed and used in a failover fashion using Linux LVS or round-robin DNS. We are aware of one significant failure mode. BeStMan SRM is known to lock up under very heavy load (over 1000 concurrent SRM requests, which is at least 10 fold what’s been observed in production), and requires a restart when this happens. We believe this to be a problem with the Globus java container. BeStMan is scheduled to transition to a different container technology in Fall 2009 in BeStMan 2. OSG is committed to follow this issue, and validate the new version when it is released. However, this issue manifests itself only under extreme loads at the existing deployments, and is thus presently not an operational problem. In addition, there is a minor issue with the way HDFS handles deployments with server nodes that serve space of largely varying sizes. HDFS need to be routinely re-balanced, which is typically done via crontab in extreme cases (e.g. Nebraska), and manually once every two weeks or so for deployments where disk space differs by no more than a factor 2-4 or so between nodes (e.g. UCSD). There is no significant operational impact from routinely running the balancer beyond network traffic; HDFS allows the site to throttle the per-node rate going to the balancer. This imbalance is caused because the selection of servers for writing files is mostly random, while the distribution of free space is not necessarily random. Requirement 7. The SE must have a well-defined and reliable method of replicating files to protect against the loss of any individual hardware system. Alternately there should be a documented hardware requirement of using only systems that protect against data loss at the hardware level. Hadoop uses block decomposition to store its data, breaking each file up into blocks of 64MB apiece (this block size is a per-file setting; the default is 64MB, but most sites have increased this to 128MB for new files). Each file has a replication factor, and the HDFS namenode attempts to keep the correct number of replicas of each of the blocks in the file. The replication policy is: a) First replica goes to the “closest datanode” – the local node is the highest priority, followed by a datanode on the same rack, followed by any data node. b) Second replica goes to a random datanode on a different rack c) Third replica (if requested) goes to a random datanode on the second rack. If two replicas are requested, each will end up on two separate racks; if three replicas are requested, they will end up on two separate racks. Replication level is set by the client at the time it creates the block. The replication level may be increased or decreased by the admin at any time per-file, and can be done recursively. The namenode attempts to satisfy the client’s request; as long as the number of successfully created replicas is between the namenode’s configured minimum and maximum, the HDFS considers the write a success. Because the client requests a replication level for each write, one cannot set a default replication for a directory tree. For example, at Caltech, a cron script automatically sets the replication level on known directories in order to ensure clients request the desired replication level. They currently use: hadoop fs -setrep -R 3 /store/user hadoop fs -setrep -R 3 /store/unmerged hadoop fs -setrep -R 1 /store/data/CRUZET09 hadoop fs -setrep -R 1 /store/data/CRAFT09 It is not possible to pin files to specific datanodes, or to set replication based on the datanodes where the files will be located. Hadoop treats all datanodes as equally unreliable. The namenode keeps track of all block locations in the system, and will automatically delete replicas or create new copies when needed. Replicas may be deleted either when the associated file is removed or the block has too more replicas than desired. Datanodes send in a heartbeat signal once every 3 seconds, allowing the server to keep up-to-date with the system status. For example, when a datanode fails, the server allows it to miss up to 10 minutes of heartbeats (setting configurable). Once it is declared dead, the namenode starts to make inter-node transfer requests to bring the blocks that were on the datanode back up to the desired level. This often is quite quick; all the desired replicas can be done in around an hour per TB (the majority of the transfers go faster than that; a good portion of the hour is “transfer tails”). When decommissioning is started, the system does not prefer to copy data off the decommissioned node. Assuming the number of replicas is greater than one, the load of replication is distributed randomly throughout the cluster. The percentage of transfers the decommissioned node will get is (# of TB of files on datanode)/(# of TB in entire system). Because older nodes typically have a smaller disk size, it will comparatively get less load. If a dead node reappears (host rebooted/fixed, disk is physically moved to a different node), the blocks it previously hosted will now be overreplicated. The namenode will then reduce the number of replicas in the system, starting with replicas located on nodes with the least amount of free space (as a percentage of total space on the node). If the blocks belonged to files which have been deleted, the node will be instructed to delete them in the response to its block report. The number of under-replicated blocks can be seen by viewing the system report using fsck or by looking at the namenode’s Ganglia statistics. The success of Hadoop's automatic block replication was seen when Caltech suffered a simultaneous failure of 3 large (6TB) datanodes on the evening of Sunday, Jul. 12: “Within about an hour of installing a faulty Nagios probe, 3 of the 2U datanodes had crashed, all within minutes of each other. Each of the 2U datanodes hosts just over 6TB of raid-5 data. Nagios started sending alarms indicating that we had lost our datanodes. We had seen kernel panics caused by this nagios probe before, so we had no trouble locating the cause of the problem. The immediate corrective action was to disable this Nagios probe. This was done immediately to avoid further loss of datanodes. Ganglia showed Hadoop reporting ~80k underreplicated blocks, and Hadoop started replicating them after the datanodes failed to check in after ~10 minutes. The network activity on the cluster jumped from ~200MB/s to 2GB/s. Since we run Rocks, 2 of the machines went into a reinstall immediately after dumping the kernel core file. Within an hour they were back up and running. Hadoop did not start automatically, however, due to a Rocks misconfiguration. I started Hadoop manually on these two nodes, which caused our underreplicated block count to drop from ~20k to ~4k. Within ~90 minutes the number of underreplicated blocks was back to zero. At this point we still had one datanode that had not recovered. I checked the system console and discovered that the system disk had also died. Since it was late on a Sunday evening, and we run 2X replication in Hadoop, I decided we could leave the datanode offline and wait until the following morning to replace the disk. After the replication was done, I ran “hadoop fsck /” to check the health of HDFS. To my surprise, hadoop was reporting 40 missing blocks and 33 corrupted files. This seemed strange because all of our files (aside from LoadTest data) is replicated 2x, so the loss of a single datanode should not have caused the loss of any files (aside from LoadTest data). After parsing the output of “hadoop fsck /”, I found that we had been accidentally setting the replication on /store/relval to 1, instead of leaving it at the default of 2. This was fixed. The next morning we replaced the system disk in the dead datanode and waited for it to reinstall (which took a few hours longer than it ought have). Almost immediately after starting Hadoop on this last datanode, hadoop fsck reported the filesystem was clean again. By 2:15 the following afternoon everything had returned to normal and hadoop was healthy again. ” Requirement 8. The SE must have a well-defined and reliable procedure for decommissioning hardware which is being removed from the cluster; this procedure should ensure that no files are lost when the decommissioned hardware is removed. This procedure should be tested and documented. The process of decommissioning hardware is documented in the Hadoop twiki under the Operations guide. The process goes approximately like this: 1. Edit the hosts exclude file to exclude the to-be-decommissioned host from the cluster. 2. Issue the “refreshNodes” command in the Hadoop CLI to get the namenode to re-read the file. The node should show up as “Decommissioning” in the web interface at this point. 3. Watch the web interface or the “report” command in the Hadoop CLI and wait until the node is listed as “dead”. This process is not only straightforward, but a very routine process at each site. Decommissioning is done whenever a node needs to be taken offline for any upgrade lasting more than 10 minutes at Nebraska. Requirement 9. The SE must have well-defined and reliable procedure for site operators to regularly check the integrity of all files in the SE. This should include basic file existence tests as well as the comparison against a registered checksum to avoid data corruption. The impact of this operation (e.g. load on system) should be documented. Hadoop’s command line utility allows site admins to regularly check the file integrity of the system. It can be viewed using “hadoop fsck /”. At the end of the output, it will either say the file system is “HEALTHY” or “CORRUPTED”. If it is corrupted, it provides the outputs necessary to repair or remove broken files. HDFS registers a checksum at the block level within the block’s metadata. HDFS automatically schedules background checksum verifications (default is to have every block scanned once every 2 weeks) and automatically invalidates any block with the incorrect checksum. The checksumming interval can be adjusted downward at the cost of increased background activity on the cluster. We do not currently have statistics on the rate of failures avoided by checksumming. Whenever a file is read by a client (even partially – checksums are kept for every 4KB), the client receives both data and checksum and computes the validity of the data on the client side. Similarly, when block is transferred (for example, through rebalancing), the checksum is computed by the receiving node and compared to the sender’s data. Note about catastrophic loss: We have emphasized that with 2 replicas, file loss is very rare due to: Failures that occur rapidly (>2 hours between failure) cause little to no loss because the re- replication in the file system is extremely fast; one guidance is to expect 1TB per hour to be re-replicated. Multi-disk failures happening within an hour usually are due to some common piece of equipment (such as the rack switch or PDU). Rack-awareness prevents an entire rack disappearing from causing file loss. However, what happens if we make the assumption that all safeguards are bypassed and 2 disks are lost? This is not without precedent; at Caltech, a misconfiguration told Hadoop 2 nodes on the same rack were on different racks. This bypassed the normal protections from rack awareness. The rack’s PDU failed and two disks failed to come back up. Caltech lost 54 blocks of file. Using the binomial distribution, the expected number of blocks lost is: (# of blocks lost) = (# of blocks) * P(single block loss) The binomial distribution is appropriate because the loss of one block does not affect the probability of another block loss. The standard deviation is approximately the square root of the number of blocks lost. The probability of a single block loss is P(single block loss) = P(block on node 1) * P(block on node 2) The probability a block is on a given node is approximately: P(block on node 1) = (replication level)*(size of node)/(size of HDFS) assuming that the cluster is well-balanced and blocks are randomly distributed. Both assumptions appear to be safe in currently-deployed clusters. Plugging in Caltech’s numbers (1,540,263 blocks; 342.64 TB in the system, each lost disk was 1TB), the expected number of lost blocks was 52.4 with a standard deviation of 7.2. This is strikingly close to the actual loss, 54 blocks. If only complete files were written (i.e., no block decomposition), then the expected loss would be (# of files lost) = (# of files) * P(single block loss) So, assuming 128MB and experimental files of around 1GB, the number of files lost would be 10x lower. In the end, CMS site would lose 10x more files using HDFS. We believe this is an acceptable risk, especially as the recovery procedure for 5 files versus 50 files is similar. In the case of simultaneous triple-disk-failure on triple-replicated files, the expected loss would be less than 1 file for Caltech’s HDFS instance. Requirement 10. The SE must have well-defined interfaces to monitoring systems such as Nagios so that site operators can be notified if there are any hardware or software failures. HDFS integrates with Ganglia; provided that the site admin points HDFS to the right Ganglia endpoint, many relevant statistics for the namenode and datanodes appear in the Ganglia gmetad webpages. Many monitoring and notification applications can set up alerts based on this. Caltech has also contributed several HDFS-Nagios plugins to the public that monitor various aspects of the health of the system directly. They have released a TCL-based desktop application, “gridftpspy” which monitors the health and activity of the Globus gridftp servers. Some of these are based on the JMX (Java Management eXtensions) interface into HDFS. JMX can integrate with a wider range of monitoring system. There is also an external project providing Cacti templates for monitoring HDFS. The Nagios and gridftpspy components are packaged in the Caltech yum repository, but not officially integrated; we foresee labeling them experimental for the OSG- supported first release. Finally, Caltech has developed the “Hadoop Chronicle”, a nightly email that sends administrators the basic Hadoop usage statistics. This has an appropriate level of details to inform site executives about Hadoop’s usage. The Hadoop Chronicle is now part of the OSG Storage Operations toolkit. This is currently in use at Caltech and in testing at Nebraska. Note about admin intervention: The previous two requirements start to cover the topic of “what HDFS activities do site admins engage in?” and at what interval. We have the following feedback from Nebraska and Caltech site admins, respectively: Nebraska: o Daily tasks: Check Hadoop Chronicle, look at RSV monitoring o Once a week: Clean up dead hardware, restart dead components. The component which crashes most often is BeStMan at about once every 2 weeks. o Once every 2 months: Some sort of data recovery or in-depth maintenance. Examples include debugging an underreplicated block or recovering a corrupted file. Caltech (note: Caltech runs an experimental kernel, which may explain the reason there’s more kernel-related maintenance than at Nebraska): o Continuously: wait for Nagios alerts o Hourly tasks: Check namenode web pages and gridftp logs via gridftpspy (admittedly a bit excessive) o Daily tasks: Read Hadoop Chronicle, browse PhEDEx rate/error pages o Weekly tasks: reboot nodes due to kernel panic, adjust gridftp server list (BeStMan plugin currently not used), track down lost blocks (for datasets replicated once), maintain ROCKS configuration. o Once a month: Reboot namenode with new kernel, reinstall data nodes with bugfix update. 6. Performance of the SE All aspects of performance must be documented. Requirement 11. The SE must be capable of delivering at least 1 MB/s/batch slot for CMS applications such as CMSSW. If at all possible, this should be tested in a cluster on the scale of a current US CMS Tier-2 system. To test this requirement, Caltech ran a test using dd to read from HDFS through the fuse mount on each of the 89 worker nodes on the Tier2 cluster. dd was used to maximize the throughput from the storage system. We acknowledge that the IO characteristics from dd are not identical to that of CMSSW applications, which tend to read smaller chunks of data in random patterns. Each worker node ran 8 dd processes in parallel, one per core. Each dd process/batch slot on a single worker node read a different 2.6GB file from HDFS 10 times in sequence. The same 8 files were read from each of the 89 worker nodes. At the end of each file read, dd reported the rate at which the file was read. A total of 18.1TB was read during this test. The final dd was finished approximately 4.25 hours after the test was started. The average read rate reported by dd was 2.3MB/s ± 1.5MB/s. The fastest read was 22.8MB/s and the slowest was 330KB/s. The rate per file delivered from HDFS was 18.1TB/4.25hours = 1238MB/s, or approximately 155MB/s (1Gbps) per HDFS file. The test was run as more of a test to see how the system behaves for the 'hot file' problem. As such, this test shows that HDFS can deliver even 'hot files' to the batch slots at the required rates. It should be noted that this test was run while the cluster was also 100% full with CMS production and CMS analysis jobs, most of which were also reading and writing to Hadoop at the same time. The background HDFS traffic from this CMS activity was not included in these results. UCSD ran a separate test with a standard CMSSW application consuming physics data. The same application has been used for computing challenging or scalability exercise. The application is very I/O intensive. Here we mainly focused on the application reading the data that is located locally in the hadoop. During the tests, there are 15 datanodes holds the data files with 1GB in size. The block size in UCSD's Hadoop is 128 MB. The replication of the data files is set to 2. For each file, there are 16 blocks well distributed across all the datanodes. The application was configured to run against 1 file or 10 files per job slot. The number of jobs running simultaneously ranged from 20 to 200. The maximal number of jobs running simultaneously is 250, which is roughly a quarter of available job slots at UCSD at that time. The rest of slots were running production or user analysis jobs. So the test was running under a very typical Tier-2 condition. The test application itself didn't significantly changes the overall condition of the cluster. The ratio of average job slots running the tests to the number of Hadoop datanodes ranged from 10-20. Eventually this ratio will be 8 if all the WNs are configured as Hadoop datanodes, and each WN runs 8 slots. This will increase the I/O capability per job slot for 50-100% from the results we measured in the test. The average processing time per job is 200 and 4000 second for the application processing 1 and 10 GB of data respectively. The average I/O in reading the data are shown in the following: average I/O for application consuming 1GB (left) and 10 GB (right). The test shows the 1MB/s per slot requirement is at the low end of the rate that is actually delivered by the HDFS. The average is ~2- 3 MB/s per job. Requirement 12. The SE must be capable of writing files from the wide area network at a performance of at least 125MB/s while simultaneously writing data from the local farm at an average rate of 20MB/s. Below is a graph for the Nebraska worker node cluster During this time, HDFS was servicing user requests at a rate of about 2500/sec (as determined by syslog monitoring using the HadoopViz application). Each user request is a minimum of 32KB, so this is at least 80MB of internal traffic. At the same time, we were writing in excess of 100MB/s as measured by PhEDEx Below is an example of HDFS serving data to a CRAB-based analysis launched by an external user. At the time (December 2008), the read-ahead was set to 10MB. This provided an impressive amount of network bandwidth (about 8GB/s) to the local farm, but is not an every day occurrence. The currently recommended read-ahead size is 32KB. Requirement 13. The SE must be capable of serving as an SRM endpoint that can send and receive files across the WAN to/from other CMS sites. The SRM must meet all WLCG and/or CMS requirements for such endpoints. File transfer rates within the PhEDEx system should reach at least 125MB/s between the two endpoints for both inbound and outbound transfers. During Aug. 20-24, Caltech and Nebraska ran inter-site load tests using PhEDEx to exercise the gridftp-hdfs servers. During this time period, PhEDEx recorded a 48-hour average of 171MB/s coming into the Caltech Hadoop SE, with files primarily originating from UNL. Peak rates of up to 300MB/s were observed. There was a temporary drop to zero at ~23:00 Aug. 24 due to an expired CERN CRL. During this same time period, Caltech was exporting files at an average rate of 140MB/s, with files primarily destined for UNL. For several hours during this time period the transfer rates exceeded 200MB/s. It must be noted that the PhEDEx import/export load tests were not run in isolation. While these PhEDEx load tests were running Caltech was downloading multi-TB datasets from FNAL, CNAF, and other sites with an average rate of 115MB/s and peaks reaching almost 200MB/s. UCSD has additionally been working on an in-depth study of the scalability of BeStMan, especially at different levels of concurrency. The graph below shows how the effective processing rate has scaled with the increasing number of concurrent clients. The operation used was srmLs without full details; this causes a “stat” operation on the file system, but reduces the amount of XML generated by the BeStMan server. This demonstrates processing rates well above the levels currently needed for USCMS. It is sufficient for high-rate transfers of gigabyte-sized files and uncontrolled chaotic analysis. 7. Site-specific Requirements Note: We believe the requirements set out here cover a subset of the functionality required at a CMS T2 site. We believe that the better test has been putting the storage elements into production at several sites – the combination of all activities and chaotic loads appears to be better than artificial tests. An additional test that we recommend below is replacing the skims (which are bandwidth-heavy and IOPS-light, unlike most T2 activities) with a few analysis jobs (which are bandwidth-medium and IOPS-heavy). Note: We've done the best we could without owning more storage (by the end of 2010, each site will probably double in size). We believe we have demonstrated that the potential bottlenecks (the namenode) scale out for what we'll need in the next three years. As long as the ratio of cores to usable terabytes stays on the order of 1 to 1 and not 1 to 10, we believe IOPS will scale as demonstrated. We believe the fact that Yahoo has demonstrated multi-petabyte clusters shows the number of raw terabytes will scale. Note: We believe that the architectures deployed at the current T2 sites (UCSD, Caltech, and Nebraska) can be repeated at others – in particular, any site that does not rely entirely on a small number of RAID arrays. It is applicable for sites having issues with reliability or site admin availability. Requirement 14. A candidate SE should be subject to all of the regular, low-stress tests that are performed by CMS. These include appropriate SAM tests, job-robot submissions, and PhEDEx load tests. The SE should pass these tests 80% of the time over a period of two weeks. (This is also the level needed to maintain commissioned status.) The below chart shows the status of the site commissioning tests from CMS, which is a combination of all the regular low-stress tests performed. Additionally, Caltech's use of a Hadoop SE has maintained a 100% Commissioned site status for the two weeks prior to Aug. 17: http://lhcweb.pic.es/cms/SiteReadinessReports/SiteReadinessReport_20090817.html#T2_US_Caltec h. Requirement 15. The new storage element should be filled to 90% with CMS data. These datasets should be chosen such that they are currently "popular" in CMS and will thus attract a significant number of user jobs. Failures of jobs due to failure to open the file or deliver the data products from the storage systems (as opposed to user error, CE issues, etc.) should be at the level of less than 1 in 10^5 level. A suggested test would be a simple "bomb" of scripts that repeatedly opens random files and reads a few bytes from them with a high parallelism; for the 10^5 test, it's not necessary to do it through CMSSW or CRAB. An example would be to have 200 worker nodes open 500 random files each and read a few bytes from the middle of the file. This was performed using the “se_punch.py” tool found in Nebraska’s se_testkit. There were no file access failures. This script implemented the suggested test – all worker nodes in the Nebraska cluster simultaneously started opening random files and reading a few bytes from the middle of each. Nebraska is now working on a script utilizing PyROOT (which is distributed with CMSSW) that opens all files on the SE with ROOT. This not only verifies files can be opened, but demonstrates a minimal level of validity of the contents of the file. Opening with ROOT should fully protect against truncation (as the metadata required to open the file is written at the end of the file) and whole-file corruption. It does not detect corruptions in the middle of the file, but built-in HDFS protections should detect these. Nebraska ran with HDFS over 90% full during May 2009 and encountered no significant problems other than writes failing when all space was exhausted. Caltech also experienced some corrupted blocks when HDFS was filled to 96.8% and certain datanodes reached 100% capacity. Some combination of failed writes, rebalancing, and failing disks resulted in two corrupted blocks and two corrupted files. These files had to be invalidated and retransferred to the site. This is the only time that Caltech has lost data in HDFS since putting it into production 6 months ago. There are a few recommendations to help avoid this situation in the future: 1) Run the balancer often enough to prevent any datanode from reaching 100% 2) Don't allow HDFS to fill up enough that an individual datanode partition reaches 100% 3) If using multiple data partitions on a single datanode, make them of equal size, or merge them into a single raid device so that hadoop sees only a single partition. Future versions of Hadoop (0.20) have a more robust API to help manage datanode partitions that have been completely filled to 100%. Requirement 16. In addition, there should be a stress test of the SE using these same files. Over the course of two weeks, priority should be given to skimming applications that will stress the IO system. Specific CMS skim workflows were run at Nebraska on June 6. However, the results of these were not interesting as the workflows only lasted 8 hours (no significant failures occurred). However, the “stress” of the skim tests is far less than the stress of user jobs (especially PAT- based analysis) due to the number of active branches in ROOT; see CMS Internal Note 2009-18. Many active branches in ROOT result in a large number of small reads; a CMS job on an idle system will read typically no more than 32KB per read and achieve 1MB/s. Hence, 1000 jobs will achieve 30,000 IOPS if they are not bound by the underlying disk system. Because the HDFS installs have relatively high bandwidth due to the large number of data nodes, but the same number of hard drives as other systems, bandwidth is usually not a concern while I/O operations per second (IOPS) is. See the below graphs demonstrating a large number of IOPS; even at the max request rate, the corresponding bandwidth required is only 5Gbps. For the hard drives deployed at the time the graph was generated, this represented about 60 IOPS per hard drive, which matched independent benchmarks of the hard drives. The bandwidth usage of 5Gbps represents only a fraction of the bandwidth available to HDFS. Because HDFS approaches the underlying hardware limits of the system during production, we consider typical user jobs are the best stressor of the system. Such “stress tests” occur in large batches on a weekly basis at both Nebraska and Caltech. During the tests in this requirement and others, Nebraska and Caltech’s systems were in full production for CMS – simulation, analysis, and WAN transfer – and often the batch slots were 100% utilized. By default, data went to HDFS and only a few datasets were kept on dCache. UCSD’s system was smaller and shared the CMS activities with a dCache instance. Requirement 17. As part of the stress tests, the site should intentionally cause failures in various parts of the storage system, to demonstrate the recovery mechanisms. As noted in Requirement 16, a HDFS instance in large-scale production is sufficient for demonstrating stress. During production at Nebraska and at Caltech, we have observed failures of the following components: Namenode: When a namenode dies, the only currently used recovery mechanism is to replace the server (or fix the existing server) and copy a checkpoint file into the appropriate directory. A high-availability setup have not yet been investigated by our production sites, mostly due to the perceived complexity for little perceived benefit (namenode failure is rare). This has been demonstrated in production at Nebraska and Caltech. When the namenode fails, writes will not continue and reads will fail if the client had not yet cached the block locations for open files. Datanode: Datanode failures are designed to be an everyday occurrence, and they have indeed occurred at both Nebraska and Caltech. The largest operational impact is the amount of traffic generated by the system while it is re-replicating blocks to new hosts. Globus GridFTP servers: Each transfer is spawned as separate process on the host by xinetd. This results in the server being extremely reliable in the face of failures or bugs in the GridFTP server. When the GridFTP host dies, others may be used by SRM. Nebraska and UCSD have implemented schemes where the SRM server stops sending new transfers to the GridFTP server. Caltech has also implemented a Gridftp appliance integrated with the Rocks cluster management software that can be used to install and configure a new gridftp server in 10 minutes. SRM server: When the SRM server fails, all SRM based transfers will fail until it has been restarted manually (the service health is monitored via RSV). This happens infrequently enough in production that no automated system has been implemented, although LVS-based failover and load-balancing is plausible because BeStMan is stateless. Caltech has implemented a Bestman appliance integrated with the Rocks cluster management software that can be used to install and configure a new Bestman server in 10 minutes. 8. Security Concerns HDFS HDFS has unix-like user/group authorization, but no strict authentication. HDFS should only be exposed to a secure internal network which only non-malicious users are able to access. For users with unrestricted access to the local cluster, it is not difficult at all to bypass authentication. There is no encryption or strong authentication between the client and server, meaning that one must have both trusted server and client. This is the primary reason why HDFS must be segregated onto an internal network. It is possible to reasonably lock-down access by: 1. Preventing unknown or untrusted machines from accessing the internal network. This requirement can be removed by turning on SSL sockets in lieu of regular sockets for inter- process communication. We have not pursued this method due to the perceived performance penalty. a. By “untrusted machines”, we include allowing end-user’s laptops or desktops to access HDFS. Such access could be allowed via Xrootd redirectors (for ROOT- based analysis) or exporting the file system via HTTPS (allowing whole-file download). 2. Prevent non-fuse users from accessing HDFS ports on the known machines on the network. This will mean only the HDFS FUSE process will be able to access the datanodes and namenode; this allows the Linux filesystem interface to sanitize requests and prevents users from TCP-level access to HDFS. It’s important to point out that in (2), we are relying on the security of the clients on the network. If a host is compromised at the root-level, the attacker can perform any arbitrary action with sufficient effort. During the various tests outlined above, the sites’ security was based on either the internal NAT (Caltech and Nebraska) or firewalls eliminating access to the outside world (UCSD). Security concerns are actively being worked on by Yahoo. The progress can be followed on this master JIRA issue: https://issues.apache.org/jira/browse/HADOOP-4487 In release 0.21.0, access tokens issued by the namenode prevents clients from accessing arbitrary data on the datanode (currently, one only needs to know the block ID to access it). Also in 0.21.0, the transition to the Java Authentication and Authorization Service has begun; this will provide the building blocks for Kerberos-based access (Yahoo’s eventual end goal). Judging by current progress, transitioning to Kerberos-based components could happen during 2010. If a vulnerability is discovered, we would release updated RPMs within one workweek (sooner if the packaging is handled by the VDT). This probably will not be necessary as the security model is already very permissive. Security vulnerabilities are one of the few reasons we will update the “golden set” of RPMs. Note: Example damage a rogue batch job could do To demonstrate the security model, we give a few examples of what a rogue job could do: Excessive memory usage by the rogue job could starve the datanode process and cause it to crash. Most sites limit the amount of memory allowed for individual batch jobs, so this is not a big concern. If the rogue job has write access to the datanode partition, then it could fill up the partition with garbage which would prevent the datanode from writing any further blocks. This will not cause the datanode to fail, but will cause a loss of usable space in the SE. o Most sites use Unix file system permissions to prevent this. A malicious batch job with telnet access to the Hadoop datanode could request any block of data if it knows the block ID. This is fixed in the HDFS 0.21.0 branch (to be released approximately in November). A malicious batch job with telnet access to the Hadoop namenode could perform arbitrary file system commands. This could result in a lot of damage to the storage system, and why we recommend client-side firewalls. o This is a known weakness in the current security model and is being addressed in current Hadoop development. Grid Components (GridFTP and BeStMan) Globus GridFTP and BeStMan both use standard GSI security with VOMS extensions; we assume this is familiar to both CMS and FNAL. Because both components are well-known, we do not examine their security models here. If a vulnerability is discovered in any of these components, we would release a RPM update once our upstream source (the VDT) has this update. The target response time would be one workweek while packaging is done at Caltech, and in lockstep with the VDT update when that team does packaging. 9. Risk Analysis In this section, we analyze different risks that are posed to the different pieces of the HDFS-based SE. We attempt to present the most pressing risks in the proposed solution (both technical and organizational), and point out any mitigating factors. HDFS HDFS is both the core component and a component external to grid computing. Hence, its risk must be examined most closely. 1. Health of Hadoop project: HDFS is completely dependent on the existence and continued maintenance of the Hadoop project. Continued development and growth of this project is critical. Hadoop is a top-level project of the Apache Software Foundation; in order to achieve this status, the following requirements were necessary: a. Legal i. All code ASL'ed (Apache Software License, a highly permissive open-source license). ii. The code base must contain only ASL or ASL-compatible dependencies. iii. License grant complete. iv. Contributor License Agreement on file. v. Check of project name for trademark issues. This legal legwork protects us from code licensing issues and various other legal issues. b. Meritocracy / Community i. Demonstrate an active and diverse development community ii. The project is not highly dependent on any single contributor (there are at least 3 legally independent committers and there is no single company or entity that is vital to the success of the project) iii. The above implies that new committers are admitted according to ASF practices iv. ASF style voting has been adopted and is standard practice v. Demonstrate ability to tolerate and resolve conflict within the community. vi. Release plans are developed and executed in public by the community. vii. ASF Board for a Top Level Project, has voted for final acceptance. The ASF has shown that these community guidelines and requirements are hallmarks of a good open source project. The fact that HDFS is an ASF project and not a Yahoo corporation project means that it is not tied to the health of Yahoo. The current HDFS lead is employed by Facebook not Yahoo. At this point in the project’s life, about 40% of the patches come from non-Yahoo employees. Relevant to the recent changes to Microsoft as the company’s search engine provider, Yahoo has made public statements that: Hadoop is used for almost every piece of the Yahoo infrastructure, including: spam fighting, ads, news, and analytics. Hadoop is critical to Yahoo as a company, and is not a subproject of the search engine. It is possible that money previously invested into the search engine technology will now be invested into Hadoop. Cloudera has received about $16 million in start-up capital and employs several key developers, including Doug Cutting, the original author of the system. Hadoop maintains a listing of web sites and companies utilizing its technology, http://wiki.apache.org/hadoop/PoweredBy. Condor currently funds a developer working on Hadoop, and is investigating the use of HDFS as a core component. While we believe these reasons mitigate the risk of HDFS development becoming stagnate, we believe this is the top long-term risk associated with the project. 2. Hadoop support / resolution of bugs: There is no direct monetary support for large-scale HDFS development, nor is the success of HDFS dependent upon WLCG usage. We have no paid support for HDFS (although it can be purchased). This is mitigated by: a. Paid support is available: We have good contacts with the Cloudera technical staff, and would be able to purchase development support as needed. Several project committers are on Cloudera staff. b. Critical bugs affect large corporations: Any bug we are exposed to affects Yahoo and Facebook, whose businesses depend on HDFS. Hence, any data loss bug we discover will be of immediate interest to their development teams. When Nebraska started with HDFS, we had issues with blocks truncated by ext3 file system recovery. This triggered a long investigation by a member of the Yahoo HDFS team, resulting in many patches for 0.19.0. Since that version, we have not seen the truncation issue again. c. Acceptance of patches: Nebraska has contributed on the order of 5 patches to HDFS, and has not had issues with getting patches accepted by the upstream project. The major issue has been passing the acceptance criteria – each patch must meet coding guidelines, pass code review from a different coder, and come with a unit test (or an explanation of why a new unit test is not needed). i. We have opened 30 issues. 10 of these issues have been fixed. 4 have been closed as duplicate. 4 have been closed as invalid. 12 remain open; 6 of these have a patch available, but have not been committed. Of the remaining open issues, only 1 is applied to our local distribution (the same patch is also applied to the Cloudera distribution). d. Large number of unittests: HDFS core has good unit test coverage (Clover coverage of 76% http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/clover/). All nontrivial commits require a unit test to be committed along with it. Because of the initial difficulties in getting completely safe sync/append functionality, a large set of new unit tests was developed for 0.21.0 based on a fault-injection framework. The fault injection framework provides developers with the ability to better demonstrate not only correct behaviors, but correct behaviors under a variety of fault conditions. The unit tests are run nightly using Apache Hudson (http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/) and take several hours. Each point helps to mitigate the issue, but does not completely remove the issue. In the extreme case, we are prepared to run on locally-developed patches that are not accepted by the upstream project. This would hurt our efforts to keep support costs under control, so we would avoid this situation. 3. Hadoop feature set: We believe HDFS currently has all the features necessary for adoption. We do not believe that any new features are required in the core. However, it should be pointed out the system is not fully POSIX compliant. Specifically, the following is missing: a. File update support: Once a file is closed, it cannot be altered. In HDFS 0.21.0, append support will be enabled. We do not believe this will ever be necessary for USCMS. b. Multiple write streams / random writes: Only a single stream of data can write to an open file and doing a seek() during write is not supported. This means that one may not write a TFile directly to HDFS using ROOT; first, the file must be written to disk, then copied to HDFS. We developed in-memory stream reordering in GridFTP in order to avoid this limitation. If USCMS decides to write files directly to the SE and not use local scratch, HDFS will not be immediately supported. We believe this to be a low risk. c. Flush and append: A file is not guaranteed to be fully visible until it has been closed. Until it is closed, it is not defined how much data a reader will see if they attempt to read the file. Flush and append support will be available for HDFS 0.21.0, which will provide guaranteed semantics about when data will be available to readers. We do not believe this will be an issue for CMS. FUSE/FUSE-DFS FUSE-DFS, as a contributed project in Hadoop, shares many risks with HDFS. There are a few concerns we believe relevant enough to merit their own category. 1. FUSE support: FUSE has no commercial company providing support. However, is a part of the mainline Linux kernel, over 5 years old, and has had a stable interface for quite awhile. We have never seen any issue from FUSE itself. We believe the FUSE kernel module is a risk because OSG has less experience packaging kernel modules, and kernel modifications often result in support issues. This is mitigated by the fact that OSG-supported Xrootd requires the FUSE kernel (meaning that HDFS isn’t unique in this situation) and that the ATrpms repository provides a FUSE kernel module and tools to build the RPM module for non-standard kernels. UCSD and Caltech both build their own kernel modules; Nebraska uses the ATrpms ones. 2. FUSE-DFS support: FUSE-DFS is the name of the userland library that implements the FUSE filesystem. This was originally implemented by Facebook and is in the HDFS SVN repository as a contributed module. It does not have the same level of support as HDFS Core because it is contributed; it also does not have as many companies using it in production. Through the process of adopting HDFS, we have discovered critical bugs, submitted, and had accepted several FUSE-DFS related patches. We have even recently found memory leak bugs in the libhdfs C wrappers (this had not been previously discovered because it only is noticeable when things are in continuous production). We believe this component is the short-term highest-risk software component of the entire solution. The mitigating factors for FUSE-DFS are: a. Small, stable codebase: FUSE-DFS is basically a small layer of glue between FUSE and the libhdfs library (a core HDFS component, used by Yahoo). The entire code base is around 2,000 lines, about 4% of the total HDFS size. During our usage of HDFS, neither the libhdfs nor the FUSE API has changed. This limits the number of undiscovered bugs and rate of bug introduction. We believe the majority of the possible issues are fixed. b. Production experience: We have been running FUSE for more than 8 months, and feel like we have a good understanding of possible production issues. As of the latest release, the largest outstanding issue is the fact that FUSE must be remounted whenever users are added or removed from groups (user-to-group mappings are currently cached indefinitely). This is well understood and possible to work-around. This bug may be mitigated in future versions of HDFS, as it will be necessary to the future Kerberos-based authorization / authentication. c. Extensive debugging experience available: The last FUSE-DFS memory leak bug tackled required in-depth debugging at Nebraska. We believe we have the experience and tools necessary to handle any future bugs. We intend to make sure that any locally developed patches are upstreamed to the HDFS project. There have been other FUSE binding attempts, but this is the only one that has been supported or developed by a major company (Facebook) and committed as a part of the HDFS projects. The other attempts appear to have never been completed or kept up-to- date with HDFS. BeStMan BeStMan is an already supported component of the OSG. We have identified the associated risks: 1. BeStMan runs out of funding: As BeStMan is quickly becoming an essential OSG package, we believe that it will always meet the needs of USLHC, even if it is not funded at LBNL. 2. BeStMan currently uses Globus 3 container: The Globus 3 web services container was never in large-scale use, and currently suffers from debilitating bugs and unmaintained architecture. The BeStMan team is currently using most of their effort in replacing this with an industry-standard Tomcat webapp container. This should be delivered fall-winter 2009. We believe this will remove many bugs and improve the overall source code. This would make it possible for external parties to submit improvements. Globus GridFTP Globus GridFTP is an already-supported component of the OSG. We have identified the associated risks: 1. Globus GridFTP runs out of funding: Globus GridFTP is an essential component to the OSG. If it runs out of funding, we will use whatever future solution the OSG adopts. 2. Globus GridFTP model possibly not satisfactory: The Globus GridFTP model is based on processes being launched by xinetd. Because each transfer is a separate process, issues affecting one transfer are very separate from other transfers. However, this makes it extremely hard to enforce limits on the number of active transfers per node. This can lead to either instability issues (by having no limit) or odd errors (globus-url-copy does not gracefully report when xinetd refuses to start new servers). We would like to investigate multi-threaded daemon-mode Globus GridFTP, but have not identified effort yet. Current T2 sites mitigate this by mostly controlling the number of concurrent transfers (except CRAB stageouts) and providing sufficient hardware to accommodate for an influx of transfers. Component Plug-ins Both BeStMan and GridFTP require plug-ins in order to achieve the desired level of functionality in this SE. We have identified the associated risks: 1. Future changes in versions of underlying components: We may have to update plugin code if the related component changes its interface. For example, BeStMan2 may require a new Java interface to implement GridFTP selector plugins. Even if the API remains the same, it’s possible for the underlying assumptions to change – i.e., if GridFTP plug-in needed to become thread-safe. 2. Original authors leaves USCMS: If the original author leaves USCMS, then much knowledge would be lost, even if the effort is replaced. This is why focus is being put into clean packaging, documentation, and ownership by an organization (OSG) as opposed to just one person. The BeStMan component is relatively simple and straightforward, mitigating this concern. The GridFTP component is not due to the complexity of the Globus DSI interface (by far, the most complex interface in the SEs). This is high-performance C code and difficult to change. If the original author left and the Globus DSI module changed significantly, USCMS would need to invest about 1 man-month of effort to perform the upgrade. This is mitigated by the fact that the current system does not have any necessary GridFTP feature upgrades – USCMS can run on the same plugin for a significant amount of time. Packaging We have worked hard to provide packaging for the entire solution. The current packaging does offer a few pitfalls: 1. Original author leaves USCMS: The setup at Caltech is based on “mock”, the standard Fedora/Redhat build tool. The VDT cannot currently does not have the processes in place to package RPMs effectively, but this is a planned development for Year 4. Until the packaging duties can be transferred from Caltech to VDT (perhaps late Year 4), we will be dependent on the setup there. We are attempting to get it better documented in order to mitigate risk. 2. Patches fail to get upstreamed: It is crucial to send patches upstream and maintain the minimum number of changes from the base install. We must remain diligent in making sure to commit upstream fixes for any bugs. 3. Rate of change: Even with only bug fix updates for “golden releases”, the rate of updates is always worrying. Most of the updates recently have been related to packaging issues, especially for platforms not present at any production T2 cluster. We hope that the added OSG effort in Year 4 will enable us to drastically reduce the rate of change. 4. Update mechanisms for ROCKS clusters: Currently, doing a “yum install” is the correct way to install the latest version of the software. However, when a administrator adds the RPMs to a ROCKS roll, they get locked into that specific version and must manually take action to upgrade the RPMs. This means there will always be significant resistance to changing versions. This makes decreasing the rate of updates even more important. Experts and Funding Much of this work was done using several CMS experts. We outline two risks: a) Loss of experts: As mentioned above, we take a significant hit if our experts leave the organization. We are focusing heavily on documentation, packaging, and “finishing off” development (in fact, preparation for this review has prompted us to clear several long- standing issues). This will allow us to do the first “golden set”, but also increase the length of time HDFS can be maintained between experts. a. A significant amount of CMS T2 funding comes from the DISUN project, which ends in Spring 2010. DISUN personnel contribute to the HDFS effort. This is a going concern to the HDFS effort and CMS T2 program as a whole. b) Loss of OSG: Much of the risk and effort is being shouldered with the OSG to leverage their packaging expertise. Having HDFS in the OSG taps into an additional pool of human resources outside the experts in USCMS. However, the current funding for the OSG runs out in 2 years (and is reduced in 1 year). If the OSG funding is lost, then we will have to again rely internally on USCMS personnel, similar to FY2009. The catastrophe scenario for HDFS adoption is both funding loss in the OSG and loss of the experts. In this case, the survival plan would be: Identify funding for new experts (from experience, it takes about 6 months to train a new expert once they are in place). This can be taken from the pool of HDFS sysadmins; as HDFS gains wider use, the pool of potential experts is broadened. No new “golden set” until a packaging, testing, and integration program can be re- established. If this becomes a chronic problem, a hard focus would be made on to switching entirely to Cloudera’s distribution in order to offload the Q/A testing of major changes to an external organization. No new USCMS-specific features. We believe that HDFS has all the necessary major features for CMS adaptation, but we do find small useful ones (an example would be the development of Ganglia 3.1 compatibility). Without a local expert, developing these for CMS would not be possible. Without a local expert, any running with patches not accepted by the upstream project becomes increasingly dangerous.
Pages to are hidden for
"Hadoop File System as part of a CMS Storage Element"Please download to view full document