Sandia National Laboratories
Scalable Computing Systems
Fifth NASA/DOE Joint PC Cluster
October 6-8, 1999
Conceptual Partition Model
Compute File I/O
File I/O Model
• Support large-scale unstructured grid applications.
– Manipulate single file per application, not per processor.
• Support collective I/O libraries.
– Require fast concurrent writes to a single file.
• Need a file system NOW!
• Need scalable, parallel I/O.
• Need file management infrastructure.
• Need to present the I/O subsystem as a single parallel file system both
internally and externally.
• Need production-quality code.
• Provide independent access to file systems on each I/O node.
– Can’t stripe across multiple I/O nodes to get better performance.
• Add a file management layer to “glue” the independent file systems so
as to present a single file view.
– Require users (both on and off Cplant) to differentiate between this
“special” file system and other “normal” file systems.
– Lots of special utilities are required.
• Build our own parallel file system from scratch.
– A lot of work just to reinvent the wheel, let alone the right wheel.
• Port other parallel file systems into Cplant.
– Also a lot of work with no immediate payoff.
• Build our I/O partition as a scalable nexus between Cplant and external
+ Leverage off existing and future parallel file systems.
+ Allow immediate payoff with Cplant accessing existing file systems.
+ Reduce data storage, copies, and management.
– Expect lower performance with non-local file systems.
– Waste external bandwidth when accessing scratch files.
Building the Nexus
– How can and should the compute partition use this service?
– What are the components and protocols between them?
– What we have now and what we hope to achieve in the future?
Compute Partition Semantics
– Allow users to be in a familiar environment.
• No support for ordered operations (e.g., no O_APPEND).
• No support for data locking.
– Enable fast non-overlapping concurrent writes to a single file.
– Prevent a job from slowing down the entire system for others.
• Additional call to invalidate buffer cache.
– Allow file views to synchronize when required.
I/O I/O I/O I/O
Enterprise Storage Services
• I/O nodes present a symmetric view.
– Every I/O node behaves the same (except for the cache).
– Without any control, a compute node may open a file with one I/O node,
and write that file via another I/O node.
• I/O partition is fault-tolerant and scalable.
– Any I/O node can go down without the system losing jobs.
– Appropriate number of I/O nodes can be added to scale with the compute
• I/O partition is the nexus for all file I/O.
– It provides our POSIX-like semantics to the compute nodes and
accomplishes tasks on behalf of the them outside the compute partition.
• Links/protocols to external storage servers are server dependent.
– External implementation hidden from the compute partition.
Compute -- I/O node protocol
• Base protocol is NFS version 2.
– Stateless protocols allow us to repair faulty I/O nodes without aborting
– Inefficiency/latency between the two partitions is currently moot;
Bottleneck is not here.
– Larger I/O requests.
– Propagation of a call to invalidate cache on the I/O node.
• Basic implementation of the I/O nodes
• Have straight NFS inside Linux with ability to invalidate cache.
• I/O nodes have no cache.
• I/O nodes are dumb proxies knowing only about one server.
• Credentials rewritten by the I/O nodes and sent to the server as if the
the requests came from the I/O nodes.
• I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI
O2K as the (XFS) file server on the other end.
• Don’t have jumbo packets.
• Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each
using about 15% of CPU.
• Put a VFS infrastructure into I/O node daemon.
– Allow access to multiple servers.
– Allow a Linux /proc interface to tune individual I/O nodes quickly and
– Allow vnode identification to associate buffer cache with files.
• Experiment with a multi-node server (SGI/CXFS).
• Stop retries from going out of network.
• Put in jumbo packets.
• Put in read cache.
• Put in write cache.
• Port over Portals 3.0.
• Put in bulk data services.
• Allow dynamic compute-node-to-I/O-node mapping.
Looking for Collaborations