No Slide Title by xHRYX2H


									     Cplant I/O

         Pang Chen
         Lee Ward
Sandia National Laboratories
Scalable Computing Systems

Fifth NASA/DOE Joint PC Cluster
      Computing Conference
        October 6-8, 1999

        Conceptual Partition Model

                  Compute      File I/O


                               Net I/O

                           File I/O Model

• Support large-scale unstructured grid applications.
    – Manipulate single file per application, not per processor.
• Support collective I/O libraries.
    – Require fast concurrent writes to a single file.


• Need a file system NOW!
• Need scalable, parallel I/O.
• Need file management infrastructure.
• Need to present the I/O subsystem as a single parallel file system both
  internally and externally.
• Need production-quality code.


• Provide independent access to file systems on each I/O node.
    – Can’t stripe across multiple I/O nodes to get better performance.
• Add a file management layer to “glue” the independent file systems so
  as to present a single file view.
    – Require users (both on and off Cplant) to differentiate between this
      “special” file system and other “normal” file systems.
    – Lots of special utilities are required.
• Build our own parallel file system from scratch.
    – A lot of work just to reinvent the wheel, let alone the right wheel.
• Port other parallel file systems into Cplant.
    – Also a lot of work with no immediate payoff.

                       Current Approach

• Build our I/O partition as a scalable nexus between Cplant and external
  file systems.
    +   Leverage off existing and future parallel file systems.
    +   Allow immediate payoff with Cplant accessing existing file systems.
    +   Reduce data storage, copies, and management.
    –   Expect lower performance with non-local file systems.
    –   Waste external bandwidth when accessing scratch files.

                    Building the Nexus

• Semantics
   – How can and should the compute partition use this service?
• Architecture
   – What are the components and protocols between them?
• Implementation
   – What we have now and what we hope to achieve in the future?

            Compute Partition Semantics

• POSIX-like.
    – Allow users to be in a familiar environment.
• No support for ordered operations (e.g., no O_APPEND).
• No support for data locking.
    – Enable fast non-overlapping concurrent writes to a single file.
    – Prevent a job from slowing down the entire system for others.
• Additional call to invalidate buffer cache.
    – Allow file views to synchronize when required.

          Cplant I/O

I/O         I/O      I/O      I/O

      Enterprise Storage Services

• I/O nodes present a symmetric view.
    – Every I/O node behaves the same (except for the cache).
    – Without any control, a compute node may open a file with one I/O node,
      and write that file via another I/O node.
• I/O partition is fault-tolerant and scalable.
    – Any I/O node can go down without the system losing jobs.
    – Appropriate number of I/O nodes can be added to scale with the compute
• I/O partition is the nexus for all file I/O.
    – It provides our POSIX-like semantics to the compute nodes and
      accomplishes tasks on behalf of the them outside the compute partition.
• Links/protocols to external storage servers are server dependent.
    – External implementation hidden from the compute partition.
            Compute -- I/O node protocol

• Base protocol is NFS version 2.
    – Stateless protocols allow us to repair faulty I/O nodes without aborting
    – Inefficiency/latency between the two partitions is currently moot;
      Bottleneck is not here.
• Extension/modifications:
    – Larger I/O requests.
    – Propagation of a call to invalidate cache on the I/O node.

                Current Implementation

• Basic implementation of the I/O nodes
• Have straight NFS inside Linux with ability to invalidate cache.
• I/O nodes have no cache.
• I/O nodes are dumb proxies knowing only about one server.
• Credentials rewritten by the I/O nodes and sent to the server as if the
  the requests came from the I/O nodes.
• I/O nodes are attached via 100 BaseT’s to a Gb ethernet with an SGI
  O2K as the (XFS) file server on the other end.
• Don’t have jumbo packets.
• Bandwidth is about 30MB/s with 18 clients driving 3 I/O nodes, each
  using about 15% of CPU.

                 Current Improvements

• Put a VFS infrastructure into I/O node daemon.
    – Allow access to multiple servers.
    – Allow a Linux /proc interface to tune individual I/O nodes quickly and
    – Allow vnode identification to associate buffer cache with files.
• Experiment with a multi-node server (SGI/CXFS).

                 Future Improvements

•   Stop retries from going out of network.
•   Put in jumbo packets.
•   Put in read cache.
•   Put in write cache.
•   Port over Portals 3.0.
•   Put in bulk data services.
•   Allow dynamic compute-node-to-I/O-node mapping.

Looking for Collaborations

           Lee Ward

           Pang Chen


To top