TITLE ALL CAPS - PowerPoint - PowerPoint

Document Sample
TITLE ALL CAPS - PowerPoint - PowerPoint Powered By Docstoc
					Isilon Clustered Storage
OneFS
Nick Kirsch
Introduction




•   Who is Isilon?
•   What Problems Are We Solving? (Market Opportunity)
•   Who Has These Problems? (Our Customers)
•   What Is Our Solution? (Our Product)
•   How Does It Work? (The Cool Stuff)
Who is Isilon Systems?

•   Founded in 2000
•   Located in Seattle (Queen Anne)
•   IPO’d in 2006 (ISLN)
•   ~400 employees
•   Q3 2008 Revenue: $30 million, 40% Y/Y
•   Co-founded by Paul Mikesell, UW/CSE

• I’ve been at the company for 6+ years
What Problems Are We Solving?

Structured Data               Unstructured Data




•   Small files               •   Larger files
•   Modest-size data stores   •   Very large data stores
•   I/O intensive             •   Throughput intensive
•   Transactional             •   Sequential
•   Steady capacity growth    •   Explosive capacity growth
Traditional Architectures
  •   Data Organized in Layers of Abstraction
      • File System, Volume Manager, RAID
  •   Server/Storage Architecture - “Head” and “Disk”
  •   Scale Up (vs Scale Out)


                                                        Storage
  •   Islands of Storage                                Device
                                                        #1
  •   Hard to Scale
  •   Performance Bottlenecks                           Storage
                                                        Device
  •   Not Highly Available                              #2

  •   Overly Complex
                                                        Storage
  •   Cost Prohibitive                                  Device
                                                        #3
Who Has These Problems?
            Worldwide File And Block Disk Storage Systems, 2005-2011*




                      By 2011, 75% of all storage capacity
                      sold will be for file-based data
    (PB)




                   File Based: 79.3% CAGR
                   Block Based: 31% CAGR




                                                             * Source: IDC, 2007

•          Isilon has over 850 customers today.
What is Our Solution?
 OneFS™        Enterprise            Isilon IQ
intelligent        -                 Clustered
 software        class                Storage
               hardware




    A 3-node            Scales to 96 nodes
    Isilon IQ Cluster
                            2.3 PB (single file system)
                            20 GB/s (aggregate)
Clustered Storage Consists Of “Nodes”
 •   Largely Commodity Hardware
 •   Quad-core 2.3Ghz CPU
 •   4 GB memory read cache
 •   GbE and 10GbE for front-end network
 •   12 disks per node

 •   InfiniBand for intra-cluster communication
 •   High-speed NVRAM journal
 •   Hot-swappable disks, power supplies, and fans

 •   NFS, CIFS, HTTP, FTP
 •   Integrates with Windows and UNIX

 •   OneFS operating system
Isilon Network Architecture

    CIFS

                               Ethernet


    NFS




    Either


•      Drop-in replacement for any NAS device
•      No client-side drivers required, like Andrew FS (Coda), or Lustre
•      No application changes, like Google FS or Amazon S3
•      No changes required to adopt.
How Does It Work?

• Built on FreeBSD 6.x (originally 5.x)
   •   New kernel module for OneFS
   •   Modifications to the kernel proper
   •   User space applications
   •   Leverage open-source where possible
   •   Almost all of the heavy-lifting is in the kernel
• Commodity Hardware
   • A few exceptions:
       •   We have a high-speed NVRAM journal for data consistency
       •   We have an Infiniband low-latency cluster inter-connect
       •   We have a close-to-commodity SAS card (commodity chips)
       •   A custom monitoring board (fans, temps, voltages, etc.)
       •   SAS and SATA disks
OneFS architecture
• Fully Distributed

                                 Network Operations (TCP, NFS, CIFS)
    • Top Half                   FEC Calculations, Block Reconstruction
        •   Initiator            VFS layer, Locking, etc.
                                 File-Indexed Cache


    • Bottom Half                Journal and Disk Operations
        •   Participant          Block-Indexed Cache




•   The OneFS architecture is basically an Infiniband SAN
    •   All data access across the back-end network is block-level
    •   The participants act as very smart disk drives
    •   Much of the back-end data traffic can be RDMA
OneFS architecture
•   OneFS started from UFS (aka FFS)
    •   Generalized for a distributed system.
    •   Little resemblance in code today, but concepts are there.
    •   Almost all data structures are trees


•   OneFS Knows Everything – no volume manager, no RAID
    •   Lack of abstraction allows us to do interesting things, but forces
        the file system to know a lot – everything.


•   Cache/Memory Architecture Split
    •   “Level 1” – file cache (cached as part of the vnode)
    •   “Level 2” – block cache (local or remote disk blocks)
    •   Memory used for high-speed write coalescer


•   Much more resource intensive than a local FS
Atomicity/Consistency Guarantees
•   POSIX file system
    •   Namespace operations are atomic
    •   fsync/sync operations are guaranteed synchronous


•   FS data is either mirrored or FEC-protected
    •   Meta-data is always mirrored; up to 8x
    •   User-data can be mirrored (up to 8x) or FEC up to +4
        •   We use Reed-Solomon codings for FEC
    •   Protection level can be chosen on a per-file or per-directory
        basis.
        •   Some files can be at 1x (no protection) while others can be at +4
            (survive four failures).
        •   Meta-data must be protected at least as high as anything it refers to.


•   All writes go to the NVRAM first as part of a distributed
    transaction – guaranteed to commit or abort.
Group Management
•   Transactional way to handle state changes
•   All nodes need to agree on their peers
•   Group changes: split, merge, add, remove
•   Group changes don’t “scale”, but are rare


              1                                 4

                                         +




              2                                 3
Distributed Lock Manager
•   Textbook-ish DLM
    •   Anyone requesting a lock is an initiator.
    •   Coordinator knows the definitive owner for the lock.
        •   Controls access to locks.
        •   Coordinator is chosen by a hash of the resource.


•   Split/Merge behavior
    •   Locks are lost at merge time, not split time.
    •   Since POSIX has no lock-revoke mechanism, advisory locks are
        silently dropped.
    •   Coordinator renegotiates on split/merge.


•   Locking optimizations – “lazy locks”
    •   Locks are cached.
    •   Lock-lost callbacks.
    •   Lock-contention callbacks.
RPC Mechanism

• Uses SDP on Infiniband
• Batch System
    • Allows you to put dependencies on the remote side.
      • i.e. Send 20 messages, checkpoint, send 20 messages.
      • Messages run in parallel, then synchronize, etc.
    • Coalesces errors.
•   Async messages (callback)
•   Sync messages
•   Update message (no response)
•   Used by DLM, RBM, etc. (everything)
Writing a file to OneFS

• Writes occur via NFS, CIFS, etc. to a single node
• That node coalesces data and initiates transactions

• Optimizing for write performance is hard
   •Lots of variables
   •Each node might have different load
   •Unusual scenarios, e.g. degraded writes
• Asynchronous Write Engine
   •Build a directed acyclical graph (DAG)
   •Do work as soon as dependencies satisfied
   •Prioritize and pipeline work for efficiency
Writing a file to OneFS



  Servers

                   NFS, CIFS,
                   FTP, HTTP


  Servers
                                             (optional 2nd
               (optional 2nd switch)   (optional 2nd
                                             switch)
                                       switch)




  Servers
Writing a file to OneFS




                          (optional 2nd
                          switch)
Writing a file to OneFS
• Break the write into regions
• Region are protection group aligned
• For each region:
       • Create a layout
       • Use layout to generate a plan
       • Execute the plan asynchronously

          write
          FEC                compute
                             FEC

          write                            compute
          block                            layout
                             allocate
                             blocks
          write
          block
Writing a file to OneFS

• Plan executes and transaction commits
• Data and parity blocks are now on disks


      Data and                         Data and
      Parity blocks                    Parity blocks
                       Data and
                       Parity blocks

      Inode mirror 0                   Inode mirror 1
Reading a file from OneFS



  Servers

                  NFS, CIFS,
                  FTP, HTTP


  Servers
                                            (optional 2nd
              (optional 2nd switch)   (optional 2nd
                                            switch)
                                      switch)




  Servers
  Reading a file from
Reading a OneFS File OneFS



  Servers

                  NFS, CIFS,
                  FTP, HTTP


  Servers
                                      (optional 2nd
              (optional 2nd switch)   switch)




  Servers
Handling Failures

• What could go wrong during a single
  transaction?
       •   A block-level I/O request fails
       •   A drive goes down
       •   A node runs out of space
       •   A node disconnects or crashes
• In a distributed system, things are expected
  to fail.
  • Most of our system calls automatically restart.
  • Have to be able to gracefully handle all of the
    above, plus much more!
Handling Failures
• When a node goes “down”:
   • New files will use effective protection levels (if necessary)
   • Affected files will be reconstructed automatically per
     request.
   • That node’s IP addresses are migrated to another node.
   • Some data is orphaned and later garbage collected.
• When a node “fails”:
   • New files will use effective protection levels (if necessary)
   • Affected files will be repaired automatically across the
     cluster.
   • AutoBalance will automatically rebalance data.
• We can safely, proactively SmartFail nodes/drives:
   • Reconstruct data without removing the device.
   • In the event of a multiple-component failure occurs, use
     the original device – minimizes WOR.
SmartConnect
SmartConnect

 CIFS

                          Ethernet

  NFS



 Either


• Client must connect to a single IP address.
• SmartConnect - DNS server which runs on the cluster
    • Customer delegates zone to the cluster DNS server
    • SmartConnect responds to DNS queries with only available nodes
    • SmartConnect can also be configured to respond with nodes
    based on load, connection, throughput, etc.
We've got Lego Pieces
• Accelerator Nodes
  • Top-Half Only
  • Adds CPU and Memory – no disks or journal
  • Only has Level 1 cache… high single-stream throughput


• Storage Nodes
  • Both Top or Bottom Half
  • In Some Workloads, Bottom Half Only Makes Sense


• Storage Expansion Nodes
  • Just a dumb extension of a Storage Node – add disks
  • Grow Capacity Without Performance
SmartConnect Zones
  hpc. tx.com                    Processing
  •10 GigE dedicated
  •Accelerator X nodes
  •NFS Failover required                                                  10gige-1




                               gg.tx.com
 Interpreters                  •Storage nodes
                               •NFS clients, no
          10.20                failover




              BizDev
                                  Eng

            10.10
                                                               10.30                 ext-1

                             eng.tx.com
                                                      Finance
 bizz.tx.com
 •Renamed sub-domain         •Shared subnet                                                                     IT
 •CIFS clients (static IP)   •Separate sub-domain
                             •NFS Failover
                                                                                      it.tx.com
                                                    fin.tx.com                        •Full access, maintenance interface
                                                    •VLAN (confidential               •Corporate DNS, no SC
                                                    traffic, isolated)                •Static (well-known) IPs required
                                                    •Same physical LAN
Initiator Software Block Diagram
                        Front-end Network


    NFS         CIFS    HTTP         NDMP     FTP      ?


                         Initiator Cache


   DFM    IFM     LIN   STF
                               BAM          Layout   BSW
            Btree
            MDS



                               RBM


                        Back-end Network


                                                           2
Participant Software Block Diagram
                 Back-end Network


                       RBM




                       LBM


                 Participant Cache


       Journal
                       DRV



   NVRAM
                       Disk Subsystem




                                        3
System Software Block Diagram

            Front-end Network                                       Front-end Network
         CIF   HTT    ND               iSC                       CIF   HTT    ND               iSC
 NFS                        FTP                          NFS                        FTP
          S      P    MP                SI                        S      P    MP                SI
                 Initiator Cache                                         Initiator Cache
 D                 S                                     D                 S
       IF   LI                                                 IF   LI
 F                  T            Lay   BS                F                  T            Lay   BS
       M    N          BAM                                     M    N          BAM
 M      Btree       F            out   W                 M      Btree       F            out   W
        MDS                                                     MDS

                      RBM                                                    RBM
               Back-end Network                                     Back-end Network

                                             Infinband

            Back-end Network                                                   Accelerator
                      RBM

                      LBM
               Participant Cache
     Journal          DRV
 NV
 RA                Disk Subsystem
 M


                    Storage Node




                                                                                                     3
Too much to talk about…
 •   Snapshots            • Failed Drive Reconstruction
 •   Quotas               • Distributed Deadlock Detection
 •   Replication          • On-the-fly Filesystem Upgrade
 •   Bit Error Protection • Dynamic Sector Repair
 •   Rebalancing Data • Globally Coherent Cache
 •   Handling Slow Drives
 •   Statistics Gathering
 •   I/O Scheduling
 •   Network Failover
 •   Native Windows Concepts (ACLs, SIDs, etc.)
Thank You!

Questions?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:9/11/2012
language:Unknown
pages:33