TITLE ALL CAPS - PowerPoint - PowerPoint
Document Sample


Isilon Clustered Storage
OneFS
Nick Kirsch
Introduction
• Who is Isilon?
• What Problems Are We Solving? (Market Opportunity)
• Who Has These Problems? (Our Customers)
• What Is Our Solution? (Our Product)
• How Does It Work? (The Cool Stuff)
Who is Isilon Systems?
• Founded in 2000
• Located in Seattle (Queen Anne)
• IPO’d in 2006 (ISLN)
• ~400 employees
• Q3 2008 Revenue: $30 million, 40% Y/Y
• Co-founded by Paul Mikesell, UW/CSE
• I’ve been at the company for 6+ years
What Problems Are We Solving?
Structured Data Unstructured Data
• Small files • Larger files
• Modest-size data stores • Very large data stores
• I/O intensive • Throughput intensive
• Transactional • Sequential
• Steady capacity growth • Explosive capacity growth
Traditional Architectures
• Data Organized in Layers of Abstraction
• File System, Volume Manager, RAID
• Server/Storage Architecture - “Head” and “Disk”
• Scale Up (vs Scale Out)
Storage
• Islands of Storage Device
#1
• Hard to Scale
• Performance Bottlenecks Storage
Device
• Not Highly Available #2
• Overly Complex
Storage
• Cost Prohibitive Device
#3
Who Has These Problems?
Worldwide File And Block Disk Storage Systems, 2005-2011*
By 2011, 75% of all storage capacity
sold will be for file-based data
(PB)
File Based: 79.3% CAGR
Block Based: 31% CAGR
* Source: IDC, 2007
• Isilon has over 850 customers today.
What is Our Solution?
OneFS™ Enterprise Isilon IQ
intelligent - Clustered
software class Storage
hardware
A 3-node Scales to 96 nodes
Isilon IQ Cluster
2.3 PB (single file system)
20 GB/s (aggregate)
Clustered Storage Consists Of “Nodes”
• Largely Commodity Hardware
• Quad-core 2.3Ghz CPU
• 4 GB memory read cache
• GbE and 10GbE for front-end network
• 12 disks per node
• InfiniBand for intra-cluster communication
• High-speed NVRAM journal
• Hot-swappable disks, power supplies, and fans
• NFS, CIFS, HTTP, FTP
• Integrates with Windows and UNIX
• OneFS operating system
Isilon Network Architecture
CIFS
Ethernet
NFS
Either
• Drop-in replacement for any NAS device
• No client-side drivers required, like Andrew FS (Coda), or Lustre
• No application changes, like Google FS or Amazon S3
• No changes required to adopt.
How Does It Work?
• Built on FreeBSD 6.x (originally 5.x)
• New kernel module for OneFS
• Modifications to the kernel proper
• User space applications
• Leverage open-source where possible
• Almost all of the heavy-lifting is in the kernel
• Commodity Hardware
• A few exceptions:
• We have a high-speed NVRAM journal for data consistency
• We have an Infiniband low-latency cluster inter-connect
• We have a close-to-commodity SAS card (commodity chips)
• A custom monitoring board (fans, temps, voltages, etc.)
• SAS and SATA disks
OneFS architecture
• Fully Distributed
Network Operations (TCP, NFS, CIFS)
• Top Half FEC Calculations, Block Reconstruction
• Initiator VFS layer, Locking, etc.
File-Indexed Cache
• Bottom Half Journal and Disk Operations
• Participant Block-Indexed Cache
• The OneFS architecture is basically an Infiniband SAN
• All data access across the back-end network is block-level
• The participants act as very smart disk drives
• Much of the back-end data traffic can be RDMA
OneFS architecture
• OneFS started from UFS (aka FFS)
• Generalized for a distributed system.
• Little resemblance in code today, but concepts are there.
• Almost all data structures are trees
• OneFS Knows Everything – no volume manager, no RAID
• Lack of abstraction allows us to do interesting things, but forces
the file system to know a lot – everything.
• Cache/Memory Architecture Split
• “Level 1” – file cache (cached as part of the vnode)
• “Level 2” – block cache (local or remote disk blocks)
• Memory used for high-speed write coalescer
• Much more resource intensive than a local FS
Atomicity/Consistency Guarantees
• POSIX file system
• Namespace operations are atomic
• fsync/sync operations are guaranteed synchronous
• FS data is either mirrored or FEC-protected
• Meta-data is always mirrored; up to 8x
• User-data can be mirrored (up to 8x) or FEC up to +4
• We use Reed-Solomon codings for FEC
• Protection level can be chosen on a per-file or per-directory
basis.
• Some files can be at 1x (no protection) while others can be at +4
(survive four failures).
• Meta-data must be protected at least as high as anything it refers to.
• All writes go to the NVRAM first as part of a distributed
transaction – guaranteed to commit or abort.
Group Management
• Transactional way to handle state changes
• All nodes need to agree on their peers
• Group changes: split, merge, add, remove
• Group changes don’t “scale”, but are rare
1 4
+
2 3
Distributed Lock Manager
• Textbook-ish DLM
• Anyone requesting a lock is an initiator.
• Coordinator knows the definitive owner for the lock.
• Controls access to locks.
• Coordinator is chosen by a hash of the resource.
• Split/Merge behavior
• Locks are lost at merge time, not split time.
• Since POSIX has no lock-revoke mechanism, advisory locks are
silently dropped.
• Coordinator renegotiates on split/merge.
• Locking optimizations – “lazy locks”
• Locks are cached.
• Lock-lost callbacks.
• Lock-contention callbacks.
RPC Mechanism
• Uses SDP on Infiniband
• Batch System
• Allows you to put dependencies on the remote side.
• i.e. Send 20 messages, checkpoint, send 20 messages.
• Messages run in parallel, then synchronize, etc.
• Coalesces errors.
• Async messages (callback)
• Sync messages
• Update message (no response)
• Used by DLM, RBM, etc. (everything)
Writing a file to OneFS
• Writes occur via NFS, CIFS, etc. to a single node
• That node coalesces data and initiates transactions
• Optimizing for write performance is hard
•Lots of variables
•Each node might have different load
•Unusual scenarios, e.g. degraded writes
• Asynchronous Write Engine
•Build a directed acyclical graph (DAG)
•Do work as soon as dependencies satisfied
•Prioritize and pipeline work for efficiency
Writing a file to OneFS
Servers
NFS, CIFS,
FTP, HTTP
Servers
(optional 2nd
(optional 2nd switch) (optional 2nd
switch)
switch)
Servers
Writing a file to OneFS
(optional 2nd
switch)
Writing a file to OneFS
• Break the write into regions
• Region are protection group aligned
• For each region:
• Create a layout
• Use layout to generate a plan
• Execute the plan asynchronously
write
FEC compute
FEC
write compute
block layout
allocate
blocks
write
block
Writing a file to OneFS
• Plan executes and transaction commits
• Data and parity blocks are now on disks
Data and Data and
Parity blocks Parity blocks
Data and
Parity blocks
Inode mirror 0 Inode mirror 1
Reading a file from OneFS
Servers
NFS, CIFS,
FTP, HTTP
Servers
(optional 2nd
(optional 2nd switch) (optional 2nd
switch)
switch)
Servers
Reading a file from
Reading a OneFS File OneFS
Servers
NFS, CIFS,
FTP, HTTP
Servers
(optional 2nd
(optional 2nd switch) switch)
Servers
Handling Failures
• What could go wrong during a single
transaction?
• A block-level I/O request fails
• A drive goes down
• A node runs out of space
• A node disconnects or crashes
• In a distributed system, things are expected
to fail.
• Most of our system calls automatically restart.
• Have to be able to gracefully handle all of the
above, plus much more!
Handling Failures
• When a node goes “down”:
• New files will use effective protection levels (if necessary)
• Affected files will be reconstructed automatically per
request.
• That node’s IP addresses are migrated to another node.
• Some data is orphaned and later garbage collected.
• When a node “fails”:
• New files will use effective protection levels (if necessary)
• Affected files will be repaired automatically across the
cluster.
• AutoBalance will automatically rebalance data.
• We can safely, proactively SmartFail nodes/drives:
• Reconstruct data without removing the device.
• In the event of a multiple-component failure occurs, use
the original device – minimizes WOR.
SmartConnect
SmartConnect
CIFS
Ethernet
NFS
Either
• Client must connect to a single IP address.
• SmartConnect - DNS server which runs on the cluster
• Customer delegates zone to the cluster DNS server
• SmartConnect responds to DNS queries with only available nodes
• SmartConnect can also be configured to respond with nodes
based on load, connection, throughput, etc.
We've got Lego Pieces
• Accelerator Nodes
• Top-Half Only
• Adds CPU and Memory – no disks or journal
• Only has Level 1 cache… high single-stream throughput
• Storage Nodes
• Both Top or Bottom Half
• In Some Workloads, Bottom Half Only Makes Sense
• Storage Expansion Nodes
• Just a dumb extension of a Storage Node – add disks
• Grow Capacity Without Performance
SmartConnect Zones
hpc. tx.com Processing
•10 GigE dedicated
•Accelerator X nodes
•NFS Failover required 10gige-1
gg.tx.com
Interpreters •Storage nodes
•NFS clients, no
10.20 failover
BizDev
Eng
10.10
10.30 ext-1
eng.tx.com
Finance
bizz.tx.com
•Renamed sub-domain •Shared subnet IT
•CIFS clients (static IP) •Separate sub-domain
•NFS Failover
it.tx.com
fin.tx.com •Full access, maintenance interface
•VLAN (confidential •Corporate DNS, no SC
traffic, isolated) •Static (well-known) IPs required
•Same physical LAN
Initiator Software Block Diagram
Front-end Network
NFS CIFS HTTP NDMP FTP ?
Initiator Cache
DFM IFM LIN STF
BAM Layout BSW
Btree
MDS
RBM
Back-end Network
2
Participant Software Block Diagram
Back-end Network
RBM
LBM
Participant Cache
Journal
DRV
NVRAM
Disk Subsystem
3
System Software Block Diagram
Front-end Network Front-end Network
CIF HTT ND iSC CIF HTT ND iSC
NFS FTP NFS FTP
S P MP SI S P MP SI
Initiator Cache Initiator Cache
D S D S
IF LI IF LI
F T Lay BS F T Lay BS
M N BAM M N BAM
M Btree F out W M Btree F out W
MDS MDS
RBM RBM
Back-end Network Back-end Network
Infinband
Back-end Network Accelerator
RBM
LBM
Participant Cache
Journal DRV
NV
RA Disk Subsystem
M
Storage Node
3
Too much to talk about…
• Snapshots • Failed Drive Reconstruction
• Quotas • Distributed Deadlock Detection
• Replication • On-the-fly Filesystem Upgrade
• Bit Error Protection • Dynamic Sector Repair
• Rebalancing Data • Globally Coherent Cache
• Handling Slow Drives
• Statistics Gathering
• I/O Scheduling
• Network Failover
• Native Windows Concepts (ACLs, SIDs, etc.)
Thank You!
Questions?
Get documents about "