Acrobat PDF

Sun Microsystems Storagetek 5800 System Architecture Whitepaper

You must be logged in to download this document
Reviews
Shared by: C Gunnison
Stats
views:
138
rating:
not rated
reviews:
0
posted:
12/29/2007
language:
English
pages:
0
SUN STORAGETEK™ 5800 SYSTEM ARCHITECTURE White Paper December 2007 Sun Microsystems, Inc. Table of Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Structured and Unstructured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Requirements for Large, Fixed Content Storage Systems . . . . . . . . . . . . . . . . . . . . . . 2 Conventional Solutions for Unstructured Static Data . . . . . . . . . . . . . . . . . . . . . . . . . 3 Fixed Content Aware Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Introducing the Sun StorageTek 5800 System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 The Sun StorageTek 5800 System and Honeycomb. . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Key Architecture Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Fully Integrated Reliable Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Symmetric Cluster for Reliability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Cell Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Reliable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Reed Solomon Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Placement Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Self Healing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Capacity Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Background Integrity Checking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Integrated High Availability Metadata Database . . . . . . . . . . . . . . . . . . . . . . . . . 16 Large Object Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Network Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Low TCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Application Program Interfaces (APIs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Overview of Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Process for Storing an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Process for Retrieving an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Self Healing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Recovering a From a Failed Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Re-distributing Content to Replaced Disk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Multi-Cell Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Adding Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Software Interface Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Key Concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Object Archive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 Write Once Read Many (WORM) Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Object Identifier (OID). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Primary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Create (Store) an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Sun Microsystems, Inc. Retrieve Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Adding Metadata to an existing Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Querying Metadata. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Deleting an Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Retrieving the Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Sun StorageTek 5800 System Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Java API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 C API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Virtualized Views via WebDAV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Emulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 System Software and RAS Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Software Components Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Remote Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Protocol Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Metadata Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Object Archive Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Software RAID Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Placement Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 High Availability Database Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Local Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Disk Access Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Disk Management and Monitoring Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Healing Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 High Availability Database Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Cluster Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Node Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Cluster Membership Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Switch Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Service State Advertisements (Mailboxes) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Clustered IPC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Administrative Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Command Line Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Administration GUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Notifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Background Integrity Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Backup and Restore via NDMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 IP Multi-Pathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Hardware Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Storage Nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Cell Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Service Processor Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Integrated Load Balancing Ethernet Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Network Patch Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Sun Microsystems, Inc. System Rack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 Bundled Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Cell Wide System Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Futures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Compliance Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Extensibility Through Storage Beans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Example Uses of Synchronous Storage Beans . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Example Uses of Asynchronous Storage Beans. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 Upcoming Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 White Papers and Articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 1 Introduction Sun Microsystems, Inc. Chapter 1 Introduction The Sun StorageTek 5800 system is the first in a new line of third generation fixed content storage systems. It is an on-line, highly reliable storage system that was developed to solve the unique problems associated with large-scale file storage applications. These applications, typically classified as fixed, static, or reference data, consume greater than 20 terabytes (TB) in storage and scale over time as data collections grow. First generation fixed content storage offered basic object store, second generation offered user-defined metadata about each object to enable easy retrieval. The Sun StorageTek 5800 system offers the ability to run data services inside the storage device that can manipulate the stored objects as they are stored or retrieved. These applications drive a new set of expectations for data storage, requiring greater scalability with much better reliability, availability, and serviceability (RAS) than traditional systems, but at much lower cost. Furthermore, the challenges of organizing and finding data objects in large-scale applications requires new thinking about how data can be better indexed, discovered, and accessed. Structured and Unstructured Data Data types and applications can be classified along two vectors: file versus structured data, and fixed versus dynamic data. DYNAMIC FIXED Digital repositories, Medical imaging, DAM, Broadcast, Compliance archival, Media Internet & archive Unstructured (files, objects) Office documents, Media production, CAD Structured (databases) Transactional systems, ERP, CRM, E-mail BI, Scientific, Data warehouses, Transaction archives Figure 1. Data types • Structured dynamic data is typically the data created by relational databases and online transaction processing (OLTP) applications. These applications often run on 2 Introduction Sun Microsystems, Inc. large servers with either direct attached storage (DAS) disk arrays or storage area network (SAN) disk arrays to store data. The requirements for this type of data are high throughput, transaction performance, and availability, which are adequately provided by DAS or SAN solutions. • Unstructured, dynamic data is usually created by departmental file sharing applications, such as office documents, and computer-aided design (CAD). This data has been supported over the years by a variety of storage architectures. However, many IT departments are moving to network-attached storage (NAS) systems because they are easy to deploy, support heterogeneous clients, and include advanced data protection features such as snapshot capabilities. • Structured, static data created by append-only applications is a relatively new type of data previously served by mainframe and storage archive management solutions. However, the increasing amount of data associated with the digital supply chain, enterprise resource planning (ERP), and radio frequency ID (RFID) applications is creating the opportunity for new approaches to storage. • Unstructured, static data (or fixed content), such as digital repositories, medical imaging, broadcast, and compliance, media, Internet archives, or dark archives (e.g., data that must be retained for legal reasons and is only retrieved in the event of a legal dispute) represents the category with the largest and fastest growing storage requirements. Fixed content shares the long-term storage requirement of structured, static data but it does not change and often requires the ability to access the data quickly and frequently (random reads). Near-line tape is widely deployed to reduce costs, but are limited in performance and reliability. Large-scale commodity RAID arrays are economically attractive, until the added complexity, lack of reliability, and limited scalability become evident. This is an emerging field with opportunities for innovation, one of which is object-based, or fixed content aware storage. Requirements for Large, Fixed Content Storage Systems The key characteristics of fixed content — non-changing data that must be stored long term, yet be quickly and readily accessible — yield a number of requirements for storage systems. • Data and metadata — to optimally locate and retrieve stored data, the storage system should have the capability to store the files as well as system and userdefined metadata. Metadata describes the content and the attributes for locating the data within the storage system. The system needs to be optimized to manage different kinds of metadata and queries against the metadata, as well as store data, requiring a separate database. In addition, metadata needs to be managed so that it can be leveraged to support new data services such as complex queries, and data validation. 3 Introduction Sun Microsystems, Inc. • Data integrity — once written, the data must be protected from accidental damage or intentional tampering and provide assurances that the data is not corrupted or suffering from bit rot over its stored life. • Reliability — the storage system must manage data reliably over the entire lifetime of the data. This is particularly critical for large fixed content as these files are typically so large that backup and recovery from tape is too time consuming and thus impractical. • Scalability — the storage system must have the capability to scale from entry-level systems of a few TB to large multi petabyte (PB) collections, without having to remove and rebuild the data on larger systems. Scalability should be non-disruptive and seamless. • Open architecture — the requirement to archive data for years to decades means it will outlive a number of generations of hardware and software. Open standards are mandatory to ensure data can migrate across technology generations. • Affordability — fixed content storage systems must be affordable both in terms of initial acquisition cost and long term running costs. It needs to be more of a selfmanaging system in order to minimize the cost of human intervention. Conventional Solutions for Unstructured Static Data Conventional solutions fail to meet all of the requirements for unstructured static data. Three alternatives using current technology are: • DAS or SAN with an application server and metadata database • NAS with an application server and metadata database • Tape or optical disk with an application server and metadata database In all three cases, the application server and metadata database are used to manage the content-specific functions such as indexing and searching the content. The DAS or SAN option, with either direct attached disks or Fibre Channel (FC) disks, is excessive for this type of data. SANs typically use high-end SCSI or FC 10K or 15K RPM disks that are geared toward a higher number of I/O operations per second (IOPS) for OLTP, databases, and other I/O intensive applications. In order to gain speed, the drives spin faster, require more power, and generate more heat. Storage capacity is compromised for speed. Though ideal for these types of applications, SCSI and FC disks represent a price/performance ratio that is unnecessary for fixed content storage. Static, unstructured data applications tend to consume vast amounts of storage, and scale steadily over time. Typical storage systems do not meet the transparent scalability and pay-as-you-grow model required to cost-effectively store data. Scaling often creates complexity and manageability problems because the process requires new file systems to be created, clients must be redirected, and the data must be manually redistributed. 4 Introduction Sun Microsystems, Inc. Though affordable because they generally use high density serial ATA (SATA) disks, NAS systems are unsuitable because they are difficult to scale due to the complexity of managing multiple file system mounts. Another issue with current disk-based solutions is reliability. Large-scale disk-based systems with RAID 5 are not adequate because of the increasing risk of dual drive failures in a RAID 5 array. In fact, RAID 5 is becoming less reliable as disk capacity increases due to the lengthening time it takes to rebuild data after a disk failure. Furthermore, in large fixed content systems the data is frequently not backed up simply because it is too large. The alternative is to tolerate permanent data loss or implement mirroring schemes that decrease both the density and increase overall cost of the solution. Tape and optical disks systems are intended for long-term storage, but the access time is much too slow for fixed content application requirements. In addition, fixed content applications generally (with the exception of dark archives) require the data be to always available and on-line, which can reduce the durability of tape or optical disks. Finally, tapes are often not reliable over time so many administrators duplicate tapes and then copy to new tapes every few years to protect the data. All of these solutions require custom integration of many discrete hardware and software components, as well as application development, which generally lead to proprietary solutions that do not extend through the ages. Custom solutions also well contribute to complexity and increase management and service costs that surpass the expense of acquiring the technology. The added complexity of databases, application servers, NAS, SAN, volume managers, high availability, and hierarchical storage management (HSM), as illustrated in Figure 2, all combine to make these solutions infeasible. 5 Introduction Sun Microsystems, Inc. Application Servers Data Metadata NAS or SAN Database RAID 1 File system Metadata HA Cluster RAID 5 RAID 5 Figure 2. Today’s common architecture Fixed Content Aware Storage In response to the requirements of fixed content, a new architecture referred to as fixed content aware storage is developing, both independently and as a working group (Fixed Content Aware Storage TWG) at the Storage Networking Industry Association (SNIA). Fixed content aware storage is a mechanism for storing information that is retrieved based on its content, rather than its storage location, as is the case with typical file systems. Fixed content aware systems combine low-cost, high-density SATA drives in a clustered system architecture. SATA based systems are well suited to store fixed content for a number of reasons. First, the systems use low-cost hardware components such as SATA disks rather than SCSI or FC, and Gigabit Ethernet instead of FC or InfiniBand. Second, the systems are modular. The clustered design makes it easy to add new storage resources. Finally, fixed content aware systems provide integrated functions, such as integrity and verifiability checking, that are customized to support fixed content applications. Introducing the Sun StorageTek 5800 System The Sun StorageTek 5800 system shares many of the elements of existing fixed content aware storage solutions but extends beyond by offering innovative and unique features that are designed to fully meet the requirements of fixed content storage and to maximize the business value of fixed content applications. 6 Introduction Sun Microsystems, Inc. The Sun StorageTek 5800 system is an online, disk-based, highly reliable storage system featuring a fully integrated hardware and software architecture with storage nodes arranged in a symmetric cluster: • Each node contains a standard processor, memory, and storage disks. All nodes are identical, yet operate independently • The clustered and redundant design provides high availability, good performance, and exceptional data persistency • The system includes an integrated metadata database for storing and querying properties of the fixed content • APIs are provided to develop applications that store and read fixed content • Virtual file system views through WebDAV (HTTP) • A Java™ technology based software solution manages the cluster and storage resources The Sun StorageTek 5800 system is designed to reduce cost and complexity associated with more traditional storage. It is IP/Ethernet based and comprised of commodity hardware components to keep the cost of acquiring and expanding the system to a minimum. Advanced reliability algorithms enable the system to heal itself from multiple simultaneous component failures without the need for immediate servicing. The Sun StorageTek 5800 system is designed to scale seamlessly, allowing an administrator to manage a PB or more behind a single management interface.The storage functionality is widely distributed across the hardware platform so no specialized high availability or premium hardware (such as RAID controllers, SCSI, or Fiber Channel drives) are required. In addition, the system is designed to allow for multiple component failures and rebuild around them, which means there is no need to rush to replace the failed components. Because of this fail-in-place, deferred maintenance model, failed disks can be replaced during regularly scheduled maintenance periods. In essence, the management associated with traditional storage, such as LUN/RAID/volume/spares management is removed, greatly reducing administrative costs. 7 Introduction Sun Microsystems, Inc. Figure 3. Sun StorageTek 5800 system (Two Cell Configuration - 32 nodes, 64 TB raw storage) The Sun StorageTek 5800 System and Honeycomb The original code name for the project that grew into the Sun StorageTek 5800 system was Project Honeycomb. The Honeycomb name lives on as the name of an OpenSolaris community that is bringing the Honeycomb software stack into the world of Open Source. The first realization of the Honeycomb storage model as a real product is the Sun StorageTek 5800 system as described in this paper. 8 Architectural Overview Sun Microsystems, Inc. Chapter 2 Architectural Overview The Sun StorageTek 5800 system is intended to archive critical digital assets on a longterm (years to decades) basis. The system meets all of the requirements for fixed content storage: integrated data and metadata, data integrity, reliability, scalability, adherence to open standards, and affordability. Key Architecture Principles The Sun StorageTek 5800 system is designed with a number of architectural principles in mind: • Object based — Files are stored with tightly integrated application-defined descriptive data. For example, an MRI might have descriptions such as Doctor's name, patient's name, date, technician, etc. Treating files as complete objects with metadata eliminates the cost and complexity of building and maintaining a separate relational database for metadata storage. • Ability to tolerate multiple simultaneous failures — Loss of either data or metadata is unacceptable, making data integrity more important than system availability. The system should be able to handle the loss of multiple disks or nodes without data loss. Routine hardware failures should not require emergency service calls. The Sun StorageTek 5800 system handles multiple simultaneous failures, and heals itself from individual failures without operator intervention. • No volume manager — Users should not need to manage how the storage is physically or logically laid out. The system appears to the user and the applications as a single logical entity even if multiple system units are added over time. Managing RAID, volumes, LUNs, and hot spares in traditional systems requires specialized administrative skills. The system should be as self-managing as possible in order to minimize TCO. • Distributed and fully symmetric cluster design — The operation of the cluster is fully distributed across all nodes in the system. Designs that would require specialized nodes, such as dedicated master nodes or database nodes, limit the flexibility of the design in several ways, e.g., creating bottlenecks and single points of failure at the specialized components. In the Sun StorageTek 5800 system, each node of the system is identically configured in terms of hardware as well as software. This allows each node to share the load of all the tasks. When a node fails, the system gracefully continues at slightly reduced capacity. 9 Architectural Overview Sun Microsystems, Inc. • Any node can coordinate any client request — Each node can run the algorithm to determine where objects are placed in the system and produce the same results. There is no centralized lock that needs to be used to avoid conflicts when allocating storage on the various nodes. The benefit is that all operations can proceed in parallel, as there is no contention waiting for a central coordination point. • Scalability — The system can grow from 16 TB seamlessly and non-disruptively. • Balanced design — As the demand for capacity increases, each node added to the system brings CPU and memory as well as additional storage. This provides more memory for caching the metadata database and more CPUs to execute parallel operations such as searches. • Gracefully handle failures — The cluster is loosely coupled and the components are stateless. When a component fails, the load is spread over the remaining components in the system. This permits failures to be handled gracefully. • Focus on overall system reliability — The system design allows for components to fail and for those failures to be handled by the rest of this system. The system's reliability does not depend on component level redundancy to prevent failures. This keeps the hardware costs low by allowing the use of commodity components. For example, there is no need add redundant power supplies into each node. If a power supply fails, the remaining nodes in the system take over for the failed node. The failed node can be left in place until a regularly scheduled service period. • Open architecture — The Sun StorageTek 5800 system represents the only commercially available object storage system with a commitment to open source its code, to help ensure data is accessible for years to come. Fully Integrated Reliable Storage System The Sun StorageTek 5800 system is a fully integrated system for the long-term reliable storage of large quantities of fixed content. A traditional system would consist of file systems and directories possibly on NAS servers on a SAN, with a relational database for storing the structured data that describes a piece of content. Building such a system would require a team of developers and system administrators. The system might need to be redesigned as it grows and starts running into capacity limits in the file system, directory structure, amount of storage, or the performance of a single server or database. The system might also need periodic redesign when it is time to perform technology refreshes as existing systems reach the end of their supported life. The Sun StorageTek 5800 system stores large files as objects. Each object in the system has associated attributes that describe it and can be used to locate the object in the system. The set of attributes can be customized for each application. This is directly analogous to designing a database schema. The stored file and its metadata are tied together on the Sun StorageTek 5800 system for the entire life of that object. No effort is needed by application developers or system administrators to keep file data and metadata in sync. 10 Architectural Overview Sun Microsystems, Inc. The metadata stored on the Sun StorageTek 5800 system is fully indexed and cached in memory for fast retrieval. Searches automatically execute in parallel across all of the nodes in the system. The amount of memory available for caching is dependent on the number of nodes in the system. The not-insignificant cost and complexity of managing a separate highly available database server is avoided with the Sun StorageTek 5800 system. Fixed content applications require long-term reliability and integrity. The Sun StorageTek 5800 system can handle multiple disk or node failures without impacting data availability. When objects are stored in the system, file level and block level checksums including a SHA-1 secure hash are stored to verify integrity. Should a block become corrupt it can be repaired using the redundancy that is built into the system's RAID algorithms. Once an object is stored, there are no operations in the system APIs that permit the data to be modified. Ensuring that a file cannot be modified would be difficult on a traditional file system. Applications are completely free of any knowledge of where objects are stored in the system. There is a single complete view of all objects even as the system scales up into the petabyte range. If more capacity is required than a single Sun StorageTek 5800 system can provide, multiple systems can be combined into a single logical unit. No applications changes are required either to the code or even the application's configuration information as the size of the Sun StorageTek 5800 system changes. Trying to scale a traditional file system-based approach into tens or hundreds of terabytes would require significant amounts of application logic devoted to placing and relocating files as the system changes. For legacy applications that need a file and directory view of the data in the system, administrators of the Sun StorageTek 5800 system can create multiple views of the objects organized into directory hierarchies by metadata elements. For example, in a library system, the directory tree could be type of genre/author/title. The views are available through HTTP using WebDAV. Symmetric Cluster for Reliability The Sun StorageTek 5800 system explicitly leverages a clustered design. All storage control and data path operations are distributed across the cluster to provide both reliability and performance scaling. Each node is completely independent of all nodes and there is complete symmetry in both hardware and software on each node. The system uses stateless design principles in the management, interface, and reliability designs to eliminate single points of failure and contention for resources. Unlike a conventional high performance cluster, the Sun StorageTek 5800 system leverages clustered servers to fully distribute and load balance both processing and storage functions. A Sun StorageTek 5800 system consists of a cluster of nodes in which 11 Architectural Overview Sun Microsystems, Inc. each node contains a standard processor, memory, networking, and storage. This design ensures that as capacity increases, each of the other resources increases in proportion as well. Each node provides an API and interface that allows all of the storage in the cluster to be accessed. This is achieved through the Sun StorageTek 5800 system object archive (OA), which implements all of the storage functions as a single image of all nodes with reliability and data integrity protection. Client Switch heartbeat Active Switch Switch Standby Storage Node Storage Node Storage Node Storage Node Storage Node Storage Node 16 nodes 16 CPUs / 48 GB RAM 64 SATA disks Figure 4. Sun StorageTek 5800 system symmetric cluster Cell Configurations The cell is the basic building block of the Sun StorageTek 5800 system and consists of 16 storage nodes, two Gigabit Ethernet switches, and one Service Node. A full cell includes16 storage nodes with sixty four 500 GB SATA drives. A half-cell configuration with only 8 nodes and 32 drives is the minimum system configuration. The Sun StorageTek 5800 system is a self contained rack mounted system. A rack can contain up to two full cells. Full Cells can be combined into a logical unit called a hive. All cells are managed using a single management interface. Table 1 lists the components in the various configurations. 12 Architectural Overview Sun Microsystems, Inc. Table 1. Cell configurations Component Storage Nodes CPUs/Memory Disks Ethernet Switches Service Nodes Half Cell 8 8/24 GB 32 2 1 Full Cell 16 16/48 GB 64 2 1 Two Cells 32 32/96 GB 128 2 per cell 1 per cell The 1.1 software release allows up to two full cells to be combined into a hive. Future releases are expected to support greater numbers of cells in a hive, enabling scalability into the petabyte range. The fact that there are multiple cells is completely hidden from the user. The multicell design stores, retrieve, queries, and deletes data transparently across multiple cells. The user does not need to know where the objects are stored or from which cells they are retrieved. Reliable Storage As discussed previously, fixed content is typically too large to realistically back up and RAID 5 takes too long to rebuild after a disk failure. While today's disk drives are increasing dramatically in capacity, the transfer rates are improving only very slowly. This means that each time drives double in capacity it takes almost twice as long to completely write or read the contents of a drive. If one 500 GB drive in a RAID 5 array fails, the window of risk that another drive could fail and cause lost data is double that of a 250 GB drive. Mirroring is sometimes used to reduce this risk, but only uses half of the available storage, making it less than attractive from a cost perspective. Furthermore, if a mirrored disk fails, a second failure causes loss of data. Considering the number of disks required for fixed content storage, the likely hood of two simultaneous disk failures is increased, also making mirroring unacceptable. In contrast, the Sun StorageTek 5800 system can withstand multiple disk and node failures. This is achieved through distributed RAID (RAID 6), in which objects are written across multiple disks and nodes using Reed Solomon encoding and a self-healing management system. Reed Solomon Encoding In order to provide reliability, both data and metadata are stored to disk in the Sun StorageTek 5800 system using the Reed Solomon encoding algorithm commonly used in RAID systems. The Reed Solomon algorithm efficiently encodes redundancy into a file to guarantee reliability in the face of failure of multiple parts of the storage system. 13 Architectural Overview Sun Microsystems, Inc. Each file is broken into a series of N fragments. From these N fragments, M additional parity fragments are generated using the Reed Solomon algorithm. If one the N data fragments are lost it can be regenerated using the parity fragments. The number of missing data fragments that can be reconstructed is equal to the number of parity fragments. Object D1 D2 D3 D4 D5 P1 P2 Figure 5. 5+2 encoding with five data fragments and two parity fragments The Sun StorageTek 5800 uses N = 5 and M = 2, which means that there five data fragments and two parity fragments. This is illustrated in Figure 5. Since there are two parity fragments (M = 2), up to two of the five (N = 5) data fragments can be reconstructed. It does not matter which of the data fragments are lost, as long as it is two or less, the data remains fully available. For example, if D2 and P1 are not accessible, the Sun StorageTek 5800 system can still reconstruct the object using the remaining fragments, as shown in Figure 6. D1 D3 D4 D5 P2 Object Figure 6. Decoding with missing fragments The intelligent placement algorithm ensures that the seven fragments are distributed to seven different nodes in the system. This method allows for up to two component failures (disk, or node) to occur without any effect on data availability. The system is therefore reliable in the face of multiple independent component failures. In addition, hot spares are not required. 14 Architectural Overview Sun Microsystems, Inc. In comparison to RAID 5, the Sun StorageTek 5800 system has higher reliability and better storage efficiency. RAID 5 is equivalent to N=4, M=1 which yields 67 percent efficiency with the use of a hot spare (6 disks). If an additional disk fails while data is being reconstructed on the hot spare a loss of data will occur. For high capacity drives in the 500 GB range, the reconstruction could take many hours. The data is vulnerable during that entire time. Placement Algorithm The Sun StorageTek 5800 system uses novel patent-pending placement algorithms to determine where data and parity fragments are placed in the cluster. The algorithms’ goals are to: • Avoid keeping a table that maps identifiers to physical locations. A table would need to be constantly updated as locations changed in response to failures or changes in cluster membership. This would adversely impact reliability and performance. • Minimize resource thrashing resulting from failure or changes in cluster membership. • Enable cluster-wide parallel recovery in the event of drive or node failure. • Maximize reliability and availability by ensuring that no two file fragments are stored on the same device and node. See “Process for Storing an Object” on page 20 for an overview of how the placement algorithm meets these goals. Self Healing In a full-cell, a disk or node failure is handled by up to 60 other disks in the remaining 15 nodes. There are 64 disks in a full cell. The placement algorithm ensures that fragments of the same object are never stored on the same node in order to be able to handle node failures without loss of data. The three remaining disks in the node with the failed disk are not considered for placing the relocated fragments from the failed disk. Since each of the disks only receives 1/60th of the data from the failed disk, the load incurred on each disk during self healing operation is minimal. Additionally, because the data writes to 60 drives, as opposed to a single hot spare, the rebuild time of the failed drive is greatly diminished, thus maximizing data protection. This eliminates the need for hot spares. Once the self-healing process has completed the cell is back at its full resiliency level, allowing up to two of the seven disks that hold an object to fail without data loss. The self healing process can be repeated to handle additional failures as long as there is capacity left in the cell. In order to ensure the ability to self heal, each cell only accepts requests to store new objects as long as the storage utilization is below 80 percent of capacity. If the 80 percent limit is reached, the system operates in a read-only mode. 15 Architectural Overview Sun Microsystems, Inc. Note: A half cell configuration of 8 nodes can only remain at full resiliency after the failure of a single node. After the healing process has completed every object in the system has one of its seven fragments stored on each node. If another node fails, in order to self heal the requirement that no two fragments can be stored on a single node must be relaxed. The subsequent failure of a node that contains two fragments from the same object removes all resiliency for that object. See “Self Healing” on page 26 for more details on how self healing works. Capacity Balancing The Sun StorageTek 5800 system's patent pending placement algorithm is designed to spread data evenly across all of the available disks in the cell. Neither administrators nor applications need to make any decisions about where to store data or know where it is located in order to retrieve it. Even after the failure of a disk or node, the data from the failed component is evenly spread over the remaining disks in the cell by the self healing process. Since the self healing process also relocates data back to its original location when failed components are replaced, the even load of data across all disks is maintained. If a half cell configuration of a Sun StorageTek 5800 system (8 nodes with 32 disks) is upgraded to a full cell configuration (16 nodes with 64 disks), the system's placement algorithm and self healing processes automatically spread the existing data on to the new disks. When the process is finished, given the way the placement algorithm works, all of the existing data will be situated as if it were originally stored on a full 16 node cell. Background Integrity Checking Maintaining data integrity is extremely important in a fixed content storage system, especially one that is designed to store data for long periods of time. Some applications have requirements for retention periods that are as long as 30 years. A well known problem with magnetic media such as tapes and disk drives is that the magnetic fields can decay over time, causing errors when the data is read after a long period. Periodically reading all of the data in the system allows for errors to be detected and corrected. Corrected blocks are written back out to disk. The Sun StorageTek 5800 system continually runs a number of background checks that include reading every object in the system and verifying block level checksums. All of the metadata in the system's high availability database is verified to ensure it is correctly stored, as well as properly indexed. “Background Integrity Checking” on page 48 discusses this topic in more detail. 16 Architectural Overview Sun Microsystems, Inc. Integrated High Availability Metadata Database A large collection of fixed content requires structured data that can be used to easily locate data stored in the system. For example, in a repository of astronomical images, the structured data might include the coordinates, telescope used, date, and format of the image. A traditional system would use a relational database for this purpose. The expense for this could be quite high considering the cost of the database license, additional high availability hardware for the RDBMS, a database administrator, and of course the application logic to tie it all together. The Sun StorageTek 5800 system includes a tightly integrated high availability database for storing the application defined attributes of each object. The metadata stays with the objects in the system for the entire lifetime of those objects. Objects and their metadata records can be located using SQL select-like functionality. Queries are very fast because they are executed in parallel across the cluster. The memory of all of the nodes in the cluster is used to cache metadata indexes for high performance. A separate relational database would be unable to take advantage of the relatively large amount of parallel processing power and memory available in the cluster. The metadata database is fully distributed across the cluster, not just for performance but for reliability as well. All of the data is fully replicated and stored with the same level of resiliency as data objects. Failures of disks or nodes are handled gracefully without operator intervention. Another very important aspect of distributing the database across every node in the system is to help ensure a balanced design. As the amount of storage in the system grows, the CPU and memory resources available to database grow proportionally. Programmatic access to the database is fully integrated in the Sun StorageTek 5800 system's APIs. No additional libraries are required other than the provided Sun StorageTek 5800 system Java or C library. Java developers familiar with Java DataBase Connectivity (JDBC™) should be able to use the Sun StorageTek 5800 system’s Java API with ease. The system’s database is designed to be fully self maintaining. A database administrator is not required, contributing to the system's very low TCO. Large Object Support A fixed content system may need to store very large objects which would exceed the limits imposed by many conventional file systems. The Sun StorageTek 5800 system is designed not to impose restrictions on the size of objects. A key principal in accomplishing this goal is that there are no operations that require an object to fit into memory or virtual memory. All operations in the Sun StorageTek 5800 system and the provided client libraries stream the data object as it is being read or written. 17 Architectural Overview Sun Microsystems, Inc. Network Design The Sun StorageTek 5800 system is a complete, self contained cluster with high availability features. Included with the system are two internal 24 port Gigabit Ethernet switches. The two switches are configured as a primary and a hot standby with automatic fail over. Perhaps the most important feature of the Sun StorageTek 5800 system’s network design is its ease of use. The cluster appears as a single virtual IP address for data access to applications as if it were a single server. For security purposes, the system's administrative access is provided through a separate dedicated virtual IP address. The cluster and its implementation details are completely hidden from the customer's network. No networking administration is required other than assigning the data and administration virtual IP addresses. The administrator does not need to manage the cluster or the switches. This contributes to keeping the Sun StorageTek 5800 system’s TCO very low. The integrated switches are the heart of the Sun StorageTek 5800 system's internal network. The switches transparently spread incoming requests across all available nodes (16 in a full cell configuration or 8 nodes in a half cell). This provides improved performance by helping to ensure that the processing power of each node is utilized. Additionally, resiliency is improved by spreading the requests to all available nodes. The switches are off-the-shelf hardware with the addition of customized firmware to provide the load spreading functions. The redundant switches eliminate single points of failure. One switch is the primary at any given time, while the other is the hot-standby. A heartbeat between the switches enables the standby switch to detect a failure of the primary and take over. Each switch is connected to each node. There are redundant connections between the two switches to provide heartbeat monitoring. The network design, illustrated in Figure 7, helps eliminate bottlenecks by spreading the load across all of the available nodes. The switches keep all node-to-node traffic for operations, such as distributing objects across nodes, off of the external network. The full bandwidth of the switch's backplane is available for node-to-node traffic. 18 Architectural Overview Sun Microsystems, Inc. Administration via CLI or GUI Running client library Client Client Client Data IP Customer network Inside a Sun StorgeTek 5800 system L2 switches (active & standby) Opteron/SATA storage nodes Gigabit Ethernet interconnect Metadata space Data space Management IP Figure 7. Sun StorageTek 5800 system clustered architecture Scalability The Sun StorageTek 5800 system is designed to be deployable in a modular model where reasonable granularity chunks can be added on demand. Clients do not need to know physical locations of their files, therefore data can be migrated across new resources without disrupting clients. New server/storage resources can be simply plugged into the cluster without interrupting availability. Low TCO A low total cost of ownership was one of the primary design goals of the Sun StorageTek 5800 system. Starting from the hardware design, commodity components that have significant cost benefit from economies of scale are used throughout. The use of expensive, high availability hardware components is avoided by using a software design that takes advantage of a large number of nodes and can gracefully handle failures. SATA 7200 RPM disks are used, which are much more appropriate to a large fixed content storage system than the SCSI/FC 10,000 or 15,000 RPM drives that are typically used in enterprise servers. The SATA drives have a higher storage density allowing the system to store more in the same physical amount of space. 19 Architectural Overview Sun Microsystems, Inc. A 7200 RPM drive consume less power, generate less heat, and require less cooling than drives with faster rotational speeds. Since the content in the system does not change after it is written, there would be very little performance benefit in using faster rotational speeds. A vast majority of the savings in the Sun StorageTek 5800 system's TCO are the savings in operations and administration. There are no RAIDs, LUNs, volumes, hot spares, or Fibre Channel infrastructure to manage. The storage in the Sun StorageTek 5800 system is self-maintaining, it automatically self heals when disks or nodes fail. Content is automatically migrated back to repaired components. While the system is implemented as a cluster, the user does not have to manage the complexity of a cluster, the system appears to the clients as if it was a single server. The entire cluster and all cables are contained in a single cabinet. Administrative functions work across the entire cluster with a single command. The Sun StorageTek 5800 system also provides cost savings in terms of software licensing and annual software maintenance. The integrated high availability metadata database avoids the need for a separate relational database license with high availability features. The highly reliable object store eliminates the need for volume management software, and there is no need for additional cluster management software, thus reducing the cost and complexity of the system. Application Program Interfaces (APIs) The Java and C language APIs enable data and metadata to be stored, retrieved, queried, and deleted data through Java and C client libraries. Sample applications and command-line routines that demonstrate the Sun StorageTek 5800 system’s capabilities as well as provide good programming examples are provided in the client software developer’s kit (SDK). The SDK also provides an emulator that imitates the behavior of a Sun StorageTek 5800 system, allowing developers to test software or applications without access to a physical Sun StorageTek 5800 system. The emulator can run on any system with a compatible Java Virtual Machine (JVM), including the Solaris™ Operating System (OS), Red Hat Enterprise Linux, Microsoft Windows, and Mac operating systems. 20 Overview of Data Management Sun Microsystems, Inc. Chapter 3 Overview of Data Management This chapter describes the key processes (storage, query, retrieval, self healing, capacity balancing, networking) in detail in order to clearly illustrate the design advantages (distributed raid, symmetric cluster, direct data paths) and how they work together to form an ideal fixed content storage system. Process for Storing an Object The Sun StorageTek 5800 system stores each data object across seven storage nodes and disks using Reed Solomon encoding to break up a data block into five data fragments and two parity fragments. The system can tolerate up to two missing data or parity fragments for each object. After a failure of a disk or a storage node, the system re-distributes the data and/or parity to other storage nodes and disks. After a rebuild cycle, the system can tolerate another two missing data or parity fragments. When a request to store a data object comes into the system, the Gigabit Ethernet switch determines which storage node to direct the request to. The selected node divides the object into fragments and calculates additional parity fragments for resiliency. A placement algorithm then decides out of thousands of different layout possibilities where to put the pieces.The fragments are then distributed to the selected nodes. In more detail the process is as follows: 1. The application issues a store request. The client library initiates an HTTP connection to the data virtual IP address of the cluster. (Figure 8) Protocol handler RAID codec Fragment placement algorithm Step 1 Store request Data Client Protocol handler Fragment placement algorithm … Protocol handler RAID codec Fragment placement algorithm Figure 8. Step 1 Gigabit Ethernet RAID codec Switch 21 Overview of Data Management Sun Microsystems, Inc. 2. The active Ethernet switch determines which node receives the request and becomes the coordinator for this request. The switch makes this determination by parsing the incoming packet and hashing the source IP port number. The result is a number between 1 and 16, which is the node number that the switch forwards all of the packets of this request to. This effectively spreads incoming requests across all of the nodes of the cluster. An advantage to using the source port number as part of the hash input is that each new connection, even if it is from the same host, uses a different source port number, thus spreading the load to other nodes. TCP/IP defines requirements for the length of time before a port number can be reused. The system keeps track of the health of the storage nodes. When a node fails, the switch is programmed with an alternate node to receive the requests on behalf of the failed node. If another node fails, the switch picks a different alternate node. 3. The coordinator receives the request and data (Figure 9). The node that was selected by the Ethernet switch becomes the coordinator for the entire request. The coordinator receives the object to be stored and writes it into temporary storage. A SHA-1 hash is computed for the received object contents. Step 3 Protocol handler Data RAID codec Fragment placement algorithm Protocol handler Client Switch Fragment placement algorithm … Protocol handler RAID codec Fragment placement algorithm Figure 9. Step 3 4. The coordinator divides the object into the appropriate sized blocks. Each block is then broken down into five data fragments for encoding via the Reed-Solomon algorithm (Figure 10). The Sun StorageTek 5800 system uses N = 5 fragments and M = 2 parity fragments. The two parity fragments can be used in the event of a failure to reconstruct up to two missing data fragments. Additional fragment level checksums are computed to verify data integrity during later retrieval. Gigabit Ethernet RAID codec 22 Overview of Data Management Sun Microsystems, Inc. Protocol handler RAID codec Fragment placement algorithm Step 4 Protocol handler Client Switch Fragment placement algorithm … Protocol handler RAID codec Fragment placement algorithm Figure 10. Step 4 5. The coordinator executes the object placement algorithm. The seven data fragments need to be assigned to 7 disks in the system. The (patent pending) placement algorithm determines which 7 out of the 64 total drives in a full cell (or 32 drives in a half cell) are to be used. No two fragments can be placed on the same node, since that would allow a single node failure to eliminate all resiliency for that object (Figure 11). There are approximately 10,000 possible layouts for the data fragments. The algorithm picks a random number between 1 and 10,000 which is the placement identifier. The placement identifier (PI) becomes part of the permanent object identifier (OID.) The PI is then used as the seed for a deterministic pseudo random number generator. The property of a pseudo random number generator that is important here, is that when it is initialized with the same seed number, it returns the same sequence of numbers every time. The generated list of numbers is then masked by the lists of disks that are available at that point in time. This yields the list of disks to assign the fragments to. There are several important properties of the placement algorithm. The first is that it can be run on any node of the cluster at any time and produces identical results given the same starting value, or placement identifier. Second, the algorithm is stateless, it does not depend on the results of any prior calculations or state from the rest of the cluster. Third, changes in the available disk mask produces very small changes in the sequence of disks to use. Gigabit Ethernet RAID codec 23 Overview of Data Management Sun Microsystems, Inc. Protocol handler RAID codec Fragment placement algorithm Protocol handler Client Switch Fragment placement algorithm … Protocol handler RAID codec Fragment placement algorithm Step 5 Figure 11. Step 5 6. The coordinator distributes fragments to other nodes. The coordinator distributes the seven fragments to the seven nodes that have the selected disks attached. The traffic flows through the Ethernet switch using the private IP addresses of the nodes. This traffic never appears on the client network. The full internal bandwidth of the ethernet switch is available for inter-node communications. Figure 12 illustrates how data and metadata is distributed throughout the system. Gigabit Ethernet RAID codec 24 Overview of Data Management Sun Microsystems, Inc. Figure 12. Example of the process of breaking an object down into blocks and then data and parity fragments for distribution to seven disks on seven different nodes Figure 13 depicts how the seven fragments of two different objects are placed on disks throughout the system. Figure 13. Two objects distributed across nodes and disks 25 Overview of Data Management Sun Microsystems, Inc. The five data fragments of Object 1 shown as squares are located on node 1 disk 3, node 3 disk 3, node 6 disk 4, node 8 disk 3, and node 14 disk 2. The two parity fragments are located on node 12 disk 2, and node 9 disk 3. Note that no two fragments are placed on the same node. Storing an object on the Sun StorageTek 5800 system is either all or nothing. The object must have been completely and reliably stored for the store operation to be considered complete. There are no partial stores. If a store operation is interrupted, the entire operation fails. Once an OID is returned to the application, the object is known to be durable. This is accomplished without the use of a central transaction coordinator by making use of the synchronous nature of certain file system primitives in the Sun StorageTek 5800 system’s underlying operating system. Avoiding a central transactional coordinator eliminates a potential bottleneck and decreases the complexity. Process for Retrieving an Object The process for retrieving an object is as follows: 1. 2. 3. The application makes a call to the client library requesting the object by its OID. The client library makes an HTTP connection to the data VIP of the cluster. The Ethernet switch determines which node will be the coordinator for this request and forwards the request packets to that node. The coordinator node extracts the placement ID from the OID, the random number used by the placement algorithm when the object was originally stored to generate the same sequence of seven disks that the five data fragments and two parity fragments are stored on. The sequence is then masked with the list of available disks. 4. Fragments that are stored on available disks or parity fragments if any of the data fragments are not available are requested from the nodes that have those disks attached via the private IP addresses of those nodes. 5. The coordinator receives the data and re-assembles the files. If any of the data fragments are missing, the parity fragments are used to reconstruct the missing data fragments using Reed Solomon. Fragment level checksums are used to verify that each fragment has been retrieved correctly. If the checksum verification fails, the parity fragments are used to reconstruct the same way as if the fragment was unavailable due to a disk failure. 6. The coordinator computes the SHA-1 hash of the reassembled object in order to check the integrity of the retrieved data. If the verification succeeds, the file is then streamed back the client. 26 Overview of Data Management Sun Microsystems, Inc. Self Healing Once the system has determined that a disk has failed, it must recover and relocate the fragments of the objects that were stored on that disk on other disks in the system in order to restore full system resiliency. The system is designed to be able to handle the loss of up to two disks or two nodes before any data becomes unavailable. While the healing process is running, the system can still recover the data if another disk or node fails. Once the healing process is complete the system is again able to handle up to two failures without any data becoming unavailable. As long as there is still capacity in the system, the system can self-heal to maintain full resiliency. This allows hardware service to be deferred until a convenient time, possibly for several months. Recovering a From a Failed Disk Once a disk fails, the system must determine all of the objects that had fragments stored on that disk. There are approximately 10,000 possible disk layouts that can be used for storing an object. The Placement ID that was assigned to each object when it was stored determines which layout is used by that object. All objects with the same Placement ID share the same layout. The Placement Algorithm is used to determine which Placement IDs would result in data being stored on the failed disk. For each Placement ID that is affected, the Placement Algorithm indicates which disk should be used to store the reconstructed fragment. The process of recovering from a failed disk is as follows: 1. Every node in the cell is notified that a disk has failed. There is a waiting period to ensure the failure is not a transient problem, such as a node in the process of rebooting. This prevents thrashing. 2. Using the process described above, each node in the cell determines if one of its local disks needs to store reconstructed content due to the failed disk. All of the nodes do this in parallel, which speeds up recovery and minimizes the performance impact by spreading the load out across the entire cell. Determining the subset of Placement IDs that require local reconstruction efficiently partitions the work for each node and prevents duplicated effort. If a node determines that reconstructed data is to be stored on one of its local disks, the node is responsible for rebuilding the data. 3. The list of objects for each affected placement ID is determined. The remaining fragments of each object are retrieved from the other nodes. The Reed Solomon coding algorithm is then used to reconstruct the missing fragments. 4. The reconstructed fragments are written out to the selected local disk. Full resiliency for that object is now restored. 27 Overview of Data Management Sun Microsystems, Inc. It is important to understand that the failure of one disk can be handled by up to 60 other disks in a full cell configuration. Each Placement ID generates a different sequence of disks to use. Different Placement IDs may utilize the same disk some where in its sequence. Different sequences will yield different alternate disks to use in the case of failure. Figure 14 illustrates an example of the recovery process. There are 16 nodes with four disks each depicting a standard full cell configuration. Node 6 disk 4 contains a data fragment from object 1 and a data fragment from object 2. Disk 4 fails, and after the self healing process the data fragment from object 1 has been reconstructed on node 7 disk 2 and the data fragment from object 1 is now on node 15 disk 2. Figure 14. Disk failure and self healing Recovering from the failure of a node is identical to the process of recovering from a failed disk. All of the remaining nodes in the cell run the recovery process described above. However, there could be up to four times as many Placement IDs affected. Re-distributing Content to Replaced Disk Content is automatically re-distributed when a disk is replaced. The process for redistributing content to a replaced disk is as follows: 1. When a new disk is inserted into a node, the threads that monitor each disk detect it. The disk is first formatted. When the formatting is complete the disk is mounted. 28 Overview of Data Management Sun Microsystems, Inc. 2. Now that the new empty disk is available for storing data, the healing service determines what content should be on this disk. It does this by examining the placement ID for each object in the cell. Given a placement ID, the placement algorithm indicates whether this disk should contain a fragment for that object. The list of disks returned by the placement algorithm also indicates which disk should hold the fragment when this disk is unavailable for that particular placement ID. 3. The healing service retrieves the fragment that needs to be relocated from the disk where it is stored and copies it on to the new disk. If for some reason that disk is unavailable, the fragment can be reconstructed the same way it would be if a disk had failed. Given the way the self healing processes works, the same process can also be used to support technology upgrades. When additional storage capacity is needed and newer denser disks are supported, an older disk can be replaced. The self healing process reconstructs the data on the new larger disk. Multi-Cell Operation When a larger amount of storage is required than one cell can provide, additional cells can be added and configured to act as one logical system. This is called a hive. The system administration tools allow the cells to be managed as one logical system. No changes to the applications are required when a system grows into a multi-cell configuration. The application only needs to be configured with the data virtual IP address of one of the cells. Typically this is the data VIP of the first cell. Cells in a hive configuration are aware of the existence of the other cells, their status, load, and configuration. When an application issues a store request to a cell that is part of a multicell system, the client library transparently retrieves the configuration including the number of cells, the data VIP address for each cell, and each cell’s current utilization. The client library randomly picks two cells from the list. It then sends the storage request to the cell that has the most capacity available. The permanent Object Identifier that is assigned when data is stored contains the number of the cell it is stored on. When a request is made to retrieve an object by its OID, the client library is able to decode the cell number in order to determine which cell to connect to in order to fetch the object. The client library also transparently handles searches across a multicell hive. The search request is issued to all cells. When the results are returned, the client library combines the results and returns them to the requesting application. Since the client libraries handle all of the details, the application code and configuration for a multicell system is identical to a single cell system. 29 Overview of Data Management Sun Microsystems, Inc. Adding Capacity When a half cell configuration of 8 nodes is upgraded the a 16 node full cell, the system can automatically spread the existing content through the nodes in order to evenly balance the available capacity. The process is essentially the same as the healing processes for replacing a disk. The healing service on each of the new nodes examines all of the objects in the system and determine which of those objects should be located on the new disks by their placement ID. 30 Software Interface Overview Sun Microsystems, Inc. Chapter 4 Software Interface Overview The Sun StorageTek 5800 system provides a Java or C API to enable developers to easily build fixed content applications. The API includes all of the functions for storing and retrieving objects, as well as querying the attributes in the integrated metadata database. Legacy applications can access the data through HTTP using WebDAV. This chapter discusses the key concepts, primary operations common to both APIs, query language, and WebDAV virtual views. Key Concepts The following concepts are important in understanding the software interface of the system. Object Archive An object in the Sun StorageTek 5800 system consists of two components, the opaque binary file data and structured data in the form of attribute/value lists. The structured data facilitates easy location of objects by constructing queries. No effort is required by applications to keep the structured data and file data synchronized. The system operates only on whole objects. Unlike a file system there are no record level operations that provide read or write access at a level below the whole object. Note: For applications to efficiently handle large objects, retrieval requests can specify byte ranges, for example from byte 10,001 up the end of the file. Both store and retrieve requests stream data so buffering an entire object in memory is never required. Metadata Structured data in the form of attribute/value lists are stored with the file data. The metadata is stored in the system’s object archive as an XML document the same way that file data is stored. Therefore, metadata is stored with the same reliability features as file data, up to two disks or nodes can fail without impacting metadata availability. Metadata is indexed by the internal highly available database and kept cached in memory across the cluster for fast searches. The database could be completely reconstructed from the metadata XML documents stored in the object archive. The Sun StorageTek 5800 system automatically assigns system metadata to every object when it is stored on the system. System metadata includes the unique identifier for each object, called the Object ID or OID. System metadata also includes creation time, data length, and data hash. 31 Software Interface Overview Sun Microsystems, Inc. Extended metadata goes beyond the system metadata to further describe each data object. For example, if the data stored on the Sun StorageTek 5800 system includes medical records, extended metadata attributes might include patient name, date of visit, doctor name, medical record number, and insurance company. Users can issue queries to retrieve data objects using these attributes. For example, a query could retrieve all records (data objects) for a given doctor and a particular insurance company. The schema describes what attributes are available. Users can define attributes through the system's administrative interfaces. Table 2 shows an example of a schema file for a system storing music files. This example also defines two WebDAV virtual file system views, one indexed by artist and the other by album. A table named mp3 is defined for retrieving logically grouped metadata. Table 2. Example schema file
32 Software Interface Overview Sun Microsystems, Inc. Attributes are typed, and the types include date, time, timestamp, strings, and binary. Strings are UTF-8 for supporting multilingual applications. Table 3 lists the supported metadata types. Namespaces can be defined as containers to group attributes on a per application basis. Namespaces are essentially directories of metadata names. Just as directories can include subdirectories, namespaces can include subnamespaces or namespaces within namespaces. There is also no limit on the number of subnamespaces within a given namespace. Table 3. Supported metadata types Valid Types Long Description 64 bits Maximum Value: -9223372036854775808 Minimum Value: 9223372036854775807 64 bits Maximum Value: 1.7976931348623157E308d Minimum Positive Value: 4.9E-324d A string of characters from the Basic Multilingual Plane of unicode values, excluding the null character (0). Characters from the range of Unicode Surrogates (D800-DFFF) are not supported. Length can be 0 to 4000 unicode characters. A string of eight-bit characters in the ISO-8859-1 (Latin-1) character set, excluding the null character (0). Length can be 0 to 8000 Latin-1 characters. A string of bytes from the range 00 to FF. Length can be 0 to 8000 bytes. Corresponds to JBDC SQL DATE type. Year/Month/Day. Corresponds to JDBC SQL TIME type with precision 0 (seconds past midnight). Corresponds to JDBC SQL TIMESTAMP type with precision 3 (absolute Year/Month/Day/Hour/Minute/Second/ Millisecond). Binary that specifies that OID of the data. Double String Char Binary Data Time Timestamp ObjectID Write Once Read Many (WORM) Storage The software interface does not include any operations to update or change a stored object in any way. If an application needs to change an object it is treated as a completely new object with a new OID. The previous object is unaffected. The application would need to delete the previous version if the goal is to completely replace the object. Future versions of the Sun StorageTek 5800 system software are expected to allow for retention requirements as well as legal holds or retention periods that prevent an object from being deleted until the retention period has expired. In the case of a legal hold, the object cannot be deleted until the legal hold is removed. 33 Software Interface Overview Sun Microsystems, Inc. Object Identifier (OID) When an object is stored it is assigned a permanent OID in the form of a 30 byte value. The OID is the key to retrieving an object. The OIDs do not change and are globally unique. This makes the OID ideal if an application has a requirement to store an identifier external to the Sun StorageTek 5800 system. While the OID is opaque to users and applications, the Sun StorageTek 5800 system encodes several key pieces of information into the permanent OID. The OID can be used to uniquely identify the correct location of the object even as the topography of the cell or hive changes over time. Primary Operations Both the Java and C APIs are capable of the following primary functions. Create (Store) an Object This operation stores file data as well as any metadata that is supplied. After the object is reliably stored (See section 3), an object identifier (OID) is returned. The OID can be used later to retrieve the contents of the stored data or the metadata associated with it. Retrieve Object The unstructured file data of an object can be retrieved by providing its OID. The OID is returned from store requests as well as from metadata queries. Adding Metadata to an existing Object This operation allows the storage of additional attribute/value metadata for an existing object in the Sun StorageTek 5800 system. This facilitates locating objects by searching by attribute/value pairs. A new object is created and a new OID is returned. Objects in the Sun StorageTek 5800 system consist of both the file data and the structured data. The newly returned object references the data stream component of the existing object as illustrated in Figure 15. Object 1 Data Stream (shared) Object 2 Figure 15. Multiple metadata 34 Software Interface Overview Sun Microsystems, Inc. Querying Metadata For simple queries, the mechanism returns the OIDs of matching objects. SQL select style functionality can be used to search the internal database and return matching metadata records and/or object identifiers that can then be used to retrieve the data stream. Matches are on a per object basis, so if additional objects have been created using the add metadata operation, multiple records that share the same underlying data stream might be returned. The application is unaware that the query is actually executed across the entire cluster. See section on system query language. Deleting an Object The object defined by the supplied OID is deleted from the system. When all objects that share the same underlying data stream have been deleted, the storage for the data stream is released so that the disk space may be reused for new object storage. Retrieving the Schema The system schema, including both system defined and user defined attributes and their types, are returned to the requesting application. Sun StorageTek 5800 System Query Language The Sun StorageTek 5800 system defines a query language similar to SQL for retrieving metadata records and their corresponding data stream from a Sun StorageTek 5800 system. The Sun StorageTek 5800 system query language is similar to the unified SQL that is supported through the JDBC interface and contains many SQL-like features. The query format is similar to the where clause of an SQL query. The two main differences are that Sun StorageTek 5800 system queries do not contain embedded subqueries, and that the only columns that are available are the attributes defined in the Sun StorageTek 5800 system schema. Java API The Sun StorageTek 5800 system Java client library provides a platform-independent mechanism to upload data and metadata to a Sun StorageTek 5800 system, and to retrieve and query the data and metadata. The Java client library works with any implementation of J2SE™ platform version 4.0 or later with HTTP connectivity to the Sun StorageTek 5800 system. The library is designed to be high-level and easy to use. Most applications can be implemented using a handful of the library classes. Almost all operations are accomplished through a single (synchronous) method invocation. The client library is implemented in a fairly small JAR file that is less than 500k as of the 1.1 release. Applications only need to be deployed with that single JAR file. 35 Software Interface Overview Sun Microsystems, Inc. The Java client API interacts with the Sun StorageTek 5800 system server entirely through the HTTP protocol. The HTTP communication layer uses the open source Apache Commons HTTP client. Object data is streamed through the Java client library opaquely and a well-defined SHA-1 data hash is returned for verification purposes. The root of the Sun StorageTek 5800 system Java client API is the NameValueObjectArchive class, which represents a connection to a single Sun StorageTek 5800 system. All operations are initiated by invoking methods on an NameValueObjectArchive instance after initializing it with the address of a cluster. The fact that a cluster of machines, rather than a single server, is handling the requests, is transparent to the application programmer. A NameValueObjectArchive uses instances of the ObjectIdentifier class to uniquely identify stored data objects. That is, there is a one-to-one correspondence between instances of ObjectIdentifer and Sun StorageTek 5800 system metadata objects. The NameValueObjectArchive class hides some of the more advanced details of Sun StorageTek 5800 system metadata and makes it easy to implement applications. When using NameValueObjectArchive, all metadata queries are executed against a Sun StorageTek 5800 system user-configurable index of name-value pair lists. This class also ensures that a metadata entry is created for every data object stored, even if no metadata is provided at store time. An instance of the NameValueObjectArchive class functions as a proxy for the Sun StorageTek 5800 system server. Instantiation incurs some overhead in establishing communication, so reusing a single instance is the recommended practice. Multithreading is supported with the same instance. NameValueObjectArchive also allows all metadata operations to be performed in terms of two classes that represent metadata records: SystemRecord and NameValueRecord. These classes represent Sun StorageTek 5800 system metadata entries. When using NameValueObjectArchive, every stored data object has a corresponding NameValueRecord that contains the extended attributes stored with that data object, and each NameValueRecord has a reference to its SystemRecord, which contains built-in system attributes such as data object size and creation time. In this model, all instances of ObjectIdentifer returned from store operations and metadata queries correspond directly to instances of NameValueRecord. The results of a Sun StorageTek 5800 system metadata query are returned using instances of the QueryResultSet class, which the application can step through to retrieve metadata or identifiers. This class, along with ObjectArchive, manages the details of fetching one batch of results after another. Table 4 provides a code example of an application storing a file using the Java API. 36 Software Interface Overview Sun Microsystems, Inc. Table 4. Example application storing a file using the Java API import java.io.*; import java.util.Map; import import import import com.sun.honeycomb.client.NameValueObjectArchive; com.sun.honeycomb.client.SystemRecord; com.sun.honeycomb.client.NameValueRecord; com.sun.honeycomb.common.ArchiveException; // Upload file to the specified StorageTek 5800 server. public class StoreFile { public static SystemRecord storeFile(String server, String file, Map metadata) throws ArchiveException, FileNotFoundException, IOException{ // Create a NameValueObjectArchive as the main entry point NameValueObjectArchive archive = new NameValueObjectArchive(server); NameValueRecord r = archive.createRecord(); r.putAll(metadata); // Store the file in the StorageTek 5800 server return archive.storeObject(new FileInputStream(file).getChannel(), r); } public static void main(String [] argv) { try{ CommandLine commandline = new CommandLine(StoreFile.class,2); // Indicate recurring -m metadata flag with values commandline.acceptFlag("m", true, true); if (commandline.parse(argv) && !commandline.helpMode()){ // Upload the specified file anp print out the resulting // Object Identifier which can be used to retrieve it. String server = commandline.getOrderedArg(0); String file = commandline.getOrderedArg(1); // retrieve parsed "name=value" pairs Map metadata = commandline.getNameValuePairs("m", "="); SystemRecord sr = storeFile(server, file, metadata); System.out.println(sr.getObjectIdentifier()); } else { if (!commandline.helpMode()) { System.exit(1); } } } catch (Exception e) { System.out.println("Operation failed... " + e); System.exit(1); } } } 37 Software Interface Overview Sun Microsystems, Inc. C API A multiplatform synchronous C API in which operations are accomplished in a few simple function calls is provided for the Sun StorageTek 5800 system. The C client library interacts with the Sun StorageTek 5800 system entirely through an HTTP based protocol. The API calls include operations for storing, retrieving, deleting, and querying data and metadata records. Multiple threads are supported, and operations block until they complete. The straightforward synchronous interface is fully thread-safe and can be use simultaneously in multiple threads from the same process. It can also be used by an application that is not threaded. Table 5 provides a code example of an application storing a file using the C API. Table 5. Sample application storing a file using C API int main(int argc, char* argv[]) { int fileToStore = -1; hc_session_t *session = NULL; hc_nvr_t *nvr = NULL; ... Parse Command Line ... /* Open file that is going to be stored to StorageTek 5800 */ if ((fileToStore = open(cmdLine.localFilename, O_RDONLY )) == -1) { ... File Open Failed, exit with error ... } else { /* Send file and metadata to StorageTek 5800 */ hc_system_record_t system_record; hcerr_tres; res = hc_init(malloc,free,realloc); if (res != HCERR_OK) { ... init error, exit ... } res = hc_session_create_ez(cmdLine.storagetekServerAddress, STORAGETEK_PORT, &session); if (res != HCERR_OK) {...create session failed, error exit...} res = hc_nvr_create_from_string_arrays(session, &nvr, cmdLine.cmdlineMetadata.namePointerArray, cmdLine.cmdlineMetadata.valuePointerArray, cmdLine.cmdlineMetadata.mapSize); if (res != HCERR_OK) { ... error, exit ... } /* Store data and metadata to the StorageTek 5800 server */ res = hc_store_both_ez (session, &read_from_file, (void *)fileToStore, nvr, &system_record); close(fileToStore); 38 Software Interface Overview Sun Microsystems, Inc. if (res != 0) { /* error occured */ HandleError(session, res); hc_session_free(session); hc_cleanup(); exit(1); } printf("The new OID = %s\n", system_record.oid); } /* Let StorageTek 5800 clean up */ hc_session_free(session); hc_cleanup(); } /* Callback function required by hc_store_both_ez. */ long read_from_file(void* stream, char* buff, long n) { long nbytes; nbytes = read((int) stream, buff, n); return nbytes; } Virtualized Views via WebDAV The Web-based Distributed Authoring and Versioning (WebDAV) protocol is a set of extensions to the HTTP/1.1 protocol that allows file to be read, added, and deleted on remote Web servers. Virtual file system views can be created in the Sun StorageTek 5800 system that enable the user to use WebDAV to browse through data files on the system as though they were stored in a hierarchical path structure. Virtual file system views can be set up to allow the use of a Web browser or a WebDAV enabled application to view files that are organized by metadata elements. A system that stores satellite imagery could present images categorized into folders by location, including subfolders for countries and states. Multiple views of the same data can be created to provide hiearchies that are appropriate for different roles. As an example a system that stores medical images divided into two views, one is organized by doctor, to be used by the doctor and their staff. The other is a view by scanner to be used by technicians operating the scanning equipment. The technician view does not contain any patient identification in order to protect the patient's privacy. The schema is used to define virtual views. In “Example schema file” on page 31 two virtual views are defined, one by artist and the other by album. 39 Software Interface Overview Sun Microsystems, Inc. Emulator The Software Developer's Kit (SDK) comes with an emulator that enables developing and testing client applications without having to connect to a physical Sun StorageTek 5800 system. The emulator mimics the behavior of a Sun StorageTek 5800 system. All data and metadata is stored on the local hard drive of the system running the emulator. The emulator is written in the Java programming language and can be run on the Solaris OS, Linux or Windows. A 1.5 or later Java Developers Kit (JDK) is required. Applications using either the Java API or C API can be tested with the emulator. 40 System Software and RAS Features Sun Microsystems, Inc. Chapter 5 System Software and RAS Features Software Components Overview The Sun StorageTek 5800 system software includes a variety of layers that all contribute to the overall product goal of reducing costs while improving data manageability. This section outlines the major architectural software components, starting from the highest level, and working down to the lowest. Figure 16 contains a diagram of the major components, and how they interact with one another. Figure 16. Sun StorageTek 5800 system software components Remote Services A set of remote services run on every node and are responsible for communicating with the local services that also run on each node to perform tasks such as exchanging data between nodes of the cluster. The remote services implement the higher level protocol and coordination functions. The local services on each node implement the storage and retrieval functions on the four directly attached disks. 41 System Software and RAS Features Sun Microsystems, Inc. Protocol Server When a client request is sent by the switch to a storage node, the software component that answers it is the protocol layer. This layer is an HTTP 1.1 server that speaks the Sun StorageTek 5800 system's protocol, which serializes and deserializes data to and from the network, and sends the request to the appropriate service. Additionally the protocol server also handles WebDAV requests for virtualized file system views. The protocol server uses the open source Jetty HTTP server, which is implemented entirely in Java code. Metadata Service The metadata service is responsible for managing the storage, retrieval, caching, search, and deletion of object metadata. The metadata service relies on the Object Archive layer to actually store metadata persistently and reliably on disk. When new metadata arrives, it is first stored reliability as an XML document in the Object Archive, using the same reliability features as file objects. Once the data is stored successfully, the metadata is inserted into the HADB database, which distributes the data to multiple nodes for database reliability as well as taking advantage of the memory on multiple nodes to provide fast searches. When the Protocol layer receives search requests, the metadata service executes searches in parallel on many nodes using the cached in-memory indices, and merges the results for return to the client. Since multiple objects may share a single piece of underlying object data, the metadata service must manage this many-to-one relationship, including during metadata delete operations, when it must ensure that data is not deleted while multiple objects still reference that data. Object Archive Service The object archive service can be thought of as a library that manages the storage, retrieve, and delete functions for object data. When the object archive service receives an object to store, it splits the byte stream into regular sized blocks. Each block is passed to the software RAID service, which turns it into five data fragments and two parity fragments by using Reed-Solomon coding. The object archive service communicates with the placement service in order to determine which disks on which nodes to store the data fragments. Each node in the cluster has an internal NFS mount for each disk on the other nodes. On a full cell this is 60 NFS mounts. The object archive service can write the fragments to the seven disks that are chosen by the placement service through NFS. The object archive service calculates a SHA-1 hash for the entire object that is used to verify data integrity. A checksum is computed for each fragment so that on retrieval the integrity of each fragment can be verified and if necessary corrected using the parity fragments. 42 System Software and RAS Features Sun Microsystems, Inc. A permanent object identifier OID, a 30 byte value, is created for the newly stored object, which contains the placement identifier that was returned by the placement service. When the writes for the underlying object data have all successfully completed, a system metadata record is created consisting of the object size, creation data, object ID, placement ID, and SHA-1 hash. The system metadata, together with the user metadata from the original storeObject request, is stored as an XML document in the object archive. The combined user and system metadata is stored in the database via the HADB client. If any user-supplied metadata is part of the storage request, an additional XML document object is created in the object archive, and stored in the database. Finally the permanent object identifier is returned to the requestor informing it that the object has been successfully and reliably stored. When read requests for an object arrive, the object archive service extracts the placement identifier from the OID, which it then sends to the placement service. The placement service returns the list of disks that the data and parity fragments of the object are stored on. The object archive service retrieves the fragments through the internal NFS mounts, or a local disk read if one of the four local disks on this node contains any of the fragments. For any data fragments that are unavailable due to disk or node failures, the object archive service uses the software RAID service to reconstruct the missing fragment using the parity fragments. When the fragment is retrieved, its checksum is verified. If it does not match correctly, the data fragment is reconstructed using the parity fragment just as if it were on a disk that was unavailable. Additionally, the object archive service is ultimately responsible for managing the delete processes of data and metadata from the systems disks when required. Software RAID Service The software RAID service is responsible for implementing the Reed Solomon algorithm that efficiency encodes redundancy into the stored data. During storage operations it takes a block of data and breaks it up into five data fragments and two parity fragments. During retrieval operations, if any of the data fragments are not available due to failures, it can recreate up to two of them by using the two parity fragments. Placement Service The placement service implements the Sun StorageTek 5800 system’s patent-pending algorithm to determine which disks to store data and parity fragments on, as well as where the fragments are located when the object needs to be read back. When a disk or a node fails, the placement service determines where to relocate any reconstructed fragments that were on the failed component in order to maintain reliability. Each object is assigned a placement ID when it is stored, which becomes part of the permanent Object Identifier (OID). 43 System Software and RAS Features Sun Microsystems, Inc. There are approximately 10,000 possible placement IDs. Placement IDs are chosen to ensure an even distribution of data throughout the cell. Given a placement ID, the placement algorithm generates a sequence of disks to use for storing fragments. The sequence includes which alternate disks are to be used when any of the earlier disks in the sequence are unavailable. As disks or nodes fail and are repaired, the list of available disks can change, and fragments might need to be moved or reconstructed to match the new list of disks. The Placement Service is used by the Healing Service in order to determine what needs to be moved. High Availability Database Client The High Availability Database (HADB) client services requests from the metadata service by contacting the appropriate HADB local server process that runs on each node. The database is distributed to every node in the cluster for availability as well as performance. Local Services Local services run on each node and are responsible the actually writing and reading of data to the local disks. Disk Access Service The disk access service is the internal NFS server on each node. Each disk is exported as a separate NFS mount to the rest of the cluster. Disk Management and Monitoring Service The disk management and monitoring service (DMM) is responsible for managing a node's local disks. It is responsible for mounting the disk at boot time, and unmounting it if a disk is to be taken off-line for swapping. When a new disk is inserted, the DMM detects it and formats the disk. The DMM also monitors the disks for errors. If the errors increase in severity it shuts the disk down and declares the disk as failed. The DMM supplies disk status information to the cluster management system that is then used by administration tools such as status reports and automated notifications. Healing Service Healing services running on each node are responsible for performing healing operations on all of the disks on that node. Healing services deploy as a separate thread for each disk. Under normal operation, these threads periodically scan disks and repair problems. When disks or nodes are added or removed, the thread is responsible for adding or safely removing fragments from its disk, as dictated by the placement algorithm, helping to implement full reliability and distribute data evenly throughout the cell. 44 System Software and RAS Features Sun Microsystems, Inc. High Availability Database Server Highly Available Data Base (HADB) is a database used for searching the system's metadata. HADB runs on every node. When the system first boots, it elects one master node for housekeeping functions. Any node can be a master node or take over as master node. Cluster Management The cluster management component provides a suite of services to enable communication between nodes in a cell (apart from the data connectivity provided by the internal NFS mounts). Cluster management maintains the list of alive nodes that each node keeps, and performs heart beats between the nodes. It executes the leader election algorithm, which picks a single master node for the cell, and re-runs the algorithm if the master dies. Cluster management provides IPC for services on different nodes to communicate with each other. Cluster management also performs failure escalation, restarting subsystems that fail, and eventually rebooting nodes if problems cannot be resolved. Node Manager The node manager manages the set of services running on each node. It is responsible for starting, stopping, and restarting services. It is aware of any dependencies one service might have on another that requires a particular start/restart order. If a service does not start or keeps dying, the node manager might chose to reboot the node in order to bring it back to a known state. Cluster Membership Management The cluster membership management service implements a distributed algorithm for tracking which nodes are on-line and are part of the cluster. A ring is formed by the online nodes. Each node sends a heartbeat to the next node in the ring. The absence of the heartbeats is used to detect when a node fails. There is one master node in each cell that runs the administrative services for the command line interface and the graphical user interface. The master node accepts administrative requests via HTTP from the GUI. It also accepts Secure Shell (SSH) connections for the admin virtual address. Which node will serve as the master is determined by the results of running a distributed election algorithm. The election algorithm determines a vice master that is used in the event the master node fails. 45 System Software and RAS Features Sun Microsystems, Inc. Switch Manager The switch manager is responsible for keeping the configuration of the two Ethernet switches up to date. As requests come in, the switch determines which node to send the request to. This spreads the load of incoming requests across all of the available nodes. When a node fails, the programming on both switches is updated by the switch manager in order to prevent requests from being sent to that node. When nodes come on line and are ready to process requests, the switch manager updates the list of available nodes on both switches. Service State Advertisements (Mailboxes) A cell-wide dashboard of the state of all of the components in the cell is implemented using mail boxes and service state advertisements. Clustered IPC The Clustered Inter-process communications service implements a remote procedure call messaging system. This is used when the local nodes need to execute an action on a remote node. Administrative Interfaces There are two interfaces to perform administrative tasks on the Sun StorageTek 5800 system: a command-line interface (CLI) and the administrative graphical user interface (GUI). The CLI is accessed with the ssh, the secure shell protocol. The GUI is accessed using a Web browser. Administration tasks can be scripted using the command line interface. Using either the CLI or the GUI, the user can perform administrative tasks such as monitoring the system and individual components such as nodes or disks, specifying which clients are authorized to access data on the system, setting up the system schema, powering down, and rebooting the system. Most administrative tasks affect all cells in a multicell configuration and are therefore considered hive-level functions. Examples of hive-level functions include setting the administrative password, specifying which clients are authorized to access the data on the system, and setting up email notifications of system events. Some administrative tasks affect only one cell in a multicell environment. For example, the administrative IP address, the data IP address, and the default gateway for each cell are specified separately. 46 System Software and RAS Features Sun Microsystems, Inc. Command Line Administration • System status — obtains basic system state information with the command sysstat, which provides an estimate of free space in the system that is available for data storage. • Performance statistics — displays real-time performance metrics about throughput and operations using perfstats. • Software version — displays the version of system software. • FRU listings — displays a list of field-replacable units (FRUs). • Disk status — displays a summary of disk usage using df. • Voltage, temperature, and fan speed — collects and displays voltage, temperature, and fan speed data from system sensors. Administration GUI The Sun StorageTek 5800 system administrator GUI presents a pictorial representation of the system that allows administrators to monitor system performance and status, and perform administrative tasks through a series of menus and screen panels. The system ships with the GUI software application installed. The GUI is not a Web-based application, but is launched from a Web browser. The administration GUI communicates with the cluster through HTTP (XML-RPC) and displays: • Failed components • System space usage • System performance statistics • Environmental status • Monitors cells: software version, nodes in a cell, disks in a cell, cell IP address • Monitors nodes and disks: FRU ID of nodes, node space usage, node status, disk status 47 System Software and RAS Features Sun Microsystems, Inc. Figure 17. Administration GUI Notifications The Sun StorageTek 5800 system can send automatic email alerts for a number of important events. Configuring e-mail notifications is as simple as specifying an SMTP server and a list of addresses to notify. Notification events include: • Enabling or disabling of a disk or node • Switch or node fail over • System shutdown or reboot • Server node joining or leaving the configuration • System reaches full capacity • Configuration changes: IP address, administrator password, or public key change • Clock time differences across server nodes 48 System Software and RAS Features Sun Microsystems, Inc. • System wiped of all data • System upgrade • Schema changes It is possible to specify an external logging host to which the Sun StorageTek 5800 system sends detailed log messages for debugging purposes or to enable easy integration with existing monitoring systems. Email notifications and the external logging host are configured on a per-hive basis. Background Integrity Checking In order to maintain the long term integrity and availability of the data stored in the Sun StorageTek 5800 system, the system continually runs a number of tasks in the background to verify that all of the data that is stored in the system is intact. These processes enable errors to be detected and corrected early before they can become worse. The process is to periodically read every object in the system from the disks. This verifies that the disk blocks are still readable. For each block that is retrieved, the block level checksum that was generated as part of the storage process is verified. If any errors are found, the system can use its healing processes to recreate up to two unreadable or bad blocks. The corrected blocks are then written back out to disk. All of the metadata in the system, in addition to being stored in the high availability database (HADB), is stored as an object in the object archive. As the system is reading through all of the objects stored in the system, and it encounters the objects that contain metadata, it verifies that the metadata is properly indexed in HADB. This ensures that queries return complete results. These processes run at a lower priority than incoming requests so as to not impact performance of the system. Backup and Restore via NDMP The Sun StorageTek 5800 system does not require back-ups in the conventional sense, since the system heals automatically from any failures. To recover from a catastrophic system loss, however, the Sun StorageTek 5800 system implements a subset of the Network Data Management Protocol (NDMP). With NDMP, the data stored on the system can be backed up to tape and restored later in the event of catastrophic system loss. The Sun StorageTek 5800 system NDMP implementation only allows data to be completely restored to an empty cell or hive (multicell configuration), not partial restorations. Before restoring data to a cell, all data must be deleted the from the cell using the CLI or GUI. 49 System Software and RAS Features Sun Microsystems, Inc. Note – The Sun StorageTek 5800 system acts as the data server within the NDMP implementation. It does not implement the optional Direct Access Recovery (DAR) portion of the NDMP protocol since DAR involves directory structure mechanisms that are not applicable to this systems object oriented approach. Sun has tested the NDMP-compliant back-up product NetVault, Version 7.4.5, from BakBone Software with the Sun StorageTek 5800 system running on Solaris 10 OS on a SPARC® processor-based system. For detailed information about using NetVault and about which tape libraries are supported, refer to the NetVault user documentation. Figure 18. Disaster protection with BakBone software IP Multi-Pathing Internet Protocol Multi Pathing (IPMP) provides increased reliability, availability, and network performance by allowing the system to utilize two physical network interfaces connected to the same network. Occasionally, a physical interface or the networking hardware attached to that interface might fail or require maintenance. Traditionally, at that point, the system can no longer be contacted through any of the IP addresses that are associated with the failed interface and any existing connections to the system using those IP addresses are disrupted. With IPMP, the system remains fully available and existing TCP/IP connections continue using the remaining network connection. 50 Hardware Details Sun Microsystems, Inc. Chapter 6 Hardware Details The basic Sun StorageTek 5800 system is a full-cell configuration that includes 16 storage nodes, one service node, two Gigabit Ethernet switches, a network patch panel, and pre-installed operating system and software. A half-cell configuration, which includes only 8 storage nodes, is also allowed. Half-cell configuration can be expanded to a full-cell configuration. Full-cell configuration can be expanded to create multicell configurations, also referred to as hives. Only full-cells are allowed in multicell configurations. Storage Nodes The storage nodes are based on the Sun Fire™ X2100 server and consists of: • CPU — One single-core AMD Opteron™ processor running at 2.2 GHz with 1 MB of level 2 cache • Memory — 3 GB using two 1 GB ECC DIMMs and two 512 MB ECC DIMMS • Drives — four 500 GB SATA hot-swappable disks. • Network I/O — two 10/100/1000BASE-T Gigabit Ethernet ports • Power supply — 350W • System management — Intelligent Platform Management Interface (IPMI) 1.5 compliant processor module The storage node stores data and metadata. Data is broken up into fragments and stored across different disks and nodes. The storage node boots a ramdisk image through Grand Unified Bootloader (GRUB). The Solaris OS instance runs entirely in memory. The Gigabit Ethernet ports are configured into a Solaris Internet Protocol Multi Pathing (IPMP) group for transparent failover. Figure 19 shows the front panel of the storage node. Figure 19. Storage node front panel 51 Hardware Details Sun Microsystems, Inc. The rear of the storage node (Figure 20) includes a monitor port and four USB ports. The VGA port and USB ports can be used to hook up a monitor and keyboard for maintenance or troubleshooting. Figure 20. Storage node back panel Cell Configurations Three configurations of the Sun StorageTek 5800 system can be housed in a single cabinet: • A full-cell with 16 storage nodes and 32 TB of storage • A half-cell with 8 storage nodes and 16 TB of raw storage • Two full-cells for a total of 32 storage nodes Systems with more than two full-cells require additional cabinets. Because a half-cell has a reduced number of storage nodes, it does not have the same inherent reliability as that of a full-cell with 16 storage nodes. The full-cell is the basic building block of the Sun StorageTek 5800 system. A full-cell includes a service node, 16 storage nodes, 2 gigabit Ethernet switches, and a network patch panel. The front view of a two cell system is shown in Figure 3. Additional fullcells in a multicell system (referred to as a hive) are identical. Service Processor Node Each Sun StorageTek 5800 cell includes a single service node with preconfigured software and firmware. The service node is a Sun Fire X2100 M2 server with one 250gigabyte Serial ATA disk drive, illustrated in Figure 21. The Sun StorageTek 5800 cell uses the service node for initial configuration and troubleshooting, and to upgrade the system software. The system does not use the service node to access the data objects. The key components of a service node are: • CPU — one dual-core AMD Opteron processor running at 1.8 GHz with two 1 MB level 2 cache 52 Hardware Details Sun Microsystems, Inc. • Memory — 2 GB using four 512 MB ECC DIMMs • Media storage — DVD-ROM drive • Drive — one 250 GB SATA hard drive • Network I/O — four 10/100/1000BASE-T GB Ethernet ports • System management — IPMI 2.0-compliant service processor module Figure 21. Service node front panel The rear of the service node, shown in Figure 22, includes a monitor port and four USB ports. The USB ports can be used to attach a monitor and keyboard for maintenance or troubleshooting. There are also two 1 gigabit Ethernet ports. The PCI slot of the service node is unused. Figure 22. Service node back panel 53 Hardware Details Sun Microsystems, Inc. Integrated Load Balancing Ethernet Switches The Gigabit Ethernet Switch, shown in Figure 23, is a low cost 24-port switch with preloaded software. A half- or full-cell Sun StorageTek 5800 system includes two gigabit Ethernet switches. Both of the gigabit Ethernet switches are connected to the service node, to all the storage nodes, and to the network patch panel. Figure 23. Gigabit Ethernet switches The switches allow the cell to be addressable from a single physical Ethernet connection (with a redundant backup) as two virtual IP (VIP) addresses: one for data and one for administration. The switches also enable load balancing capabilities for store and retrieve data flows to and from the storage nodes by making use of chipsets that support basic packet header analysis of hash-table based routing information. One of the switches is designated as primary and the other as standby. By default, the bottom switch is the active, primary switch, and the top switch is the secondary switch in standby mode. If the primary switch fails, the secondary switch automatically takes control and becomes the primary switch. If the primary switch comes back online, it resumes control. Storage nodes 1 through 16 are connected to Ethernet ports 1 through 16 of each switch for load balancing and high availability. The service node is connected to port 17 of each switch. The switches are connected to each other for heartbeat communication by ports 23 and 24 of each switch. Network Patch Panel A single network patch panel on the back of the Sun StorageTek 5800 system provides all the attachment points for the network, as illustrated inFigure 24. 54 Hardware Details Sun Microsystems, Inc. Figure 24. Network patch panel port connections for a two-cell system System Rack The Sun StorageTek 5800 system rack is a Sun Rack 1000 model 1038 with 38 rack units of space. Power connections are two L6-20, 20 Amp circuits. There are two power distribution systems (PDS) rated for single phase at 32 Amps and 250 Volts Alternating Current (VAC) each. The Sun StorageTek 5800 system does not require a specialized sequencer. Standard PDS is sufficient to stagger power-on and does not consume rack space.There are also two power strips, labeled A and B. One power strip is used for one cell. Power strip B is used for the first (bottom) cell. Power strip A is used for the second (upper) cell in the two cell configuration. 55 Hardware Details Sun Microsystems, Inc. Bundled Software The Sun StorageTek 5800 system is used and administered as an appliance and therefore is installed, upgraded, and managed as a single unit instead of as a collection of systems. All of the software shipped with the Sun StorageTek 5800 system is bundled together as a single package. The software components of the bundle include: • Solaris 10 Operating System for x64 system • Solaris 10 OS patches • BIOS • Server Management Daughter Card (SMDC) firmware • SATA disk firmware Even though there are many components, each with their own unique software, BIOS, and firmware, they are upgraded as a whole unit with a single command. Even if only one component is upgraded, the version number of the bundle changes, and the entire bundle is upgraded. This makes it easier for management and maintenance of the system. The system can be upgraded with the service node using the DVD or over the network. Cell Wide System Monitoring The Sun StorageTek 5800 system management software allow system administrators to gather system data from all of the nodes in the cell with a single command. Available operations include: • Hardware monitoring: temperature, voltage, fan speed • System firmware (BIOS) versions, SMDC system management daughter cards (Table 6) • Disk revisions and serial numbers 56 Hardware Details Sun Microsystems, Inc. Table 6. Cluster wide versions ST5800 $ version --verbose ST5800 1.1 release [1.1-11076] Service Node: BIOS Version: 1.1.3 SMDC Version: 4.13 Switch: Overlay Version (sw#1): 11068 Overlay Version (sw#2): 11068 NODE-101: BIOS version: 0.1.8 SMDC version: 4.18 NODE-102: BIOS version: 0.1.8 SMDC version: 4.18 NODE-103: BIOS version: 0.1.8 SMDC version: 4.18 NODE-104: BIOS version: 0.1.8 SMDC version: 4.18 . . . ST5800 $ 57 Futures Sun Microsystems, Inc. Chapter 7 Futures Sun is working on a number of innovative features expected to be available in subsequent releases of the Sun StorageTek 5800 system software, including features to support compliance and extensibility. Compliance Features • Mandatory Retention Periods — Objects cannot be deleted until the retention period has expired. • Legal holds — Information that is required to be retained for lawsuits can be put under a legal hold that prevents deletion until the legal hold has been removed by an authorized administrator. • WORM storage — Various regulations require data storage that cannot be modified once the data is stored in order to prevent intentional tampering or accidental destruction. Today, the Sun StorageTek 5800 system's APIs do not permit applications to modify objects. The SHA-1 secure hash algorithm is used to provide a checksum for each object when it is stored, which becomes part of the object's permanent system metadata. The cryptography used in the SHA-1 algorithm makes it highly improbable that an object could be modified in a way that would produce the same checksum. Future releases are expected to implement methods for providing additional assurances that data had not been tampered with or deleted. Some of these assurances will require the use of strong cryptographic techniques. Extensibility Through Storage Beans Future releases of the Sun StorageTek 5800 system software are expected to extend functionality by providing the ability to run small programs (storage beans) on the Sun StorageTek 5800 system itself. Storage beans can be run as part of normal requests and modify their behavior in order to implement important business rules. This is very similar to the way triggers or stored procedures are used in relational databases. These are called Synchronous Storage Beans since they run as part of a client initiated request. Storage beans are also expected to be able to support tasks that are not tied to specific requests, but are more appropriate to run in the background. For example, tasks that might need to look at every object in the system would take a long time to run and would not be appropriate to run as part of a client request. These are called Asynchronous Storage Beans. The ability to run tasks on the system itself becomes very important with large collections of data that range in the tens or hundreds of 58 Futures Sun Microsystems, Inc. terabytes. For instance, it would be extremely inefficient to transport terabytes of data across the network to other servers for computation. Example Uses of Synchronous Storage Beans • Audit logs — application specific audit requirements could be implemented by intercepting store, retrieve, and delete requests. • Watermarking — with an object such as an image or video, a storage bean could digitally watermark the file as it is stored. The watermark would stay with the object for its entire life span and could be used for verifying the source of the data. • Encryption — storage beans could be used to encrypt or decrypt objects as they are stored or retrieved, thus helping to ensure that data on disks and any backup devices is kept confidential over the life of the system. • Automatically augment metadata with embedded headers/tags — some files contain structured data in headers or tags that are part of the files. For example, JPEG images contain EXIF tags that might indicate the date and time a photo was taken, exposure settings, type of camera and lens used, etc. A storage bean could extract that information from an object as it is stored and automatically add that data as metadata. This would enable searches of this data in the Sun StorageTek 5800 system’s internal metadata database. • Full text index — as objects containing text are stored in the system, the text could be extracted and indexed into a full text search system. Example Uses of Asynchronous Storage Beans • Format conversions — in many cases the useful life span of data far exceeds the life span of the application that was used to create it. The longer data is stored, the higher the likelihood this will occur several times. When a file format becomes deprecated, an asynchronous storage bean could be used to scan the object archive and convert any objects in that format to the new format. • Re-sampling/alternate formats — the object archive could be scanned to create additional objects in different formats to meet the needs of applications. For example, from an archiving point of view it is best to store an image in its original high resolution, raw format produced by the camera or scanner. For typical display applications, a lower resolution JPEG might be best suited. The full resolution image might be used for producing large high quality prints. • Duplicate consolidation — in order to protect the capacity in the system, a process could be run on the Sun StorageTek 5800 system to scan for objects that might be duplicates of each other and to consolidate those, so multiple objects could share the same underlying storage. • Data scrubbing/sanity checks — all of the metadata in the system could be periodically scanned to analyze whether descriptive attributes are consistently applied and to clean them up. 59 Futures Sun Microsystems, Inc. Upcoming Features • Multicell configurations beyond two cells — The first release of the system software supports a maximum of two cells in a multicell configuration. Future releases are expected to remove this restriction in order to allow the system to scale into the petabyte range. As with the existing two cell configurations, applications do not need to be modified as the size of the system scales. In fact, the only configuration information an application needs to know is the data virtual IP address of just one of the cells. • Migrating data between cells — Future releases of the system software are expected to allow data to be migrated across cells. When new cells are added, this ability can be used to relocate some of the data from the existing cells in order to balance the storage usage across all cells. When this process completes, there will be an equal amount of storage capacity in each cell. This enables the system to then utilize the considerable processor, memory, and network resources that each new cell brings. As request come in to store new content, the cell is determined automatically in order to maintain balanced utilization. • Open architecture — To ensure that data can be accessed years into the future, Sun intends to release the code for the Sun StorageTek 5800 system into the open source community. Sun has also committed to present open client and server APIs as part of the Honeycomb Fixed Content Storage OpenSolaris project (opensolaris.org/os/ project/honeycomb). Sun is a strong adopter of the XAM proposed standard from SNIA and is prepared to donate code to this effort. The combination of open source and open standards equals open storage, which no other product is currently achieving. • Wide area replication — The Sun StorageTek 5800 system provides a high level of resiliency at the cell level in order to be able to accommodate common hardware failures such as disks and nodes. For full protection against disasters, the data in the system needs to be copied into remote physical location to prevent a disaster from affecting the backup site. Future releases are expected to support wide area replication. 60 Summary Sun Microsystems, Inc. Chapter 8 Summary The Sun StorageTek 5800 system offers innovative and unique features that are designed to fully meet the requirements of fixed content storage and to maximize the business value of fixed content applications. It is an online, disk-based, highly reliable storage system featuring a fully integrated hardware and software architecture with storage nodes arranged in a symmetric cluster. The system is affordable, reliable, scalable, accessible, and extensible. • Affordable — An affordable entry-level price point is achieved by using low cost components including ATA disk drives, a cluster of low cost x64 servers, and Gigabit Ethernet networking. It is affordable to operate both in terms of low power requirements and low management requirements through self healing, minimized system administration, and a fail-in-place maintenance model. • Reliable —The system can withstand multiple disk and server failures by combining a stateless symmetric clustered architecture, distributed RAID, and a self-healing management system. • Seamlessly scalable — The system can scale from an entry level half cell configuration to a multicell hive without requiring any changes to applications. As storage capacity is added, processing capacity grows as well as a function of the balanced design. • Accessible — Java and C APIs are included to encourage development of rich storage applications. Support for a file system interface using virtual file-system views is provided by the WebDAV protocol, which permits legacy applications to flexibly access archived data. • Extensible — In a future release, the Sun StorageTek 5800 system is extensible through the us of storage beans. Developers can implement storage beans on the system to change the behavior of read, write, delete, and search operations. • Open — The Sun StorageTek 5800 system is designed with an open software infrastructure that leverages the Solaris OS and Java technology. Sun has committed to open source its code, which means data can be accessed for years to come. Sun is also a key participant on the XAM standard. Sun is constantly looking into the future to try to anticipate how companies and organizations use technology to address business challenges. The Sun StorageTek 5800 system is yet another successful example of Sun’s forward-thinking strategy — offering a new solution that provides online access to data, reduces capital and maintenance costs, improves on existing RAS schemes, and helps relieve the headache of metadata management and search. 61 Summary Sun Microsystems, Inc. For More Information Sun Microsystems posts product information in the form of data sheets, specifications, and white papers on its Web site at www.sun.com. For more information on the Sun StorageTek 5800 system see: http://www.sun.com/ storagetek/disk systems/enterprise/5800 On-line manuals for the Sun StorageTek 5800 system are located at: http://docs.sun.com/app/docs/coll/st58001.1 White Papers and Articles • Benefitting Health Care Delivery with Secure Data Management, White Paper, May 2007, http://www.sun.com/storagetek/disk_systems/enterprise/5800/ BenefitingHealthcareDeliveryWP.pdf • EPRINTS and the Sun StorageTek 5800 System, White Paper, November 2007, http:// www.sun.com/storagetek/disk_systems/enterprise/5800/SunEPrintsWP.pdf • Honeycomb Fixed Content Storage at Opensolaris.org, http://opensolaris.org/os/ project/honeycomb • The Storage Evolution: From Blocks, Files, and Objects to Object Storage Systems, SNIA, Christian Bandulet, http://www.snia.org/education/tutorials/2007/spring/ storage/The_Storage_Evolution.pdf • XAM - Extensible Access Method, http://www.snia.org/forums/xam/ • A Tutorial on Reed-Solomon Encoding for Fault Tolerance in RAID Like Systems, http://www.cs.utk.edu/~plank/plank/papers/CS-96-332.pdf • RFC-3174 – US Secure Hash Algorithm 1 (SHA-1), http://www.faqs.org/rfcs/ rfc3174.html • RFC 4918, WebDAV - Web Distributed Authoring and Versioning, http:// www.ietf.org/rfc/rfc4918.txt • NDMP Overview, http://www.ndmp.org/info/overview.shtml • Internet Protocol Network Multipathing (Updated), Sun BluePrint, http://www.sun.com/blueprints/1102/806-7230.pdf 62 Glossary Sun Microsystems, Inc. Appendix A Glossary administrative IP address The virtual IP (VIP) address exported by the Sun StorageTek 5800 system for administrative access to a cell. API Application programming interface. A set of routines, protocols, and tools that developers use to build software applications. attribute An entry in the schema that associates a name with a type. For example, the name Doctor might be of type string. Metadata is stored by assigning a value of the appropriate type to an attribute name, and attributes can also be used to create virtual file system views. authorized client Clients that are authorized to access data on the Sun StorageTek 5800 system. By default, the system allows any client on the network to access the data stored on the Sun StorageTek 5800 system, but you can specify a list of authorized clients, which are the only clients that have access to the data. cell The basic building block of the Sun StorageTek 5800 system. A full-cell configuration consists of 16 storage nodes, two gigabit Ethernet switches, and one service node. CLI Command-line interface. Text-based form of communication with the Sun StorageTek 5800 system. You access the CLI by issuing the command ssh admin@adminIPaddress from a host on the same network as the Sun StorageTek 5800 system. client An application that runs on a personal computer or workstation and relies on a server to perform some operations. cluster A term sometimes used to refer to the Sun StorageTek 5800 system cell or cells in a configuration. CPU Central processing unit. The brains of the computer, sometimes referred to simply as the processor or central processor. The CPU is where most calculations take place ctime Creation time. The system metadata includes information on the creationtime, data length, and data hash. data hash Hashes are used for accessing data or for security. A hash, also called a message digest, is a number generated from a string of text. The hash is substantially smaller than the text itself, and is generated by a formula in such a way that it is extremely unlikely that some other text will produce the same hash value. data IP address The virtual IP (VIP) exported by the Sun StorageTek 5800 system for access to the data stored on a cell. 63 Glossary Sun Microsystems, Inc. data object A stored file with an associated Object ID (OID). disk mask A current record of disk availability across the system. DNS Domain Name Service. A service that defines naming conventions that translate domain names into IP (Internet Protocol) addresses. DTD Document Type Definition. Defines the legal building blocks of an XML document. The DTD defines the document structure with a list of legal elements, thus providing an applicationindependent way of sharing data. emulator Software that imitates the behavior of a Sun StorageTek 5800 system, allowing you to test applications. extended metadata Metadata that is added by the user of the Sun StorageTek 5800 system. User metadata consists of name=value pairs. The name is defined in the system schema as of a certain type (for example, a string), and the value is associated with the name at the time data is stored. file system view See virtual file system view. fragment A piece of a file. Files over a certain size are stored in several chunks or fragments rather than in a single contiguous sequence of bits in one place. The Sun StorageTek 5800 system stores fragments of files across multiple disks and nodes using 5+2 encoding. Thus, when an object of any type (for example, an image or a text file) is stored in the Sun StorageTek 5800 system, it is divided into five data fragments and two corresponding parity fragments. FRU Field-replaceable unit. Describes any hardware device, or more commonly a part or component of a device or system, that can easily be replaced by a skilled technician without having to send the entire device or system to be repaired. As the name implies, the unit can be replaced in the field (that is, at the user location). fsView Section of the metadata schema file where you specify virtual file system views. fsViews are also used to specify which indexes the system creates for responding to metadata queries. full-cell A Sun StorageTek 5800 system configuration that includes 16 storage nodes, two gigabit Ethernet switches, and one service node. gateway A router that connects the local subnet on which the Sun StorageTek 5800 system resides to the larger network. You must configure a default gateway for each Sun StorageTek 5800 system cell, to enable information about the system to be available on the network. GB Gigabyte. Represents 2 to the 30th power (1,073,741,824) bytes. One gigabyte is equal to 1,024 megabytes. 64 Glossary Sun Microsystems, Inc. GUI graphical user interface. A graphical form of communication with the Sun StorageTek 5800 system. You access the GUI by typing the administrative IP address and GUI port number in the URL line in a Java technology-enabled web browser connected to the same network as the Sun StorageTek 5800 system. HADB High-availability database. A highly available and scalable, always-on relational database management system used to store metadata on the Sun StorageTek 5800 system. half-cell A Sun StorageTek 5800 system configuration that includes eight storage nodes, two gigabit Ethernet switches, and one service node hive A multicell configuration including at least two full-cell (16-node) Sun StorageTek 5800 system storage nodes. HTML HyperText Markup Language. Designed to display data and focus on how data looks. The tags you use to mark up HTML documents and the document’s structure are predefined, so you can only use tags that are defined in the HTML standard. HTTP HyperText Transfer Protocol. Underlying protocol used by the World Wide Web. HTTP defines how messages are formatted and transmitted, and what actions web servers and browsers should take in response to various commands. index A sequence of columns in the metadata database against which queries are made. metadata Extra information about the data object. Describes how and when and by whom a particular set of data was collected, and how the data is formatted. There are two main types of metadata in the Sun StorageTek 5800 system: system and extended. MP3 Moving Pictures Experts Group (MPEG), audio layer 3 file. Layer 3 is one of three coding schemes (layer 1, layer 2, and layer 3) for the compression of audio signals. multicell A configuration including more than one full-cell of sixteen Sun StorageTek 5800 system storage nodes. Also called a hive. namespace A collection of names, identified by a uniform resource identifier (URI), that XML uses to keep names from separate sources from colliding unintentionally. You can have as many namespaces as desired in the Sun StorageTek 5800 system metadata schema. There is also no limit on the number of namespaces that can be encapsulated within a given namespace level (subnamespaces). NDMP Network Data Management Protocol. An open standard backup protocol implemented on the Sun StorageTek 5800 system to allow you to back up the data stored on the system to tape and restore that data in the event of catastrophic system loss. node A processing location. A node can be a computer or some other device, such as a printer. Every node has a unique network address. 65 Glossary Sun Microsystems, Inc. NTP Network Time Protocol. An Internet standard protocol (built on top of TCP/IP) that assures accurate synchronization to the millisecond of computer clock times in a network. object Any item that can be individually selected and manipulated. For example, in object-oriented programming, an object is a self-contained entity that consists of both data and procedures to manipulate the data. Object archive (OA) The complete set of data that is reliability stored on a Sun StorageTek 5800 system. OID Object ID. A unique identifier for each stored object included in the system. placement algorithm Calculation that determines where to store the data and parity chunks of an object stored on the Sun StorageTek 5800 system. When a data object comes into the system, the Gigabit Ethernet switch directs the store request to a storage node, and that node fragments the object and distributes the fragments to different disks in the system according to the placement algorithm. query A request for information from a database. Reed-Solomon Encoding Algorithm An encoding algorithm that protects data stored in the Sun StorageTek 5800 system. The ReedSolomon (RS) algorithm is part of a code family that efficiently builds redundancy into a file to guarantee reliability in the face of multiple part failures in the storage system. SATA Serial Advanced Technology Attachment (ATA). An evolution of the Parallel ATA physical storage interface. Serial ATA is a serial link (a single cable with a minimum of four wires) that creates a point-to-point connection between devices. Transfer rates for Serial ATA begin at 150 MBps. schema Defines how the Sun StorageTek 5800 system metadata is structured. The schema consists of attributes, each of which has a defined type. SDK Software developer’s kit. Includes sample applications and command-line routines that demonstrate the Sun StorageTek 5800 system’s capabilities as well as provide good programming examples. service node A Sun Microsystems Sun Fire X2100 M2 server with one 250-gigabyte serial ATA (SATA) disk drive. Used by the Sun StorageTek 5800 system for initial configuration and troubleshooting, and to upgrade the system software. SMTP Simple Mail Transfer Protocol. A protocol for sending email messages between servers. Most email systems that send mail over the Internet use SMTP to send messages from one server to another. storage node A node on which the Sun StorageTek 5800 system stores data. The storage node includes a single-core AMD Opteron processor, three GB of memory, four 500-GB disk drives, and two Ethernet ports. 66 Glossary Sun Microsystems, Inc. string A contiguous sequence of symbols or values, such as a character string (a sequence of characters) or a binary digit string (a sequence of binary values). One of the attribute types allowed for metadata on the Sun StorageTek 5800 system. system metadata Metadata that includes a unique identifier for each stored object, called the OID, as well as information on creation time (ctime), data length, and data hash. It is automatically maintained by the system. table Partition of the metadata schema. You partition the metadata schema into tables and specify each metadata field as a column within a particular table. You can greatly improve the performance of query and store operations by grouping metadata fields that commonly occur together in the same table and by separating metadata fields that do not commonly occur together into separate tables. Objects stored in the Sun StorageTek 5800 system become rows in one or more tables, depending on which fields are associated with that data. virtual IP (VIP) Virtual IP address. The Sun StorageTek 5800 system exports two public IP addresses, one to access the data and one to access administrative functions. virtual file system view Arrangements of the data stored in the Sun StorageTek 5800 system that allow you to use WebDAV to browse the files as though they were stored in a hierarchical path structure. A virtual file system view is defined using the metadata attributes in the metadata schema file. WebDAV Web-based Distributed Authoring and Versioning. A set of extensions to the HTTP/1.1 protocol that allows you to read, add, and delete files on remote web servers. Using the metadata schema file, you can set up virtual file system views in the Sun StorageTek 5800 system that allow you to use WebDAV to browse through data files on the system as though they were stored in a hierarchical path structure. XML Extensible markup language. XML offers a widely adopted standard way of representing text and data in a format that can be processed with relatively little human intervention and exchanged across diverse hardware, operating systems, and applications. Sun StorageTek 5800 System Architecture On the Web sun.com Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com © 2007 All rights reserved. Sun Microsystems, Inc. Sun, Sun Microsystems, the Sun logo, J2SE, Java, JDBC, Solaris, Sun Fire, and Sun StorageTek are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Products bearing SPARC trademarks are based upon architecture developed by Sun Microsystems, Inc. AMD Opteron and the AMD Opteron logo are a trademarks or registered trademarks of Advanced Micro Devices, Inc. Information subject to change without notice. Printed in USA 12/07

Related docs
premium docs
Other docs by C Gunnison
Three-Year Profit Projection
Views: 383  |  Downloads: 50
Start-up Expenses
Views: 615  |  Downloads: 90
Personal Financial Statement
Views: 362  |  Downloads: 35
Opening Day Balance Sheet
Views: 555  |  Downloads: 23
Loan amortization schedule
Views: 250  |  Downloads: 18
Financial History and Ratios
Views: 240  |  Downloads: 21
C Projected Balance Sheet
Views: 258  |  Downloads: 6
Break-Even Analysis
Views: 620  |  Downloads: 94
12 Month Cashflow Form Rev
Views: 321  |  Downloads: 10
12 Month Sales Forecast
Views: 347  |  Downloads: 28
12 Month Profit and Loss Projection1[4]
Views: 173  |  Downloads: 7
BankLoanRequestforSmallBusiness[3]
Views: 326  |  Downloads: 24
Competitive Analysis[4]
Views: 806  |  Downloads: 79
invoice_quadplay
Views: 1622  |  Downloads: 56
invoice_eternity
Views: 2330  |  Downloads: 111