Document Sample
–   A two-node High Availability cluster providing NFS fail-over and mirrored storage

A master’s thesis by Jonas Johansson


Ericsson Utvecklings AB                 Department of Microelectronics and Information
                                        Technology at the Royal Institute of Technology
Reliable Network Mass Storage                                                     2(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                2002-02-14


Ericsson Utvecklings AB is developing a server platform, which agrees with the
common telecommunication requirements. New wireless communication technologies
are influencing the service and control networks’ servers. It is desirable to have
reliable storage because it promises a more versatile usage of the generic server

Current generations of the server platform lack the support for non-volatile storage
and this project has investigated the possibilities to design a mass storage prototype
based on open source software components and conventional hardware. The result
was a two-node cluster with fail-over capabilities that provides a NFS file system with
increased availability. The prototypes are somewhat limited but several proposals that
enhance the solution are discussed.
        Reliable Network Mass Storage                                                3(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                           2002-02-14

1       INTRODUCTION                                                                    6
1.1     BACKGROUND                                                                      6
1.2     PURPOSE                                                                         7
1.3     DISPOSITION                                                                     7
1.4     NOTATION USED IN THIS PAPER                                                     7

2       THE SERVER PLATFORM                                                             9
2.1     THE NETWORK SERVER PLATFORM                                                     9
2.1.1   The Generic Ericsson Magazine                                                   9
2.1.2   Switch Boards                                                                  10
2.1.3   Processor Boards                                                               10
2.2     TELORB                                                                         11
2.2.1   Communication                                                                  12
2.2.2   TelORB Processes                                                               13
2.3     STORAGE REQUIREMENTS                                                           14

3       MAGNETIC DISK DEVICE BASICS                                                    15
3.1     DISK DEVICE TERMINOLOGY                                                        15
3.1.1   Basic Hardware Components                                                      15
3.1.2   Data Layout                                                                    16
3.1.3   Data Encoding                                                                  16
3.1.4   Form Factor and Areal Density                                                  17
3.2     SERVICE TIMES                                                                  17
3.3     DISK DEVICE RELIABILITY                                                        18
3.3.1   Common reasons causing disk device failure                                     18
3.3.2   Disk device reliability measurements                                           19
3.3.3   Self Monitoring and Reporting Technology                                       20
3.4     SINGLE DISK STORAGE                                                            20

4       STORAGE DEVICE INTERFACES                                                      21
4.1     ADVANCED TECHNOLOGY ATTACHMENT                                                 21
4.2     SMALL COMPUTERS SYSTEMS INTERFACE                                              22
4.3     ISCSI                                                                          23
4.4     SERIAL STORAGE ARCHITECTURE                                                    24
4.5     FIBRE CHANNEL                                                                  24

5       REDUNDANT DISK ARRAYS                                                          26
5.1     DISK ARRAY BASICS                                                              26
5.1.1   Striping                                                                       26
5.1.2   Disk Array Reliability                                                         27
5.1.3   Redundancy                                                                     27
5.1.4   RAID Array Reliability                                                         28
5.2     RAID LEVELS                                                                    28
5.2.1   Level 0 – Striped and Non-Redundant Disks                                      29
5.2.2   Level 1 – Mirrored Disks                                                       29
5.2.3   Level 0 and Level 1 Combinations – Striped and Mirrored or vice versa          31
5.2.4   Level 2 – Hamming Code for Error Correction                                    32
5.2.5   Level 3 – Bit-Interleaved Parity                                               32
5.2.6   Level 4 – Block-Interleaved Parity                                             33
5.2.7   Level 5 – Block-Interleaved Distributed Parity                                 34
5.2.8   Level 6 – P+Q Redundancy                                                       35
        Reliable Network Mass Storage                                     4(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                2002-02-14

5.2.9   RAID Level Comparison                                               35
5.3     RAID CONTROLLERS                                                    37

6       FILE SYSTEMS                                                        38
6.1     BASICS ABOUT LOCAL FILE SYSTEMS                                     38
6.1.1   Format and Partition                                                38
6.1.2   Data Blocks                                                         39
6.1.3   Inodes                                                              39
6.1.4   Device Drivers                                                      40
6.1.5   Buffers and Synchronisation                                         41
6.1.6   An Example                                                          42
6.1.7   Journaling and Logging                                              42
6.2     LINUX VIRTUAL FILE SYSTEM                                           43
6.3     DISTRIBUTED FILE SYSTEMS                                            43
6.3.1   Network File System                                                 44
6.3.2   Andrew File System                                                  44
6.4     THE FILE SYSTEM AND THE USER                                        45
6.4.1   User Perspective of the File System                                 45
6.4.2   Filesystem Hierarchy Standard                                       45

7       SYSTEM AVAILABILITY                                                 46
7.1     THE TERM AVAILABILITY                                               46
7.2     TECHNIQUES TO INCREASE SYSTEM AVAILBILITY                           46

8.1     PROPOSAL BACKGROUND                                                 49
8.1.1   Simple NFS Server Configuration                                     49
8.1.2   Adding a Redundant NFS Server                                       50
8.1.3   Adding Shared Storage                                               51
8.1.4   Identified Components                                               52
8.2     VIRTUAL SHARED STORAGE                                              52
8.2.1   Network Block Device and Linux Software RAID Mirroring              52
8.2.2   Distributed Replicated Block Device                                 55
8.3     INTEGRATION OF THE COMPONENTS                                       57
8.3.1   Linux NFS Server                                                    57
8.3.2   Linux-HA Heartbeat                                                  57
8.3.3   The two-node High Availability Cluster                              57

9       IMPLEMENTATION                                                      59
9.1     REDBOX                                                              59
9.1.1   Hardware Configuration                                              59
9.1.2   Operating system                                                    60
9.1.3   NFS                                                                 61
9.1.4   Distributed Replicated Block Device                                 61
9.1.5   Network Block Device and Software RAID Mirroring                    62
9.1.6   Heartbeat                                                           63
9.1.7   Integrating the Software Components into a Complete System          63
9.2     BLACKBOX                                                            65
9.2.1   Hardware Configuration                                              65
9.2.2   Software configuration                                              66

10      BENCHMARKING AND TEST RESULTS                                       67
         Reliable Network Mass Storage                        5(82)
         Jonas Johansson at Ericsson UAB and IMIT, KTH   2002-02-14

10.1     BENCHMARKING TOOLS                                     67
10.1.1   BogoMips                                               67
10.1.2   Netperf                                                68
10.1.3   IOzone                                                 69
10.1.4   Bonnie                                                 69
10.1.5   DRBD Performance                                       69
10.2     REDBOX BENCHMARK                                       70
10.3     BLACKBOX BENCHMARK                                     70
10.4     REDBOX VERSUS BLACKBOX                                 71
10.5     FAULT INJECTION                                        71

11       CONCLUSIONS                                            72
11.1     GENERAL                                                72
11.2     THE PROTOTYPES                                         72
11.3     DATACOM VERSUS TELECOM                                 73
11.4     LINUX AND THE OPEN SOURCE COMMUNITY                    73
11.5     TSP                                                    74

12       FUTURE WORK                                            75
12.1     POSSIBLE PROTOTYPE IMPROVEMENTS                        75

13       ACKNOWLEDGEMENTS                                       77

14       ABBREVIATIONS                                          78

15       REFERENCES                                             80
15.1     INTERNET                                               80
15.2     PRINTED                                                80



      Reliable Network Mass Storage                                                        6(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14


      All building blocks in a modern telecom infrastructure (e.g. servers, switches, media
      gateways and base station controllers) are designed to be extremely fault tolerant and
      highly available. Unplanned system downtime is unacceptable; every second is
      crucial but it is also vital that planned downtime is kept at a minimum level. In modern
      telecom networks far from all traffic are regular phone calls. Mobile units, for instance
      PDAs, are often used to connect to the Internet via cellular telephones and the trend
      is towards an increasing data/voice traffic ratio. That is, more data traffic and less
      actual talking. New wireless communication technologies with enhanced performance,
      e.g. GPRS and UMTS, speeds up the emerging of cellular telephones and PDAs. The
      service and control networks’ servers are of course influenced by this change. They
      must be able to adopt new services and therefore it must be easy to configure and
      scale the servers.

      This project has investigated the possibilities to design a solution based on open
      source software components and conventional hardware. The result was a two-node
      cluster with fail-over capabilities that provides an increased availability.


      Ericsson Utvecklings AB is headquarters for Core Network Development, which is a
      virtual organisation hosted by several Ericsson companies and partners around the
      world. Ericsson UAB provides Ericsson and its customers with cutting edge
      telecommunications platform products, services and support. This project is
      performed at Ericsson UAB in Älvsjö, Sweden, which is developing a server platform
      informally referred to as The Server Platform (TSP). TSP agrees with the common
      telecommunication requirements: high availability, high performance, cost efficiency
      and scalability. The first three generations of TSP lacks of any implementation and
      integration of mass storage solutions. Currently the fourth generation is under
      development and telecom operators, the primary buyers, have now clearly showed
      interest in the possibility to store large amount of information in a non-volatile
      memory (e.g. a hard disk device). Many new services and their applications rely on
      the possibility to store large amount of information. An example of an application that
      needs reliable storage is AAA, which is an application that provides authentication,
      authorisation and accounting services. A Home Location Register (HLR) is another
      example of a TSP implementation affected by new storage demands. An HLR could
      shortly be described as a high performance database manager system (DBMS) that
      stores client specific information such as billing and last geographical location.

      Today the server platform stores operating system files and files used when booting
      the processor cluster on single hard disk devices attached to dedicated load nodes.
      The processes running in the processor cluster do not use these disks other than for
      booting. Application data generated by a process during execution is stored in a
      Reliable Network Mass Storage                                                       7(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

      database, which is distributed over the cluster members’ volatile memory. To exclude
      the possibility of a single point of failure, another processor always stores a copy of a
      process’ information. This simple principle of storing the same information at two
      different places is an example of the redundancy principle, an efficient solution to
      provide a higher level of availability. Volatile memory has two obvious drawbacks; it
      is expensive when storing large amounts of information and the information is of
      course lost when the power is turned off.


      The purpose of this thesis was to investigate the possibilities to design a high
      availability mass storage configuration suitable for a telecom server’s requirements
      and if it was feasible to implement a prototype with conventional hardware and open
      source software. Questions that pervade the thesis:

      -   How is a storage configuration suitable for a telecommunication server platform
          designed to exclude any single point of failures?

      -   How a storage configuration prototype is implemented just using standard
          components and open source software?

      -   How is the performance compared to commercial solutions?


      The thesis starts with a brief description of the rather complex target system – the
      TSP – and carries on with the basics about the fundamental hardware components
      used in storage configurations. Principally because a deeper knowledge in the most
      basic storage components increases the understanding for the more high-level
      system design issues, i.e. understand why single disk storage is inappropriate in a
      vital system. Techniques and theories providing increased reliability and availability at
      some level such as RAID, clustering and distributed file systems are discussed.
      General system design issues such as single point of failure and redundancy are also
      in the scope of the thesis. A system’s availability is basically a compromise of
      hardware availability, software availability and human error. High availability is a hot
      topic; several open source projects are designing and implementing new ideas
      concerning high availability for Linux. These theories are of course useful when
      designing a high availability mass storage suitable for a telecommunication server
      platform. Next section presents possible designs based on standard components and
      open source software followed by an evaluation, the final conclusions and
      suggestions how the prototypes may be improved.


      Something that needs an extra attention when presented for the first time is written in
      italic. For instance:
Reliable Network Mass Storage                                                        8(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

     A widely used file system is EXT2FS that is considered being the standard Linux
     file system today.

There are a few examples of applications used during this project that could be typed
in at a computer terminal and these are written with the fixed courier font style:

     When creating an EXT2 file system it is desirable to use the mke2fs application.

A UNIX or Linux prompt is illustrated with a dollar sign:

     $ ls | grep foo

References are marked as:

        Reliable Network Mass Storage                                                                              9(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                        2002-02-14


        The Server Platform is developed to fulfil
        increased        demands           for     openness,
        robustness and flexibility. TSP is currently
        available in three generations, each with a
        set of different configurations. This is a
        description of ANA 901 02/1 (figure 1), a
        configuration of TSP 3.0 that utilises the
        fault control software system TelORB for
        traffic handling and database transactions.
        The ANA 901 02/1 configuration and its
        hardware and software components are
        used for testing and evaluation of the mass
        storage prototype during this project. In this
        document TSP refers to this configuration
        and components are specified ANA 901
        02/1 components if not otherwise stated.

        In parallel with this thesis project the fourth
        generation of TSP is under development.
        Major differences between it and its
        precursors are the migration from Solaris
        UNIX to Linux and a homogenous use of
                                                                     Figure 1 – A fully equipped TSP base cabinet.
        Intel Pentium based processor boards.

        The material in this section is more thoroughly described in the ANA 901 02/1 System
        Description [Ericsson01] and the TelORB System Introduction [Ericsson02].


        The TSP is derived from a broad variety of hardware components and its generic
        hardware platform is called the Network Server Platform (NSP). This covers the most
        essential components, emphasising on those used during this master’s thesis project.

2.1.1   The Generic Ericsson Magazine

        TSP is adapted to the Generic Ericsson Magazine (GEM) hardware infrastructure that
        is based on the BYB 501 equipment practise. It is a generic platform sub rack that is
        used in several different Ericsson products. All open hardware components are
        standard Compact PCI components. An Ericsson made carrier board is used to

            CompactPCI or cPCI is a very high performance industrial bus based on the standard PCI electrical
        specification. Compact PCI boards are inserted from the front of the chassis, and I/O can break out either to the
        Reliable Network Mass Storage                                                                           10(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                      2002-02-14

        integrate the standard cPCI components to the GEM back plane practise. This
        provides Ericsson with the possibility to influence essential characteristics such as

2.1.2   Switch Boards

        A sub rack is equipped with two redundant Switch Control Boards (SCB) providing a
        complete electrical and mechanical infrastructure for up to 24 circuit boards with 15-
        mm spacing. The SCB includes a level 1 Ethernet switch with 26 100Base-T
        backplane connections, one 1000BaseT connection and two 100Base-TX connections
        in the front. Its primary usage is to provide Ethernet communication between a sub
        rack’s processor boards.

        The Gigabit Ethernet Switch Board (GESB) is a state-of-the-art Gigabit Ethernet
        switch made by Ericsson. It provides many features but its obvious usage is to render
        the possibility for communication between different sub racks. When the SCBs’
        Gigabit interfaces are connected to the GESB it is possible to communicate between
        processors not only in specific sub rack but also between all processors mounted in
        connected sub racks.

2.1.3   Processor Boards

        In third generation of TSP the Support Processor (SP) is a Sparc based processor
        board from Force Computers adapted to GEM with a special carrier board. In TSP 4
        the goal is to replace the Sparc based processor board with a processor board based
        on the Intel Pentium processor. There are typically four SPs in a base cabinet,
        working in a load sharing mode, which are dedicated for operation and maintenance
        (O&M) as well as other support functions. They are used to start-up the cluster, i.e.
        deliver the appropriate boot images and operating system files as well as
        applications. The SPs include all I/O units used in a TSP node, i.e. SCSI hard disks,
        DVD players and tape drives, and that is why the SP sometimes internally is referred
        to as to as I/O processor. Another important feature of the SPs is that they are acting
        as gateways between the internal and external networks; all communication to the
        TSP is managed with the SPs’ external Ethernet interfaces or RS-232 serial
        communication ports. The first three generations of SPs run Solaris UNIX but
        generation 4 is aiming towards Linux.

        The applications are executed on a cluster of Traffic Processors (TP) running the
        TelORB operating system. A TP in TSP 3 is really a MXP64GX Pentium III processor
        board from Teknor, which could be described as a PC integrated on a single board.

        front or through the rear. More information can be found at, the PCI Industrial Computers
        Manufacturer's Group.


      Reliable Network Mass Storage                                                                          11(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                                      2002-02-14

      Each board is by default equipped with two 100base-TX interfaces in the backplane
      but it is possible to have at least two additional interfaces in the front. A TP-IP board
      is a TP mounted with a dual port Ethernet PMC4 module. The TPs could also be
      configured to support Signalling System number 7 (SS7), an out-of-band signalling
      architecture used in telecom networks providing service and control capabilities.
      Similar to TP-IP, a standard TP mounts a PMC module providing the SS7
      functionality. This configuration is referred to as Signalling Terminal (ST). The PMC
      enables versatile usage of the TP since it provides means to add functionality, as it is

      In TSP 4.0 the TPs can run both Linux and TelORB. The two operating systems can
      coexist but a single TP can of course not run both OS at the same time.

2.2   TELORB

      TelORB is a distributed operating system with real-time characteristics suitable for
      controlling, e.g., telecom applications. It can be loaded onto a group of processors
      and the group will behave like one system (figure 2).

      Applications run on top of TelORB. Different parts of an application can run on
      different processors and communication among these parts is possible in a
      transparent manner. A TelORB system is described as a truly scalable system,
      meaning that the capacity is a linear function of the number of processors. If there is
      a need to increase the processing capacity you can add processors in run-time


          CORBA - ORB       JAVA VM

                            SOFTWARE MANAGEMENT


            KERNEL           KERNEL           KERNEL           KERNEL
             PROM             PROM             PROM             PROM



                                I/O PROCESSORS
                                 I/O PROCESSORS
                                  I/O PROCESSORS

      Figure 2 – Overview of the hardware and software used in a TelORB system.

      without disturbing ongoing activities.

          PCI Mezzanine Connector modules are small PCI cards mountable to the PMC interface standard.
        Reliable Network Mass Storage                                                                           12(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                      2002-02-14

        TelORB also include a database called DBN. It is distributed over the cluster of
        TelORB processors and it is stored in their primary memory. Since it is distributed
        every process can access any database instance no matter of which processor it is
        running on. The distributed DBN database has unique characteristics and is a key
        component in a working TelORB application.

        In addition to TelORB’s real-time and linearity characteristics the system can also be
        described as [Ericsson02]:

        -      Open, since the external protocols of the system are standardised protocols, e.g.
               IIOP and TCP/IP. IIOP is also used for managing the TelORB system and it is a
               part of CORBA and it is implemented via ORB. The processors used are
               commercially available and applications can be programmed in C++ and Java.

        -      Fault Tolerant, a TelORB application is extremely fault tolerant and this is
               achieved primarily by duplication of functionality and persistent data on at least
               two processors. There exists no single point of failure, meaning that one
               component fault alone will not reduce the systems availability.

        -      High Performance, the true linear scalability makes TelORB applications capable
               of handling large amounts of data.

        -      Object Oriented; a TelORB system is able to run common object oriented
               program languages such as C++ and Java. TelORB objects are specified using
               IDL that is supported by CORBA standard.

2.2.1   Communication

        TSP uses well-specified and open protocol stacks for communication both internally
        and externally (figure 3). A TelORB zone is internally built around two switched
        Ethernet networks physically mounted in the GEM back plane. Thus all processors
        have direct contact with all other TelORB processors in the same zone.

                                        Inter-Node                            O&M
                                        IP Network                         IP Network

                                                       Routers                                         Remote
                                                                                        Modem           O&M
            E1/T1 Links                               Ethernet        Ethernet               RS-232

        TelORB                                                                                I/O
        Processors        ST   ST     TP-IP   TP-IP      TP      TP              SP     SP    Processors

        Internal                                                                                Ethernet
        Network                         IPC
                                                                   TCP/IP, UDP/IP

        Figure 3 – Overview of the provided communication possibilities.
        Reliable Network Mass Storage                                                             13(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                         2002-02-14

        Communication         between     TelORB        processors   utilises   the   Inter    Process
        Communication protocol (IPC), which is a protocol layered directly on top of the MAC
        layer. For internal communication between I/O processors and TelORB processors,
        common IP protocol stacks such as UDP/IP and TCP/IP are used. The internal
        processor communication is never exposed to the outside world.

        For operation, maintenance and supervision purposes, the support processors are
        accessible externally using IP protocol stacks such as TCP/IP.

        To provide geographical network redundancy, two TelORB zones are able to
        cooperate and serve as redundant copies. Updates of the two TelORB databases are
        transferred between the zones using TCP/IP connections directly between TelORB
        processors in different zones. Virtual IP (VIP) is a function used as interface towards
        external IP networks. TSP also supports SS7, an out-of-band signalling architecture
        used in telecom networks providing service and control capabilities, and RS-232 that
        is standard serial communication.

2.2.2   TelORB Processes

        Everything executing on a TelORB processor executes in processes. A process is a
        separate execution environment and in TelORB they execute in parallel with other
        processes. Every process is an instance of a special process type defined in Delos,
        which is TelORB’s internal specification language. A process cannot affect another
        process except through special communication mechanisms called Dialogues, which
        are software entities also defined in Delos, handling communication between
        processes. Dialogues are based on IPC, which take care of packaging, sending,
        receiving and unpacking.

        Processes can be static or dynamic. Static processes are started when the system
        starts and they are always running, if a static process is destroyed, it is automatically
        restarted by the operating system. A dynamic process is created from other process
        instances but never automatically restarted. Dynamic processes are started on
        request and terminated when their task is completed.

        A static process could for example supervise a group of subscribers. Whenever a
        subscriber calls, the supervising process starts a dialogue for creation of a dynamic
        process to handle that particular call. When the call is finished the process closes the
        dialogue and the dynamic process terminates. The dynamic process is not necessarily
        executed on the same processor as its static parent process.

        Dynamic processes are used for increasing the robustness of a TelORB system. If a
        software fault occurs in a call, only the corresponding distributed dynamic process will
        be shut down, all other calls are unaffected and the static parent process could
        immediately start a new dynamic process to handle the ongoing call. A static process
        is automatically restarted by the operating system if a software fault occurs.
      Reliable Network Mass Storage                                                     14(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                 2002-02-14

      Every process type is associated to a distribution unit type (DUT), which in turn is
      associated to a processor pool. The DUT is specified in Delos while the processor
      pool is specified in a special configuration file. A processor pool can be associated
      with several DUTs. When a system starts, TelORB could for example start an
      instance of a static process at a processor. If that particular processor crashes,
      TelORB will automatically restart that process at another processor associated to the
      same DUT.


      Currently TSP does not support non-volatile storage other than the block devices
      used by the support processors and small Flash disks used by the traffic processors
      for the initial booting. It is desirable to support the use of attached reliable storage
      since it promise a more versatile usage of the TSP platform. What exactly the storage
      is going to be used for is really a matter for the customers.

      The storage must of course be reliable and provide high availability, that is, never
      ever go down. It is desirable to have a solution that easy to maintain and that is
      scalable. The solution must at least support continuous reading and writing at a rate
      of 5 to 10 MBytes/s.
        Reliable Network Mass Storage                                                    15(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                2002-02-14


        The main purpose for a magnetic hard disk or hard drive is to serve as a long-term
        inexpensive storage for information. Compared with other types of memory, e.g.
        DRAM, magnetic disks are considered slow but generally a rather reliable storage
        medium. Hard disks are considered non-volatile storage, meaning that data remains
        even when turning of the power. A disk drive supports random access compared to a
        tape device that’s referred to as a sequential access technology.


        A head-disk assembly (HDA) is the set of platters, actuator, arms and heads protected
        by an airtight casing to insure that no outside air contaminants the platters. A hard
        disk device is an HDA and all associated electronics.

3.1.1   Basic Hardware Components

        A hard disk consists of a set of platters (figure 4) coated with a magnetic medium
        designed to store information in the form of magnetic patterns. Usually both surfaces
        of the platters are coated and thus both surfaces are able to store information. The
        platters are mounted by cutting a hole in the centre and stacking them onto a spindle.
        The platters rotate with a constant angular velocity, driven by a spindle motor
        connected to the spindle. Modern hard disks rotational velocity is usually 5,400, 7,200
        or 10,000 RPM but there are examples of state-of-the-art SCSI disks with speeds as
        high as 15,000 RPM.

                        Sector       Track              Head              Actuator



        Figure 4 – An overview of the most basic hard disk device terminology.

        A set of arms with magnetic read/write heads is moved radially across the platters’
        surfaces by an actuator. The head is an electromagnet that produces switchable
        magnetic fields to read or record bit streams on a platter’s track. The heads are very
        close to the spinning platters but they never touch the surfaces. In almost every disk
        drive the actuator moves the heads collectively but only one head can read or write
        concurrently. When the heads are correctly positioned radially the correct surface is
        Reliable Network Mass Storage                                                                            16(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                       2002-02-14

3.1.2   Data Layout

        Information, i.e. bit streams of zeroes and ones can be read from or recorded to the
        platters’ surfaces. Information is stored in concentric tracks. Each track is further
        broken down into small arcs called sectors, each of which typically holds 512 bytes of
        information. Read and write operations physically affect complete sectors since a disk
        is unable to address bits within a sector. A cylinder is the vertical set of tracks at the
        same radius.

        Early disk devices had the same amount
                                                                Zone 3
        of sectors on all tracks and thus present
                                                                Zone 2
        an inhomogeneous data density across
                                                                Zone 1
        the platters’ surface. By placing more
        sectors on tracks at the outside of the
        platter and fewer sectors at the inside
        edge of the platter a constant data bit
        density is maintained across the platter’s
        surface (figure 5). This technique is
        called     zone      bit   recording       (ZBR).
                                                                Figure 5 – An illustration of a platter divided into
        Typically, drives have at least 60 sectors              three different zones: 1, 2 and 3.
        on the outside tracks and usually less
        then 40 on the inside tracks. This changes the disk's raw data rate. The data rate is
        higher on the outside than on the inside. Most ZBR drives have at least 3 zones but
        some may have 30 and even more zones. All of the tracks within a zone have the
        same number of sectors per track.

3.1.3   Data Encoding

        As mentioned above the bit streams are encoded as series of magnetic fields
        produced by the electromagnetic heads. The magnetic fields are not used for
        absolute measurements, i.e. north polarity represents a zero and south polarity
        represents a one. The technique used is based on reversal flux. When a head moves
        over a reversal, e.g. a transition from a field with one polarity to an adjacent field of
        the opposite polarity, a small voltage spike is produced that is much easier to detect
        then the magnetic field’s actual polarity (figure 6).

        Write Current


        Read Voltage

        Figure 6 – Data is encoded and recorded to the magnetic coating as magnetic fields.
        Reliable Network Mass Storage                                                          17(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                      2002-02-14

        Suppose a series of more then 500 zeroes is being recorded in the magnetic coating
        of a platter, then it is almost impossible to keep track of all the bits since it is just one
        long unipolar magnetic field. To avoid problems associated with long sequences of
        zeroes or ones the encoded information contains a clocking synchronisation.

3.1.4   Form Factor and Areal Density

        The platter’s size is the primary determinant of the disk device’s overall physical
        dimensions, also generally called the drive's form factor. All platters in a disk are of
        the same size and it is usually the same for all drives of a given form factor, though
        not always. The most widely used disk device today is the 3.5-inch form factor disk
        and it is used in a wide range of configurations from ordinary PCs to powerful storage

        Traditionally, bigger platters meant more storage. Manufacturers extended the
        platters as close to the width of the physical drive package as possible to maximise
        the amount of storage in one drive. But despite this fact the trend is towards smaller
        platters and the primary reason is performance. The areal density of a disk drive is
        the number of bits that can be stored per square inch or centimetre. As areal density
        is increased, the number of tracks per areal unit, referred to as track density, and the
        number of bits per inch stored on each track, usually called linear density or recording
        density, also increases. As data is more closely packed together on the tracks, the
        data can be written and read far more quickly. The areal density is increasing so fast
        that the loss of storage due to smaller platters is negligible.


        Disk performance specifications for hard disks are generally based upon how the hard
        disk performs while reading. The hard disk spends more time reading than writing but
        the service times for reading is also lower than for writing. Disk device performance is
        a function of service times, which can be divided into three components: seek time,
        rotational latency and data transfer time [Chen93].

        The Seek time is the amount of time required for the actuator to move a head to the
        correct radial position, i.e. correct track. The heads’ movement is a mechanical
        process and thus the seek time is a function of the time needed to initially accelerate
        the heads and the number of tracks traversed. Seek time is the most discussed
        measurement for hard disks performance but since the number of traversed tracks
        varies it is presented with three different values:

        -   Average: The average seek time is the time required to seek from a random track
            to another random track. Usually in the range of 8 – 10 ms but some of the latest
            SCSI drives are as low as 4 ms.

        -   Full Stroke: The amount of time required traversing all tracks, starting from the
            innermost track to the outermost track. In the range of 20 ms and this time
        Reliable Network Mass Storage                                                                               18(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                          2002-02-14

            combined with the average seek time is close to the actual seek time for a full
            hard disk.

        -   Track-to-Track: This measurement is the amount of time that is required to seek
            between adjacent tracks, approximately 1 ms.

        The heads movement and the platters’ spin are not synchronised. That is, the desired
        sectors can be anywhere on the track and therefore the head, when positioned over
        the desired track, must wait for the desired sector to rotate under itself. The amount of
        time waiting is called the rotational latency. The waiting depends on the platter’s rate
        of rotation and how far the sector is from the head. The rotational latency correlates
        with the disk spin; hence faster disk spins results in less rotational latency. Generally
        the average rotational latency value calculated for half a rotation is provided from:

        AverageRotationalLatency (x ) =
        Equation 1 – Average rotational latency is calculated for half a rotation. X is the platter’s rate of rotation in
        rounds per minute.

        Some manufacturers also provide a worst case scenario, meaning that the sector just
        passed the head and a full rotation is needed before the sector can be read or written.
        The worst case latency is twice the amount of the average rotational latency.

        Data transfer time is the amount of time required to deliver the requested data. It
        correlates with the disk’s bandwidth, which is a combination of the areal density of the
        disk device medium and the rate at which data can be transferred from the platters’

        Command overhead time, actually “the disk’s reaction time”, is referring to the time
        that elapses between commands are sent and when they are executed. It is usually
        just added to the much greater seek time added since it is only about 0.5 ms.

        Settle time is the time needed for the heads to stabilise after the actuator have moved
        them. The heads must be stable enough to be able to read or write information. The
        settle time is usually in the range of 0.1 ms and therefore it is often negligible.


        Hard disk devices are manufactured under rigorous safety measurements. New
        devices are seldom delivered with hardware faults but if the disk vendors accidentally
        deliver drives with some chemicals inside the hard disk assembly that should not be
        there they normally fail almost instantly. A failure is defined as a detectable physical
        change to the hardware, a fault is an event triggered by a non-normal operation and
        an error is the consequence of a fault.

3.3.1   Common reasons causing disk device failure

        There are three major reasons hard disk drives fail:
        Reliable Network Mass Storage                                                           19(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                      2002-02-14

        -   Heat is lethal to disk devices and it rapidly decreases the overall lifetime. In
            reliable systems hard disk devices are often equipped with a special disk device
            cooling fan. If the cooling halts and a disk get overheated it could cause serious
            hardware faults to that particular device. “Bad sectors” is one common result,
            which means that some of the platters sectors are corrupt and unusable. High
            heat could in a worst case scenario cause the disk’s heads to get “glued” to the
            platters’ surfaces and cause the spindle trouble when it tries to spin the platters.

        -   Mishandling is of course one major reason that a hard disk fails. The small
            electromechanical components are not designed for drops or earthquake similar

        -   Electronics failure is common due to the heating/cooling cycles that cause breaks
            in the printed circuit board or breaks in the wires inside the chips. Electronics
            failures caused by these cycles are usually sudden and without warning. Ignoring
            ESD safety measurements (e.g. proper grounding) when handling a drive could
            cause the electronics to fail due to electrostatic discharges.

3.3.2   Disk device reliability measurements

        Disk device vendors present several measurements according to disks’ reliability.
        Most of these measurements tend to be hard to interpret and sometimes they are
        misleading but if interpreted correctly, they are helpful when comparing different disk
        devices. Two important reliability measurements are:

        -   Mean Time Between Failures (MTBF) is the most commonly used measurement
            for hard disk device reliability. MTBF is the average amount time that will pass
            between two random failures on a drive. It is usually measured in hours and
            modern disk devices today are usually in the range of 300,000 to 1,200,000
            hours. A common misinterpretation is that a disk device with a MTBF value of
            300,000 hours (approximately 34 years) will work for as many years without
            failing. This is of course not the case. It is not effective for a disk device vendor to
            test a unit for 34 years and an aggregated analysis of a large number of devices
            is used instead. The MTBF should be used in conjunction with the service life
            specification. Assume that a disk device have a MTBF value of X hours and a
            service life of Y years. The device is then supposed to work for Y years. During
            that period of time a large number of disks will accumulate X hours of run time.

        -   Service life is the most correct measurement to         P(t)

            use if the disk device itself is used in a system
            with high reliability demands. The service life is
            the amount of time before the disk device enters
                                                                                          Time, t
            a period where the disk’s probability to fail over             Service life
            time increases.
        Reliable Network Mass Storage                                                      20(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

3.3.3   Self Monitoring and Reporting Technology

        The Self Monitoring and Reporting Technology (SMART) was first developed at
        Compaq and it tries to detect possible hard disk failures before they occur. It evolved
        from a technology developed by IBM called Predictive Failure Analysis. The
        manufacturers analyse mechanical and electronic characteristics of failed drives to
        determine relationships between predictable failures and trends in various
        characteristics of the drive that suggest the possibility of slow degradation of the
        drive. The exact characteristics monitored depend on the particular manufacturer.


        Conventional hard disks are considered to be rather cheap hardware components.
        For a personal computer used in the office for writing documents or at home for
        playing computer games they are also considered to be sufficiently fast and reliable.
        Most hard disk devices outlive the rest of the computer components in an ordinary
        personal computer and when changing computer you also tend to change the disk. If
        the disk in your personal computer unfortunately crashes the only one affected is
        sadly you. In a real-time and business critical network with servers handling
        databases with millions of clients, every single minute of down time could cost a
        fortune and affect millions of people, i.e. clients paying for a working service. Single
        disk storage solutions are impossible to use; critical systems must survive at least a
        single disk failure.

        A single disk storage system can support multiple user sessions when the disk I/O
        bandwidth is greater than the per session bandwidth requirement by multiplexing the
        disk I/O bandwidth among the users. This is achieved by retrieving data for a user
        session at the disk transfer rate, buffering them, and delivering them to the user at the
        desired rate. Despite new and improved I/O performance a single disk device are
        often not enough to serve as storage for a high-end server system. Vital servers and
        systems need more I/O bandwidth than a sole disk is able to provide.
      Reliable Network Mass Storage                                                                  21(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                              2002-02-14


      The interface is the channel where the data transmission between a hard disk device
      and a computer system takes place. It is really a set of hardware and software rules
      that manage the data transmission. Physically the interface exists in many various
      configurations but traditionally most are implemented in hardware with compatible
      chips at the motherboard and the disk device linked together with cable. ATA and
      SCSI interfaces are two interfaces often utilised today but there are emerging
      standards improving the performance and connectivity as well as reliability.


      The Advanced Technology Attachment (ATA) interface is mostly used in conventional
      PCs. ATA is considered a low-bandwidth interface and is relatively cheap compared
      with other existing interfaces. It originates from the standard bus interface first seen
      on the original IBM AT computer [IBM01].

      ATA is sometimes referred to as Integrated Device Electronics (IDE) and Ultra DMA
      but the real ANSI standard designation is ATA. Despite the official ATA
      standardisation many vendors have invented their own names but these are to
      consider as marketing hypes.

      The first method ATA used for transferring data over the interface was a technique
      called programmed I/O (PIO). The system CPU and support hardware executes the
      instructions that transfer the data to and from the drive control. It works well for lower
      speed devices such as floppy drives but high-speed data transfers tend to take over
      all CPU cycles and simply make the system too slow. When introducing Direct
      Memory Access (DMA) the actual data transfer does not involve the CPU. The disk
      and some additional hardware communicate directly with the memory. DMA is a
      generic term used for the peripheral’s possibility to communicate directly with the
      memory. The transfer speed increases because of decreased overhead and the CPU
      workload significantly decreases because it is not involved in the actual data transfer.

      Though not standardised, Ultra-ATA is accepted by the industry as the standard that
      boosts ATA-2’s performance with double transition clocking and includes CRC error
      detection to maintain data integrity.

                                      ATA      ATA-2 Ultra-ATA/33     Ultra-ATA/66      Ultra-

       Max. data bus transfer 8.3              16.6     33            66                100
       speed [MBytes/s]

       Max. data bus width            16-bit   16-bit   16-bit        16-bit            16-bit

       Max. device support            2        2        2             2                 2

      Table 1 – An overview of different ATA specifications. Note that data rates are maximum rates only
      available for short data transfers.
      Reliable Network Mass Storage                                                     22(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                 2002-02-14

      The standard ATA cable consists of 40 wires, each with a corresponding pin, and its
      official maximum length is ~ 0.45 metres. It was originally designed for transfer
      speeds less than 5 MBytes/s. When the Ultra-DMA interface was introduced the
      original cable caused problems. The solution was a cable with 80 wires. It is still pin-
      compatible with the original 40-pin ATA interface, since all of the additional 40 wires
      are connected to ground and used as shielding.


      The Small Computers Systems Interface (SCSI) is the second most used PC interface
      today, accepted as an ANSI standard 1986. It is preferred in high-end servers prior
      ATA. While ATA is primarily a disk interface it may be more correct to consider SCSI
      a system-level bus, regarding that each SCSI device’s intelligent controller. SCSI
      components are generally more expensive than ATA but they are considered faster
      and they load the CPU less [IBM01].

      While the performance of modern ATA transfers correlates with the speed of the
      DMA, SCSI data throughput is influenced by two factors:

      -   Bus width is really how many bits that are transferred in parallel on the bus. The
          SCSI term Wide refers to a wider data bus, typically 16 bits.

      -   Bus speed refers to the speed of the bus. Fast, Ultra and Ultra2 are typical SCSI
          terms referring to specific data rates.

      Except for the two SCSI characteristics controlling the data throughput there is
      another important characteristic, signalling. There are several standards but the most
      common are Single Ended (SE), High Voltage Differential (HVD) and Low Voltage
      Differential (LVD). SE is the signalling used in the original standard, it is simple and
      cheap but with some flaws. HVD tried to solve the problems associated with SE with
      two wires for each signal but it is expensive and consumes lots of power and was
      never really used. When LVD was defined in the Parallel Interface Standard 2, it was
      feasible to increase bus speed and cable length. LVD is today the best choice for
      most configurations and it is the exclusive signalling method for all SCSI modes
      faster than Ultra2 (if not HVD is used).

      All SCSI devices must have a unique id, typically set using jumpers, which is used for
      identifying and prioritising the SCSI devices. The SCSI configuration also requires
      proper bus termination and since there are almost as many signalling standards as
      SCSI standards there are several types of terminators.
      Reliable Network Mass Storage                                                                         23(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                                     2002-02-14

                                      SCSI    Fast    Ultra     Wide      Ultra2     Wide      Ultra3
                                              SCSI    SCSI      Ultra     SCSI       Ultra2    SCSI
                                                                SCSI                 SCSI

          Max. data bus transfer 5            10      20        40        40         80        160
          speed [MBytes/s]

          Max. data bus width         8-bit   8-bit   8-bit     16-bit    8-bit      16-bit    16-bit

          Max. cable length [m]       6       3       1.5 - 3 1.5 - 3     12         12        12

          Max. device support         8       8       8-4       8-4       8          16        16

      Table 2 – A comparison of different SCSI generations. Note that data rates are maximum rates only available
      for short data transfers.

      SCSI is often described as intelligent compared to ATA. There are several reasons for

      -     Command Queuing and Re-Ordering, allows for concurrent multiple requests to
            devices on the SCSI bus, while ATA only allows one request at the time.

      -     Negotiation and Domain Validation is a feature that automatically interrogates
            each SCSI device for its supported bus speed. If the supported speed cause
            errors during a validation test, the speed is lowered to increase the data bus

      -     Quick Arbitration and Select allows a SCSI device to quickly access the bus after
            another device is finished sending data. A built in regulation prevents high priority
            devices to dominate the bus.

      -     Packetisation is an effort to improve SCSI bus performance by reducing

      Most SCSI implementations also support CRC and bus parity to increase data

4.3   ISCSI

      iSCSI, which is also known as Net SCSI, provides the                                     SCSI
      SCSI genereic layer with a reliable network transport. It is a                           iSCSI
      mapping of SCSI commands, data and status over TCP/IP                           Upper Functional Layers
      networks and enables universal access to storage devices                                 TCP
      and storage area networks. TCP ensures data reliability
                                                                                      Lower Functional Layers
      and manages conguestions and IP networks provide                                     (e.g. IPsec)

      security, scalability, interoperability and cost efficiency.
      It is described in an Internet draft and its standardisation is
                                                                                     Figure 7 – A layered model
                                                                                     of iSCSI.
      Reliable Network Mass Storage                                                       24(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

      managed by the IP Storage Working Group of the IETF. It is still under development
      but there are a few working implementations, e.g. Linux-iSCSI Project .


      Serial Storage Architecture (SSA) is an advanced serial interface that provides higher
      data throughput and scalability compared to conventional SCSI [Shim97]. Its intended
      use, according to IBM, is high-end server systems that need cost-effective and high-
      performance SCSI alternatives.

      Serial Storage Architecture nodes (e.g. devices, subsystems and local host
      processors, are able to aggregate several links’ bandwidth. Common configurations
      use one, two or four pairs of links. A pair consists of one in-link and one out-link. Each
      link supports 20 MBytes/s bandwidth, thus the aggregated link bandwidth is 40, 80 or
      even 160 MBytes/s depending on the number of pairs utilised by the SSA
      configuration [IBM01]. SSA supports several flexible interconnection topologies,
      which includes string, loop and switched architectures. If the media interconnecting
      the nodes is copper, the maximum distance between two nodes is 25 metres but if
      fibre optics is used, the distance is extendable up to 10 km. A SSA loop enables
      simultaneous communication between multiple nodes, which results in higher
      throughput. SSA supports up to 128 devices and a fairness algorithm, which is
      intended to provide a fair bandwidth sharing among the devices connected to a loop.
      Hot swapping, auto configuration of new devices and support for multiple
      communication paths are features making the systems utilising SSA configurations
      more available.


      Fibre Channel (FC) is a rather new open industry-standard interface but it has
      attained a strong position in Storage Area Networks (SAN). FC provides the ability to
      connect nodes in several flexible topologies and makes the storage local to all
      servers in the SAN. FC supports topologies such as fabrics (analogue to switched
      Ethernet networks) and arbitrated loops (FC-AL). FC promise high performance,
      reliability and scalability.

      FC-AL is really a circle where all devices share the same transmission medium. A
      single loop provides a bandwidth of 100 MBytes/s but most FC nodes are able to
      utilise at least two loops. The use of two loops not only enhance the data transfer but
      it also serve as a redundant communication path which increase the reliability. An FC-
      AL can have up to 126 addressable nodes but even at a low node count the shared
      medium might become a bottleneck.

          IP Storage Working Group home page:

          The Linux-iSCSI Project:
Reliable Network Mass Storage                                                      25(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

A fabric switch is able to interconnect several loops. A device in a “public loop” gets a
unique address and it is allowed to access any other device on the same public loop.
The switched fabric address domain contains of 16 M addresses and provides
services such as multicast and broadcast.

FC cabling is either copper wiring or optical fibre. It is designed to provide high
reliability and it supports redundant medium as well as hot swap.

Fibre Channel supports several protocols and thanks to its multi-layered architecture,
it easily adopts new protocols. SCSI-FCP is a serial SCSI protocol using frame
transfers instead of block transfers.
        Reliable Network Mass Storage                                                                26(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                      2002-02-14


        Disk based storage are the most popular for building storage configurations. This is
        primarily due to the relatively low price performance ratio for disk systems in
        comparison with other forms of storage such as magnetic tape drives and solid state
        memory devices (e.g. Flash disks). A disk array is basically a set of disks configured
        to act as one virtual disk. The primary reason to implement a disk array is to
        overcome drawbacks with single disk storage: reliability and performance.

        The array is often transparent to the system using it, which means that the system
        does not need to know anything about the array’s architecture. It just uses it as a
        regular block device. The disk array systems are often, but not always, encapsulated
        from the public environment and treated as one disk communicating via common I/O
        interfaces such as SCSI and ATA.

        There are three basic characteristics when evaluating                   Performance

        disk arrays: performance, reliability and cost [Chen93]. In
        every      configuration there must        be at   least one
        compromise or else the result is a disk array with modest      Reliability            Cost

        availability and performance that are of average cost, i.e.
                                                                       Figure 8 – The relation
        the same characteristics as a conventional hard disk           between the disk arrays basic

        While RAID organisations (except striping, see section 5.2) protect and increase the
        data reliability, the storage systems using the array are often unreliable. Many storage
        system vendors tend to exaggerate RAID’s significance in storage systems’
        availability. A system is not highly available just because the data stored in the
        system is managed by some RAID organisation. What if the RAID controller fails or
        the power is lost?


5.1.1   Striping

        Data striping is used to enhance I/O performance and was historically the best
        solution to the problem described as “The Pending I/O Crisis”. The problem in short is
        that an I/O system’s performance is limited by the performance of its networks and
        magnetic disks. The performance of CPUs and memories is improving extremely fast,
        much faster than the I/O units’ performance. So far the modest gain in storage device
        performance is solved with striped disk arrays.

        Striping means that a stripe of data is divided into a number of strips, which are
        written to subsequent disks. Since several disks’ I/O are aggregated the performance
        of the array is greatly improved compared to a single disk. Since there is no need to
        calculate and store any redundant information all I/O and storage capacity is
        Reliable Network Mass Storage                                                        27(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                    2002-02-14

        dedicated for user data. Hence striping is really fast and relatively cheap but it does
        not provide increased reliability, if anything, the other way around.

5.1.2   Disk Array Reliability

        Adding more disks to a disk array to enhance the I/O performance, significantly
        increase the probability of data loss and therefore the array’s data reliability7 is
        decreased [Schulze89]. If the disks used in an array have a mean time to failure of
        MTTFdisk and the failures are assumed to be independent and occur at a constant
        rate, the corresponding value for the array is:

        MTTFarray =
                              Number _ of _ Disks _ in _ the _ Array

        Equation 2 – Mean Time to Failure for a disk array without any redundancy.

        Conventional disks’ service lifetime is approximately 5 years or 60 months. In a disk
        array configuration with 10 disks the arrays service lifetime is drastically decreased to
        just 6 months if using the above relationship.

        To increase disk arrays reliability they must be redundant, i.e. dedicate some of the
        disks capacity and bandwidth in order to save redundant data. In case of a failure the
        lost data can be reconstructed using the redundant information, an array using this
        technique is called a Redundant Array of Independent Disks (RAID). The RAID
        approach does not intend to increase each individual component’s reliability, its
        purpose is to make the array itself more tolerant to failures.

5.1.3   Redundancy

        There are several different redundancy approaches to counteract the decreased
        reliability caused by an increasing number of disks. The most common usable
        implementations are:

        -       Mirroring or shadowing is the traditional approach and the simplest to implement
                but also the least storage effective, regarding MB/$. When data is written to the
                array it is written to two separate disks, hence twice the amount of disk space is
                needed. If one of the two disks fails the other one is used alone, if supported by
                the controller or the software. Some implementations only secure the data and do
                not provide any increased availability.

        -       Codes are parity information calculated from the data stored on the disks using
                special algorithms. They are often used for both error detection and correction
                despite the fact that error detection is a feature already supported by most
                conventional disks, e.g. SMART. Codes are rather unusual due to calculation

            In this section the data reliability is equal to the data availability.
        Reliable Network Mass Storage                                                                    28(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                               2002-02-14

               overhead, complexity and because they do not significantly decrease the number
               of dedicated disks needed to store the redundant information compared to

        -      Parity is a redundancy code capable of correcting any single, self-identifying
               failure. Parity is calculated using bitwise exclusive-OR , Parity = Disk 1 ⊕ Disk 2
               ⊕ Disk 3. If Disk 1 fails, exclusive-OR’s nature makes it possible to regenerate it
               from the available information, Disk 1 = Parity ⊕ Disk 2 ⊕ Disk 3.

        If one disk fails the RAID array is still functional but running in a degraded mode,
        depending on which RAID level that is used the performance may be decreased.
        When the failed disk is replaced regeneration starts, which is a process that rebuilds
        the replaced disk to the state prior failure. During regeneration the RAID array is non-
        redundant, i.e. if another critical disk fails the whole RAID array also fails. Under
        special circumstances mirrored RAIDs can survive multiple disk failures and RAID
        Level 6 is designed to sustain two simultaneous disk failures. How fast the
        regeneration takes depends on the complexity of the calculations needed to derive
        the lost information.

5.1.4   RAID Array Reliability

        If a RAID array is broken into nG reliability groups, each with G data disks and 1 disk
        with redundant information, the RAID array’s reliability of could be described as:

        MTTFRAID =
                                 (MTTFdisk )2
                          nG ⋅ G ⋅ (G + 1) ⋅ MTTRdisk
        Equation 3 – The equation provides a somewhat pleasant value of the Mean Time to Failure for a RAID
        array, since it does not pay attention to any other hardware.

        The equation assumes that the disk failure rate is constant and that MTTRdisk is the
        individual disks’ mean time to repair value. Low MTTR is obtained if a spare disk is
        used and the system using the RAID is configured to automatically fence the failed
        disk and begin regeneration. The above expression ignores all other hardware and
        tends to exaggerate the RAID’s MTTF value. For instance a single RAID Level 5
        group with 9 data disks and 1 redundancy disk (each disk with MTTF = 60 months =
        43830 hours and MTTR = 2 hours), would according to the above expression have a
        MTTFRAID of 10672605 hours = 14610 months ≈ 1218 years!


        In the beginning of RAID’s evolution the researchers at Berkeley University defined
        five different RAID organisations or levels as they are called, level 1 to 5

            Exclusive-OR, XOR and symbolised by ⊕ is defined as: 0 ⊕ 0 = 0, 1 ⊕ 0 = 1 and 1 ⊕ 1 = 0. Example of
        bitwise XOR: 110 ⊕ 100 = 010.
        Reliable Network Mass Storage                                                                                29(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                          2002-02-14

        [Patterson88]. Since then, RAID Levels 0 and 6 have generally been accepted but
        strictly speaking RAID Level 0 is not a redundant array. The levels’ numbers are not
        to be used as some kind of performance metrics; they are just names for a particular
        RAID organisation.

        In the following RAID organisation figures all white cylinders represents user data and
        grey cylinders are used to store redundant data. All organisations are organised to
        provide four disks of user storage and each stack of cylinders represents a single
        disk. The letters A, B, C... represents the order of which the strips are distributed
        when written to the array.

5.2.1   Level 0 – Striped and Non-Redundant Disks

        RAID Level 0 is a non-redundant disk array with the lowest cost and the best read
        performance of any RAID organisation [Chen93]. Data striping is used to enhance I/O
        performance and since there is no need to calculate and store any redundant
        information all I/O and storage capacity is dedicated for user data (figure 9). Due to
        the lack of redundancy, a sole disk failure results in lost data and therefore it’s often
        regarded as a “non-true” RAID.

            A          B        C          D
            E          F        G          H
            I          J        K          L
            M          N        O          P

        Figure 9 – A RAID Level 0 organisation is non-redundant, i.e. any single disk failure results in data-loss

        -    I/O performance is greatly improved by data striping
        -    No parity calculation overhead is involved
        -    Simple design and thus easy to implement
        -    All storage capacity is dedicated to user data

        -    Non-redundant, a single disk failure results in data-loss

        Use this organisation when performance, price and capacity are more important then

5.2.2   Level 1 – Mirrored Disks

        A RAID Level 1, usually referred to as mirroring or shadowing, uses twice as many
        disks as a striped and non-redundant RAID organisation to improve data reliability
        (figure 10). The RAID Level 1 organisation’s write performance is slower than for a
        single hard disk device due to disk synchronisation latency but its reads are faster
        since the information can be retrieved from the disk by the moment presenting the
        shortest service times, i.e. seek time and rotational latency. If a disk fails its copy is
Reliable Network Mass Storage                                                    30(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                2002-02-14

used instead and if the failed disk is changed, it is automatically regenerated and the
RAID array is again redundant.
        Reliable Network Mass Storage                                                                          31(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                      2002-02-14

            A        A           E         E            I         I           M          M
            B        B           F         F            J         J           N          N
            C        C           G         G            K         K           O          O
            D        D           H         H            L         L           P          P
        Figure 10 – A RAID Level 1 organisation

        -    Extremely high reliability, under certain circumstances RAID Level 1 can sustain
             multiple simultaneous drive failures
        -    Simplest RAID organisation

        -    Really expensive, since twice the number of the user storage disks are needed

        Use this organisation when reliability is top priority.

5.2.3   Level 0 and Level 1 Combinations – Striped and Mirrored or vice versa

        The two most basic disk array techniques (striping and mirroring) are combined to
        enhance their respective strengths, high I/O performance and high reliability. There
        are two possible combinations: RAID Level 1+0 (sometimes called RAID Level 10)
        and RAID Level 0+1.

        RAID 1+0 is implemented as a striped array whose segments are mirrored arrays
        (figure 11) and RAID 0+1 is implemented as a mirrored array whose segments are
        striped arrays (figure 12). Both combinations increase performance as well as
        reliability; RAID 1+0 has the same fault tolerance as RAID Level 1 while a RAID 0+1
        organisation has the same fault tolerance as RAID Level 5. A drive failure in a RAID
        0+1 will cause the whole array to actually degrade to the same level of reliability as a
        RAID Level 0 array.

            A        A           B           B          C         C           D          D
            E        E           F           F          G         G           H          H
            I        I           J           J          K         K           L          L
            M        M           N           N          O         O           P          P

        Figure 11 – A RAID Level 1+0 organisation is a striped array whose segments are mirrored arrays.

            A        B         C         D            A         B         C          D
            E        F         G         H            E         F         G          H
            I        J         K         L            I         J         K          L
            M        N         O         P            M         N         O          P

        Figure 12 – A RAID Level 0+1 organisation is a mirrored array whose segments are striped arrays.
        Reliable Network Mass Storage                                                       32(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

        -    High I/O performance is achieved by striping, especially reads but also writes are
             considerable faster compared to a single disk storage
        -    High reliability, under certain circumstances RAID Level 1+0 can sustain multiple
             simultaneous drive failures
        -    Low overhead, no parity calculations needed

        -    Expensive, these organisations require twice as many disks as for user data
        -    Limited scalability

5.2.4   Level 2 – Hamming Code for Error Correction

        Hamming code is an error detection and correction code technology originally used by
        computer designers to increase DRAM memory reliability. RAID Level 2 utilises the
        Hamming code to calculate redundant information for the user data stored on the data
        disks. The information stored on the dedicated error code disks is used for error
        detection, error correction and redundancy (figure 13). The number of disks used for
        storing redundant information is proportional to the log2 of the total number of disks in
        the system. Hence the storage efficiency increases as the number of disks increases
        and compared to mirroring it is more storage efficient.

            A        B         C         D         Hamming     Hamming      Hamming

            E        F         G         H         Hamming     Hamming      Hamming

            I        J         K         L         Hamming     Hamming      Hamming

            M        N         O         P         Hamming     Hamming      Hamming

        Figure 13 - A RAID Level 2 organisation use Hamming code to calculate parity.

        -    The array sustain a disk failure
        -    Less disks needed to support redundancy compared to mirroring

        -    Overhead, the Hamming code is complex compared with for instance parity
        -    Commercial implementations are rare

5.2.5   Level 3 – Bit-Interleaved Parity

        On writes, a RAID Level 3 calculates a parity code and writes the information to an
        extra disk – a dedicated parity disk (figure 14). During reads the parity information is
        read and checked. RAID arrays utilising parity, i.e. RAID Level 3 to 6, are much
        cheaper than other organisations discussed. They use less hard disks to provide
        redundancy and utilise conventional disk controllers’ features to detect disk errors.
        Reliable Network Mass Storage                                                                           33(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                      2002-02-14

        RAID Level 3 is a bit-interleaved organisation and it is primarily used in systems that
        require high data bandwidth but not as high I/O request rates. In a bit-interleaved
        array, the read and write requests access all disks, data disks as well as the parity
        disk. Hence the array is only capable of serving one request simultaneously. Since a
        write request accesses all disks, the information needed to calculate the parity is
        already known and thus re-reads are unnecessary. When the parity has been
        calculated it is written to the dedicated parity disk, a write request limited by that
        single disk’s I/O performance.

            A        B         C          D         A-D parity

            E        F         G          H         E-H parity

            I        J         K          L          I-L parity

            M        N         O          P         M-P parity

        Figure 14 - A RAID Level 3 organisation use bit-interleaved parity. Though similar organisation as RAID Level 4
        it’s important to differ between a bit and a block oriented disk array.

        -    High data bandwidth
        -    Cheap, when parity is used for redundancy less disks are needed compared to
        -    Easy to implement compared to higher RAID Levels since a dedicated parity disk
             is used

        -    Low I/O rate and if the average amount of data requested is low the disks spends
             most their time seeking

        Used with applications requiring very high throughput and where the average amount
        of data requested is large, i.e. high bandwidth but low request rates, e.g. video
        production and multimedia streaming.

5.2.6   Level 4 – Block-Interleaved Parity

        A block-interleaved parity disk array is organised similar to a bit-interleaved parity
        array. But instead of interleave the data bit-wise it is interleaved in blocks of a
        predetermined size. The size of the blocks is called the striping unit. If the size of the
        data to read is less then a stripe unit only one disk is accessed, hence multiple read
        requests can be serviced in parallel if they map to different disks. When information is
        recorded it may not affect all disks and since all data in a stripe (a group of
        corresponding strips, e.g. strips A, B, C and D in figure 15) is needed to calculate
        parity some strips may be missing. This parity calculation problem is solved by
        reading the missing strips and then calculate the parity, a rather performance
        decreasing operation. Because the parity disk is accessed on all write requests it can
        easily become a bottleneck as for RAID Level 2 and thus decrease the overall array
        performance, especially when the write load is high.
        Reliable Network Mass Storage                                                                             34(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                        2002-02-14

            A           B            C            D            A-D parity

            E           F            G            H            E-H parity

            I           J            K            L            I-L parity

            M           N            O            P            M-P parity

        Figure 15 - A RAID Level 4 organisation

        -     High read performance, especially for many small reads requesting information
              less than a stripe unit
        -     Cheap, since only one disk is dedicated to store redundant information

        -     Low write performance, especially for many small writes
        -     The parity disk can easily become a bottleneck since all read and write requests
              access the parity disk

        This RAID organisation is seldom used because of the parity disk bottleneck.

5.2.7   Level 5 – Block-Interleaved Distributed Parity

        The block-interleaved distributed parity disk array organisation distributes the parity
        information over all of the disks and hence the parity disk bottleneck is eliminated
        (figure 16). Another consequence of parity distribution is that the user data is
        distributed over all disks and therefore all disks are able to participate to service read
        operations. The performance is also dependent of how the parity is distributed over
        the disk. A common distribution often considered being the best is called the left-
        symmetric parity distribution.

            A           B            C            D         A-D parity

            F           G            H         E-H parity      E
            K           L         I-L parity      I            J
            P        M-P parity     M             N            O
        Q-T parity      Q           R             S            T

        Figure 16 - A RAID Level 5 organisation with left-symmetric parity distribution that is considered to be the best
        parity distribution scheme available.

        -     The best small read, large read and large write performance of any RAID

        -     Rather low performance for small writes
        -     Complex controller
        Reliable Network Mass Storage                                                                                   35(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                            2002-02-14

        Considered to be the most versatile RAID organisation and it’s used for a number of
        different applications: file servers, web servers and databases.

5.2.8   Level 6 – P+Q Redundancy

        RAID Level 6 is basically an extension of RAID Level 5 which allows for additional
        fault tolerance by using a second independent distributed parity scheme or two-
        dimensional parity as it is called (figure 17). RAID Level 6 provides for an extremely
        high data fault tolerance and three concurrent disk failures are required before any
        data is lost. Every write request requires two parity calculations and parity updates.
        Therefore the write performance is extremely low.

            A           B             C            D        A-D parity   A-D parity

            F           G             H        E-H parity   E-H parity      E
            K           L         I-L parity   I-L parity      I            J
            P        M-P parity   M-P parity       M           N            O
        Q-T parity   Q-T parity       Q            R           S            T
        Figure 17 - A RAID Level 6 using a two-dimensional parity, which allows multiple disk failures.

        -     An extremely reliable RAID organisation

        -     Very low write performance
        -     Controller overhead to compute parity addresses is extremely high
        -     Generally complex

        Considered to be on of the most reliable RAID organisations available and it’s
        primarily used for mission critical applications.

5.2.9   RAID Level Comparison

                                  RAID 0 RAID 1 RAID 2                       RAID 3 RAID 4 RAID 5 RAID 6 RAID 10

        Redundant                 0            n            ∝ log2 k,        1        1        1        2        n
        disks needed                                        where k is
        for n user disks                                    the total

        Redundancy                None         Mirror       ECC              Parity   Parity   Parity   Dual     Mirror

        Complexity                Medium Low                High             Medium Medium High         Very     Medium

        Reliability               Low          High         High             High     High     High     Very     High
Reliable Network Mass Storage                        36(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH    2002-02-14

Table 3 – Comparison of different RAID levels.
      Reliable Network Mass Storage                                                      37(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

      It is difficult to compare different RAID Levels and state which level that is the best
      since they all have special characteristics suitable for different applications. RAID
      Level 5 is the most versatile organisation while RAID Level 6 is the array providing
      the highest reliability since it sustains 2 concurrent disk device failures.


      The unit that takes care of data distribution, parity calculations and regeneration is
      called a RAID controller. They are available as hardware and software solutions but
      both are based on software. The big difference between hardware and software
      controllers is where the code is executed. A hardware RAID controller is superior to
      software RAID in virtually every way, except cost.

      Hardware RAID controllers are essentially small computers dedicated to control the
      disk array. They are usually grouped as:
      -   Controller card or bus-based RAID: The conventional hardware RAID controller,
          which is installed into the server’s PCI slot. The array drives are usually
          connected to it via ATA or SCSI interface. Software running on the server is used
          to operate and maintain the disk array.
      -   External RAID: The controller is as the name implies completely removed from
          the system using it. Usually it is installed in a separate box together with the disk
          array. It is connected to the server using SCSI or Fibre Channel. Ethernet or
          RS232 are common interfaces for operation and maintenance.

      An alternative to dedicated hardware RAID controllers is to let the host system
      provide the RAID functionality, that is take care of I/O commands, parity calculations
      and distribution algorithms. Software RAID controllers are cheap compared to
      hardware controllers but they require high-end systems to work properly.
        Reliable Network Mass Storage                                                       38(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14


        The hard disk device’s platters are the medium where the information actually is
        stored; zeroes and ones are encoded as magnetic fields. A file system provides a
        logical structure of how information is stored and routines to control the access to the
        information recorded on a block device. File systems are in most case hardware
        independent and different operating systems are often able to use more the one file
        system. The emphasis in this section is on different file systems supported by the
        Linux operating system.


        Most Linux file systems make use of the same concepts as the UNIX file systems;
        files are represented by inodes (see section 6.1.3) and directories are basically tables
        with entries for each file in that particularly directory. Files on a block device are
        accessed with a set of I/O commands, which are defined in the device drivers. Some
        specialised applications are not using a file system to access physical disks or
        partitions, they use raw access. A database like Oracle is using low-level access from
        the application itself, not managed by the kernel.

        Though it is general concepts, this subsection tends to emphasise on the Second
        Extended File System (EXT2), a Linux file system currently installed with virtually all
        Linux distributions.

6.1.1   Format and Partition

        Before a blank disk device is usable for the first time it must be low-level formatted.
        The process outlines the positions of the tracks and sectors on the hard disk and
        writes the control structures that define where the tracks and sectors are. Low-level
        formatting is not needed for modern disks after the disks left the vendor, though older
        disks may need it occasionally because their platters are more affected by heat.

        Before a disk is usable by the operating system it must be partitioned, which means
        dividing a single hard disk into one or more logical drives. Disks must be divided into
        partitions even if it is only one partition. A partition is treated as an independent disk
        but it is really a set of contiguous sectors on the physical disk device. Typically a disk
        device under Linux is divided into several partitions, each capable of its own kind of
        file system. A partition table is an index that maps partitions to the physical location
        on the hard disk. There is an upper size limit to how large certain partitions can be
        depending on file system and hardware. EXT2 supports approximately 4 TB.

        After a disk has been low-level formatted and partitioned the disk contains sectors
        and logical drives. Still it is unusable to most operating systems (if not raw access is
        used) because they need a structure in which they can store files. High-level
        formatting is the process of writing the file system specific structures. While a low-
        Reliable Network Mass Storage                                                       39(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

        level format totally cleans a disk device, high-level format only removes the paths to
        the information stored on the disk.

6.1.2   Data Blocks

        The smallest manageable units on a disk device are the sectors. Most file systems,
        including EXT2, are not using individual sectors to store information. Instead they are
        using the concept of data blocks to store the data held in files. A data block could be
        described as a continuous group of sectors on the disk. The data blocks’ sizes are
        specified during the file system’s creation and they are all of the same length within a
        file system; i.e. they contain the same amount of sectors. Data blocks are sometimes
        referred to as Clusters and Allocation Units.

        Every file’s size is rounded up to an integral number of data blocks. If the block size is
        1024 bytes a 1025 bytes file requires two data blocks of 1024 bytes each, thus the file
        system waste 1023 bytes. On average half a data block is wasted per file. It is
        possible to derive an algorithm able to optimise the data block usage but almost
        every modern operating system accepts an insufficient disk usage in order to reduce
        the processor’s workload.

6.1.3   Inodes

        Every file in EXT2 is represented by a unique structure called inode. The inodes are
        the basic building blocks for virtually every UNIX-like file system. Inodes specifies
        which data blocks specific files occupies as well as access rights, modification dates
        and file types (figure 18). Each inode has a single unique number that is stored in
        special inode tables.

        Directories in EXT2 are actually files themselves described by inodes containing
        pointers to all inodes in that particular directory.

        Device files in EXT2 (for the first ATA drive in Red Hat Linux it is typically /dev/hda)
        are not “real files”, they are device handles that provide applications with access to
        Linux devices.
        Reliable Network Mass Storage                                                                   40(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                             2002-02-14

                                                        Data Block

                                                        Data Block

                                                                              Data Block
                                                        Data Block

            Inode information                                                 Data Block
                                                        Data Block
            Direct Blocks                                                     Data Block
            Direct Blocks
                                                                              Data Block
            Direct Blocks
            Direct Blocks
                                                                                           Data Block
            Indirect Blocks
            Double Indirect Blocks                                                         Data Block

            Triple Indirect Blocks
                                                                                           Data Block

                                                                                           Data Block
        Figure 18 – An EXT2FS inode and data blocks.

        The EXT2 file system divides the partition, the logical volume it occupies, into a
        series of blocks. The data blocks themselves are aggregated into manageable groups
        called block groups. The block groups contain information about used inodes and
        those that are unallocated (figure 19). Every block group contains a redundant copy of
        itself and it is used as a backup in case of file system corruption.

            Super       Group        Block    Inode     Inode        Data
            Block       Descriptor   Bitmap   Bitmap    Table        Blocks

        Figure 19 – An EXT2 file system Block Group

        Block group number 0 is the EXT2 file system’s super block. It contains basic
        information about the file system and it provides the file system manager basic
        functionality for handling and maintaining the file system. The EXT2 super block’s
        magic number is 0xEF53 and that number identifies the partition as an EXT2 file
        system. The Linux kernel also uses the super block to indicate the file system’s
        current status:

        -      “Not clean” when mounted read/write. If a reboot occurs when the file system is
               dirty a file system check is forced the next time Linux boot.

        -      “Clean” when mounted read only, unmounted or when successfully checked.

        -      “Erroneous” when a file system checker finds file system inconsistencies.

6.1.4   Device Drivers

        From a file system’s point of view a block device, e.g. a hard disk device, is just a
        series of blocks that can be written and read. Where the actual blocks are stored on
        the physical media does not concern the file system, it is a task for the device drivers.
        Reliable Network Mass Storage                                                         41(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                     2002-02-14

        The major part of the Linux kernel is device drivers, which control the interaction
        between the operating system and the hardware devices that they are associated
        with. Linux file systems (and most other file systems) do not know anything about the
        underlying physical structure of the disk; it makes use of a general block device
        interface when writing blocks to disk. The device driver takes care of the device
        specifics and maps file system block requests to meaningful device information; that
        is information concerning cylinders, heads and sectors.

        The block device drivers hide the differences between the physical block device types
        (for instance ATA and SCSI) and, so far as each file system is concerned, the
        physical devices are just linear collections of blocks of data. The block sizes may vary
        between devices but this is also hidden from the users of the system. An EXT2 file
        system appears the same to the application, independent of the device used to hold

6.1.5   Buffers and Synchronisation

        The buffer cache contains data buffers that are used by the block device drivers. The
        primary function of a cache is to act as a buffer between a relatively fast device and a
        relatively slow one. These buffers are of fixed sizes and contain blocks of information
        that have either been read from a block device or are being written to it. It is used to
        increase performance since it is possible to "pre-fetch" information that is likely to be
        requested in the near future, for example the sector or sectors immediately after the
        one just requested. Hard disks also have a hardware cache but it is primarily used to
        hold the results of recent reads from the disk.

        When a file system is mounted, that is, attached to the operating system’s file system
        tree structure, it is possible to specify if to use synchronisation or not. Most times it is
        by default unused. Synchronisation provides the possibility to ignore the write buffer
        cache. Briefly this means that when a write request is acknowledged, it is really
        written to the physical media and not only to the buffer. In some cases this is vital
        because a power loss empties all volatile memory. Hence the buffers are wiped out
        and the information is lost despite that it has been acknowledged as successfully
        written. It is a matter of increased performance versus increased data integrity.
        Reliable Network Mass Storage                                                                            42(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                       2002-02-14

6.1.6   An Example

        How then is a text file stored on a block device such as a hard disk device discussed
        in previous sections?

                                       Disk divided        Block            Data
         Bits           Sectors       into partitions     Groups           Blocks            File

          0                                                                                   A
          1                                                                                   E
          0                                                                                   I
          0                                                                                   L


        Figure 20 – A simplified overview of how a text file is stored on a block device such as a hard disk device or a

        On the left-hand side (figure 20) there is a series of zeroes and ones, bits. A group of
        bits, typically 512, is in this example referred to as a hard disk sector. The sectors are
        the manageable units of a disk device, which consists of numerous sectors of a
        specific size. A disk device is often divided into an arbitrary number of partitions,
        which in turn is divided into a number of block groups. Each block group consists of a
        number of data blocks which sizes are constant and specified during file system
        creation. In this example the text file is contained in three file system data blocks.

6.1.7   Journaling and Logging

        Non-journaled file systems, for instance EXT2, rely on file system utilities when
        restarted dirty. These file system checkers (typically fsck) examine all meta-data at
        restart to detect and repair any integrity problems. For large file systems this is a time
        consuming process. A logical write operation in a non-journaled file system may need
        several device I/Os before accomplished.

        Journaling file systems, e.g. JFS, use fundamental database techniques; all file
        system operations are atomic transactions and all operations affecting meta-data are
        also logged. Thus the recovery in the event of a system failure is just to apply the log
        records for the corresponding transactions. The recovery time associated with
        journalised file systems is hence much faster than traditional file systems but during
        normal operation a journalised file system may be less effective since operations are
      Reliable Network Mass Storage                                                                        43(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                                   2002-02-14


      The Linux kernel provides an abstract file system layer, which present the processes
      with an unambiguous set of rules of accessing file systems, independent of their real
      layout. It is called the Virtual File System (VFS) and acts like an interaction layer
      between the file system calls and the specific file systems (figure 21). VFS must at all
      times manage the mounted file systems because it is the only access path. To do that
      it maintains data structures describing the virtual file system and the real file systems.

               Application                                  User space

          System calls interface

                                       Inode Cache
                                      Directory Cache

        DOS       EXT2        MINIX                         Linux Kernel space

              Buffer Cache

        ATA Driver      SCSI Driver

           Disk              Disk                           Hardware

      Figure 21 – An overview of the Linux Virtual File System and how it connects with user space processes, file
      systems, drivers and hardware.

      File systems are either build into the kernel or as loadable kernel modules and they
      are responsible for the interaction between the common buffer cache, which is used
      by all Linux file systems and the device drivers.

      Except for the buffer cache, VFS also provides inode and directory caches.
      Frequently used VFS inodes (similar to the EXT2 inodes) are cached in the inode
      cache which make access to them faster. The directory cache stores a mapping
      between the full directory names and their inode numbers but not any inodes for the
      directory itself. To keep the caches up to date and valid they use Least Recently Used
      (LRU) principle.


      Local file systems such as EXT2, NTFS and JFS are only accessible by the systems
      where they are installed. There are several approaches that export a local file system
      so that it is accessible from other hosts as well. Distributed file systems as they are
      called, allow sharing of files and/or completely shared storage areas.
        Reliable Network Mass Storage                                                         44(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                     2002-02-14

6.3.1   Network File System

        The Network File System designed by Sun Microsystems in the mid-80s allows
        transparent file sharing among multiple clients and it is today the de facto standard in
        heterogeneous computer environments. NFS assumes a file system that is
        hierarchical and it is centralised; several hosts connect to one file server, which
        manages all access to the real file system. NFS works well in small and medium size
        installations, preferably local area networks. AFS, described below, is more suitable
        when used in wide area networks and installations where scalability is important.

        Most NFS implementations are based on the Remote Procedure Call (RPC). The
        combination of host address, program number and procedure number specifies one
        remote procedure. NFS is one example of such a program. The eXternal Data
        Representation (XDR) standard is used to specify the NFS protocol but it also
        provides a common way of to representing the data types sent over the network.

        The NFS protocol was intended to be stateless; a server should not need to maintain
        any protocol state information about any of its clients in order to function correctly. In
        the event of a failure there is a prominent advantage with stateless servers. The
        clients need only to retry a request until the server respond, it does not need to know
        why the server is down. If a stateful server goes down, the client must detect the
        server failure and rebuild the stateful information or mark the operation as failed. The
        idea with a near stateless server is the possibility to write very simple servers. It is the
        NFS clients that need the intelligence.

        The protocol should not introduce any additional states itself but there are some
        stateful operations available, implemented as separate services file and record
        locking and remote execution.

6.3.2   Andrew File System

        The Andrew File System (AFS) was developed at Carnegie Mellon University to
        provide a scalable file system suitable for critical distributed computing environments.
        Transarc, an IBM company, is the current owners of AFS but they have also released
        an open source version of AFS.

        AFS is suitable for wide area network installations as well as smaller local area
        networks installations [IBM02]. AFS is based on secure RPC and provides Kerberos
        authentication to enhance the security. Compared with NFS centralised client/server
        architecture AFS is a little bit different. AFS provides a common global namespace;
        files are addressed unambiguously from all clients and the path does not incorporate
        any mount points as for NFS.

        Another significant difference is that AFS allows more than one server in one group or
        cell as it is called. AFS joins together the file systems of multiple file servers and
        export one file system. Therefore the clients do not need to know on which server the
        files are stored, which makes access to files as easy as on a local file system.
        Reliable Network Mass Storage                                                          45(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                      2002-02-14

        Important files, e.g. application binaries, may be replicated to other servers. If one of
        the servers goes down the client automatically access the file from the other server
        without any interruption. This feature significantly increases the availability of a critical
        system. The use of several file servers also increase the efficiency since the work is
        distribute over several file servers and not as for NFS where only one server manage
        all requests.


6.4.1   User Perspective of the File System

        From a user’s point of view it is totally irrelevant how the information is recorded on to
        the disk and how the information is stored. From a user perspective, modern file
        systems are based on three assumptions [Nielsen96]:

        -      Information is partitioned into coherent and disjunct units, each of which is treated
               as a separate object or file.

        -      Information objects are classified according to a single hierarchy, the subdirectory

        -      Each information object is given a semi-unique file name, which users use to
               access information inside the object.

        The fact that information is normally stored as non-contiguous sectors of the hard disk
        is hidden to the end users. The information is usually presented to the user as files,
        the most common abstract level of digital information. That it is possible to read and
        to write information and that it is stored in a safe and unambiguous manner is much
        more important than knowing exactly on which sectors the information is stored.

6.4.2   File system Hierarchy Standard

        The File system Hierarchy Standard (FHS) is a collaborative document that defines a
        set of guidelines and requirements for names and locations of many files and
        directories under UNIX-like operating systems. Its intended use is to support
        interoperability between applications and present a uniform file hierarchy to the user,
        independent of distribution. Many independent software suppliers and operating
        system developers provide FHS compliant systems and applications, which simplifies
        installation and configuration since the files’ directories are known.

            The FHS standard is available from
      Reliable Network Mass Storage                                                       46(82)
      Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14



      A system’s availability is measured in terms of the percent of the time the system is
      available and provides its services correctly. Rest of the time is assumed to be
      unplanned downtime, i.e. time when the system is unavailable. The availability
      measure uses a logarithmic scale based on nines; a system with three nines of
      availability is thus available 99.9% of the time. Each additional nine is associated with
      more extreme requirements and especially increased costs. High Availability (HA)
      refers to systems that are close to be continuously available, meaning no down time –
      an expression often associated with telecom equipment that promise up to five nines
      of availability.

      The concept of availability incorporates both reliability and repairability, which are
      measurable as MTTF and MTTR. Availability compromises at least four components:
      -   hardware availability
      -   software availability
      -   human error
      -   catastrophe

      The determinant for most system’s availability is human error. More intuitive and
      more automated user interfaces may prevent most unnecessary errors associated
      with configuration and installation. Today’s hardware generally provides good
      availability. If the intention is to build a HA service one should consider to utilise one
      or many of the existing technologies that improve a system’s availability.


      There are many different attempts to increase systems availability. Some of the
      following explanations make use of examples of systems but the techniques are of
      course applicable to other systems than those described.

      A system is as weak as its weakest point and a system is not only involves the
      components actually presenting some functionality. If for instance a system is
      considered to be highly available and it is powered from a single power supply its
      maximum availability is identical to the one for the power supply it is attached to; if
      the power supply fails the whole system fails. Single point of failures (SPOF) must be
      avoided. As for RAID where redundant information increased its reliability a system’s
      availability is increased when adding an extra redundant component. Adding an
      additional power supply thus increases the whole system’s availability but the system
      must of course be designed to use the redundant component.

      If a system is vital and it is of most importance that it never ever goes down,
      geographic redundancy is a final extreme precaution. Adding redundancy to a system
Reliable Network Mass Storage                                                      47(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

increase its availability but what if a catastrophe such as earthquake destroys the
whole system including its redundant components? As the additional “geographic”
intimates, geographic redundancy not only involves redundant components, it implies
that the redundant components are not used in the same geographic location.

When a system recognises a component as failed and a redundant component is
present it must be possible to transfer the service from the failed to the working
component. This mechanism is called a fail-over and it is often implemented in
active-standby server configurations; one server is active and presents some service
but when it goes down the standby node is ready to take over and restarts the service.
Fail-over often introduces a little time delay and it is close to impossible to eliminate
this delay since it increases the risk of introducing other problems.

Heartbeats are used to monitor a system’s health. A heartbeat monitor continuously
asks the node or the subsystem if it is working properly. If the answer is negative or if
the question is unanswered after a defined number of tries an action is triggered. It
may involve fail-over, resource fencing and other actions that provide means to
maintain the complete system’s functionality. Heartbeats are available both as
hardware and software solutions but the hardware implementations are desirable if
fast response is required.

In virtually all systems it is desirable to isolate malfunctioning components. If a
component is active but is not working properly it might introduce new problems to a
system and thus damage it. Resource fencing is an approach that isolates
components that are identified as malfunctioning from the rest of the system to
prevent them from disturbing or harming other components. In clusters it eliminates
the possibility that a “half-dead” node is able to present its arbitrarily working
resources, for instance in a two-node cluster utilising fail-over this is very important.
Suppose that the heartbeats between the two systems are late and that the standby
node after a while declares the other node as dead. When it turns out that it was just
network congestion both nodes already believe that they are the active one – a
situation known as split brain. If resource fencing is used it is possible to control the
other node’s power supply and turn it off during a fail-over.

High availability can also be provided without any specialised hardware. Clustering is
a method where a number of collaborating computers or nodes provide a distributed
service and/or serve as backup for each other. Clustering involves specialised
software to work, so called cluster managers.

Checkpointing is another software concept, which provides clusters with the possibility
to store information about individual nodes’ processes that are vital to the system. If a
cluster member fails the checkpoint information is used to quickly allow another node
to take over the failed node’s processes and restart them. Thus the time delay
normally associated with a normal fail-over is reduced.
Reliable Network Mass Storage                                                          48(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                      2002-02-14

Suppose that a storage system is equipped with a RAID array to provide a higher
level of data reliability. If a disk in the RAID array fails it should be desirable to repair
or replace the failed device without the need of taking down the storage system. Hot
swap is a technology that increases system availability in that way that it provides the
possibility to replace a component, for instance a disk device, while the system is
running. This of course requires that the system sustain at least one failed

A system’s availability could be even further increased if it is equipped with hot
standby. It is similar to hot swap but requires no manual intervention. If for instance a
disk in a RAID configuration supporting hot standby fails, the failed disk is
automatically regenerated and replaced by the hot standby. That is, an additional
component is installed but it is not used until one of the active components fails. Hot
standby sharply decreases the MTTR value for that particular component and this
also affect the system’s comprehensive MTTR.

A hardware watchdog is really a timer that is periodically reset by a system when it is
working properly. If the timer is not reset for a period of time it reaches a threshold
and the watchdog assumes that there are problems with the system. It automatically
inactivates the system and chooses from restarting the object or forces it to stay off-
line. Watchdogs are also available both as hardware and software but only the
hardware solutions really provide any increased availability. Assume that a system is
utilising a software watchdog and the watchdog’s environment hangs. The software
watchdog is thus totally useless whereas the hardware watchdog is shielded from any
software influence.
        Reliable Network Mass Storage                                                      49(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14


        Until now, this thesis has discussed basic storage components used in most modern
        storage systems. The purpose of this project is to design and implement a prototype
        that fulfils the telecom requirements regarding availability and performance. The
        forthcoming sections discuss one possible solution and an evaluation of the solution.

        I decided to enhance Network File System servers’ availability. It was desirable to use
        a well-known standard as NFS but also because TelORB currently supports this file
        system. The main drawback with centralised storage solutions is the problems
        associated with availability. If a single NFS server goes down due to hardware or
        software failures all clients connected are unable to use the data stored on the
        server’s disks for some period of time. This is not acceptable in any high availability
        systems and applications. Therefore I have tried to create a two-node high availability
        cluster where an active node provides an NFS service and another node is standing
        by ready to take over if the active fails. If the active accidentally goes down, the
        standby acquires the former active’s IP address and restarts its services transparent
        to the clients using them.

        There are systems with load sharing capabilities on top of its high availability features
        but this project focus on HA only. Load shared storage solutions involve mechanisms
        to have two or more nodes with a homogenous file system image and this is not
        feasible within this project’s limited time.


        This subsection briefly explains some weaknesses associated with a single NFS
        server and how it is possible to overcome them and implement a solution with
        increased availability.

8.1.1   Simple NFS Server Configuration

        In a simple NFS configuration there is a central server providing the file service, i.e.
        exporting a file system, and a number of clients using the services (figure 22), that is
        mounting the exported file system. It is a rather simple process of putting it all
        together; most Linux distrubutions come with both NFS server and client support and
        a simple system needs minor configuration when starting from scratch.

        If the server accidentily crashes or the network goes down, the clients loses contact
        with the server providing the file service and the stored information gets inaccessible.
        Hard disk mirroring and other RAID organisations are common methods that prevent
        data loss in case of hard disk failure. But mirroring inside a single machine does not
        increase server availability if a component other than a hard disk is failing. The server
        itself and the network are single point of failures (SPOF) and it is therefore not
        enough to increase the server’s “internal” reliability.
        Reliable Network Mass Storage                                                                           50(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                       2002-02-14

        To exclude the server SPOF one might have a redundant server standing by, ready
        to take over if the service provider goes down. This proposal makes use of this idea,


        NFS                                                   NFS
        Server                                                Clients

        Figure 22 – A simple NFS configuration where the server and the network are single point of failures.

        generally known as fail over. Shortly, fail-over means moving services from a failing
        server to another redundant server standing by.

        To exclude the access path SPOF, that is, the single Ethernet network in figure 22, it
        is possible to add a redundant network analogous to the NFS server redundancy. This
        prototype proposal only focuses on NFS server redundancy because the Linux NFS
        implementation used does not support redundant networks. TelORB’s NFS
        implementation supports redundant networks but to port this to Linux is out of this
        project’s scope. Henceforth the figures are using dual networks despite it is not

8.1.2   Adding a Redundant NFS Server

        If another NFS server is added to provide server redundancy the configuration gets
        more complicated (figure 23). Except for the redundant hardware additional software
        is also required if the two servers are goin to work together. In this proposal “working
        together” does not include any kind of load sharing.


        NFS                                                             NFS
        Servers    1        2                                           Clients

        Figure 23 – Adding a redundant server and a redundant network eliminates the server and the network single
        point of failures.

        If it is possible for server number 2 to monitor server 1’s status in real-time it is
        possible to restart the services provided by 1 at server 2 if the primary fails. For static
        information services, for instance a web server and databases seldom updated, this is
        basically enough to increase the availability. But for a file service or any other service
        where the information is constantly changing the situation is somewhat more
        complex. If the NFS clients write information to the active server (they usually do so)
        and it goes down it must be possible for the standby server, number 2 in the
        illustration, to access the same information just written by the clients. The system,
        consisting of the two servers, must have a homogeneous image of the file system. In
        Reliable Network Mass Storage                                                                        51(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                    2002-02-14

        this report a homogeneous file system means that the in-cluster nodes see exactly the
        same file system but not that it is mounted simultaneously.

8.1.3   Adding Shared Storage

        Many reliable storage solutions use redundant servers with access to some sort of
        shared storage (figure 24). The obvious advantage with this approach is that there is
        only one physical place where the information is stored, thus only one file system
        image. It should be a straightforward process to implement this in Linux if it is
        acceptable that only one server has access to the file system at a time. There are,
        however, solutions available where two servers are simultaneously accessing shared
        storage, typically solutions that use FC.

                                                   Switched                                            Switched
                                                   Ethernet                                            Ethernet

        Active                                 Standby        Active                              Standby

        Figure 24 – Two shared storage possibilities: Fibre Channel Arbitrated Loop and shared SCSI.

        Shared Storage often involves special hardware, such as shared SCSI or Fibre
        Channel. Shared SCSI is really regular SCSI used by two host adapters instead of
        just one and it does not provide any hardware interface redundancy. Thus it is not
        suitable for systems that require extreme high availability. Fibre Channel Arbitrated
        Loop (FC-AL) is a configuration of Fibre Channel providing high throughput but also
        redundant access paths as well as hot-swap capabilities. Unfortunately FC equipment
        is much more expensive               than conventional hardware such as SCSI. Both SCSI and
        FC-AL can of course use RAID controllers to build disk configurations with increased

        If shared storage is used, the standby node is able to mount the shared storage in
        case of a fail-over and provide the clients with exactly the same file system.
        Assuming that there are mechanisms for monitoring and restarting processes and
        servers’ status the technique is rather simple. But since none of the hardware
        solutions were appropriate; shared SCSI is not providing redundancy and FC-AL is
        too expensive. The proposal is to try a software approach instead and create a so-
        called Virtual Shared Storage.

             Cheetah 73LP, a high-end FC disk from Seagate (36.7 GB, 10k RPM, 4.7 ms Avg. Seek), costs $540.00,
        Reliable Network Mass Storage                                                            52(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                        2002-02-14

8.1.4   Identified Components

        To build a two-node NFS fail-over cluster I identified the following components:
        -    hardware platform
        -    network
        -    cluster service, to provide monitoring, fail-over and IP address binding
        -    shared storage, to present the file managers with a homogenous file system
        -    file server application


        It would be desirable to have a disk area that two servers have direct access to
        because a single file system image simplifies the solution. It is possible to create a
        virtual shared storage similar to the high-end shared storage described just using
        standard Ethernet networks and additional Linux software. These solutions could be
        thought of as a cheap model of the above or as a stand-alone software solution where
        geographic redundancy comes for free. Virtual shared storage is considered shared
        storage with the limitation that only one file system manager can mount the file
        system read/write instantaneously. In this fail-over system proposal it is an acceptable
        limitation since only one server needs full access to the file system. The other waits
        until it is in primary state to mount the disk space and by this time it is alone.

        I have tried two different solutions and both are specialised Linux software
        components. Both solutions are basically providing a mirroring service and it is
        important to emphasise that they should not be compared with any high-performance
        hardware solutions under the same conditions.

8.2.1   Network Block Device and Linux Software RAID Mirroring

        The vital component in this virtual shared storage solution is the enhanced network
        block device driver (NBD), which is a device driver that makes a remote resource look
        like a local device in Linux. Typically it is mounted to the file system using /dev/nda.
        The driver simulates a block device, such as a hard disk device or a hard disk
        partition, but access to the physical device is carried across an IP-network and hidden


            NBD                                         NBD
            Client   1                           2      Server

        Local                  Network
        Block                  Block
        Device                 Device

        to user processes.
Reliable Network Mass Storage                                                                            53(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                                       2002-02-14

Figure 25 – A Network Block Device configuration where Node 1 transparently accesses a block device
mounted as a local device that’s physically at Node 2. The dotted cylinder represents the network block device
mounted locally at Node 1 and the grey-shaded cylinder is the actual device used at Node 2. A device is either a
physical device or a partition.
Reliable Network Mass Storage                                                                 54(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                            2002-02-14

On the server side (Node 2 in figure 25), a server daemon accepts requests from the
client daemon at Node 1. At the server the only extra process running is just the
server daemon listening at a pre-configured port. The client must in addition to the
client daemon also insert a Linux kernel module before any NBD is mountable. When
the module is loaded, the both parent processes started and the block device
mounted using for instance the native Linux EXT2 file system, it is used as a
conventional block device. In the figure above the grey-shaded cylinder represents
the real partition or disk device mounted at Node 1 as the dotted cylinder.

The enhanced network block device driver uses TCP as data transfer protocol.
Compared to UDP the TCP protocol is a much more reliable protocol due to its
consistency and recovery mechanisms. The developers have accepted the extra
overhead associated with TCP because it significantly simplifies the NBD

Many Linux distributions are by default installed with kernels supporting Linux
software RAID. If not, late kernels are possible to upgrade with a matching RAID
patch. The current RAID software for Linux supports RAID levels 0, 1, 4 and 5. It also
supports a non-standardised RAID level; linear mode that aggregates one or more
disks to act as one large physical device. The use of spare disks for “hot-standby” is
also supported in current release. The 2.2.12 kernel and later are able to mount any
type of RAID as ROOT and use the software RAID device for booting. There is also a
software package called raidtools that includes the tools you need to set up and
maintain a Linux software RAID device.

Assume that the client in the network block device example divides its local block
device into two partitions, one for operating system files and one partition identical to
the one mounted from the NBD server. It’s of most importance that the partitions are
identically defined according to data block size, partition size and all other block
device specific parameters. Linux software RAID organisations require identical
partitions or devices to work but there are no limitations in where they are
geographically or physically installed. It is therefore possible to use a geographically
local partition or device and a network block device locally mounted but physically
installed at another node in a software RAID configuration, assuming they are


             Mirrored pair – RAID Level 1

Figure 26 – A local partition and a network block device configuration used as a Linux software RAID

        Reliable Network Mass Storage                                                                    55(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                2002-02-14

        If the client creates a partition with identical parameters as the network block device
        these two partitions could be used as a RAID Level 1 configuration, i.e. disk mirroring
        (figure 26). Hidden to the application using the apparently locally mounted RAID
        device all file operations are both carried out locally but also mirrored to a block
        device somewhere on the network.

        This configuration not only increases the data reliability but it also increases the
        availability of the data. If any of the two nodes whose disk devices are used in the
        mirroring crashes, the data is still available from the disk in the available node. This
        could be thought as LAN mirroring or virtual shared storage since everything written
        on the active side is transparently written to a redundant copy, which is usable in case
        of fail over.

8.2.2   Distributed Replicated Block Device

        Distributed Replicated Block Device (DRBD) is an open source kernel module for
        Linux. It renders the possibility to build a two-node HA cluster with distributed mirrors.
        DRBD provides a virtual shared disk to form a highly available storage cluster and it
        is similar to the RAID mirror solution but includes some extra features and it is
        distributed as single software package.

        Linux virtual file system passes data blocks to a block device via file system and
        device driver specific layers (figure 27). DRBD acts as a middle layer that
        transparently forwards everything written to the local file system to a mirror connected
        to the same network.

                      Active Service                               Standby Service

               VFS                                                               VFS

            File System                                                       File System

            Buffer Cache                                                     Buffer Cache

              DRBD                 TCP/IP                  TCP/IP               DRBD

            Disk Driver           NIC Driver              NIC Driver          Disk Driver

               Disk                    NIC                   NIC                 Disk

        Figure 27 – An overview of how DRBD is acting as a middle layer and forwards file system operations to a
        redundant disk.

        Three different protocols are available, each with different characteristics suitable for
        a number of applications:

        -     Protocol A: Signals the completion of a write request as soon as the block is
             written to the local disk and sent it out to the network. This protocol is best suited
Reliable Network Mass Storage                                                              56(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                       2002-02-14

    for long distance mirroring. Lowest performance penalty of the three protocols but
    it is also the least reliable DRBD protocol.

-   Protocol B: A write request is considered completed as soon as the block is
    written to the local disk and when the standby system has acknowledged the
    reception of the block.

-   Protocol C: Treats a write request as completed as soon as the block is written to
    the local disk and when an acknowledgement is received from the standby system
    assuring that the block is written to local disk. The most reliable protocol of the
    three discussed here. It guarantees the transaction semantics in all failure cases.

For some file systems, e.g. the journaling file system JFS, it is vital that the blocks
are recorded to the media in a pre-determined order. DRBD ensures that the data
blocks are written exactly in same order on both the primary and the secondary disks.
It is vital that the nodes in the cluster all have the same up-to-date data and nodes
that do not have up-to-date data must be updated as soon as possible. A small
amount of information referred to as meta-data is stored in non-volatile memory at
each node. The meta-data consisting of an inconsistent flag and generation counter is
used to decide which node that has the most up-to-date information. The generation
counter is really a tuple of four counters:

<human-intervention-count, connected-count, arbitrary-count, primary-indicator>

During normal operation, data blocks are mirrored as they get written in real-time. If a
node rejoins a cluster after some down time, the cluster nodes are in need of
synchronisation. The meta-data is used during a cluster node’s restart in order to
identify the node with the most up-to-date data. When the most up-to-date node is
identified the nodes are synchronised using one of the two mechanisms supported by

-   Full synchronisation: the common way to synchronise two nodes is to copy each
    block from the up-to-date node to the node in need for an update. Not
    performance efficient.

-   Quick synchronisation: if a node leaves the cluster for a short time a memory
    bitmap that records all block modifications is used to specifically update the
    blocks modified during the node’s absence. DRBD’s requirement for a short time
    is that the active node is not restarted during this time.

Synchronisation is designed to run in parallel to the data block mirroring and other
services,    thus    not    affecting   the     node’s   normal   operation.   Therefore     the
synchronisation can just use a limited amount of the total network bandwidth.
        Reliable Network Mass Storage                                                       57(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14


8.3.1   Linux NFS Server

        NFS on Linux was made possible by a collaborative effort of many people and
        currently version 3 is considered the standard installation. NFS version 4 is under
        development as a protocol ( and includes many features from
        other file system competitors such as the Andrew File System and Coda File System.
        The advantage of NFS today is that it is mature, standard and supported robustly
        across variety of platforms.

        All Linux kernels version 2.4 and later have full NFS version 3 functionality. From
        kernels version 2.2.14 and above there are patches available that provide the use of
        NFS version 3 and reliable file locking. Linux NFS is backward compatible and thus
        version software 3 supports version 2 implementations as well.

8.3.2   Linux-HA Heartbeat

        A high availability cluster is a group of computers, which work together in such a way
        that a failure of any single node in the cluster will not cause the service to become
        unavailable [Robertson00]. Heartbeat is open source software that provides the
        possibility to monitor another system’s health by periodically send “heartbeats” to it. If
        the response is delayed or never received it is possible to define actions that
        hopefully increase the complete systems availability. Heartbeat is highly configurable
        and it is possible to develop own scripts suitable for specific purposes.

8.3.3   The two-node High Availability Cluster

        The basic idea for this prototype proposal is to make the Linux NFS server highly
        available. I chose NFS because it is today’s de facto standard distributed file system
        and under Linux, NFS version 3 is considered a mature and stable implementation. If
        NFS is the best choice or not, is really not an issue here in this section. The purpose
        of the network mass storage project is to build a prototype and it is of course easier to
        implement a prototype with well-known and already working components.

                              Virtual Cluster IP

         NFS                                                NFS
         Active      1            Heartbeat         2       Standby

                                LAN Mirrored
                                 Disk Arrays

        Figure 28 – Overview of the two-node HA cluster design proposal.
Reliable Network Mass Storage                                                     58(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                 2002-02-14

The obvious problem with this kind of centralised NFS file server is of course that its
availability is exactly the same as the server’s it is running on. Any single point of
failures in a system is unacceptable, even if it is a complete server. Using Linux-HA
Heartbeat makes it possible to have an identical standby NFS server ready to take
over the active node’s operation in case of failure (figure 28). With a hardware RAID
the data’s reliability is increased locally but if the active node goes down the standby
node must be able to go online from any state, since we do not know when the fail
over is expected. Using DRBD the active node’s disk is transparently mirrored in real-
time to the standby’s disk. DRBD also takes care of synchronisation if any of the two
nodes are restarted.
        Reliable Network Mass Storage                                                                           59(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                      2002-02-14


        Two prototypes were built; Redbox11 and Blackbox 12. Each consists of two Linux
        servers working as one unit – a two-node high availability cluster. This section
        emphasises on practical issues concerning the assembly process and it involves
        many technical details.

        Redbox was the first out of the two prototypes built. When it was working “properly”
        some of the hardware and software used in Redbox was reused to build the Blackbox
        prototype. The prominent difference between Redbox and Blackbox is the hardware
        configuration. Redbox is built from TSP components with some minor tweaks and
        Blackbox is compiled from conventional PC components.

9.1     REDBOX

        Because Redbox was the first prototype built, the process of putting it all together
        involved many time-consuming mistakes possible to avoid in the Blackbox prototype.
        This section about Redbox is therefore more detailed than the next section
        concerning Blackbox.

                 GEM Magazine                                      cPCI Magazine

                                                             Red1:        Red0:

                                        SCB                                     Serial               Monitor,
                                                                                                     Keyboard &


                                     3Com Superstack 3900

        Figure 29 – Overview of the physical components and the communication possibilities in the Redbox prototype.

9.1.1   Hardware Configuration

        In the Ericsson hardware prototype lab I configured three Teknor MXP64GX cPCI
        processor boards to run Red Hat Linux: Red0, Red1 and Pinkbox (figure 29). Two
        processor boards were running in a cPCI magazine with split backplane and they are

             Redbox – the first prototype needed a name and since it is running Red Hat Linux as OS, the name had to
        include Red.

             Blackbox – the second prototype also needed a name, since Black Hat Linux is unknown to me but the cluster
        node cabinets are black the name was as obvious as for Redbox.
        Reliable Network Mass Storage                                                                              60(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                         2002-02-14

        therefore treated as separate nodes with no connection to each other, other than
        Ethernet networks. The 3.5” hard disks are fed from an external power supply and
        attached to the processor boards with a self-made cable (the pin layout is available in
        Appendix A). The third node was installed in a TSP cabinet with GEM and Teknor
        specific adapter boards. A more detailed description about the hardware components
        is available in the Appendix A.
        I used a 3Com Superstack 3900                     Ethernet switch, a SCB Ethernet switchboard and
        standard Ethernet twisted-pair cabling to connect Red0, Red1 and Pinkbox in a
        switched private network with network number Red0 and Red1 were
        except from the switched network also connected directly to each other with a second
        crossed Ethernet cable. This interface was used for internal cluster communication
        between Red0 and Red1, which constitutes the Redbox two-node high availability
        cluster. The internal cluster network with network number is of course not
        connected to any non-cluster members. The third communication possibility is a
        simple null-modem                 cable used by the cluster’s high availability software as a
        redundant heartbeat path if the primary path failed, i.e. the cluster’s internal network.

9.1.2   Operating system

        Ericsson UAB is currently transferring parts of the TSP’s functionality to run under
        Linux as a complement to Solaris UNIX. Red Hat Linux 15 is the distribution used for
        development, testing and evaluation at Ericsson UAB today and that’s the main
        reason why I choose to use Red Hat as operating system during this master thesis
        project. Red Hat is a mature and well-known distribution, some even say that it is
        most widespread of them all, but there are a lot other distributions to choose from;
        SuSE, Mandrake, Debian and Slackware are just some examples.

        I downloaded the latest Red Hat Linux distribution from Sunet’s ftp archive, at the
        time release 7.1, which is also known as “Seawolf”. The Linux kernel distributed along
        with this Red Hat release was version 2.4.2. The Linux kernel, the core component of
        the Linux operating system, undergoes constant updates and the latest stable kernels
        and patches are always published at the Linux Kernel Archives homepage.

        The installation process was a straightforward process compared to earlier
        distributions of Linux. Before adding any extra functionality, i.e. cluster and virtual
        shared storage software, I configured the three nodes and tested the networks and the
        serial connection carefully. When all communication paths were up and running I


             A null-modem is a cable connected to a computer’s serial interface and a cheap and simple solution to get two
        computers to “talk” with each other.


        Reliable Network Mass Storage                                                               61(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                           2002-02-14

        configured a NFS version 3 server at both cluster’s nodes. When I successfully was
        able to mount the two exported file systems from the third client, Pinkbox, I decided
        to move on with the additional software components.

9.1.3   NFS

        Red Hat Linux 7.1 supports both NFS version 2 and version 3. At the server side an
        administrator typically defines what partitions or directories to export and to whom.
        The file /etc/exports is the access control list for the file systems which may be
        exported to the NFS clients and it is used by NFS file server daemon and the NFS
        mount daemon (rpc.mountd). Security is of less importance in this proposal since
        the prototype is attached to a private network and all clients attached to this particular
        network are granted access. Access rights are else defined in /etc/hosts.allow
        and /etc/hosts.deny but I used them blank.

9.1.4   Distributed Replicated Block Device

        The first extra non-standard Red Hat component I installed was the distributed
        replicated           block   device.     At   I
        downloaded the latest release of the DRBD software, at the time release 0.6.1-pre5.
        As the name intimates it is a pre-release, but since I am using the 2.4.2 Linux kernel
        previous releases 0.5.x won’t work, they require kernel 2.2.x. Since DRBD is a kernel
        module, the kernel source code must be installed; else it is impossible to compile the
        source code.

        At a glance, the software seems really well documented but as the first problems
        arise the only way of solving the problems is more time for testing and tweaking. A
        good approach to eliminate some basic problems is to join the DRBD-developers’
        mailing list . Philipp Reisner, the original author of DRBD, is a frequent visitor and
        answers any relevant question almost immediately. I have been in touch with him
        regarding a bug that arises when DRBD is used together with some NFS export
        specific parameters.

        Two identical partitions are needed to get DRBD up and running and therefore I
        started with creating two small partitions accessed via /dev/hda10 at both cluster
        nodes, each approximately 100 MB.                      I began with small partitions because re-
        synchronisation is time-consuming for large disk spaces. When I finally tested the
        DRBD software binaries after compilation and installation it worked but it was really
        slow. I started configuring the system; typically /etc/drbd.conf for Red Hat Linux,
        and some minor tweaking greatly improved the performance, especially the
        synchronisation process. Typical parameters used in the configuration file are
        resource name, nodes, ports, file system check operations, synchronisation bandwidth

             Currently hosted at
        Reliable Network Mass Storage                                                       62(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

        utilisation and kernel panic options. See appendix for configuration files used in the
        Redbox implementation.

        Because the cluster nodes use a dedicated 100 Mbit/s network (network for
        disk replication it was possible to utilise full resynchronisation bandwidth, the
        Heartbeat signalling bandwidth usage is negligible. The DRBD software is the limiting
        factor; today the maximum resynchronisation throughput is approximately 7

        DRBD is distributed with several scripts for various purposes. The most usable script
        is a service initialisation script that can be executed either with Red Hat’s service tool
        or as a standalone script. Another useful script is used for benchmarking. The script
        tests the individual systems respective hard disk device as well as the bulk data
        transfer for each of the three protocols between the DRBD nodes. Since DRBD utilise
        TCP for data transfer I found it interesting to benchmark the bulk TCP transfer over
        the 100 Mbit/s Ethernet network. Hewlett-Packard has developed an interesting
        application for this purpose. Netperf, as the software is called, was originally targeted
        for the UNIX world but is now distributed for Linux as well, free of charge. More
        information about the results can be found in the benchmark section.

9.1.5   Network Block Device and Software RAID Mirroring

        In parallel with the DRBD testing I also tried to configure the software RAID mirror.
        Since I found the DRBD software much more interesting I just tested this solution
        briefly but found some interesting limitations when trying to integrate it with NFS. I
        began with a local RAID configuration making use of two local partitions. It was no
        problems to set it up and it seemed like it worked ok.

        Next step was to test the network block device and I used the same partition used by
        DRBD and successfully mounted Red1’s partition at Red0. When the two components
        needed for the distributed RAID mirror worked properly I tried to integrate them, thus
        working as a unit transparent the processes using them. All configuration files used
        are published in Appendix B.

        The RAID regeneration process is poorly documented and I am not really sure of how
        it works. Red1 had some problems with the system hardware clock and I think this
        affect the regeneration because some files stored at the mirror disappeared by
        mistake. I tried to solve this by integrating rdate, which is a client application that
        uses TCP to retrieve the current time of another machine using the protocol described
        in RFC 868. Despite the use of time synchronisation I am not sure that the problem is
        truly eliminated and hence no conclusions are made.

        I believe that it is desirable to use DRBD prior to a NBD RAID mirror configuration
        since it is made solely for this explicit purpose and requires less manual intervention.
        Reliable Network Mass Storage                                                      63(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

9.1.6   Heartbeat

        Heartbeat is the high availability software for Linux I intended to use as cluster
        manager. It is open source and thus it id free to download it from
        The problems began when I started to configure Red0 and Red1. For instance there is
        a file /etc/ha.d/authkeys whose mode must be set to 600, which corresponds to
        root read/write only. This is easily done by using the chmod command but if it is unset
        the software refuse to start, which is a bit confusing.

        When the Heartbeat software starts it creates a virtual interface typically eth0:0,
        which is bound to the real interface eth0. If the active node is shut down this
        interface is within a specified time interval rebound to the standby node’s
        corresponding interface. It works well and it seems stable but before I started to
        integrate it with any virtual shared storage I tried to fail-over a service presenting
        static information, e.g. a web server. I installed an Apache, a Linux web server, and
        successfully moved the service from Red0 to Red1 transparent to Pinkbox, despite
        Red0 was rebooted as Pinkbox concurrently downloaded some files from it.

9.1.7   Integrating the Software Components into a Complete System

        When every component worked satisfactory I started integrating them one by one to
        finally build a complete system. I started with Heartbeat and a NFS configuration
        exporting a local device /dev/hda10. The NFS processes were successfully
        restarted at the standby node when the active was shut down but the Pinkbox client
        was unable to access the exported partition after the fail-over. When trying to access
        a file or just display a directory list the following message was returned by the server:
        “Stale NFS handle”. According to the NFS specifications this is returned because the
        file referred to by that file handle no longer exists, or access to it have been revoked
        [RFC1094] [RFC1813]. Thus a fail-over using two nodes’ separate disks is of course
        impossible at protocol level since the file systems are separate and this is not really a
        problem, just an observation.

        Another problem with the NFS implementation arose when I tried to add a NBD RAID
        mirror to serve as a virtual shared storage. I manually failed-over the NFS and
        remounted the local device previously used by the NBD server but the NFS still
        complained over “Stale NFS handle” despite the files were there and the access was
        granted by the server. Reading the NFS specifications once more revealed another
        NFS feature; exports and mounts are hardware dependent. In Linux every device has
        a pair of numbers; in short they refer to what type of driver to use when accessing a
        specific device. These numbers are called the major and minor numbers and a
        directory listening in the device directory /dev/ creates the following output (this
        output is edited to fit into the report):

        brw-rw----            1 root           disk        3,     0 Mar 24     2001 hda
        brw-rw----            1 root           disk        3,     1 Mar 24     2001 hda1
        brw-rw----            1 root           disk        3,     2 Mar 24     2001 hda2
Reliable Network Mass Storage                                                             64(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                        2002-02-14

brw-rw----            1   root         disk       9,     0    Mar   24      2001    md0
brw-rw----            1   root         disk       9,     1    Mar   24      2001    md1
brw-rw----            1   root         disk      43,     0    Mar   24      2001    nb0
brw-rw----            1   root         disk      43,     1    Mar   24      2001    nb1
brw-r--r--            1   root         root      43,     0    Oct   16      2001    nda

                                                             minor number            device
                                                   major number

Hda is the first ATA disk and hda1 is the first partition on the first ATA disk, its major
number is 3 and the 1 refers to which partition it is. When using the software RAID, a
device called mdx is created. As seen for the device md0 its major number is 9 and its
minor number is 0, thus its major number differs from the hda’s major number. The
difference causes a “Stale NFS handle” error and thus I found it hard to use a NBD
RAID in a NFS fail-over configuration. Nbx is a DRBD device and nda is a NBD
device. Since an NFS fail-over configuration using DRBD as virtual shared storage is
accessing the physical device via nbx at both nodes the problem associated with
different major numbers is eliminated.

I discarded the NBD RAID solution in favour to DRBD and tried to use Heartbeat and
two scripts distributed with the packages, datadisk and filesystem, to finally fail-
over a NFS service. When I shut down the active node the services were restarted
and the DRBD device remounted. Despite compatible major numbers, homogenous
file system and identical access rights the only message I got was “Stale NFS
handle”. After some research I found out that when the NFS server is started, a
daemon called the NFS state-daemon (rpc.statd) is also started. It maintains NFS
state specific information such as mounted file systems and by whom they are
mounted, typically in /var/lib/nfs/. On the active node when DRBD was running
and mounted at /shared/ I created the following tree structure as seen from the

   / -shared +export

Since DRBD is running, the same file modifications are of course also carried out to
standby node’s disk. I moved the NFS state information by default stored in
/var/lib/nfs/ to the /shared/nfs/ folder place and created a symbolic link
from its original to the copy on the shared storage. Thus it is possible for the standby
node to access exactly the same information about the current NFS server-state as
the active in case of a fail-over. The /shared/export/ folder is the actual file
system exported by the NFS server. Because it is a DRBD device, every file
operation is automatically mirrored to the standby node’s disk. I also found it
necessary to append the following line to the existing file system table, typically
/etc/fstab, because the file system should not be mounted automatically and it
should only be accessible from the currently active node:

/dev/nb0              /shared             ext2    noauto                    0   0
        Reliable Network Mass Storage                                                              65(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                         2002-02-14

        With these minor tweaks the fail-over worked and a restarted node automatically
        resynchronise its DRBD managed disks, all actions are carried out in the background
        and thus hidden to clients using the exported file system. The only thing that a client
        process notice is a short delay for approximately 5 to 10 seconds, which is the time it
        takes for the standby node to declare the active node dead and to restart necessary

9.2     BLACKBOX

        The Redbox testing was somewhat limited since all nodes were controlled from Red0
        whose processor board was a prototype with external VGA, keyboard and mouse
        connectors. If Red0 is rebooted all possibilities to monitor a fail-over is lost.

                  Pinkbox:                                              Red0:

                                                                                      Keyboard &

                                                               3Com Superstack 3900

                                 Black0    Eth1:   Black1



        Figure 30 – Overview of the Blackbox prototype.

9.2.1   Hardware Configuration

        Blackbox is compiled from conventional computer components available in most
        computer stores, primarily due to cost efficiency and simplicity. Compared with the
        cPCI components used in Redbox this prototype is considered cheap but it has great
        theoretical performance possibilities. The only limitation except cost was of course
        that Linux must support the hardware components.

        The problems began with delivery delays, the last components arrived less than two
        weeks from the project’s end-date, and this influenced the time plan negatively.
        Reliable Network Mass Storage                                                                      66(82)
        Jonas Johansson at Ericsson UAB and IMIT, KTH                                                  2002-02-14

        Despite a thorough Linux hardware support                           research the hardware caused
        compatibility problems when installing the operating system. Many drivers provided
        were not working properly and required tweaking and re-compiling. Unfortunately
        Black1’s main board’s AGP port was not working properly and together with a failed
        SCSI disk I spent many hours seeking for the problems’ origin. Apart from hardware
        compatibility problems and delivery delays the hardware was extremely unstable. I
        tried to cool down the systems with four extra fans each but it just raises the systems
        to a modest level of stability.

        I used the same network components as for Redbox but utilised Blackbox’ additional
        Gigabit Ethernet interfaces for disk replication; the network. One of the two
        100 Mbit/s Ethernet networks was used explicitly for Heartbeat signalling, network, but as for Redbox I also used a serial null modem to provide heartbeat
        redundancy. The second 100 Mbit/s is connected to an Ericsson UAB internal LAN
        with access to the Internet.

        Originally I also intended to use a GESB – SCB coupling to aggregate ten 100 Mbit/s
        Ethernet links to utilise the server’s 1000 Mbit/s interface. Unfortunately the delays
        associated with the Blackbox hardware and poor access to TSP equipment forced me
        too skip this configuration. I only used a SCB and the Superstack.

9.2.2   Software configuration

        Most software and configuration files from Redbox was re-used in Blackbox, the only
        updated software was DRBD. Blackbox utilise DRBD release 0.6.1-pre7. Some minor
        modifications to the configuration files were of course needed to reflect the change in
        hardware and communication paths.

        If an NFS export is configured to use sync and no_wdelay, which are two
        /etc/exports specific parameters, the NFS server’s request-response is extremely
        slow. Sync is used to synchronise file system operations and no_wdelay is used to
        force file operations to be carried out immediately and prevent that they are buffered.
        A network interface monitor                 revealed a strange behaviour. When a client performs a
        file operation, e.g. a file copy, the communication between the servers is extremely
        low for about 5 - 10 seconds. For a short period of time, a fast burst of data is sent to
        the server and the operation finish. I contacted the developer to solve the problem
        and his suggestion was to use DRBD protocol B instead or try to recompile the
        software with a little fix he sent me. I suggest that the simplest solution is to skip the
        parameters. No file system corruption has yet been detected despite several tests and
        it was important for the delayed project to move on.

             There are many databases with information about supported Linux hardware:, and

             IPTraf -
         Reliable Network Mass Storage                                                             67(82)
         Jonas Johansson at Ericsson UAB and IMIT, KTH                                         2002-02-14


         Benchmarking means measuring the speed of which a computer system will execute
         a computing task [Balsa97]. It is difficult to specify and create valid benchmarking
         tools and most measurements are often abused and seldom used correct. Deeper
         knowledge in benchmarking is somewhat peripheral to this thesis since it is a really
         wide and difficult area that requires lots of time.

         From the thesis project’s point of view the most interesting measurement is how fast
         clients can read and write a mounted NFS partition. But since I have implemented
         two prototypes, Redbox and Blackbox, with significantly different hardware
         configurations I also found it interesting to benchmark more specific parts of the
         system. It is difficult to find bottlenecks but I have tried to measure the most important
         components influencing the over all system performance. Each test is of course
         possible to divide further into even smaller benchmark tests, but since the critical
         factor is time and benchmarking is somewhat time-consuming I decided that the
         general tests are enough.


         When the Redbox implementation was working and running at an acceptable level of
         stability I began looking for performance measurement tools. Since the prototype’s
         intended use is to export a network file system I decided to measure components
         involved in the process of reading and writing a file: block device, local file system,
         network, DRBD and network file system. I had some difficulties to find the appropriate
         software tools for Linux. Especially tools for hardware component specific
         performance benchmarking and that is why I dropped the block device performance
         measurements. All software tools I used and describe here are open source software
         or free of charge.

         The Standard Performance Evaluation Corporation focuses on a standardised set of
         relevant benchmarks and metrics for performance evaluation of modern computer
         systems [SPEC]. The tests are unfortunately not free of charge but they are really

         An interesting Linux NFS Client performance project [LinuxNFS] is performed at the
         University of Michigan and they have composed a set of benchmarking procedures
         that is useful when evaluating NFS under Linux.

10.1.1   BogoMips

         Linus Thorvalds              invented the BogoMips concept but its intended use is not to serve
         as a benchmarking tool. The Linux kernel is using a timing loop that must be

              The original author of Linux
         Reliable Network Mass Storage                                                                             68(82)
         Jonas Johansson at Ericsson UAB and IMIT, KTH                                                         2002-02-14

         calibrated to the system’s processor speed at boot time. Hence, the kernel measures
         how fast a certain kind of loop runs on a computer each time the system boots. This
         measurement is the BogoMips value and a system’s current BogoMips rating is stored
         in the processor’s state information, typically /proc/cpuinfo.

         BogoMips is a compilation of “Bogo” and “MIPS” which should be interpreted as
         “bogus” and “Millions of Instructions Per Second”. BogoMips is related to the
         processor’s speed and sometimes the only portable way of getting some information
         of different processors speed but it is totally unscientific. It is not a valid computer
         speed measurement and it should never be used for benchmark ratings. Despite
         these facts there are lots of benchmarking ratings derived from BogoMips
         measurements. Somebody humorously defined BogoMips as “the number of million
         times per second a processor can do absolutely nothing”. Though not a scientific
         statement it illustrates the BogoMips concept’s loose correlation with reality and that
         is why I mention the BogoMips concept.

10.1.2   Netperf

         Netperf is a benchmark for measuring network performance. It was developed by
         Hewlett-Packard and was originally targeted for UNIX but is now distributed for Linux
         as well. Documentation and source is available from

         Netperf’s primary focus is on bulk data transfers, referred to as “streams”, and
         request/response performance using either TCP or UDP and BSD sockets21 [HP95].
         When the network performance is measured between two hosts, that is how fast one
         host can send data to another and/or how fast the other host can receive it, one is
         acting server and the other is acting client. The server can be started manually as a
         separate process or using inetd . The Netperf distribution also provides several
         scripts used to measure TCP and UDP stream performance. The default Netperf test
         is the TCP stream test and it typically creates the following output:

         $ ./netperf
         Recv          Send          Send
         Socket Socket               Message         Elapsed
         Size          Size          Size            Time            Throughput
         bytes         bytes         bytes           secs.           Kbytes/sec

              4096       4096          4096          10.00           8452.23

              The BSD socket is a method for accomplishing inter-process communication, which is used to allow one
         process to speak to another. More information at

              The Internet Daemon, current Linux distributions are distributed with the improved xinetd instead.
         Reliable Network Mass Storage                                                          69(82)
         Jonas Johansson at Ericsson UAB and IMIT, KTH                                     2002-02-14

         When I performed the benchmark tests I made use of the provided scripts. These
         scripts test the network performance for different socket and packet sizes at a fixed
         test time. I benchmarked both systems in both directions, e.g. I measured the Redbox
         internal performance both from Red0 to Red1 and from Red1 to Red0, and thus each
         host acted both server and client. If the results correlate, I assume that I have
         eliminated the possibility that one host runs “faster” as server or as client than the
         other, a possibility that of course affects the total result.

10.1.3   IOzone

         IOzone is a free file system benchmark tool available for many different computer
         systems      [IOzone01].        Its   source    and   documentation   is   available   from IOzone is able to test file system I/O with a broad set of
         different file operations: read, write, re-read, re-write, read backwards, read strided,
         fread, fwrite, random read, pread, mmap, aio_read, aio_write. It is also possible to
         use IOzone for NFS specific benchmarking, typical tests are read and re-read latency
         tests for different sized operations.

         The benchmark tests utilised IOzone’s fully automatic mode to test all file operations
         for record sizes of 4 kBytes to 16 MBytes for file sizes of 64 kBytes to 512 MBytes.
         The typical command line options I have used looks like:

         $ ./iozone -a -z -b result.xls -U /IOtest

         The -a and -z options force IOzone to test all possible record sizes and file sizes in
         automatic mode. -b is used to specify the Excel file and I have also used the mount
         point option, -U, which mounts and remounts the specified mount point between
         tests. This guarantees that the buffer cache is flushed. To use this option the mount
         point must exist and it must be specified in the file system table, a file that contains
         descriptive information about the various file systems, typically /etc/fstab.

10.1.4   Bonnie

         Bonnie is also a file system benchmark but the use of Bonnie was troublesome.
         According the brief Bonnie user manual it was important to use a file at least twice the
         amount of the RAM. Since the systems used are equipped with 1 GB RAM I naturally
         used a 2 GB file. That particular file size caused file system errors and forced the
         systems to halt. Files size less than twice the amount of RAM result in invalid values
         and hence it was impossible to conduct any benchmarking with the Bonnie software.

10.1.5   DRBD Performance

         DRBD replication is the process where the data is copied from the active server to the
         server standing by. Since a write at the active side is first acknowledged when the
         information is also written to the standby node, this is an important component of the
         over-all performance for the system. DRBD Replication was measured using the
       Reliable Network Mass Storage                                                       70(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

       performance script distributed along with the DRBD source code. The results were
       also confirmed with a network utilisation monitor that shows the momentary amount of
       data passing the network interface.

       DRBD Resynchronisation was measured using the Linux command time, which
       calculates the elapsed time while the operation actually is performed. The test is
       simple and may not be totally accurate but I believe it is an acceptable approximation
       of the performance. I tested resynchronisation on several partitions of different size.


       I started testing different network paths: Red0 to Red1 and Pinkbox to Redbox. There
       was no remarkable difference depending on the path and the link utilisation was about
       8 – 10.8 MBytes/s, which I believe is a good result. The result depends on socket
       sizes as well as message sizes used by Netperf during the benchmark.

       According to the DRBD performance script the maximum replication speed between
       the two nodes is about 9.5 – 10.5 MBytes/s. These values are close to the maximum
       of network bandwidth.

       The author of DRBD claims that the maximum resynchronisation speed is
       approximately 7 MBytes/s. It is today the limiting factor and my measurements
       correspond to this value, despite the link’s higher bandwidth. Resynchronisation is a
       process where both reading and writing is involved. This is highly prioritised by the
       author and resynchronisation is going to be improved in newer versions of DRBD.

       I tried to use IOzone and Bonnie when testing the NFS file system read/write
       performance but it was troublesome. Therefore I wrote my own scripts, which rapidly
       writes numerous of files of various sizes to a mounted NFS file system. In the range
       of 1000 files was written and during this time I controlled the network utilisation. The
       total time for the file writes and the amount of data was also used to approximate the
       write performance over long time. With the results I estimated the write performance
       to be approximately 6 – 7 MBytes/s.


       As for Redbox I tested the Ethernet links between Black0 to Black1 and from Red0 to
       Blackbox. The Blackbox internal link is Gigabit Ethernet and the result was about 70 –
       95 MBytes/s. Since Red0 is using a 100 Mbit/s interface it makes no difference that
       Blackbox is utilising a Gigabit interface; the result corresponds with the Redbox

       Since I used Gigabit Ethernet links with much greater bandwidth I expected higher
       DRBD replication speed. Sadly, DRBD does not today fully support the usage of
       Gigabit; only a small increase about 1 MBytes/s was noticed in the tests.
       Reliable Network Mass Storage                                                       71(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

       As for the DRBD replication results the resynchronisation is limited by the software
       rather than the hardware and no major improvements compared to Redbox was

       Using the same approach as for Redbox NFS write performance I noticed a small
       increase in the speed but since this method is considered unscientific, no conclusions
       are made from this increase.


       Despite the fact that Blackbox is superior in virtually all aspects compared to Redbox,
       the results did not differ much except for the Gigabit links. Currently the DRBD
       software is the limiting factor but hopefully this change as newer versions are

       In my final week I had the opportunity to very briefly familiarise with a commercial
       NAS from EMC ; its fundamental design is similar to Redbox and Blackbox but utilise
       Fibre Channel as internal storage interface and it also presents the clients with a
       Gigabit interface.

       I performed the same write tests mentioned above; mounting the file system from
       Red0, I reached a maximum of 8 MBytes/s. Mounting the NFS file system from
       Black0 and utilising the Gigabit interfaces, the corresponding value was 14.5
       MBytes/s. I emphasise that the tests are not verified themselves and it is uncertain if
       these results are of significance.


       When the prototypes were up and running and I tested different fault scenarios. Both
       prototypes sustain a complete in-cluster node failure, that is, remove the power from
       one of the nodes. Since both prototypes are equipped with redundant heartbeat paths
       they also survive any single heartbeat path failure, both Ethernet and null-modem.

       If the Ethernet cable connecting the prototypes to the clients where disconnected, all
       file system activities halt until it is reconnected again. When the connection utilised by
       the DRBD replication was removed, the mirroring also halted. After the cable was
       reconnected, the software starts communicating with the lost DRBD device and
       initiates resynchronisation.
       Reliable Network Mass Storage                                                        72(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                    2002-02-14


       Personally this project have been really successful in terms of new knowledge and
       experience but I believe that this project’s scope was somewhat wide, making a
       reliable network mass storage incorporates many hardware and software components.
       The most troublesome was to find an appropriate solution that was feasible within the
       project’s time frame.

       Currently the prototypes sustain a complete in-cluster node failure or failures that
       cause kernel panics, e.g. a hardware fault. Kernel panics restart the failing node and
       initiate a fail-over at the standby node. Concurrent client reads and writes are possible
       during a fail-over with the limitation of the time delay associated with the fail-over.

11.1   GENERAL

       There are many solutions regarding reliable storage but it was a bit troublesome to
       find an open source solution that applies to telecom equipment. There are many
       commercial solutions available but most of them rely on datacom techniques that are
       quite different compared to telecom requirements. In short the task was a bit tougher
       than I thought from the beginning. There are many theories but less actual
       implementations and my primarily task was to sift through all information.

       The choice of using NFS as a file system was simply because TSP already supports
       NFS and that it is a well-known standard. Afterwards it might have been desirable to
       test another more experimental file system but at the moment my focus was on
       making a system reliable and not to test a new file system.

       Because the software used is entirely written by the open source community it is hard
       to really tell anything about its quality. DRBD proved to be the limiting factor,
       resynchronisation is really slow, but if it is enhanced it is going to be a powerful


       Due to the rather tight time frame and the requirement of a working prototype, the
       solution was rather limited. Several approaches are however discussed in the next
       section that may drastically enhance the prototype proposal’s reliability and

       An advantage with Redbox compared to any commercial solution is that it is
       mountable in GEM since it is made solely of standardised Ericsson components. The
       only modification needed is carriers for the disk arrays. Since it is attached to the TSP
       via the GESB switchboards no modifications are needed to the TSP hardware

       Simplicity is pervading the proposal and both prototypes are cost effective compared
       to other solutions. They both use standard components; common Ethernet for
       Reliable Network Mass Storage                                                      73(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14

       communication and no specialised hardware is required except for the RAID
       controllers used in Blackbox. The only real drawback is that twice the amount of disks
       is needed; an issue that is solved by introducing shared storage.


       There are fundamental differences between standard datacom equipment and
       telecom equipment; quantitative and qualitative differences, there are many cheap
       datacom products while telecom are rather few but expensive and exclusive. It seems
       like the availability and performance is the two most important characteristics for the
       telecom industry, even prior cost. A telecom product must work under any
       circumstance but many datacom products are often not designed for redundant
       usage, even “reliable” NAS solutions. This telecom and datacom contradiction
       affected the project process since the solution is based entirely on standard datacom
       components. When building the prototypes there was an obvious difference in the
       hardware stability and reliability; the cPCI components used in Redbox were much
       more stable than the conventional PC hardware used when building Blackbox.


       Linux is really a hot topic in datacom today with its free networking applications and
       servers. Lately, the open source community has been more accepted in other
       industrial areas as well but the question is if Linux and the rest of the open source
       community in general are mature enough to meet the demands of the telecom
       industry today?

       I believe that it is possible to compile open source components to build a HA system
       but if it presents an availability that is enough the meet the telecom requirements is
       uncertain. Maybe if the software components are improved and the hardware platform
       is based on state-of-the-art components but the software modification parts introduce
       new problems for commercial solutions. According to GNU General Public License
       (GNU GPL), a software license intended to guarantee the freedom to share and
       change free software [GPL], all software under GNU GPL is free even if it has been
       modified. That is, all software based on free software must be free. I hardly believe
       that a company happily spends loads of hours to improve free software to release it
       free again, for anyone to use, e.g. a competitor.

       Constant updates and patches makes it hard to have a stable system, at least stable
       enough to be called an HA system. Linux high availability is advancing but I consider
       it optimistic to believe that it is possible to use open source software “as is”, without
       any modifications. There are commercial Linux solutions that guarantee a certain
       level of availability but these are often modified Linux solutions and not clean open
       Reliable Network Mass Storage                                                       74(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

11.5   TSP

       Does the prototype fulfil the TSP requirements for reliable network storage? As the
       prototypes are at the moment the answer must be no. Currently the prototypes are
       somewhat limited according to scalability, maintenance and reliability. Still it seems
       that the basic idea with fail-over and virtual shared storage is working. It is a reliable
       approach that is utilised in various solutions. Since it is made up of independent
       components it is possible to exchange the components individually to enhance
       system characterises; e.g. change the virtual shared storage to shared storage based
       on FC.

       The main disadvantage with the solution is that it is basically a centralised solution.
       Despite its distributed file system images and clustering features the information is
       really stored at one place, at the two-node cluster. A distributed solution where all
       processor boards contribute would be a more TelORB-like solution but it is also much
       more complex.

       During the second half of the project when prototype implementation was long gone I
       found really interesting solutions regarding distributed and fault-tolerant storage.
       These are briefly discussed in the next section.
       Reliable Network Mass Storage                                                                         75(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                                     2002-02-14


       This section discusses improvements regarding the prototypes, which are assumed to
       increase their performance as well as their availability. The suggestions are all
       existing solutions but they are neither integrated nor evaluated.


       Only one Ethernet network is used to access the networks file system and this SPOF
       must be eliminated. Ericsson has a solution, which is really a modified NFS
       implementation that provides the possibility to use redundant networks. The clients
       make use of the network currently available with the lowest traffic load. It may also
       be desirable to remove any features that make the NFS fail-over complex, such as
       state information and security. Since it is used in an internal protected network this
       should not cause any problems.

       Make use of a specialised shared storage solution such as FC to eliminate the
       bottleneck associated with the virtual shared storage. It does not only increase the
       performance but also the reliability and the scalability since it are specified to support
       hot swap and redundant medium.

       Integrate some sort of monitoring software that continuously monitors networks and
       local processes. There is an open source solution called Mon                          that makes it possible
       to define actions that trigger on certain failures. Mon renders the possibility to restart
       local processes, redirect traffic if a network fails and kill nodes that are considered
       active but that are behaving strange.

       Resource fencing is another possible improvement. It eliminates the possibility that
       both nodes believe that they are active – the split-brain syndrome. If a fail-over is
       started it automatically initiates a process where the failed node’s power is cut off. It
       force the node to reboot and promise for a more reliable fail-over, at lest when
       concerning split-brain.
       The Logical Volume Manager                    (LVM) provides on-line disk storage management of
       disk and disk subsystems by grouping arbitrary disks into volume groups. The total
       capacity of volume groups can be allocated to logical volumes, which are accessed
       as regular block devices. These block devices are resizable while on-line, so if more
       storage capacity is needed it is just a matter of adding an extra disk and bind it to the
       correct subgroup without interrupting the ongoing processes. Logical volumes hence
       decrease downtime and enhance maintainability as well as scalability.

            Mon is a Service Monitoring Daemon available at

             LVM    is   a storage management        application for Linux. Further information is available from
       Reliable Network Mass Storage                                                     76(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                 2002-02-14

       The Heartbeat software supports the use of the software Watchdog , which is a
       daemon that checks if your system is still working properly. If programs in user space
       are not longer executed it will reboot the system. It is not as reliable as a hardware
       watchdog because a total system hanging also affects the Watchdog software, which
       then is unable to force the system to reboot. A hardware watchdog is desirable and it
       exists as standard PCI cards as well as PMC modules.


       During the project I found many interesting solutions regarding storage that promise
       increased performance, scalability, security and reliability. I mention two projects

       Network Attached Secure Disks (NASD) is a project at the Carnegie Mellon University
       supported by the storage industry leaders. The object in short is to move primitives
       such as data transfer, data layout and quality of service down to the storage device
       itself while a manager is responsible for policy decisions such as namespace, access
       control, multi-access atomicity and client caches [Gibson99]. More information is to
       be found at

       The Global File System (GFS) is a file system in which cluster nodes physically share
       storage devices connected via a network [Soltis96]. This shared storage solution tries
       to exploit the sophistication of new device technology. GFS distributes the file system
       responsibilities across the nodes and storage across the devices. Consistency is
       established by using a locking mechanism maintained by the storage device
       controllers. More information at:

            Watchdog is available from
     Reliable Network Mass Storage                                                      77(82)
     Jonas Johansson at Ericsson UAB and IMIT, KTH                                  2002-02-14


     I would like to thank Johan Olsson who gave me the opportunity to perform my
     master thesis at Ericsson UAB, my supervisor Kjell-Erik Dynesen who guided me
     throughout the project and my examiner Mats Brorson.

     I also would like to thank all personal at Ericsson Utvecklings AB and especially
     everyone at KY/DR who have been very helpful during my master thesis project.
     Besides all help you have all made it a pleasant time at Ericsson with lots of floor ball
     and talk about Mr Béarnaise; I am especially proud of my bronze medal in Go-Cart
     Reliable Network Mass Storage                                78(82)
     Jonas Johansson at Ericsson UAB and IMIT, KTH            2002-02-14


     AFS          Andrew File System
     ANSI         American National Standards Institute
     ATA          Advanced Technology Attachment
     CORBA        Common Object Request Broker Architecture
     cPCI         Compact Peripheral Component Interface
     DAS          Direct Attached Storage
     DRBD         Distributed Replicated Block Device
     ECC          Error Correction Code
     EMC          Electro Magnetic Compatibility
     ESD          Electrostatic Discharge
     FC           Fibre Channel
     FC-AL        Fibre Channel Arbitrated Loop
     GPRS         General Packet Radio System
     HDA          Head-Disk Assembly
     I/O          In or Out
     IDL          Interface Definition Language
     IP           Internet Protocol
     IPC          Inter-Process Communication
     kBytes       Kilo Byte
     MB           Mega Byte
     MBytes/s Megabyte per Second
     Mbit/s       Megabit per Second
     MTTF         Mean Time to Failure
     MTBF         Mean Time between Failures
     NAS          Network Attached Storage
     NBD          Network Block Device
     NFS          Networking File System
     PC           Personal Computer
     PCI          Peripheral Component Interface
     RAID         Redundant Array of Independent Disks
     RFC          Request for Comments
     RPC          Remote Procedure Call
     RPM          Rounds per Minute
     SAN          Storage Area Network
     SCSI         Small Computer System Interface
     SPOF         Single Point of Failure
     SS7          Signalling System number 7
     TCP          Transport Control Protocol
     TSP          The Server Platform
     UDP          User Datagram Protocol
Reliable Network Mass Storage                        79(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH    2002-02-14

UMTS         Universal Mobile Telephone System
VIP          Virtual IP
ZBR          Zone Bit Recording
       Reliable Network Mass Storage                                                                       80(82)
       Jonas Johansson at Ericsson UAB and IMIT, KTH                                                   2002-02-14



       [Balsa97]             André       D.     Balsa,       1997,     “Linux       Benchmarking      HOWTO”,
                   , 2001-11-1

       [Barr01]              Tavis Barr, Nicolai Langfeldt, Seth Vidal, 2001, “Linux NFS-
                             HOWTO”,, 2001-10-10

       [Dorst01]             Win          van           Dorst,       2001,          “BogoMips       mini-Howto”,
                   , 2001-11-13

       [GPL]                 GNU General Public License,, 2002-

       [IBM01]     , 2001-10-15

       [IBM02]               “The AFS File System In Distributed Computing Environments“,
                   , 2002-02-11

       [IOzone01]            “IOzone File system Benchmark”,, 2001-11-

       [LinuxNFS]            University       of    Michigan,        ”Linux     NFS      Client    performance”,
                   , 2001-11-13

       [Nielsen96]           Dr.       Jacob Nielsen,            1996, “The Death of File Systems”,
                   , 2001-10-2

       [SPEC]                Standard               Performance                Evaluation           Corporation,
                   , 2002-02-25

15.2   PRINTED

       [Brown97:1]           Aaron Baeten Brown, 1997, “A Decompositional Approach to
                             Computer System Performance Evaluation”, Center for Research
                             in Computing Technology, Harvard University

       [Brown97:2]            Aaron B. Brown, Margo I. Seltzer, 1997, “Operating System
                             Benchmarking in the Wake of Lmbench: A Case Study of the
                             Performance           of     NetBSD      on      the    Intel   x86   Architecture”,
                             Proceedings of the 1997 ACM SIGMETRICS Conference on
                             Measurement and Modeling of Computer Systems

       [Chen93]              Peter Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz,
                             David A. Patterson, 1993, “RAID: High-Performance, Reliable
                             Secondary Storage”, ACM Computing Surveys
Reliable Network Mass Storage                                                          81(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                    2002-02-14

[Du96]                D. Du, J. Hsieh, T. Chang, Y. Wang, S. Shim, 1996, “Performance
                      Study of Serial Storage Architecture (SSA) and Fibre Channel -
                      Arbitrated Loop (FC-AL)”, Technical Report at Computer Science
                      Department, University of Minnesota

[Ericsson01]          Ericsson Internal Information, 2001, “ANA 901 02/1 System

[Ericsson02]          Ericsson Internal Information, 2001, “TelORB System Introduction”

[Gibson99]            Garth A. Gibson, David F. Nagle, William Courtright II, Nat Lanza,
                      Paul Mazaitis, Marc Unangst, Jim Zelenka, 1999, “NASD Scalable
                      Storage Systems”, Carnegie Mellon University, Pittsburgh

[HP95]                Information Networks Division Hewlett-Packard Company, 1995,
                      “Netperf: A Network Performance Benchmark, Revision 2.0”

[Katz89]              Randy H. Katz, Garth A. Gibson, David A. Patterson, 1989, “Disk
                      System      Architectures   for   High   Performance    Computing”,
                      University of California at Berkeley

[Nagle99]             David F. Nagle, Gregory R. Ganger, Jeff Butler, Gart Goodson,
                      Chris Sabol, 1999, “Network Support for Network-Attached
                      Storage”, Carnegie Mellon University, Pittsburgh

[O’Keefe98]           Matthew T. O’Keefe, 1998, “Shared File Systems and Fibre
                      Channel”, Proceedings of the Sixth NASA Goddard Space Flight
                      Conference on Mass Storage Systems and Technologies

[Patterson99]         David A. Patterson, 1999, “Anatomy of I/O Devices: Magnetic
                      Disks”, Lecture Material, University of California at Berkeley

[Patterson89]         David A. Patterson, Peter Chen, Garth Gibson, Randy H. Katz,
                      1989, “Introduction to Redundant Arrays of Inexpensive Disks
                      (RAID)”,    Proceedings     Spring   COMPCON     Conference,      San

[Patterson88]         David A. Patterson, Garth Gibson, Randy H. Katz, 1988, “A case
                      for Redundant Arrays of Inexpensive Disks (RAID)”, University of
                      California at Berkeley

[RFC1094]              Sun Microsystems Inc, 1989, “NFS: Network File System Protocol
                      Specification”, Request for Comments: 1094, IETF

[RFC1813]              B. Callaghan, B. Pawlowski, P. Staubach, 1995, “NFS Version 3
                      Protocol Specification”, Request for Comments: 1813, IETF

[Reisner01]           Philipp Reisner, 2001, “DRBD”, Proceedings of UNIX en High
                      Availability, Netherlands UNIX User Group
Reliable Network Mass Storage                                                       82(82)
Jonas Johansson at Ericsson UAB and IMIT, KTH                                   2002-02-14

[Robertson00]         Alan Robertson, 2000, “Linux-HA Heartbeat System Design”,
                      Proceedings of the 4           Annual Linux Showcase & Conference,

[Schulze89]           Martin Schulze, Garth Gibson, Randy Katz, David Patterson,
                      1989, “How reliable is RAID?”, Proceedings Spring COMPCON
                      Conference, San Francisco

[Shim97]              Sangyup Shim, Taisheng Chang, Yuewei Wang, Jenwei Hsieh,
                      David H.C. Du, 1997, “Supporting continuous media: Is Serial
                      Storage Architecture (SSA) better than SCSI?”, Proceedings of the
                      1997 International Conference on Multimedia Computing and
                      Systems (ICMCS '97)

[Soltis96]            Steven R. Soltis, Thomas M. Ruwart, Matthew T. O’Keefe,
                      1996,”The Global File System”, Proceedings of the Fifth NASA
                      Goddard Space Flight Conference on Mass Storage Systems and