Docstoc

new TPT-RAID a High Performance Box-Fault Tolerant Storage System

Document Sample
new TPT-RAID a High Performance Box-Fault Tolerant Storage System Powered By Docstoc
					 TPT-RAID: a High Performance
Box-Fault Tolerant Storage System


           Erez Zilber
 TPT-RAID: a High Performance Box-Fault
     Tolerant Storage System




                           Research Thesis


  Submitted in Partial Fulfillment of The Requirements for the
   Degree of Master of Science in Electrical Engineering


                             Erez Zilber


Submitted to the Senate of the Technion - Israel Institute of Technology


      TEVET, 5767              HAIFA                 January 2007
 The Research Thesis Was Done Under The Supervision
of Dr. Yitzhak Birk in the Electrical Engineering Department.




                              1
Contents
1 Introduction                                                                                                                                        12
  1.1 Single-box RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                               13
  1.2 Multi-box RAID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                13
  1.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                              15

2 Data transfer and interconnects                                                                                                                     17
  2.1 Fibre Channel . . . . . . . . . .           .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
  2.2 RDMA overview . . . . . . . .               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   18
  2.3 InfiniBand overview . . . . . . .            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   19
      2.3.1 Memory registration . .               .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   20

3 Storage systems overview                                                                                                                            21
  3.1 SCSI overview . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   21
  3.2 iSCSI overview . . . . . . .        .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   22
  3.3 iSER overview . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   24
  3.4 RAID overview . . . . . . .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   26
  3.5 Multi-box RAID with iSER            .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   28

4 3rd    Party Transfer RAID (TPT-RAID) architecture                                                                                                  29
  4.1     Overview . . . . . . . . . . . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   29
  4.2     Login and logout . . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   29
  4.3     Request execution . . . . . . . . . . . . . . . . . . . . .                                 .   .   .   .   .   .   .   .   .   .   .   .   31
          4.3.1 3rd Party Transfer . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   31
          4.3.2 Parity calculation in the targets . . . . . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   34
          4.3.3 Required protocol changes . . . . . . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   34
          4.3.4 Symbols and Variables . . . . . . . . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   35
          4.3.5 READ requests . . . . . . . . . . . . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   35
          4.3.6 Parity block handling in READ requests . . . .                                        .   .   .   .   .   .   .   .   .   .   .   .   37
          4.3.7 WRITE requests . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   .   .   .   .   .   .   39
          4.3.8 Considerations in ECC calculation . . . . . . .                                       .   .   .   .   .   .   .   .   .   .   .   .   48
          4.3.9 Two-phase commit in WRITE requests . . . . .                                          .   .   .   .   .   .   .   .   .   .   .   .   51
   4.4    Maximum Throughput comparison . . . . . . . . . . .                                         .   .   .   .   .   .   .   .   .   .   .   .   54
          4.4.1 READ requests . . . . . . . . . . . . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   55
          4.4.2 WRITE requests . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   .   .   .   .   .   .   56
   4.5    Latency comparison . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   59
          4.5.1 READ requests . . . . . . . . . . . . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   59
          4.5.2 WRITE requests . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   .   .   .   .   .   .   60
   4.6    Comparison of data transfer and number of operations                                        .   .   .   .   .   .   .   .   .   .   .   .   66
          4.6.1 READ requests . . . . . . . . . . . . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   66
          4.6.2 WRITE requests . . . . . . . . . . . . . . . . .                                      .   .   .   .   .   .   .   .   .   .   .   .   66
   4.7    Error handling . . . . . . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   69
         4.7.1 Request execution failure . . . . . . . . . . . . . . . . . . . . . . . . .                                                               69
         4.7.2 Degraded mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                                                                 69
   4.8   Concurrent execution of multiple requests . . . . . . . . . . . . . . . . . . .                                                                 71

5 Proof of correctness and completeness                                                                                                                  72
  5.1 Assumptions . . . . . . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    72
  5.2 Proof of correctness . . . . . . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    72
      5.2.1 READ requests . . . . . . . . .                         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    72
      5.2.2 WRITE requests . . . . . . . .                          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    72
  5.3 Proof of completeness . . . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .    74

6 Prototype and performance analysis                                                                                                                    75
  6.1 System prototypes . . . . . . . . . . . . . . . .                                 .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   75
      6.1.1 Hardware . . . . . . . . . . . . . . . . .                                  .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   75
      6.1.2 Software . . . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   75
  6.2 Throughput comparison . . . . . . . . . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   78
      6.2.1 READ requests . . . . . . . . . . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   78
      6.2.2 WRITE requests . . . . . . . . . . . . .                                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   82
  6.3 Scalability . . . . . . . . . . . . . . . . . . . . .                             .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   85
  6.4 Latency comparison . . . . . . . . . . . . . . . .                                .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   90
      6.4.1 READ requests . . . . . . . . . . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   90
      6.4.2 WRITE requests . . . . . . . . . . . . .                                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   91
      6.4.3 Comparison with theoretical calculation                                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   92
  6.5 Comparison with single-box RAID . . . . . . .                                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   96
      6.5.1 READ requests . . . . . . . . . . . . . .                                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   96
      6.5.2 WRITE requests . . . . . . . . . . . . .                                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   97

7 Other RAID types                                                                                                                                      98
  7.1 Mirroring . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   98
      7.1.1 READ requests .         .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   98
      7.1.2 WRITE requests          .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   98
      7.1.3 Scalability . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   99
  7.2 RDP . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   99

8 Summary                                                                                                                                               102

A Required changes to protocols                                                                                                                         103
  A.1 Changes to the SCSI protocol . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   103
  A.2 Changes to the iSCSI protocol . . .                   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   107
      A.2.1 PDU formats . . . . . . . .                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   108
      A.2.2 Login and logout . . . . . .                    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   108
  A.3 Changes to the Datamover protocol                     .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   108



                                                            3
B READ and WRITE request            examples                                                                                            110
  B.1 READ request . . . . . . .    . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   110
      B.1.1 Baseline system . .     . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   110
      B.1.2 TPT-RAID system         . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   111
  B.2 WRITE request . . . . . .     . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   112
      B.2.1 Baseline system . .     . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   112
      B.2.2 TPT-RAID system         . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   113

C Maximum throughput — detailed comparison                                              120
  C.1 READ requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
  C.2 WRITE requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
List of Figures
  1    Network Attached Storage (nas) . . . . . . . . . . . . . . . . .          . . . . . . .   12
  2    Storage Area Network (san) . . . . . . . . . . . . . . . . . . . .        . . . . . . .   12
  3    Multi-box raid . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .   14
  4    Host-target data path in networked storage . . . . . . . . . . . .        . . . . . . .   17
  5    scsi Client-Server model . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .   22
  6    iscsi Write Sequence . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .   23
  7    iscsi Read Sequence . . . . . . . . . . . . . . . . . . . . . . . .       . . . . . . .   23
  8    iscsi over iser . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . .   24
  9    Example of scsi read/write with iscsi over iser ([22]) . . . .            . . . . . . .   25
  10   iser header ([22]) . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . .   25
  11   raid-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . .   26
  12   raid-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . .   27
  13   Partial and complete stripes . . . . . . . . . . . . . . . . . . . .      . . . . . . .   27
  14   Baseline raid . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . .   28
  15   tpt-raid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .   29
  16   3rd Party Transfer between a host and targets . . . . . . . . . .         . . . . . . .   33
  17   Parity calculation in the binary tree . . . . . . . . . . . . . . . .     . . . . . . .   34
  18   read request in a Baseline raid . . . . . . . . . . . . . . . . .         . . . . . . .   36
  19   read request in a tpt-raid . . . . . . . . . . . . . . . . . . . .        . . . . . . .   37
  20   Reading parity blocks in read requests . . . . . . . . . . . . . .        . . . . . . .   38
  21   write request in a Baseline raid . . . . . . . . . . . . . . . . .        . . . . . . .   40
  22   write request in a tpt-raid . . . . . . . . . . . . . . . . . . .         . . . . . . .   42
  23   ECC calculation methods . . . . . . . . . . . . . . . . . . . . .         . . . . . . .   50
  24   Prototypes software architecture . . . . . . . . . . . . . . . . . .      . . . . . . .   77
  25   read request throughput . . . . . . . . . . . . . . . . . . . . .         . . . . . . .   79
  26   Controller’s tx thread cpu usage in read requests . . . . . . . .         . . . . . . .   80
  27   read request throughput with and without parity blocks . . . .            . . . . . . .   81
  28   write request throughput . . . . . . . . . . . . . . . . . . . . .        . . . . . . .   84
  29   Baseline controller’s cpu usage in write requests . . . . . . . .         . . . . . . .   85
  30   tpt controller’s cpu usage in write requests . . . . . . . . . .          . . . . . . .   86
  31   System scalability (read requests) . . . . . . . . . . . . . . . .        . . . . . . .   88
  32   System scalability (write requests) . . . . . . . . . . . . . . . .       . . . . . . .   89
  33   read request latency . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . .   90
  34   read request latency = f(block size) with request size = 8mb .            . . . . . . .   91
  35   write request latency . . . . . . . . . . . . . . . . . . . . . . .       . . . . . . .   92
  36   write request latency = f(block size) with request size = 8mb .           . . . . . . .   93
  37   Calculated and actual read request latency difference . . . . .            . . . . . . .   94
  38   Calculated and actual read request latency difference (empirical           approxima-
       tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . .   94
  39   Calculated and actual write request latency (Baseline raid) .             . . . . . . .   95
  40   Calculated and actual write request latency (tpt-raid) . . . .            . . . . . . .   95
41   3rd Party Transfer latency overhead in read requests . . . . . . .     .   .   .   .   .   .    96
42   3rd Party Transfer cpu overhead in read requests . . . . . . . . .     .   .   .   .   .   .    97
43   Maximum throughput for read requests (mirroring) . . . . . . .         .   .   .   .   .   .    99
44   Maximum throughput for write requests (mirroring) . . . . . . .        .   .   .   .   .   .   100
45   Maximum throughput for write requests (rdp-raid) . . . . . .           .   .   .   .   .   .   101
46   prep read command . . . . . . . . . . . . . . . . . . . . . . . .      .   .   .   .   .   .   103
47   prep write command . . . . . . . . . . . . . . . . . . . . . . . .     .   .   .   .   .   .   104
48   read old block command . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   104
49   read new block command . . . . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   105
50   read parity part tmp command . . . . . . . . . . . . . . . . .         .   .   .   .   .   .   105
51   read parity comp tmp command . . . . . . . . . . . . . . . .           .   .   .   .   .   .   106
52   read parity part block command . . . . . . . . . . . . . . .           .   .   .   .   .   .   106
53   read parity comp block command . . . . . . . . . . . . . . .           .   .   .   .   .   .   107
54   write new data command . . . . . . . . . . . . . . . . . . . . .       .   .   .   .   .   .   107
55   Requested blocks example . . . . . . . . . . . . . . . . . . . . . .   .   .   .   .   .   .   110
56   Example of a binary tree for parity calculation in partial stripes .   .   .   .   .   .   .   117
57   Example of a binary tree for parity calculation in complete stripes    .   .   .   .   .   .   118
List of Tables
  1   ecc calculation methods . . . . . . . . . . . . . . .     .   .   .   .   .   .   .   .   .   .   .   .   .   . 49
  2   Data transfers (in blocks) during request execution       .   .   .   .   .   .   .   .   .   .   .   .   .   . 68
  3   Operations during request execution . . . . . . . .       .   .   .   .   .   .   .   .   .   .   .   .   .   . 68
  4   raid controller scalability . . . . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 88
  5   raid controller scalability (mirroring) . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 100
  6   raid controller scalability (rdp) . . . . . . . . . . .   .   .   .   .   .   .   .   .   .   .   .   .   .   . 101
                                       Abstract
    Storage devices are inexpensive. Reliable storage systems, however, remain expen-
sive, and even they are susceptible to a box-level failure, rendering an entire ecc group
unavailable. One solution is a multi-box raid, wherein each error-correction group uses
at most one block from each storage box. We introduce tpt-raid, a highly available,
scalable yet simple multi-box raid. tpt-raid extends the idea of an out-of-band san
controller into the raid: data is sent directly between hosts and targets and between
targets, and the raid controller supervises ecc calculation performed by the targets.
This prevents a communication bottleneck in the controller and improves performance
dramatically while retaining the simplicity of centralized control. tpt-raid moreover
conforms to a conventional switched network architecture, whereas an in-band raid
controller would either constitute a communication bottleneck or would have to be
constructed as a full-fledged router.
    tpt-raid can be implemented as a software extension to a san controller without
hardware changes. This and tpt-raid’s scalability are demonstrated by our tpt-
raid prototype that uses InfiniBand, an emerging very high speed interconnect with
rdma capability. We prove the correctness and completeness of tpt-raid. Finally, we
describe the required protocol extensions.




                                           8
   Definitions and acronyms

Definitions

Request - An iscsi command sent from a host to a raid controller.

Block - A chunk of data in a disk, which is part of a raid.

Request size - The size of an iscsi command sent from a host to a raid controller.

Transfer size - The size of data that is sent from a target to a controller or a host in a
  single rdma operation.

Block size - The size of a block in a disk, which is part of a raid.



Acronyms

CA (Channel Adaptor): A device that terminates a link and executes transport-level
 functions.

CDB (scsi Command Descriptor Block): Defines a structure with all arguments needed to
 perform a scsi command.

CM (Communication Manager): Supports InfiniBand’s communication management
 mechanisms and protocols.

DAS (Direct Attached Storage): A storage system directly attached to a server or
 workstation.

DDP (Direct Data Placement protocol): Provides information to place the incoming data
 directly into an upper layer protocol’s receive buffer without intermediate buffers.

DI (Datamover Interface): The interface between the iscsi Layer and the Datamover
  Layer.

DMA (Direct Memory Access): A technique that allows hardware to access system
 memory without passing it through the cpu.

ECC (Error Correcting Code): Enables to check whether the data has not been corrupted
  by the introduction of errors and repair it.

FC (Fibre Channel): A high speed network technology that is mainly used for storage
  networking.

HCA (Host Channel Adaptor): A channel adapter that supports the InfiniBand Verbs
 interface.

                                              9
IB (InfiniBand protocol): A high speed network protocol for the transmission of data
  between processors and i/o devices.

iSCSI (Internet scsi): An ip based protocol that carries scsi commands and data over ip
  networks.

iSER (iscsi Extension for rdma): Provides the rdma data transfer capability to iscsi.

iWARP A suite of wire protocols comprising of rdmap, ddp, and mpa when layered
  above tcp. It provides rdma operations over tcp.

ITT (Initiator Task Tag): A unique id allocated by an iscsi initiator that identifies a task
  session-wide.

kDAPL (Kernel Direct Access Programming Library): A transport neutral infrastructure
  that provides rdma capabilities inside the kernel.

LU (Logical Unit): A scsi target device object, containing a device server and task
  manager, that implements a device model and manages tasks to process commands sent
  by an application client.

MPA (Marker pdu Aligned Framing): Designed to work as an ”adaptation layer” between
 tcp and ddp, preserving the reliable, in-order delivery of tcp, while adding the
 preservation of higher-level protocol record boundaries that ddp requires.

NAS (Network Attached Storage): A storage system running a network filesystem and
 attached to a network. It uses a file-level protocol.

PDU (Protocol Data Unit): A message that is sent between an iscsi initiator and target

R2T Ready to Transmit: An iscsi message that is sent from an iscsi target to an iscsi
  initiator to control the flow of written data from an iscsi initiator to an iscsi target.

RAID (Redundant Array of Inexpensive Disks): A system of using multiple hard drives for
 sharing or replicating data among the drives.

RDMA (Remote Direct Memory Access): A technique that allows hardware to access
 memory on a remote system in which the local system specifies the remote location of
 the data to be transferred. Employing an rdma engine in the remote system allows the
 access to take place without passing it through the cpu on the system.

RDMAP (rdma protocol)

SAN (Storage Area Network): A high-speed subnetwork of shared storage devices using a
  block-level protocol.

SCSI (Small Computer System Interface): A parallel/serial protocol for attaching
  peripheral devices to computers.

                                             10
STag (Steering Tag): An identifier of a tagged buffer on a node. It is used for advertising
  the tagged buffer for rdma operations.

TCA (Target Channel Adaptor): A channel adapter typically used to support i/o devices.
 tcas are not required to support the InfiniBand Verbs interface.

TCP (Transport Control Protocol): One of the core protocols of the Internet protocol
 suite,providing a reliable communication over ip.

ULP Upper Layer Protocol




                                           11
1    Introduction
Traditionally, storage devices were directly attached to computers. During the 1990’s, large
storage servers became popular. These storage servers were connected to multiple clients
using a block-level interface such as the Small Computer Systems Interface [1] (scsi).
    More recently, with the proliferation of high-speed networks and related storage protocols
as well as distributed file systems, a further physical separation has taken place between the
control of the storage subsystem and the actual storage-containing boxes. This has facilitated
the assembly of multi-vendor systems through virtualization of the storage resources.
    There are presently two main networked storage architectures: Network Attached Storage
(nas), which exposes a file system interface (Fig. 1), and Storage Area Network (san) with
block semantics (Fig. 2). These differences in semantics have a large impact on other aspects
of the system architecture. Another architecture, which has a higher abstraction level than
a san but lower than a nas, is object store. It raises the level of abstraction presented by a
block device by appearing as a collection of objects.




                        Figure 1: Network Attached Storage (nas)




                           Figure 2: Storage Area Network (san)


                                             12
    Many storage systems use the Internet scsi [2] (iscsi) protocol for sending scsi com-
mands and data over an ip network. Other systems use Fibre-Channel. Some systems use the
remote dma [3] (rdma) mechanism provided by interconnects like InfiniBand [4] and iwarp
(a protocol suite that provides the Remote Direct Memory Access support [3] (rdmap), the
Marker pdu Aligned Framing for tcp support [5] (mpa) and the Direct Data Placement sup-
port [6] (ddp)). The iscsi Extensions for rdma protocol [7] (iser) or scsi rdma Protocol
[8] (srp) are used in order to harness rdma for storage communication purposes.
    This work explores architecture for scalable storage systems. The work focuses on the
raid level, and fits in particularly well with san controllers. Moreover, some nas and object-
store solutions are designed with a back-end san, so the work is often applicable to them as
well.

1.1    Single-box RAID
raids [9] use multiple disks to share or replicate data among the disks. Some raids stripe
data at block granularity along with redundant information (parity etc.) over the disks. For
the remainder of this paper, the term ”block” refers to a raid stripe unit (not an os block).
   Current san controller implementations may be divided into two groups:

   • In-Band controllers: All data sent between hosts and storage boxes (targets) passes
     through the san controller.

   • Out-Of-Band controllers: Data is sent directly between hosts and storage boxes (tar-
     gets) without passing through the san controller. Commands are sent from hosts to
     the san controller and from the san controller to storage boxes.

    In both types, the raid functions (e.g. parity calculation) are carried out by the raid
controller itself, below the granularity that the san controller deals with other than knowing
the properties of a given target box. (The san controller is thus not involved in parity
calculations.)
    In most current raids, including very large ones, any given error-correcting (ecc) group
resides in a single box. This is convenient and simple. However, although storage devices
are becoming cheaper, such raids are still very expensive because of the need to make
individual boxes reliable and fault tolerant. Also, a single-box raid, regardless of the degree
of redundancy, is still a single point of failure because box failures (e.g., cable disconnection,
flood, coffee spill) render entire ecc groups unavailable.

1.2    Multi-box RAID
An alternative to the current san architecture would be to build a multi-box raid (Fig. 3), in
which the controller and the targets reside in separate machines. Each ecc group comprises
at most one block from any given target box. The controller must be fault tolerant (by
having a hot backup, for example). With this architecture, the failure of a single target box


                                               13
                                 Figure 3: Multi-box raid

does not render any data inaccessible. In order to retain simplicity and in accordance with
the general trend in storage management, the preferred approach is to retain centralized
control. Another approach is to build a fully distributed raid including distributed control.
This has several advantages but is more complicated.
    In a multi-box raid, the controller and the targets naturally reside in separate machines.
Therefore, unlike current single-box raids, which are seen by the san controller as opaque
boxes, when using a multi-box raid the san controller is also a raid controller and controls
the raid at the disk level.
    Achieving high performance while placing the controller and targets in separate machines
requires a high-performance network, interconnecting the entities through a switch. Also, a
protocol for sending commands and data among the entities is required. A possible solution
is to use the iscsi protocol for transferring scsi traffic over a tcp infrastructure.
    A multi-box raid can tolerate failures that affect an entire box. Also, because it com-
prises inexpensive machines connected by a network, it can be less expensive than single-box
raids while providing at least the same level of availability. However, unlike a single-box
raid that uses a dma engine for data transfer, the aforementioned multi-box raid uses iscsi
over tcp for data transfer. This is less efficient because it requires data copies that affect
the throughput and latency, and moreover burdens the cpu. Overcoming the single point of
failure by going to a multi-box raid thus poses several challenges: the communication must
be efficient and the controller must not become a bottleneck.
    The main architectural goals when designing a multi-box raid are:

   • Permit inexpensive boxes while retaining high availability and performance.

   • Allow a multi-box raid to be scalable.

   • Do the above with minimal changes to the storage communication protocols.

                                             14
In-band vs. Out-of-band Controller
    In single-box raids, the raid controller is naturally in the data path. An attempt to
do so for the multi-box raid would result in a problem. If the controller is connected
to the switch like each of the target boxes, it will become a communication bottleneck:
the controller is connected to the switch with a single port while the n targets are jointly
connected to the same switch with n ports; yet, every communicated bit either originates
from the controller or is targeted to it. If, instead, the controller is located inside the switch,
acting as a router, the bottleneck may be relieved but the “orthogonality” of communication
and other functions would be destroyed.
    In a multi-box raid, the disks reside in separate boxes and all types of boxes are in-
terconnected by a common network. This admits an out-of-band controller configuration,
wherein data is sent directly among hosts and targets. An out-of-band controller has several
inherent features:

   • The controller’s network link, busses, memory and cpu (for protocol processing) do
     not become a bottleneck because no data passes through the controller.

   • Being out of the data path, the controller cannot perform ecc calculations during
     the execution of write requests, in degraded mode and during reconstruction. These
     calculations must be performed by the targets. (This will decrease the usage of the
     controller’s cpu and thus relieve another possible bottleneck.)

    Having an out-of-band controller may also be essential if a storage vendor wants to keep
the capacity of an individual storage box unchanged, e.g., in order to use the same hardware
that is used in single-box storage boxes (for cost-effectiveness reasons). In this case, a single
raid controller has to handle more activity than a controller in a single-box raid. If the
controller is in band, this may lead to scalability problems in communication and possibly
in computing. This all leads to the conclusion that removing the controller from the data
path is essential in a multi-box raid.

1.3    Related work
Previous studies have tried to address the aforementioned problems (the high price of raids,
single box failure and scalability of the raid controller), but did not simultaneously solve
all of them.
    Several studies focused on multi-box (or distributed) raids. Redundant Array of Dis-
tributed Disks [10] (radd) proposed a distributed raid architecture without centralized
control: parity calculation was performed among target boxes that were spread across a
wide-area network. This improved availability when failures that affect an entire box oc-
curred. However, since there was no centralized control over the distributed raid, the targets
had to be intelligent and the implementation became more complicated than in a raid with
centralized control. The TickerTAIP project [11] offered a parallel raid architecture for sup-
porting parallel disk i/o with multiple controllers. However, like radd, it had the potential


                                                15
problems of decentralized control. The Petal project [12] implemented a single, block-level
storage system with a pool of distributed storage servers. However, the architecture of Petal
was not transparent to clients because they had to be aware of the storage servers.
    Distributed iscsi raid [13] proposed to improve performance by striping data among
iscsi targets with a single controller. In order to improve system reliability, it used rotated
parity for data blocks (raid-5). This was a partial solution: it was less expensive than
a single-box raid and there was no problem of single-box failure because no ecc group
contained more than one block residing in any single target machine. However, a bottleneck
may have been created in the raid controller because it was in the data path and had to
perform ecc calculations.
    The specification of scsi [14] contains block commands that support parity calculation
by the disk drive itself. This can be used to distribute the ecc work among targets and to
reduce the overall required data movement. However, the raid controller is still in the data
path. Also, such a system does not solve the problem of single-box failure and may be even
more expensive than other single-box raids because it requires scsi disks with support for
an extended command set.
    Other studies and proprietary commercial solutions focused on removing the controller
from the data path. svm [15] is a san appliance that provides virtual volume management
and has an out-of-band controller. dsdc [16] discusses the bottleneck that is created in san
controllers because all communication between initiators and targets passes through the san
controller. It solves the problem by sending data directly between initiators and targets.
However, the problems of high cost of raids and single-box failure are not addressed.
    In this work, we present our 3rd Party Transfer multi-box raid architecture, tpt-raid,
which uses an out-of-band raid controller and distributed ecc calculation by the targets, all
under centralized control. We show that by improving the data transfer efficiency, removing
the controller from the data path and transferring the ecc calculation from the controller
to the targets, the performance of a multi-box raid is improved dramatically relative to
that of one with an in-band controller: for sufficiently large requests and block sizes, the
maximum throughput supported by the out-of-band controller is higher (and it is therefore
more scalable), and zero-load latency is not increased. Also, since the tpt-raid offers higher
maximum throughput, queuing time is reduced and latency is lower for all but very light
workloads. We extend the iscsi and iser storage communication protocols in support of
this architecture.
    he remainder of the thesis is organized as follows. Section 2 provides an overview of
relevant data transfer technologies and interconnects. Section 3 provides an overview of
relevant storage technologies. Section 4 describes the proposed architecture of tpt-raid.
Section 5 contains a proof of correctness and completeness. Section 6 contains performance
measurements. Through out this document, we mostly use raid-5 to show the ideas of the
tpt-raid. Section 7 contains a brief description of how these ideas may be implemented in
other raid types. Section 8 concludes this work.




                                              16
2     Data transfer and interconnects
When data is sent between a host and a storage box (Fig. 4), it passes through several
entities inside each box (host or storage) and over a network between boxes:

    • When executing read requests:

        – The host sends a command to the storage box.
        – The storage box reads the data from the local disk to a buffer. The decision
          whether to use user or kernel buffers depends on the implementation.
        – The data is sent by the nic over the network to the host. The data may be sent
          from the buffer into the network in several ways as explained later in this section.
        – The host receives the data from its nic into a kernel buffer. The data may be
          received from the network into the buffer in several ways as explained later in this
          section.
        – If necessary, data is copied from kernel space to user space.

    • When executing write requests:

        – If necessary, data is copied from user space to kernel space.
        – The data is sent by the nic over the network to the storage box.
        – The storage box receives the data from its nic into a buffer. Again, the decision
          whether to use user or kernel buffers depends on the implementation.
        – The storage box writes the data from the buffer to its local disk.




                  Figure 4: Host-target data path in networked storage


                                             17
    The transition to san and nas has been accompanied by the emergence of high-performance
communication infrastructure for storage. Storage protocols (like scsi) have specific commu-
nication requirements such as high data rate and efficient data transfer (minimal loading of
the cpu). As target boxes have large caches, protocol offload engines emerge, and with future
storage technologies possibly having low access times, low latency of storage communication
has also become important.
    In this section, we discuss data transfer technologies that optimize this communication.

2.1    Fibre Channel
Currently, Fibre Channel (fc) is the most prominent high-end storage communication inter-
connect, with data rates of up to 4 gbit/s. fc speed relies on the physical interconnect. Also,
fc fabric is a managed network. Endpoints need to log in to the network, so that switches
can optimize paths. Each node has a fc Host Bus Adapter (hba) containing a dma engine.
The dma engine is used to transfer data between the node’s memory and the hba. fc
equipment is very expensive and the data rate is limited (relative to other communication
interconnects).

2.2    RDMA overview
Direct Memory Access (dma) is confined to working with a single host and its i/o subsys-
tems. Networked storage requires inter-box communication with efficient data transfer. The
rdma protocol provides read and write services directly to applications, and enables data
to be transferred directly into upper layer protocols (ulp) buffers without intermediate data
copies. Data may be transferred between user or kernel buffers. However, in this work we
deal only with kernel buffers.
    When data is sent over tcp/ip, it is copied from a local buffer to the nic in the data
source machine and then copied again from the nic to a local buffer in the data sink machine.
These copy operations add latency, and consume significant cpu and memory resources.
rdma obviates the need for data copy operations, and allows an application to read or write
data on a remote computer’s memory with minimal demands on memory bus bandwidth
and cpu processing overhead. Using rdma doesn’t create problems of memory protection.
    A local node informs a remote peer that a local rdma buffer is available to it by sending
it the tagged buffer identifiers (steering tag (stag), base address, and buffer length). This
action is called “advertisement”.
    The rdma protocol provides several data transfer operations. The following operations
are relevant to this work:

   • rdma write: An rdma write operation uses an rdma Write message to transfer data
     from the data source to a previously advertised buffer at the data sink. The data sink
     enables the data sink tagged buffer for access, and advertises the buffer’s size (length),
     location (address) and steering tag (stag) to the data source through a ulp specific
     mechanism. The ulp at the data source initiates the rdma write operation.

                                              18
   • rdma read: An rdma read operation transfers data to a tagged buffer at the data
     sink from a tagged buffer at the data source. The ulp at the data source enables the
     data source tagged buffer for access, and advertises the buffer’s size (length), location
     (address) and steering tag (stag) to the data sink through a ulp specific mechanism.
     The ulp at the data sink enables the data sink tagged buffer for access, and initiates
     the rdma read operation.

    A state of the art architecture that provides rdma operations is InfiniBand, which is
discussed next.

2.3    InfiniBand overview
Several interconnects have been developed for clusters, such as Myrinet (By Myricom) [17]
[18], QsNet (by Quadrics) [19] [20] and InfiniBand [4]. Perhaps the most interesting and
relevant to storage of these is InfiniBand.
    InfiniBand is a new industry-standard architecture for server i/o and inter-server commu-
nication. It provides high levels of reliability, availability, performance (10 − 60 gbit/s end-
to-end and low latency) and scalability for server systems, featuring full protocol offloading.
It also allows user-level networking. In conformance with common storage implementations,
we use InfiniBand only in kernel mode.
    The motivation for the development of InfiniBand is the fact that the capabilities of
standard i/o systems that use busses do not satisfy the needs of server systems, while
processing power is growing.
    Although bus-based i/o systems are simple and have served the industry well, they do
not use their underlying electrical technology well to provide high data transfer rates out of
a system to devices. There are several reasons for this:

   • The fact that busses are shared require arbitration protocols. As the number of devices
     that use the bus grows, the arbitration penalty increases.

   • A bus doesn’t provide a high level of availability and reliability that are required by
     server systems. A single device failure may inhibit the correct operation of the bus
     itself, which will cause a failure of all devices that use the bus. The bus is a shared
     medium, so only one pair of entities can communicate at any given time.

    Several server communication vendors supply solutions to the problems that were men-
tioned above. However, these solutions are proprietary, thereby incurring significant costs.
InfiniBand is an industry-standard architecture. It uses Point-to-point connections: all data
transfer is point-to-point, not bussed. Arbitration is not needed, and if a device fails, it
doesn’t affect other devices. Using a switched network allows scalability.
    For these reasons and the fact that InfiniBand supplies rdma read and write operations,
it was selected as the interconnect for this work.



                                              19
2.3.1   Memory registration
A device that terminates an InfiniBand link and executes transport-level functions is called
Channel Adapter (ca). The collection of features that are defined to be available to host
programs is defined by Verbs. A ca that supports the Verbs interface is called Host Channel
Adapter (hca). A Target Channel Adapter (tca) has no defined software interface.
    An hca accesses host system memory. Physical address space for host system memory
is typically organized into pages, and a given logical data buffer that spans page boundaries
often has a non-contiguous physical address range.
    Memory registration is a mechanism that allows consumers to describe a set of virtually
contiguous memory locations or a set of physically contiguous memory locations in order to
allow the hca to access them as a virtually contiguous buffer using virtual addresses. A
memory location must be explicitly registered before the hca can access it.




                                            20
3     Storage systems overview
Direct-attached storage (das) is the traditional way to access storage. In das, storage is
directly attached by a cable to a processor in a single machine and commands are sent
directly to the storage.
    The independent storage subsystems initially took the form of storage “mainframes”. A
mainframe contains a common storage server attached to multiple computers (like the emc2 s
Symmetrix [21] storage subsystem).
    Later, two main storage networking variations were developed to replace das:

    • Storage Area Network (san): Network storage is placed on a separate network that is
      specifically designed for storage management. Like das, i/o requests access devices di-
      rectly. It provides block-level storage access to clients. Most sans use Fibre Channel
      infrastructure or iscsi over Ethernet infrastructure.

    • Network Attached Storage (nas): nas devices are often connected to a tcp/ip based
      network. The protocol used with nas is a file based protocol such as nfs. Therefore,
      it provides file-level storage access to clients.

This work is mostly relevant to san (which can also be the back end of a nas system).
   The next section presents several storage protocols and architectures. It will be used as
the basis for the following sections.

3.1    SCSI overview
scsi is a standard protocol between computers and peripheral devices, including disks, tape
drivers, cd/dvd drives, printers, communications devices, optical media drives and other
devices. The standard is maintained by the t10 working group of the American National
Standards Institute (ansi).
    The first scsi standard was approved by ansi in 1986, and is presently referred to as
scsi-1. It was very basic, and contained a small command set. It was replaced in 1994 by
scsi-2. scsi-2 supplied a standard command set and support for command queuing. It was
a bundled protocol that defined all layers, from cabling to command set. It was designed
only for das using dma semantics over a bus.
    Later, scsi-3 replaced scsi-2. It introduced some major changes relative to the old scsi-2:

    • Layered protocol: scsi-3 is a collection of different but related standards. For some of
      the layers, many standards may be used.

    • Networked protocol: scsi-3 uses the network as its physical layer.

    Each i/o device is called a “logical unit” (lu). Service interfaces between distributed
objects are represented by the client-server model shown in Fig. 5. The client is called an
“initiator”, and the server is called a “target”.


                                              21
                            Figure 5: scsi Client-Server model

   A client-server transaction is effected by an initiator sending a scsi “command” to a
target. The command identifies the server and the service to be performed, and includes
the input data. The target replies with a scsi “response” that contains the output data
and request status. Such a transaction is represented as a remote procedure call with inputs
supplied by the initiator. The procedure is processed by the target and returns outputs and
a procedure status.
   A scsi task contains a definition of the work to be performed by the lu in the form of a
command or a group of linked commands. Each task is represented by a unique “initiator
task tag” (itt).
   A scsi command may contain a data phase and a required response phase. In the data
phase, data may be transferred between the initiator and the target in both directions. After
the data phase is complete, a response is sent from the target to the initiator.
   The operation to be performed by the target is defined by a command descriptor block
(cdb). All cdbs contain an operation code and operation specific parameters. The cdb
content and structure are defined by [1] and device-type specific scsi standards.

3.2    iSCSI overview
The iscsi protocol [2] is a mapping of the scsi remote procedure invocation model (see [1])
over tcp. scsi commands are encapsulated in iscsi requests, and scsi responses and status
are encapsulated in iscsi responses.
    For the remainder of this document, the terms “initiator” and “target” refer to “iscsi

                                             22
initiator” and “iscsi target” respectively.
    The messages that are sent between the initiator and target are called “iscsi protocol
data unit” (iscsi pdu). pdus are sent over one or more tcp connections. A group of
connections that link an initiator with a target is called a session.
    Outgoing scsi data (initiator to target) may be sent in one of the following ways (as
shown in Fig. 6):
   • With the command pdu, as Immediate data (as Unsolicited data).
   • In separate Data-Out pdus (as Unsolicited data).
   • In separate Data-Out pdus in response to a ready-to-transmit (r2t) pdu from the
     target (as Solicited data).




                             Figure 6: iscsi Write Sequence

    Sending unsolicited data in write commands can reduce the command latency (since
the initiator doesn’t have to wait for the r2t pdu from the target). It may also be useful
for small data transfers in which the data may be sent as immediate data with no need to
send data-out pdus.
    Incoming scsi data (target to initiator) may be sent in data-in pdus in response to a
read command pdu as shown in Fig. 7.




                             Figure 7: iscsi Read Sequence


                                           23
3.3    iSER overview
The iscsi Extensions for rdma (iser) protocol maps the iscsi protocol over a network that
provides rdma services (like tcp with rdma services (iwarp) or InfiniBand). This permits
data to be transferred directly into scsi i/o buffers without intermediate data copies.
    The Datamover Architecture (da) defines an abstract model in which the movement of
data between iscsi end nodes is logically separated from the rest of the iscsi protocol. iser
is one Datamover protocol. The interface between the iscsi and a Datamover protocol, iser
in our case, is called Datamover Interface (di). The layering of iscsi over iser is shown in
Fig. 8.




                                 Figure 8: iscsi over iser

    The motivation for iser is the problem of out-of-order tcp segments in the traditional
iscsi model. These segments have to be stored and reassembled before the iscsi layer can
place the data in the iscsi buffers. The tcp reassembly may reduce system performance
because of the needed data copying.
    iser’s goal is to allow direct data placement and rdma capabilities. Data is sent between
the initiator and target with rdma services without any data copy at the ends.
    The main difference between the standard iscsi and iscsi over iser in the execution of
scsi read/write commands is that with iser the target drives all data transfer (with the
exception of iscsi unsolicited data) by issuing rdma write/read operations, respectively.
    When the iscsi layer issues an iscsi command pdu, it calls the Send Control primitive,
which is part of the di. The Send Control primitive sends the stag with the pdu. The iser
layer in the target side notifies the target that the pdu was received with the Control Notify
primitive (which is part of the di). The target calls the Put Data or Get Data primitives


                                             24
(which are part of the di) to perform an rdma write/read operation respectively. Then, the
target calls the Send Control primitive to send a response to the initiator. An example is
shown in Fig. 9. (In the figure, time progresses from top to bottom.)




           Figure 9: Example of scsi read/write with iscsi over iser ([22])

    All iscsi control-type pdus contain an iser header (shown in Fig. 10), which allows the
initiator to advertise the stags that were generated during buffer registration. The target
will use the stags later for rdma read/write operations.




                               Figure 10: iser header ([22])




                                            25
3.4    RAID overview
Redundant array of inexpensive disks [9] (raid) use multiple disks to share or replicate
data among the disks. Some raids stripe data at block granularity along with redundant
information (parity etc.) over the disks.
    There are various raid configurations, which differ in the fraction of redundant data (e.g.,
replication vs. parity) and in the roles played by different disk drives. (e.g., dedicated disk
drives for redundant data vs. having every disk store both data and redundant information.)
In this work, we mainly use raid-4 and raid-5 (although other raid levels may be used
with small changes as shown in section 7).
    raid-4 uses block-level striping with a dedicated parity disk as shown in Fig. 11. Write
requests must update the requested data blocks, and also compute and update the parity
block for each affected stripe. For small write requests, parity is computed by noting how
the new data differs from the old data and applying those differences to the parity block.
Such write requests require 4 disk i/os: 2 for reading the old requested data block and old
parity block and 2 for updating the requested data block and parity block. The parity disk
may become a bottleneck because it must be updated on all write operations.




                                     Figure 11: raid-4

    raid-5 eliminates the parity disk bottleneck that exists in raid-4 by distributing the
parity data across all member disks as shown in Fig. 12. Since there is no dedicated parity
disk, all disks may participate in servicing read requests.
    As mentioned in [9], for small write operations we can recalculate the new parity as
follows:
    new parity = xor(xor(old data,new data),old parity)

We will refer to this ecc calculation method as “incremental parity calculation”.
  For write operations that involve many disks, the following parity calculation method
may be more efficient:
  For each modified stripe:
  modified-blocks = {New data blocks}


                                             26
                                     Figure 12: raid-5

   non-modified-blocks = {Non modified data blocks}
   new parity = xor(xor(modified-blocks),xor(non-modified-blocks))

We will refer to this ecc calculation method as “batch parity calculation”.
    The raid controller has to select a parity calculation method for each modified stripe.
The main difference between the two methods is the number of read commands that are
sent to the disks. We use it in order to select the parity calculation method: if p blocks are
modified in a stripe, then p + 1 read commands are required when using the incremental
method while raid size − 1 − p commands are required when using the batch method.
Therefore if p < raid2size − 1, we use the incremental method. Else, the batch method is used.
    Read and write requests may involve more than a single stripe. If all blocks in the stripe
need to be read or written, we will refer to the stripe as a “complete stripe”. Else, we will
refer to the stripe as a “partial stripe”. An example is shown in Fig. 13.




                          Figure 13: Partial and complete stripes




                                             27
3.5    Multi-box RAID with iSER
Using iscsi over iser solves the problem of inefficient data transfer in the multi-box raid
and may be considered as a substitute for the dma engine that is used by a single-box raid.
However, the other problems of the multi-box system remain unsolved:

   • The separation of control and data that is offered by iser is really a protocol separation
     over the same physical path. All data that is transferred between the hosts and the
     disks passes through the controller.

   • ecc calculations require additional data transfers between the controller and the disks
     as well as execution of xor operations by the controller.

    We will use iscsi over iser with an in-band controller as a baseline for comparison and
refer to it as the Baseline system (Fig. 14). In the next section, we introduce tpt-raid,
which extends the use of rdma to help solve the remaining problems.




                                 Figure 14: Baseline raid




                                             28
4     3rd Party Transfer RAID (TPT-RAID) architecture
4.1    Overview
The 3rd Party Transfer raid (tpt-raid) is a multi-box raid architecture. It comprises a
central out-of-band controller, which is fault-tolerant (by having a hot backup, for example)
and multiple storage (target) boxes, all interconnected by a network with rdma capabilities.
One or more hosts can also be connected to the same network, jointly forming a san.
    tpt-raid combines the simplicity of centralized control with the scalability of direct
host-target and target-target communication as well as distributed ecc computation, all
under controller supervision (Fig. 15).
    tpt-raid uses rdma, which makes data transfers more transparent than in other out-
of-band controllers because the passive side of an rdma operation is unaware of the fact
that it is sending or receiving data - it is all done by the hardware. In current out-of-band
storage systems, hosts actually have to send and receive data packets from targets, which
makes it less transparent for the host.




                                   Figure 15: tpt-raid

   In the remainder of this section, we describe the operation of tpt-raid, addressing login
and logout, i/o request execution and error handling with tpt-raid.

4.2    Login and logout
The login and logout phases in the tpt-raid are different from a san with single-box
raids and an out-of-band controller: the controller has to establish a connection with each
target, and each target has to establish a connection with all other targets and with each
host. The architecture of the tpt-raid also raises some potential security problems during
the connection establishment and login phase (unless the hosts, controller and targets all
reside in a trusted environment). The possible threats are rogue targets that may establish
connections to other targets and hosts. We next go over the connection establishment and
login phase, and show how these problems are handled:

                                             29
   • Connection establishment between targets: in order to prevent a situation wherein a
     rogue target connects to other targets, the controller sends a list of all targets to each
     target. A target will not accept a connection request unless it is part of a full login
     phase with a controller or a connection request from a target that is on the target list.
     An anti-spoofing mechanism is required in order to do that.
   • Connection establishment between targets and hosts: in order to prevent a situation
     wherein a rogue target connects to a host, the host and the controller agree on a secret
     key, which is unique to the host-controller pair. The controller sends the key to all
     targets. When a host receives a connection request, it accepts it only if it contains the
     key.

   In order to achieve an even higher level of security, the security mechanisms offered by
iscsi may be used. However, this requires target-target and target-host login sessions.
   We now describe the steps of the login and logout phases:
  1. When the controller is started:

      (a) The controller logs in to the targets. This is a standard iscsi login phase.
      (b) During the login phase, the controller sends the list of targets to each target. This
          requires a new type of message.
      (c) When a target receives the target list, it connects to the other targets. This is
          only a connection, without any login phase. A target decides whether to accept
          a connection based on the list that it received from the controller.

  2. When a host logs in to the controller:

      (a) The host logs in to the controller. This is a standard iscsi login phase.
      (b) The host and the controller agree on a secret key.
      (c) The controller instructs the targets to connect to the host.
      (d) Each target connects to the host (no login session, only connection). The host
          accepts the connection request only if it contains the secret key that was agreed
          on with the controller.

  3. When a host logs out:

      (a) The host logs out. This is a standard iscsi logout phase.
      (b) The controller requests the targets to disconnect from the host.
      (c) Each target disconnects from the host.

  4. When the controller is stopped: If a host is still connected, the controller initiates a
     logout phase with the host. Then, the controller logs out from the targets. When a
     target receives a logout request, it disconnects from all other targets.

                                              30
4.3     Request execution
tpt-raid features two main elements:

   • 3rd Party Transfer: Data passes directly between the host and the targets (unlike
     the Baseline raid in which data that is transferred between the host and the targets
     passes through the controller). Data for parity calculation is sent between targets under
     controller command (unlike the Baseline raid in which parity data is sent between the
     targets and the controller). Only control traffic passes through the controller.

   • Distributed Parity Calculation: Since the tpt-raid uses 3rd Party Transfer, the con-
     troller is no longer in the data path. Therefore, it cannot make parity calculations. The
     parity calculation is done in the targets in the form of a binary tree, under controller
     supervision.

    Both mechanisms make use of rdma, and iser is extended accordingly. Adding them to
a multi-box raid improves the system throughput, and makes it more scalable than other
multi-box raids. Removing the controller from the data path also improves the latency
because data is transferred directly between the host and the targets.
    Out-of-band controllers already exist, but rdma makes data transfers in a system with
an out-of-band controller more transparent because the passive side of an rdma operation
is unaware of the fact that it’s sending or receiving data - it is all done by the hardware
(while in current out-of-band storage systems, host cpus have to participate in the sending
and receiving of data packets from targets, which makes it less transparent for the host).
Another difference between the tpt-raid and current out-of-band storage systems is that
we carry the idea of 3rd Party Transfer into the raid. We next describe 3rd Party Transfer
and Distributed Parity Calculation in greater detail.

4.3.1   3rd Party Transfer
As mentioned before, 3rd Party Transfer is used for data transfer between hosts and targets
and between targets. A raid controller that receives a command uses the stag in the iser
header of the iscsi command pdu. When it sends iscsi command pdus to the targets, it
includes the stag with a specific offset for each target in the iser header. The scsi command
pdu also contains information about the identity of the passive side of the rdma operation.
(In this case, it is the host.) The target that receives the pdu uses the stag and the identity
of the passive side, and performs an rdma operation. The same thing happens when the
raid controller requests a target to perform an rdma operation with another target: the
raid controller receives an stag from a target which will be the passive side of an rdma
operation. Then, it sends an iscsi command pdu to another target which will be the active
side of an rdma operation, and includes the stag that was received from the passive target
in it. The scsi command pdu also contains information about the identity of the passive
side of the rdma operation. (In this case, it is the passive target.) The active target that
receives the pdu, uses the stag and the identity of the passive side and performs an rdma

                                              31
operation. Fig. 16 provides a step-by-step description of 3rd Party Transfer between a host
and targets. Referring to the numbered steps in the figure:

  1. The host (internally) registers a buffer for an rdma operation. The registration pro-
     duces an stag (as described in section 2.2).

  2. The host sends an iscsi command pdu to the raid controller. The iser header of the
     pdu contains the stag.

  3. The raid controller uses the stag to send iscsi command pdus to the targets. A
     command is sent for each data block. The iser header of each pdu contains the stag
     that was received from the host with a specific offset according to the offset of the data
     block in the buffer that was registered by the host. The pdu also contains the identity
     of the passive side of the rdma operation. (In this case, it is the host.)

  4. Each target that receives an iscsi command pdu uses the stag and the identity of
     the passive side to perform an rdma operation. If a target receives multiple stags,
     it performs multiple rdma operations (one rdma operation per stag). This is due
     to the fact that InfiniBand does not allow multiple stags to be gathered to a single
     stag. Scattering and gathering of stags could have improved the system performance
     in terms of latency and throughput, thereby also improving scalability.

   3rd Party Transfer between targets is very similar to 3rd Party Transfer between a host
and targets:

  1. A target (passive target) registers a buffer for an rdma operation. The registration
     produces an stag.

  2. The passive target sends the stag to the controller.

  3. The raid controller uses the stag to send an iscsi command pdu to another target
     (active target). the iser header of pdu contains the stag that was received from the
     passive target. The pdu also contains the identity of the passive side of the rdma
     operation. (In this case, it is the passive target.)

  4. The active target that receives the iscsi command pdu uses the stag and the identity
     of the passive side to perform an rdma operation.

    It is important to note that 3rd Party Transfer does not require any change in the rdma
mechanism nor in InfiniBand (or iwarp). Moreover, the flow of a 3rd party rdma operation
is similar to a standard rdma operation.




                                            32
Figure 16: 3rd Party Transfer between a host and targets




                          33
4.3.2   Parity calculation in the targets
Since the tpt-raid uses 3rd Party Transfer, the controller in no longer in the data path.
Therefore, it cannot perform parity calculations. By moving the parity calculation to the
targets, another controller bottleneck is relieved because its cpu usage is dramatically re-
duced.
   The parity calculation is done in the form of a binary tree. An example is shown in
Fig. 17. The controller sends commands to specific targets to read data from other targets.
The data transfer between the targets is done using the 3rd Party Transfer mechanism. A
target that reads data from another target, performs an xor operation with its own data
and saves the result. The process continues, with each participating target reading a result
block from the previous one and xoring it with its own block, until, finally, the new parity
data is ready.




                      Figure 17: Parity calculation in the binary tree


4.3.3   Required protocol changes
3rd Party Transfer and parity calculation in the targets require modifications to the Data-
mover architecture, iscsi and scsi. For more details, refer to appendix A.
    In the next sections, we show how 3rd Party Transfer and parity calculation in the targets
are used when executing read and write requests.




                                             34
4.3.4    Symbols and Variables
In the next sections, the following symbols and variables will be used:
raid size - The number of targets in the raid
req size - Request size (kb)
block size - Block size (kb)
                req
block count = blocksize - The number of data blocks in a request
                    size
p - The number of data blocks in a partial stripe
c - The number of requested block in all complete stripes
rdma latency(k) - The latency of an rdma operation with size k
prep read latency - The latency of a single prep read command
prep write latency - The latency of a single prep write command
disk read latency(k) - The latency of a disk read operation with size k
disk write latency(k) - The latency of a disk write operation with size k
xor latency(k) - The latency of an xor operation in which the length of data that is
read/written is k
link bw - The bandwidth of the network link

4.3.5    READ requests
This section describes the steps for executing a read request in a Baseline raid and a
tpt-raid. The main difference between the two systems is that the target in the tpt-raid
writes the requested data directly to the host while the target in the Baseline raid writes
the requested data to the raid controller.

Baseline RAID
When the host sends a read command to a Baseline raid, the following steps are performed
(as shown in Fig. 18):

  1. The host sends a read command to the raid controller.

  2. The raid controller processes the received command, and sends read commands to
     the targets that contain the requested blocks.

  3. A target that receives a read command performs the following operations:

        (a) Reads the data from the disk.
        (b) Writes the requested data to the RAID controller buffer using an rdma write
            operation.
        (c) Sends a response to the raid controller.

  4. The raid controller receives all responses. At this stage, all requested data is ready in
     its local buffer. Then, it performs the following operations:

        (a) Writes the requested data to the host buffer using an rdma write operation.

                                             35
      (b) Sends a response to the host.




                       Figure 18: read request in a Baseline raid

TPT-RAID
The execution of read requests in tpt-raid requires new scsi commands. These are listed
in appendix A.1. When the host sends a read command to a tpt-raid, the following steps
are performed (as shown in Fig. 19):
  1. The host sends a read command to the raid controller (similar to the Baseline raid).

  2. The raid controller processes the received command, and sends prep read commands
     for each requested data block. In the last prep read command that is sent to each
     target, the final field is set to ’1’.

  3. A target that receives a prep read command performs the following operations:

      (a) Copies the logical block address and transfer length fields from the
          cdb and the read stag from the iser header.
      (b) If the final field is set to ’1’, it performs the following operations:
            i. Allocates a buffer, and reads from the disk according to the data that was
               received in all prep read commands specifying the same value for the group
               number field as the current prep read command.
           ii. Executes rdma write operations for each of the above prep read commands.
               The buffer that was allocated and the read stag that was received with the
               prep read commands are used for the rdma write operations. The passive
               side of the rdma operation is the host.

                                             36
        (c) Sends a response to the raid controller.

  4. The raid controller receives all responses. At this stage, all requested data is already
     ready in the local buffer of the host. Then, it sends a response to the host.




                          Figure 19: read request in a tpt-raid


4.3.6    Parity block handling in READ requests
When a read request is executed, the controller requests the targets to read from the disk.
No parity blocks need to be read. However, in the case of multi-stripe reads it may be more
efficient to read them from the disk: if multiple data blocks need to be read from a given
disk and a parity block is located between two data blocks, it may be more efficient for a
disk to read all requested blocks from the media in a single operation and throw away the
parity block(s) instead of executing multiple disk operations.
    The effect of operating the disk more efficiently is both on the zero-load latency of
responding to a user request and on the amount of work done by the disk drive in granting
this request. A reduction in the amount of work (disk time) per user request increases
the system’s maximum throughput. Also, it reduces the load on the system for any given
workload, thereby reducing the queuing delay and the response time experience by the user,
especially in heavy-workload situations.
    An example is shown in Fig. 20:

   • Target 0 will read 3 blocks starting from block 5.

   • Target 1 will read 4 blocks starting from block 6. Parity 12-15 will be ignored.

                                             37
   • Target 2 will read 4 blocks starting from block 2. Parity 8-11 will be ignored.

   • Target 3 will read 4 blocks starting from block 3. Parity 4-7 will be ignored.

   • Target 4 will read 3 blocks starting from block 4. In this case, parity 0-3 is not read.

   We assume that the Baseline system target is not aware of the difference between data
blocks and parity blocks and, therefore, will send all data that was read from the disk to
the controller using an rdma operation. The target in the tpt-raid system is aware of the
difference between data blocks and parity blocks and will send only data blocks to the host.
Therefore, the difference is only in the load on the interconnect, not on the disks themselves.
   The difference in the behavior of the two targets influences the system throughput because
some of the network bandwidth is wasted on useless parity blocks in the case of the Baseline
system. This further increases the difference in load on the interconnect, and also increases
the load on the target and host ports in the Baseline system.




                    Figure 20: Reading parity blocks in read requests

   Algorithm 4.1 calculates the number of parity blocks that are sent from the targets to
the controller in the Baseline system:




                                             38
Algorithm 4.1: get parity blocks cnt(req size, f irst data block idx, f irst parity block idx)

 parity block cnt = 0
 stripes = get stripe list(req size, f irst data block idx, f irst parity block idx)
 for tgt id = 0 to raid size − 1
 do
 for each stripe ∈ stripes
 
 
  do
 
 
 
   if f irst stripe(stripe) or last stripe(stripe)
   
   
 
      then continue
 
 if parity block(stripe[tgt id])
 
 
 
       then parity block cnt = parity block cnt + 1
 return (parity block cnt)


   read req parity blocks cnt(req size) =
        raid size−1     1         raid size−1      1
        f pbi=0     raid size
                              ·    f dbi=0    raid size−1
                                                          ·get   parity blocks cnt(req size, f dbi, f pbi)
                                   f pbi=f dbi




4.3.7     WRITE requests
This section describes the steps for executing a write request in a Baseline raid and a
tpt-raid. There are two main differences between the two systems:

  1. The target in the tpt-raid reads the new data directly from the host while the target
     in the Baseline raid reads it from the raid controller.

  2. ecc calculation in the tpt-raid system is carried out in the targets. In the Baseline
     raid system, ecc calculation is carried out in the raid controller.

     During the execution of a write request, data may be sent to the target as unsolicited
data and/or as solicited data [2]. In the tpt-raid, we only modify solicited writes. Unso-
licited data is sent from the host to the controller as a Protocol Data Unit (pdu) without
using rdma services, rendering 3rd Party Transfer irrelevant to this case.

Baseline RAID
When the host sends a write command to a Baseline raid, the following steps are performed
(as shown in Fig. 21):

  1. The host sends a write command to the raid controller.

  2. The raid controller processes the received command, and performs the following op-
     erations:


                                                        39
   (a) Reads the requested data from the host using an rdma read operation.
   (b) Decides which blocks should be read from the targets: Only blocks from partial
       stripes should be read. For each partial stripe, the raid controller decides whether
       to use incremental or batch ecc calculation, and decides which blocks should be
       read from the stripe according to the selected ecc calculation method.
   (c) Sends read commands to the targets that contain blocks that need to be read.
   (d) Recalculates the parity block for each complete stripe.

3. A target that receives a read command performs the following operations:

   (a) Reads the data from the disk.
   (b) Writes the requested data to the RAID controller buffer using an rdma write
       operation.
   (c) Sends a response to the raid controller.

4. The raid controller receives all responses. At this stage, the raid controller can
   recalculate all ecc data, and it performs the following operations:

   (a) Recalculates the parity block for each partial stripe.
   (b) Sends write commands to the targets that contain the requested blocks and
       recalculated parity blocks.




                    Figure 21: write request in a Baseline raid

5. A target that receives a write command performs the following operations:

   (a) Reads the data from the RAID controller buffer using an rdma read operation.

                                          40
      (b) Writes the data to the disk.
      (c) Sends a response to the raid controller.

  6. The raid controller receives all responses. At this stage, all requested data was written
     to the disks, and it sends a response to the host.

TPT-RAID
The execution of write requests in tpt-raid requires new scsi commands. These are
listed in appendix A.1.
    Multi-stripe write requests contain 0-2 partial-stripes (the first and/or last stripe) and
complete stripes. These requests are handled as 0-2 partial-stripe requests and a single
multi-stripe request (for all complete stripes). The treatment of the latter is optimized by
reading the blocks in any given target during parity calculation in a single command. Hence
we use different commands for partial and complete stripes.
    When the host sends a write command to a tpt-raid, the following steps are performed
(as shown in Fig. 22):

  1. The host sends a write command to the raid controller.

  2. The raid controller processes the received command, and performs the following op-
     erations:

      (a) sends prep write commands to the targets for each requested data block and
          each parity block that needs to be recalculated in complete stripes. If the request
          contains multiple stripes, a target may receive multiple prep write commands
          for that request. In the last prep write command that is sent to each target,
          the final field is set to ’1’.
      (b) For each partial stripe, it decides which parity calculation method will be used.
          The following operations are performed according to the selected parity calcula-
          tion method:
             i. Incremental parity calculation: Sends xdwrite commands (the disable
                write bit in the cdb is set to 1) to targets that contain requested data blocks
                in partial stripes for each data block and read old block commands to
                targets that contain parity blocks in partial stripes for each parity block. The
                parity mode field in the cdb of each read old block command indicates
                that incremental parity calculation is being used.
            ii. Batch parity calculation: Sends read new block commands to targets that
                contain requested data blocks in partial stripes for each data block and read
                old block commands to targets that contain all other data blocks in partial
                stripes for each data block. The parity mode field in the cdb of each read
                old block command indicates that batch parity calculation is being used.

  3. A target that receives a prep write command performs the following operations:

                                              41
Figure 22: write request in a tpt-raid


                 42
   (a) Copies the logical block address and transfer length fields from the
       cdb and the write stag from the iser header.
   (b) If the final field is set to ’1’, it allocates a buffer, and executes rdma read
       operations for each prep write command specifying the same value for the
       group number field as the current prep write command. The buffer that was
       allocated and the write stag that was received with the prep write commands
       are used for the rdma read operations. The passive side of the rdma operation
       is the host. The data that is read from the host belongs to complete stripes.
    (c) Sends a response to the raid controller.
4. A target that receives an xdwrite command performs the following operations:
   (a) Reads the old data block from the disk according to the logical disk address and
       transfer length in the cdb.
   (b) Reads the new data block (that belongs to a partial stripe) from the host using
       an rdma read operation.
    (c) Sends a response to the raid controller.
5. A target that receives a read old block command performs the following operations:
   (a) Reads the old parity block from the disk according to the logical disk address
       and transfer length in the cdb. The parity mode field in the cdb is used to
       determine how to use the parity block later for parity calculations.
   (b) Sends a response to the raid controller.
6. A target that receives a read new block command performs the following operations:
   (a) Reads the new data block from the host using an rdma read operation.
   (b) Sends a response to the raid controller.
7. The raid controller receives all responses. At this stage, each target has the following
   data:

     • Partial stripes:
          – For stripes in which incremental parity calculation is used: If the target
            contains the parity block of the stripe, the old value of the parity block is
            stored in a local buffer. If the target contains a requested data block in the
            stripe, the new value of the data block is stored in a local buffer and the
            result of an xor operation between the old value and new value of the data
            block is stored in another buffer.
          – For stripes in which batch parity calculation is used: If the target contains a
            requested data block in the stripe, the new value of the data block is stored
            in a local buffer. If the target contains a data block that wasn’t requested in
            the stripe, the old value of the data block is stored in a local buffer.

                                          43
  • Complete stripes: If the target contains the parity block of the stripe, a buffer
    that contains zero values is stored. If the target contains a data block in the
    stripe, the new value of the data block is stored in a local buffer.

Now, the raid controller may start orchestrating the recalculation of ecc data. The
new parity blocks are recalculated in the form of a binary tree. The calculation for
partial stripes and complete stripes are done separately:

(a) ecc calculation for partial stripes is shown in algorithms 4.2, 4.3 and 4.4:

    Algorithm 4.2: ECC calculation for partial stripes(partial stripes)

     for each stripe ∈ partial stripes
          
          if incremental parity(stripe)
          
          
              then calc ecc incremental(stripe)
       do
           else
          
          
              then calc ecc batch(stripe)


    Algorithm 4.3: calc ecc incremental(stripe)

     blocks = {all requested data blocks in stripe}
            list
     while length(blocks) > 1
           Set blocks in random order
           
           active blocks = ∅
           
           
           
           
           passive blocks = ∅
           
           
           
           for i = 0 to list length(blocks) − 1 step 2
           
                  
           
           
           
                  active blocks = active blocks {blocks[i]}
           
            do if i + 1 < list length(blocks)
           
           
           
                  
                      then passive blocks = passive blocks {blocks[i + 1]}
       do for i = 0 to list length(passive blocks) − 1
                   
           
           
           
                  Send read parity part tmp command to active blocks[i].
                   
           
                  
                   Set the rdma destination id f ield in the command to
           
                  
           
            do passive blocks[i].
           
           
           
                  
                   Set the parity mode f ield to indicate that incremental
           
                  
                   
           
                  
           
                    parity calculation is used.
           
           W ait f or responses f or all read parity part tmp commands
           
           
           
             blocks = active blocks
     Send read parity part block to the target that contains the parity block.
     Set the rdma destination id f ield in the command to blocks[0].
     Set the parity mode f ield to indicate that incremental parity calculation is used.




                                      44
    Algorithm 4.4: calc ecc batch(stripe)

     blocks = {all data blocks in stripe}
            list
     while length(blocks) > 1
           Set blocks in random order
           
           active blocks = ∅
           
           
           
           
           passive blocks = ∅
           
           
           
           for i = 0 to list length(blocks) − 1 step 2
           
                  
           
           
           
                  active blocks = active blocks {blocks[i]}
           
            do if i + 1 < list length(blocks)
           
           
           
                  
                      then passive blocks = passive blocks {blocks[i + 1]}
       do for i = 0 to list length(passive blocks) − 1
                   
           
           
           
                  Send read parity part tmp command to active blocks[i].
                   
           
                  
                   Set the rdma destination id f ield in the command to
           
                  
           
            do passive blocks[i].
           
           
           
                  
                   Set the parity mode f ield to indicate that batch
           
                  
                   
           
                  
           
                    parity calculation is used.
           
           W ait f or responses f or all read parity part tmp commands
           
           
           
             blocks = active blocks
     Send read parity part block to the target that contains the parity block.
     Set the rdma destination id f ield in the command to blocks[0].
     Set the parity mode f ield to indicate that batch parity calculation is used.


(b) ecc calculation for complete stripes is shown in algorithm 4.5:




                                      45
       Algorithm 4.5: ecc calculation for complete stripes(a)

       targets = {all targets}
              list
       while length(targets) > 1
             Set targets in random order
             
             active targets = ∅
             
             
             
             
             passive targets = ∅
             
             
             
             for i = 0 to list length(targets) − 1 step 2
             
                    
             
             
             
                    active targets = active targets {targets[i]}
             
              do if i + 1 < list length(targets)
             
             
                    
                         then passive targets = passive targets {targets[i + 1]}
         do
             
              for i = 0 to list length(passive targets) − 1
                     
             
             
             
                    Send read parity comp tmp command to active targets[i].
                     
             
                    
              do Set the rdma destination id f ield in the command to
             
             
             
                    passive targets[i].
             
                    
                     
             
                      Set the expected length f ield to the number of complete stripes.
             
             
             
             W ait f or responses f or all read parity comp tmp commands
             
             
               targets = active targets
       for each stripe ∈ complete stripes
             
             Send read parity comp block command to the target that contains
         do the parity block.
             
               Set the rdma destination id f ield in the command to targets[0].

8. A target that receives a read parity part tmp command performs the following
   operations:

   (a) Reads a block from the target whose id appears in the rdma destination id field
       in the command using an rdma read operation.
   (b) Checks the value of the parity mode field in the cdb:
         i. Incremental parity calculation: It performs an xor operation between the
            rdma read buffer and the buffer that was allocated to store the result of
            the xor operation of the xdwrite command specifying the same values
            for the group number field, the logical block address field, and the
            transfer length field. The result will be stored in the buffer that was
            allocated in the xdwrite command.
        ii. Batch parity calculation: If no previous read parity part tmp commands
            specifying the same values for the group number field, the logical block
            address field and the transfer length field were received, it performs
            an xor operation between the rdma read buffer and the buffer that was
            allocated in the read old block or read new block specifying the same
            values for the group number field, the logical block address field, and
            the transfer length field. Else, it performs an xor operation between

                                       46
            the rdma read buffer and the buffer in which the xor result was stored in
            the previous read parity part tmp command specifying the same values
            for the group number field, the logical block address field, and the
            transfer length field.
    (c) Sends a response to the raid controller.

 9. A target that receives a read parity part block command performs the following
    operations:

    (a) Reads a block from the target whose id appears in the rdma destination id field
        in the command using an rdma read operation.
    (b) Checks the value of the parity mode field in the cdb:
          i. Incremental parity calculation: It performs an xor operation between the
             rdma read block and the block that was allocated in the read old block
             command specifying the same values for the group number field, the log-
             ical block address field, and the transfer length field. The result of
             the xor operation is the new parity block of a partial stripe.
         ii. Batch parity calculation: The block that was read in the rdma operation is
             the new parity block of a partial stripe.
    (c) Sends a response to the raid controller.

10. A target that receives a read parity comp tmp command performs the following
    operations:

    (a) Reads blocks from the target whose id appears in the rdma destination id field
        in the command using an rdma read operation.
    (b) If no previous read parity comp tmp commands specifying the same value
        for the group number field were received, Performs an xor operation between
        the rdma read buffer and the buffer that was allocated in the last prep write
        command specifying the same values for the group number field, the logical
        block address field, and the transfer length field as the read parity
        comp tmp command. It stores the result in a target buffer. Else, it performs
        an xor operation between the rdma read buffer and the target buffer in which
        the xor result was stored in the previous read parity comp tmp command
        specifying the same values for the group number field, the logical block
        address field, and the transfer length field.
    (c) Sends a response to the raid controller.

11. A target that receives a read parity comp block command performs the following
    operations:


                                         47
        (a) If the own field in the cdb is set to 0, it reads a block from the target whose id
            appears in the rdma destination id field in the command using an rdma read
            operation. Else, the block is already stored in the target in the buffer in which
            the xor result was stored in the previous read parity comp tmp command
            specifying the same values for the group number field, the logical block
            address field, and the transfer length field. The block that was read is the
            new parity block of a complete stripe.
        (b) Sends a response to the raid controller.

 12. The raid controller receives all responses. At this stage, all targets have the new
     parity blocks in local buffers. The new data is already stored in local buffers in the
     targets. Now, each target can write the new data to the disk. The raid controller
     sends a write new data command to each target that contains requested data.

 13. A target that receives a write new data command writes the following data to the
     disk:

        (a) New data blocks in partial stripes that were read in xdwrite and read new
            data commands.
        (b) New data blocks in complete stripes that were read in the last prep write
            command.
         (c) Parity blocks of partial stripes that were read in read parity part block
             commands.
        (d) Parity blocks of complete stripes that were read in read parity comp block
            commands.

        Then, the target sends a response to the raid controller.

 14. The raid controller receives all responses. At this stage, the requested data was written
     to the disks, and the parity blocks were updated. The raid controller sends a response
     to the host.

4.3.8     Considerations in ECC calculation
As mentioned earlier, ecc calculation in write requests is done in the form of a binary tree.
However, there are other ways to calculate ecc. We will now compare the possible methods
for ecc calculation in the tpt-raid (as shown in Fig. 23) and show why the binary tree
method was chosen:

  1. Chain: The parity block is calculated by passing a temporary result between the
     targets. Each target, in its turn, recalculates the temporary result by performing an
     xor operation between the received temporary result and its own data. For n targets,
     n-1 steps are required. Each step contains a single rdma operation.

                                              48
  2. Star: Each target sends it own data to the target that contains the parity block.
     Then, the target that contains the parity block performs an xor operation between
     the received data and its own data. For n targets, a single step is required in which
     n-1 rdma operations are performed. When using this method, the center gets all the
     load. However, since raid-5 uses rotating parity, the identity of the target which is
     the center is random. It keeps the system load balanced.

  3. Binary tree: For n targets, log2 n steps are required. A total of n-1 rdma operations
     are required. For more details, refer to appendix C.2.

   Table 1 shows a comparison of the three methods:

                             Table 1: ecc calculation methods

                                                Chain    Star    Binary tree
                RDMA operations count            n-1      n-1        n-1
                Number of steps                  n-1       1        log2 n

    Each rdma operation may be counted as two operations (since it involves an active
target and a passive target). However, since this is the case for all three parity calculations
methods, it doesn’t matter whether each rdma operation is counted as a single operation
or two operations.
    All three methods thus require the same work (in terms of rdma operations). However,
using the star method may be problematic it terms of zero load latency because for a single
request, a single target has to read data from all other targets and perform all xor operations
by itself, while when using the binary tree scheme, the work is divided among all targets.
    The chain method is also problematic because its latency is higher than the binary tree
method.




                                              49
             (a) Chain




              (b) Star




           (c) Binary tree

Figure 23: ECC calculation methods




                 50
4.3.9   Two-phase commit in WRITE requests
When the tpt-raid executes a write request, it is required not to lose data and the request
execution must be atomic (i.e. when the request is completed, all new data and parity is
written to the disks or nothing is written to the disks and an error code is returned). Another
requirement is not to involve the host after the data to be written to the disks was read from
it.
    In order to prove the necessity of two-phase commit for atomicity and for the ability not
to lose data, consider a protocol that does not use it (single-phase commit protocol). The
difference between the single-phase commit protocol and the protocol that was described in
section 4.3.7 (two-phase commit protocol) is that with the single-phase commit protocol,
when a target completes its role in the request execution, the controller notifies the target
that it should complete the request by sending it a comp req command. The target writes
the new data to the disk and releases all buffers that are related to the request. The extra
command is required because the target has data that will be read in an rdma operation by
another target during the request execution. The write new data command is not used
in the single-phase commit protocol.
    The following example shows that the single-phase commit protocol does not ensure data
integrity and atomicity. The example shows that the controller should permit deletion of old
data/parity only once it knows that all targets possess the new data/parity as applicable.
Otherwise, there is no useful error correction group, and the occurrence of a fault will not
be fixable by the raid system itself. In the example, we use raid-5 with 5 disks. The
controller receives a write request that contains two data blocks in a single stripe: one
block on target 1, one block on target 2 and a parity block on target 3. The beginning of
the request execution is common to both protocols:

   • The controller sends read new block commands to targets 1 and 2. Each target
     reads its new data block from the host and sends a response to the controller.

   • The controller sends read old block commands to targets 0 and 4. Each target
     reads a data block from its disk and sends a response to the controller.

   • The controller sends a read parity part tmp command to target 1 (indicating that
     it should read a block from target 4). Target 1 reads a block from target 4, performs
     a xor operation with the data block that it read from the host and saves the result.
     Then, it sends a response to the controller.

   • The controller sends a read parity part tmp command to target 0 (indicating that
     it should read a block from target 2). Target 0 reads a block from target 2, performs
     a xor operation with the data block that it read from the disk and saves the result.
     Then, it sends a response to the controller.

   From this point, the two protocols act differently. The request execution continues in the
system that uses the two-phase commit protocol:


                                              51
   • The controller sends a read parity part tmp command to target 1 (indicating that
     it should read a block from target 0). Target 1 reads a block from target 0, performs
     a xor operation with the saved result and saves the result. Then, it sends a response
     to the controller.

    At this point, target 1 crashes. No data was written to the disks yet, so no data is lost
and atomicity is preserved. The system moves to degraded mode and the faulty disk can be
easily reconstructed using the data from all other disks.
    The following commands are executed in the system that uses the single-phase commit
protocol:

   • The controller sends a comp req command to target 4. Target 4 releases all buffers
     that belong to the request and send a response to the controller.

   • The controller sends a comp req command to target 2. Target 2 writes the new data
     to its disk and releases all buffers that belong to the request. Then, it sends a response
     to the controller.

   • The controller sends a read parity part tmp command to target 1 (indicating that
     it should read a block from target 0). Target 1 reads a block from target 0, performs
     a xor operation with the saved result and saves the result. Then, it sends a response
     to the controller.

   • The controller sends a comp req command to target 0. Target 0 releases all buffers
     that belong to the request and sends a response to the controller.

   At this point, target 1 crashes. Now, target 3 (parity target) needs to read the parity
block from target 1 but it is impossible. At this point, the stripe contains the following data:

   • Target 0: current data block

   • Target 1: faulty

   • Target 2: new block only

   • Target 3: old parity block

   • Target 4: current data block

    The command may not be completed because two blocks are missing (target 1 and target
3) and there is not enough data to recover them (without reading again from the host). On
the other hand, it is impossible to roll back the request because target 2 already wrote the
new block to its disk and does not have the old block. The old block of target 2 cannot be
derived by xoring the data from other targets because two blocks are missing.
    The old or the new data of target 1 cannot be derived, resulting in loss of data. Thus,
this is not merely an atomicity problem as the system has neither version of the data of

                                              52
target 1. This is important whenever the blocks of an ecc group have no affinity (e.g., they
belong to different parties). Then, atomicity is not critical, but data loss is.
    If target 2 saved the old block, recovery was possible. However, since the system does
not use two-phase commit, target 2 had to delete the old block from its memory when it
received the comp req command. At that point, not all targets already had the new data
(to be written to the disk) ready.
    It is important to note that using the single-phase commit protocol does not save com-
mands because the controller has to send the same amount of comp req commands instead
of write new data commands.




                                           53
4.4    Maximum Throughput comparison
In this section, we compare the total amounts of work that must be carried out by the system
for read and write requests in the Baseline raid and tpt-raid systems. By calculating
the amount of work of each system resource per request, we can estimate the number of
requests/sec that each system resource can handle in the following way:
                                                   resource max thr
                    resource max reqs/sec =
                                                 resource work per req
  The resource that has the lowest resource max reqs/sec value is the system’s bottleneck.
Therefore, the system maximum throughput is calculated in the following way:

          system max thr = min∀resource∈system   resources {resource   max reqs/sec}
   The following resources are used when executing requests:

   • cpu: The cpu is used for the following operations:

        – Sending pdus and initiating rdma operations: In our implementation, both are
          executed in the same context. The cpu percentage usage per (pdu per sec) is
          more or less constant and is marked as cpu pdu work.
        – Receiving pdus: In our implementation, receiving pdus requires less cpu work
          than sending pdus. Not every received pdu results in an interrupt, as Infini-
          Band permits them to be collected and removed from the queue as a batch (by
          polling) rather than interrupting the processor repeatedly. Therefore, the impact
          of received pdus will be ignored. Handling rdma operations as the passive side
          doesn’t require any cpu work.
        – Parity calculation: The cpu usage for xor operations per (block per second) is
          more or less constant and is marked as cpu xor block.

      In this document, cpu usage is expressed in percents.

   • Network link: The network link is used for sending control and data. However, when
     calculating the communication load, we ignore control messages since they are very
     short relative to the size of data that is transferred in rdma operations. The network
     link (InfiniBand) works in full-duplex mode, so sending data and receiving data may
     be considered as using separate resources.

   • Disk: Since the disk operations are identical in the two systems and we want to compare
     the behavior of all other components of the two systems, we assume that the targets
     have disks that never become a bottleneck (because each target box contains many
     disks and/or the disks have large caches that result in memory transfers most of the
     time instead of accessing the media). However, we do not ignore the memory access
     operations that disk operations cause.


                                            54
   • Memory: The memory is used in the following operations:

          – rdma operations
          – Disk operations
          – Parity calculation

        The memory may also become a bottleneck when data is read and written to it in
        parallel. This is because the requests that are sent towards the memory are interleaved.
        This may happen when rdma operations require reading from the memory while other
        rdma operations require writing to the memory.
   In sections 4.4.1 and 4.4.2, we describe how is the system maximum throughput is cal-
culated. A detailed calculation may be found in appendix C.

4.4.1     READ requests
In order to assess the maximum system throughput for read requests for each system, we
calculate the maximum throughput of each entity (host, controller and targets):

   The maximum throughput of the host depends on the maximum throughput of its net-
work link (net max read req host /sec) and memory (mem max read req host /sec). The influ-
ence of the cpu is ignored because only a single command is sent for each request. Therefore,
the maximum throughput of the host is:

host max read thr = min(net max read req host /sec, mem max read req host /sec)

   The maximum throughput of the Baseline controller depends on the maximum num-
                                                                ctrl
ber of requests/sec that the cpu can handle (cpu max read reqBaseline /sec), the maximum
                                                      ctrl
throughput of its network link (net send max read reqBaseline /sec and
                      ctrl                                           ctrl
net rcv max read reqBaseline /sec) and memory (mem max read reqBaseline /sec). Therefore,
the maximum throughput of the Baseline controller is:

                                 ctrl max read thrBaseline =
                         ctrl                               ctrl
     min(cpu max read reqBaseline /sec, net rcv max read reqBaseline /sec,


                             ctrl                           ctrl
        net send max read reqBaseline /sec, mem max read reqBaseline /sec)



    The maximum throughput of the tpt controller depends only on the maximum number
                                                         ctrl
of requests/sec that the cpu can handle (cpu max read reqtpt /sec):

                                                          ctrl
                   ctrl max read thrtpt = cpu max read reqtpt /sec

                                               55
    The maximum throughput of the Baseline targets depends on the maximum throughput
                                           target                                      target
of the network link (net send max read reqBaseline /sec) and memory (mem max read reqBaseline /sec)
in each target. The influence of the cpu is ignored because only a single command is sent
to each target per request. Therefore, the maximum throughput of the Baseline targets is:

                            target max read thrBaseline =
                            target                         target
   min(net send max read reqBaseline /sec, mem max read reqBaseline /sec)
    The maximum throughput of the tpt targets depends on the maximum number of re-
                                                     target
quests/sec that the cpu can handle (cpu max read reqtpt /sec), the maximum throughput
                                          target                                 target
of the network link (net send max read reqtpt /sec) and memory (mem max read reqtpt /sec)
in each target. Therefore, the maximum throughput of the tpt targets is:

                                                       target
          target max read thrtpt = min(cpu max read reqtpt /sec,
                                               target
                          net send max read reqtpt /sec,
                                            target
                            mem max read reqtpt /sec)



   The maximum throughput of the Baseline system is:

                                max read thrBaseline =
min(host max read thr, ctrl max read thrBaseline , target max read thrBaseline )



   When assuming similar hardware for all entities, the controller is the bottleneck in the
Baseline system because it has to do much more work per operation, especially on its net-
work link.

   The maximum throughput of the tpt system is:

                                  max read thrtpt =
    min(host max read thr, ctrl max read thrtpt , target max read thrtpt )

4.4.2   WRITE requests
In order to assess the maximum system throughput for write requests for each system, we
calculate the maximum throughput of each entity (host, controller and targets):



                                            56
   The maximum throughput of the host is identical to its maximum throughput for read
requests:

                     host max write thr = host max read thr
    The maximum throughput of the Baseline controller depends on the maximum num-
                                                                ctrl
ber of requests/sec that the cpu can handle (cpu max write reqBaseline /sec), the maximum
                                                       ctrl
throughput of its network link (net send max write reqBaseline /sec and
                       ctrl                                           ctrl
net rcv max write reqBaseline /sec) and memory (mem max write reqBaseline /sec). There-
fore ,the maximum throughput of the Baseline controller is:

                             ctrl max write thrBaseline =
                         ctrl                                ctrl
    min(cpu max write reqBaseline /sec, net rcv max write reqBaseline /sec,
                           ctrl                            ctrl
     net send max write reqBaseline /sec, mem max write reqBaseline /sec)



    The maximum throughput of the tpt controller depends only on the maximum number
                                                          ctrl
of requests/sec that the cpu can handle (cpu max write reqtpt /sec):

                                                         ctrl
                ctrl max write thrtpt = cpu max write reqtpt /sec
   The maximum throughput of the Baseline targets depends on the maximum number of re-
                                                         target
quests/sec that the cpu can handle (cpu max write reqBaseline /sec), the maximum through-
                                               target                                 target
put of the network link (net send max write reqBaseline /sec and net rcv max write reqBaseline /sec)
                                    target
and memory (mem max write reqBaseline /sec) in each target. Therefore, the maximum
throughput of the Baseline targets is:

                            target max write thrBaseline =


                         target                              target
    min(cpu max write reqBaseline /sec, net rcv max write reqBaseline /sec,


                           target                          target
     net send max write reqBaseline /sec, mem max write reqBaseline /sec)



    The maximum throughput of the tpt targets depends on the maximum number of re-
                                                     target
quests/sec that the cpu can handle (cpu max write reqtpt /sec), the maximum throughput
                                           target
of the network link (net send max write reqtpt /sec and net rcv max write req)target /sec)
                                                                              tpt




                                             57
                             target
and memory (mem max write reqtpt /sec) in each target. Therefore, the maximum through-
put of the tpt targets is:

                             target max write thrtpt =


                           target                         target
      min(cpu max write reqtpt /sec, net rcv max write reqtpt /sec,



                             target                     target
       net send max write reqtpt /sec, mem max write reqtpt /sec)

   The maximum throughput of the Baseline system is:

                               max write thrBaseline =
min(host max write thr, ctrl max write thrBaseline , target max write thrBaseline )



    In a similar way to the Baseline system in read requests, the controller becomes a bot-
tleneck because it has to do much more work per operation.

   The maximum throughput of the tpt system is:

                                  max write thrtpt =
  min(host max write thr, ctrl max write thrtpt , target max write thrtpt )




                                            58
4.5     Latency comparison
In this section, we compare the latency for read and write requests in the Baseline raid
and tpt-raid systems. We measure zero-load latency (i.e. when only a single request is
executed). The measurement starts when a command is sent and ends when a response is
received.

4.5.1   READ requests
When executing read requests, the two systems differ in several ways:

   • The main difference is that a read request in the Baseline raid requires that the
     data be transferred from the targets to the raid controller and then from the raid
     controller to the host (2 hops) while in the tpt-raid, the data is transferred directly
     from the targets to the host (single hop).

   • As described in 4.3.6, when the controller in the Baseline system requests that a target
     read data from its disk and send it to the controller, the target may also read unnec-
     essary parity blocks. This doesn’t happen in the tpt-raid system. Therefore, the
     latency of the rdma write operation in the target in the Baseline system is roughly
         1
     raid size
               higher than in the tpt-raid system.

   • A read request in the tpt-raid has more overhead because of the following reasons:

        – More scsi commands (prep read) are required for each read request (and,
          therefore, more interrupts will occur, more cpu work is required and more latency
          of sending the commands to the targets is added).
        – Each data block that is read from a target requires an rdma operation. Therefore,
          if more than a single data block is read from a target, multiple rdma operations
          are required for that target (while in the Baseline raid, a single rdma operation
          is required). However, our measurements show that multiple rdma operations do
          not add so much latency and may be ignored.

    Assuming that the execution of a prep read command contains multiple stages that can
be pipelined, the latency incurred by the prep read commands is the sum of the latency
of a single command and the longest stage in the pipe for all other commands. The latency
incurred by the prep read commands should be calculated in the following way:

   latencyprep read (block count) = prep read latency+
   (block count − 1) · longest stage in pipe

However, there are several problems:

   • Some pdus (prep read commands) are received as a result of an interrupt while
     others are received as a result of polling the hardware. It is hard to predict whether a

                                             59
        pdu will be received in an interrupt or by polling when multiple pdus are received in
        parallel. This applies to both controller and targets.

   • When measuring zero-load latency, only a few prep read commands are sent for a
     single request. The system is not in steady state and is therefore more ”noisy”.

   The above calculation is more accurate in two scenarios: 1) single-block request 2) request
with a very large number of blocks: the system is in steady state and almost all pdus are
received by polling the hardware.
   Therefore, we calculate an upper bound on the latency of the prep read commands.
The upper bound is the latency of a single prep read command for each command. Our
measurements (in section 6.4.3) show that the difference between the upper bound and the
actual results is roughly 30%.
          upper bound
   latencyprep read (block count) = block count · prep read latency

We next calculate the latency difference between the two systems:
                                           req size
   The average data length per target is   raid size


   Therefore, the latency difference is as follows:


                         baseline                 tpt
                  latencyread (req size) − latencyread (req size) =
                                             1                           req size
          rdma latency(req size) +                     · rdma latency(               )−
                                        raid size                        raid size
                                     upper bound
                              latencyprep read (block count)

4.5.2     WRITE requests
We next calculate the latency of a single write request in each system.

Baseline raid
The controller reads new data from the host: block count blocks are transferred from the
host to the controller. Therefore:

          Baseline
   latencyrdma host→controller (req size) = rdma latency(req size)

    Now, the controller reads data and parity blocks from the targets: For each raid stripe,
the controller decides whether to use incremental or batch parity calculation and sends read
commands to the targets. The latency is the latency of the target that receives more read
commands than other targets. The maximum number of read commands sent to any target
is calculated according to algorithm 4.6.

                                              60
Algorithm 4.6: get max old blocks read(req size, f irst data blk idx, f irst parity blk idx)

 procedure stripe get old blocks read count(stripe, read count)
  if inc parity calc(stripe)
           
           for tgt id = 0 to raid size − 1
    then             if requested block(stripe[tgt id]) or parity block(stripe[tgt id])
            do
                       then read count[tgt id] + +
    else 
           for tgt id = 0 to raid size − 1
    then             if not requested block(stripe[tgt id]) and not parity block(stripe[tgt id])
            do
                       then read count[tgt id] + +

 main
 stripes = get stripe list(req size, f irst data blk idx, f irst parity blk idx)
 for tgt id = 0 to raid size − 1
   do read count[tgt id] = 0

  f irst stripe = stripes → f irst
  if partial stripe(f irst stripe)
     then stripe get old blocks read count(f irst stripe, read count)

  if stripes → count > 1
           
           last stripe = stripes → last
    then if partial stripe(last stripe)
           
              then stripe get old blocks read count(last stripe, read count)

  max old blocks read = 0
  for tgt id = 0 to raid size − 1
          if read count[tgt id] > max old blocks read
    do
            then max old blocks read = read count[tgt id]
  return (max old blocks read)

   The average number of blocks that are read by the target that receives the maximal
number of read commands is:

   avg max old blocks read(req size) =
      raid size−1     1         raid size−1      1
      f pbi=0     raid size
                            ·    f dbi=0    raid size−1
                                                        ·get   max old blocks read(req size, f dbi, f pbi)
                                 f pbi=f dbi

    As mentioned earlier, the latency of reading the old blocks from the targets to the con-
troller is the latency of the target that receives the maximal number of read commands.
We assume that a target can perform a disk operation for a command while performing an
rdma operation for another command. We also assume that the latency of a disk operation


                                                      61
is greater than the latency of an rdma operation. Therefore, the latency of reading the old
blocks from the targets is the sum of latency of reading the data blocks from the disk and
the latency of the rdma operation for the last command:

          Baseline
  latencyread old blocks (req size) =
avg max old blocks read(req size) · disk read latency(block size)+
rdma latency(block size)

   While reading the old data blocks from the targets, the controller calculates the parity
blocks for the complete stripes. We assume that by the time that the old blocks are ready,
the parity calculation of the complete stripes is already done.

    Now, the controller calculates the new parity blocks for the partial stripes. The number
of blocks that are read/written during the parity calculation is calculated according to algo-
rithm 4.7:

Algorithm 4.7: get xored blocks count(req size, f irst data blk idx, f irst parity blk idx)

 procedure get stripe xored blocks count(stripe)
  if inc parity calc(stripe)
    then return (2 · stripe → data blocks count + 2)
    else return (raid size)

 main
 stripes = get stripe list(req size, f irst data blk idx, f irst parity blk idx)
 xored blocks cnt = 0
 f irst stripe = stripes → f irst
 if partial stripe(f irst stripe)
    then xored blocks cnt+ = get stripe xored blocks count(f irst stripe)
 if stripes → count > 1
          
          last stripe = stripes → last
    then if partial stripe(last stripe)
          
              then xored blocks cnt+ = get stripe xored blocks count(last stripe)
 return (xored blocks cnt)

   The average number of blocks that are read/written during the parity calculation is:

   avg xored blocks count(req size) =
      raid size−1     1         raid size−1      1
      f pbi=0     raid size
                            ·    f dbi=0    raid size−1
                                                        ·get   xored blocks count(req size, f dbi, f pbi)
                                 f pbi=f dbi

   The average latency of the partial stripes parity calculation is, therefore:


                                                      62
          Baseline
  latencypartial stripes parity (req size) =
xor latency(avg xored blocks count(req size) · block size)

    Now, the controller writes the new data to the targets: The targets read from the con-
troller req size data blocks and parity blocks. Algorithm 4.8 calculates the number of parity
blocks:


Algorithm 4.8: write parity blocks count(req size, f irst data blk idx, f irst parity blk idx)

 stripes = get stripe list(req size, f irst data blk idx, f irst parity blk idx)
 return (stripes → count)

   avg write parity blocks count(req size) =
      raid size−1     1         raid size−1      1
      f pbi=0     raid size
                            ·    f dbi=0    raid size−1
                                                        ·write   parity blocks count(req size, f dbi, f pbi)
                                 f pbi=f dbi

   Therefore, the latency of writing the new data to the targets is:

          Baseline
   latencywrite new data (req size) =
rdma latency( req size+write parity blocks count(req size)·block size )+
                                    raid size
                     req size+write parity blocks count(req size)·block size
disk write latency(                         raid size
                                                                             )

   The latency of a single write request is:


                                             Baseline
                                      latencywrite (req size) =


             Baseline                                 Baseline
      latencyrdma host→controller (req size) + latencyread old blocks (req size)+


              Baseline                                            Baseline
       latencypartial stripes        parity (req   size) + latencywrite new     data (req   size)



tpt-raid
The targets read new data from the host: block count blocks are transferred from the host
to the targets. We use the same upper bound that was used for read requests in tpt-raid.
Our measurements show that the difference between the upper bound and the actual results
is less than 10%. Therefore:



                                                      63
          tpt
   latencyprep write (req size) = prep write latency · block count

          tpt
   latencyrdma    host→targets (req        size) = rdma latency(req size)

    Some targets have to read old data blocks from the disk for parity calculation of partial
stripes:

         tpt
  latencyread old blocks (req size) =
avg max old blocks read(req size) · disk read latency(block size)

   Now, the targets calculate temporary parity results which will be used later for the new
parity blocks calculation:

    The number of iterations between the targets that contain the data blocks of a partial
stripe in order to calculate the temporary parity result for that stripe is:
                                             
                                              0, bc = raid size − 1 (complete stripe)
    iterationspart stripe parity calc (bc) =    log2 bc , bc < raid2size − 1 (incremental calculation)
                                             
                                                log2 (raid size − 1) , else (batch calculation)
   The number of iterations between the targets in order to calculate the temporary parity
results for all complete stripes is:

   iterationscomp   stripe parity calc   = log2 raid size

   During the phase of the temporary parity results calculation, three calculations are per-
formed:

   • Temporary parity result for the first stripe (if it’s a partial stripe)

   • Temporary parity results for the complete stripes (if exist)

   • Temporary parity result for the last stripe (if it’s a partial stripe)

    The targets that perform the rdma read operations and xor operations are selected so
as to distribute the load of xor calculations and data trasnfer uniformly among the targets.
Therefore, we assume that these calculations are performed in parallel. The latency of this
phase is the latency of the longest calculation of these three. It is calculated in the following
way:

   f bc - first stripe data block count
                                           block count−(f bc mod (raid size−1))
   Complete stripes count: cs =                        raid size−1




                                                       64
   Last stripe block count: lbc = block count − f bc − cs · (raid size − 1)

   latencyparity calc tmp (req size, f bc) =

max(iterationspart      stripe parity calc (f bc)·(rdma   latency(block size)+xor latency(3·block size)),

iterationscomp   stripe parity calc ·(rdma      latency(cs·block size)+xor latency(3·cs·block size)),

iterationspart   stripe parity calc (lbc)   · (rdma latency(block size) + xor latency(3 · block size)))

   The average latency of calculating the temporary parity results is:

          tpt
   latencyparity calc tmp (req size) =
  raid size−1     1
  f bc=0      raid size
                           · latencyparity calc tmp (req size, f bc)

   Now, each target that contains a parity block that needs to be updated, reads the tem-
porary result from the target that holds it. A target may read multiple parity blocks in
parallel. Therefore, the latency of this phase is:

          tpt
   latencyread parity block = rdma latency(block size)

   Now, the targets write the new data to the disks. The latency of this phase is:

          tpt
   latencywrite new data (req size) =
disk write latency( req size+write parity blocks count(req size)·block size )
                                          raid size


   The latency of a single write request is:


                                               tpt
                                        latencywrite (req size) =


                 tpt                            tpt
          latencyprep write (req size) + latencyrdma               host→targets (req   size)+


                tpt                                            tpt
         latencyread        old blocks (req     size) + latencyparity calc tmp (req size)+


                         tpt                        tpt
                  latencyread parity block + latencywrite           new data (req   size)



                                                        65
4.6     Comparison of data transfer and number of operations
We now compare the amount of data that is transferred and the number of operations in the
course of single-block read/write and of a full-stripe read/write in both systems. The
operations that are counted are sending a pdu or initiating an rdma operation.


4.6.1   READ requests
Single-block read

   • Baseline raid: The host sends a read command to the controller. The controller
     sends a command to the target. The target writes the requested data block to the
     controller and sends it a response. The controller writes the data to the host and sends
     it a response.

   • tpt-raid: The host sends a read command to the controller. No data passes through
     the controller. The controller sends a command to the target. The target writes a
     single block directly to the host using rdma and sends a response to the controller.
     The controller sends a response to the host.

Single-stripe read

   • Baseline raid: The host sends a read command to the controller. The controller
     sends a command to raid size − 1 data targets. Each data target writes the requested
     data block to the controller and sends it a response. The controller writes the data to
     the host and sends it a response.

   • tpt-raid: The host sends a read command to the controller. No data passes through
     the controller. The controller sends a command to raid size − 1 data targets. Each
     data target writes the requested data block directly to the host and sends a response
     to the controller. The controller sends a response to the host.

4.6.2   WRITE requests
Single-block write

   • Baseline raid: The host sends a write command to the controller. The controller
     reads the new data block from the host and sends a command to the data target and
     parity target. The targets write the old data block and the old parity block to the
     controller, and each target sends a response. The controller calculates parity and sends
     a command to the data target and parity target. The targets read the new data block
     and the new parity block from the controller, write the new data to their disks and
     each target sends a response to the controller. Then, the controller sends a response
     to the host.

                                             66
   • tpt-raid: The host sends a write command to the controller. The controller sends
     a command to the data target and the parity target. The data target reads a single
     block directly from the host and sends a response to the controller. The parity target
     reads a block from its disk and sends a response to the controller. The controller sends
     another command to the parity target. The parity target reads the old data block
     from the data target and sends a response to the controller. This target-target rdma
     operation is counted twice (only for transferred data calculation) because one target
     sends it and another target receives it. Then, the controller sends commands to the
     the data target and parity target. Each target writes the new data to its disk and
     sends a response to the controller. Then, the controller sends a response to the host.

Single-stripe write

   • Baseline raid: The host sends a write command to the controller. The controller
     reads raid size − 1 blocks from the host, computes parity and sends a command to
     raid size − 1 data targets and the parity target. Each target reads a single block from
     the controller, writes it to its disk and sends a response to the controller. Then, the
     controller sends a response to the host.

   • tpt-raid: The host sends a write command to the controller. The controller sends a
     command to raid size − 1 data targets. Each data target reads a single block directly
     from the host and sends a response to the controller. Then, the targets calculate the
     new parity block in the form of a binary tree. In order to simplify the calculation, we
     assume that raid size − 1 is even. There are raid size−1 rdma operations in the 1st
                                                           2
     iteration ( raid size−1 targets read data from the other raid size−1 targets), and a total
                      2                                            2
     of raid size − 2 rdma operations are needed for the calculation of the parity block.
     Another rdma operation is required for the parity target to read the recalculated parity
     result. Therefore, raid size − 1 rdma operations are required for the calculation of the
     new parity block. These target-target rdma operations are counted twice as explained
     above (only for transferred data calculation). For each rdma operation, the controller
     sends a command to a target and the target sends a response to the controller. After
     the new parity is calculated, the controller sends a command to raid size − 1 data
     targets and the parity target. Each target writes the new data to its disk and sends a
     response to the controller. Then, the controller sends a response to the host.

    Table 2 presents a comparison of the number of transferred blocks during read and
write requests. Table 3 compares the number of operations during read and write
requests.
    The comparison for read and write requests shows that the amount of data that is
transferred through the controller is reduced to zero in the tpt-raid. For the host, there
is no difference between the two systems. In read requests, the amount of data that is
transferred through the targets is the same for the two systems. In write requests, the
amount of data that is transferred through the targets is increased in the tpt-raid. Also,
more control messages are sent in the tpt-raid between the controller and targets. The

                                              67
                Table 2: Data transfers (in blocks) during request execution

                                   READ requests              WRITE requests
                  Entity
                              1-Block      Full stripe    1-Block      Full stripe
                   Host          1       raid size − 1       1       raid size − 1
                 Controller      2    2 · (raid size − 1)    5     2 · raid size − 1
     Baseline
                  Targets        1       raid size − 1       4          raid size
                   Total         4    4 · (raid size − 1)   10     4 · raid size − 2
                   Host          1       raid size − 1       1       raid size − 1
                 Controller      0              0            0              0
    tpt-raid
                  Targets        1       raid size − 1       3    3 · (raid size − 1)
                   Total         2    2 · (raid size − 1)    4    4 · (raid size − 1)

                       Table 3: Operations during request execution

                                   READ requests              WRITE requests
                  Entity
                              1-Block      Full stripe    1-Block      Full stripe
                   Host          1              1            1              1
                 Controller      3       raid size + 1       6       raid size + 2
     Baseline
                  Targets        2    2 · (raid size − 1)    8        2 · raid size
                   Total         6        3 · raid size     15    3 · (raid size + 1)
                   Host          1              1            1              1
                 Controller      2          raid size        6     3 · raid size − 1
    tpt-raid
                  Targets        2    2 · (raid size − 1)    7     5 · raid size − 4
                   Total         5     3 · raid size − 1    14     8 · raid size − 4


number of control messages that are sent between hosts and the controller does not change.
We will see in section 6 that the extra data transfers and control messages are negligible
when considering the performance improvement achieved by removing the raid controller
from the data path and moving the ecc calculation to the targets.




                                            68
4.7     Error handling
4.7.1   Request execution failure
If a failure occurs during a read request, no data is lost. For writes, we must guarantee
atomicity; i.e., upon failure, a request must either be completed or rolled back, and the
controller must be advised accordingly. When a write request is executed in the tpt-
raid, no data is written to the disks before the controller instructs the targets to write the
new data and parity blocks to the local storage. The write new data command is sent
simultaneously to all targets. The targets release the buffers that contain the data that is
written to the disks after the write new data command is done as conventional systems
do after the write command is done. Assuming (as do all raid systems) that at most one
disk failure occurs during a given write request, the controller will be able to instruct the
targets how to complete the request. If a target fails to operate, the tpt-raid switches to
degraded mode, which is discussed next.

4.7.2   Degraded mode
When the tpt-raid controller detects the failure of one of the targets, the system switches
to degraded mode. Failure detection is done in a very similar way to the way it is done in
the Baseline system. When the system is in degraded mode, a spare disk must be installed
instead of the faulty disk and a background reconstruction process begins. This is similar
to the way it is done in the Baseline system. Then, write requests may be executed as
they are executed in normal mode. Reading data blocks from the operational targets is left
unchanged. Reading a single data block from the faulty target is done in the following way:

  1. The data block is calculated by calculating the parity of all other blocks in the stripe.
     This is done exactly like the calculation of the parity block in write requests.
  2. The target that holds the result of the parity calculation (which is the requested block)
     writes it to the host using an rdma operation.

   We now compare the amount of data that is transferred by the raid controller and
targets when a block is read from the non-operating target:

  1. Baseline raid: The controller receives raid size − 1 blocks from the operating targets,
     calculates the requested block and transfers it to the host. Therefore, raid size blocks
     pass through the controller. The targets transfer raid size − 1 blocks to the controller.
  2. tpt-raid: No data passes through the controller. In the targets, raid size − 2 rdma
     operations are required for calculating the requested block (similar to parity calculation
     when a single stripe is written in normal mode). These rdma operations are counted
     twice because in each rdma operation between two targets, one target sends data
     while another target receives it. Another rdma operation is required for transferring
     the requested block directly to the host. Therefore, 2·raid size−3 blocks pass through
     the targets.

                                             69
   The results of this comparison are very similar to the comparison of write requests in
normal mode. Therefore, the same performance improvement that is achieved in write
requests in normal mode will be achieved in read requests in degraded mode.

   The reconstruction of the faulty target is done in a very similar way to read requests
execution in degraded mode: for each stripe, the data block is calculated by calculating the
parity of all other blocks in the stripe. Then, the target that is being reconstructed reads
the block from the target that holds the result of the parity calculation.




                                            70
4.8    Concurrent execution of multiple requests
In this section, we show that the tpt-raid system doesn’t generate new problems when
executing requests in parallel.
   In the case of parallel read requests, there is no problem, since no data is changed. In
the case of parallel write requests, if two write requests refer to different parity groups,
there’s no problem. If two write requests refer to the same parity group, the raid controller
handles it in the same way that the Baseline raid controller would handle it.




                                             71
5       Proof of correctness and completeness
5.1     Assumptions
In proving the correctness and completeness of the system, we make the following assump-
tions that are ordinary in storage systems:

    • Commands, response messages and data do not get lost and are not damaged when sent
      between the entities in the system. (This may be achieved by a lower-level protocol.)
    • tpt-raid can tolerate a single failure. If more than a single failure occurs, data
      correctness may not be ensured.
    • Disk operations complete successfully with the correct data read/written or complete
      with failure and an error code is returned.

    We next prove that tpt-raid is as correct as the Baseline raid.

5.2     Proof of correctness
In this section we prove that:

    • The correctness of a single read request is ensured, and the transaction will terminate
      with the correct data read into the host’s buffer.
    • The correctness of a single write request is ensured, and the transaction will terminate
      with the correct data written to the disks.

5.2.1    READ requests
During the execution of the last prep read command in each target in the tpt-raid, it
reads from the disk the same data that a target in the Baseline raid would read when it
receives a read command from the Baseline raid controller.
   The only difference between the targets is that a target in the tpt-raid system writes
the requested data to the buffer in the host while a target in the Baseline raid system writes
the requested data to the buffer in the raid controller.
   The transaction terminates only after all targets that contain requested blocks have
written the requested data to the buffer in the host. Therefore, the transaction will terminate
with the correct data read into the host’s buffer.

5.2.2    WRITE requests
In this section, we show that the correctness of a single write request is ensured, and the
transaction will terminate with the correct data written to the disks if no errors occurred. (If
any errors occurred, the request may not end successfully and it will be handled as defined
in section 4.7.) We show this separately for partial stripes and complete stripes.

                                              72
• Partial stripes: Let p be the number of requested data blocks from a partial stripe, k
  be the index of the target that contains the 1st requested block in the partial stripe,
  and raid size be the number of blocks in a single raid stripe. In order to simplify the
  calculation, we assume that the index of the target that contains the parity block is
  raid size − 1.
    – Incremental parity calculation: For each stripe, each target that contains a re-
      quested data block reads its new value from the host and its old value from the
      disk with a xdwrite command. The xdwrite command also produces the fol-
      lowing result: tmp blocki = xor(new blocki , old blocki ) where new blocki is the
      new value that was read from the host for target i and old blocki is the old value
      that was read from the disk for target i. The target that contains the parity block
      reads its old value from the disk (old parity block) with a read old block
      command.
      The iterations of read parity part tmp commands form a binary tree of xor
      operations which creates the following result:
      xor(...(xor(xor(tmp blockk , tmp blockk+1 ), xor(tmp blockk+2 , tmp blockk+3 ))...
      xor(tmp blockk+p−2 , tmp blockk+p−1 )...) =
      xor(tmp blockk , tmp blockk+1 , ..., tmp blockk+p−2 , tmp blockk+p−1 )
      The above result is read by the target that contains the parity block during the
      read parity part block command. It performs the following xor operation:
      xor(xor(tmp blockk , tmp blockk+1 , ..., tmp blockk+p−2 , tmp blockk+p−1 ),
      old parity block) = xor(tmp blockk , tmp blockk+1 , ..., tmp blockk+p−2 , tmp blockk+p−1 ,
      old parity block) = new parity block
      Now, the new data blocks and new parity block for the partial stripe are already
      stored in the target. These blocks are written to the disk during the write new
      data command. The transaction terminates only after all targets that contain
      requested data block or parity blocks have written the new data to the disks.
      Therefore, the transaction will terminate with the correct data written to the
      disks.
    – Batch parity calculation: For each stripe, each target that contains a requested
      data block reads its new value from the host with a read new block command.
      All other targets, except for the target that contains the parity block, read the
      old value of the data block from the disk with a read old block command.
      blocki denotes the data block that was read by target i during the execution of a
      read new block or read old block command.
      The iterations of read parity part tmp commands form a binary tree of xor
      operations which creates the following result:
      xor(...(xor(xor(block0 , block1 ), xor(block2 , block3 ))...
      xor(blockraid size−3 , blockraid size−2 )...) =
      xor(block0 , block1 , ..., blockraid size−3 , blockraid size−2 ) = new parity block

                                          73
            The new parity block is read by the target that contains the parity block during
            the execution of the read parity part block command.
            Now, the new data blocks and new parity block for the partial stripe are already
            stored in the target. These blocks are written to the disk during the write new
            data command. The transaction terminates only after all targets that contain
            requested data block or parity blocks have written the new data to the disks.
            Therefore, the transaction will terminate with the correct data written to the
            disks.

   • Complete stripes: Let raid size be the number of blocks in a single raid stripe. In
     order to simplify the calculation, we assume that the index of the target that contains
     the parity block is raid size − 1.
      For each stripe, each target that contains a data block reads its new value from the
      host with the last prep write command. The target that contains the parity block
      allocates a buffer that contains zero values. blocki denotes the data block that was read
      by target i during the execution of the last prep write command. zero block denotes
      the buffer that contains zero values for the target that contains the parity block.
      The iterations of read parity comp tmp commands form a binary tree of xor
      operations which creates the following result:
      xor(...(xor(xor(block0 , block1 ), xor(block2 , block3 ))...
      xor(blockraid size−2 , zero block)...) =
      xor(block0 , block1 , ..., blockraid size−2 , zero block) = xor(block0 , block1 , ..., blockraid size−2 ) =
      new parity block
      The new parity block is read by the target that contains the parity block during the
      execution of the read parity comp block command.
      Now, the new data blocks and new parity block for the complete stripe are already
      stored in the target. These blocks are written to the disk during the write new data
      command. The transaction terminates only after all targets that contain requested
      data block or parity blocks have written the new data to the disks. Therefore, the
      transaction will terminate with the correct data written to the disks.

5.3    Proof of completeness
We state that the new architecture is complete since it covers all aspects of the scsi and
iscsi protocols. No changes have been made in these protocols except for the login/logout
mechanism that was described in section 4.2 and scsi read and write commands that
were described in sections 4.3.5 and 4.3.7.
   To complete the completeness proof, we proved in section 4.7 that atomicity is preserved,
and, therefore, no data loss may occur if any single failure occurs.



                                                    74
6       Prototype and performance analysis
In order to compare the tpt-raid with the Baseline in-band controller, we constructed
prototypes of the two systems. The two systems were compared in terms of zero-load request-
execution latency and maximum throughput. Approximate calculations were provided in
sections 4.4 and 4.5. Here, we bring actual performance measurements. We focus only
on data transfer. We ignore connection establishment, login and logout because they are
infrequent, and for which we therefore focus on correctness and do not pay much attention
to performance.
    The tpt-raid prototype is not a complete system and is not fully connected to the Linux
scsi subsystem. Therefore, it was impossible to use standard benchmarks on it. Instead,
low-level throughput and latency tests were executed on the prototype.

6.1     System prototypes
6.1.1    Hardware
The Baseline raid and tpt-raid prototypes are built using the same hardware in order
to make the comparison as fair as possible. For convenience of implementation, we are
using identical hardware for all types of boxes (hosts, controller and targets). Each system
comprises a single host, a raid controller and 5 targets. Each machine has dual Intel Pentium
4 xeon 3.2ghz processors with 2mb L2 cache and an 800 mhz front side bus. Each machine
contains a Mellanox mhea28-1t 10gb/sec full duplex Host Channel Adaptor (hca). The
hca adaptors work under the pci-Express X8 interface. All machines are connected to a
Mellanox mts2400 InfiniBand switch. Since the target machines have low end sata disks
that would have become a bottleneck, we simulate targets with multiple disks with large
cache by not sending the scsi commands to the disk. Instead, a successful scsi response is
returned immediately. (The returned data blocks contains random data.)

6.1.2    Software
We use the Linux SuSE 9.1 Professional operating system with 2.6.4-52 kernel. Voltaire’s
InfiniBand host and iser [23] are used for InfiniBand. In order to facilitate comparison,
both prototypes are implemented using the same code whenever possible.
   We now describe the software modules as shown in Fig. 24. Unless stated otherwise, all
modules are kernel modules:

    • Baseline raid:

         – Host:
            ∗ Proc fs: the proc file system [24] in Linux is used as a user interface to the
              host application.
            ∗ Host application: used to generate scsi commands, send them down and
              receive scsi response messages.

                                             75
        ∗ iscsi initiator: basic prototype.
        ∗ iser initiator: implements the Datamover primitives for the initiator side.
        ∗ kdapl [25]: part of the Voltaire InfiniBand host stack. It is a transport
          neutral infrastructure that provides rdma capabilities inside the kernel.
        ∗ ib host: Voltaire InfiniBand host stack. kdapl uses the InfiniBand Verbs and
          Communication Manager [4] (cm) supplied by the stack.
    – Controller:
        ∗ Baseline controller: controls the targets as described in section 3.5. It uses the
          iser target to receive commands from the host. In order to send commands
          to the targets, it uses the iscsi initiator.
        ∗ iscsi initiator
        ∗ iser initiator
        ∗ iser target: implements the Datamover primitives for the target side.
        ∗ kdapl
        ∗ ib host
    – Target:
        ∗   iscsi target: basic prototype.
        ∗   iser target
        ∗   kdapl
        ∗   ib host
        ∗   scsi interface: used to send commands to scsi mid-layer.

• tpt-raid:

    – Host: very similar to the host in the Baseline system. It also includes a connection
      request agent (user space application) that is used to accept incoming connection
      requests from tpt targets.
    – Controller:
        ∗ tpt controller: controls the targets as described in section 4. It uses the
          iser target to receive commands from the host and the iser initiator to
          send commands to the targets. Unlike the Baseline controller, it cannot use
          the iscsi initiator because the tpt controller uses the 3rd Party Transfer
          mechanism, which is not part of the iscsi initiator.
        ∗ iser initiator
        ∗ iser target
        ∗ kdapl
        ∗ ib host
    – Target:

                                         76
            ∗ tpt iscsi target: a target prototype that supports the 3rd Party Transfer and
              parity calculation in the targets as described in section 4.
            ∗ iser target
            ∗ kdapl
            ∗ ib host
            ∗ scsi interface




                        Figure 24: Prototypes software architecture

Terms and definitions
We use the term request size to denote the requested data length in the command that is
sent by the host. The term block size denotes the striping granularity of the raid. The term
target set refers to raid size targets that constitute a parity group.



                                            77
6.2     Throughput comparison
The throughput of the two systems was compared by measuring how many requests were
completed in an interval of 10 seconds. A request is considered completed when the host
receives a successful response for it.
    In this test, the host sends a sequence of requests. The measurement starts 5 seconds
after the host started sending requests, and lasts 10 seconds. The host will keep sending
requests for at least 5 more seconds after the measurement is done. This ensures that only
the system throughput in steady state is measured.
    Since the target machines have low end sata disks that would have become a bottleneck,
we simulated targets with multiple disks with large cache by not sending the scsi commands
to the disk. Instead, a successful scsi response was returned immediately. (The returned
data blocks contained random data.) Note that because the measurements are aimed in
assessing the bottlenecks in the controller, any reduction in the target work (that does not
affect the controller) is good.
    As described in 4.4.1, the system maximum throughput may be limited by several re-
sources:
   • cpu

   • Network link

   • Memory
   We now go over the system entities (host, controller and targets) and locate potential
system bottlenecks according to our measurements.

6.2.1   READ requests
The read request throughput test was executed for several request sizes (4kb-8mb) and
several block sizes (4kb-128kb). For a given block size, there is no data if request size <
block size. Fig. 25 depicts the throughput comparison of the two systems. Fig. 26 shows the
cpu usage of the controller’s tx thread, which is responsible for sending pdus to the targets.
    Baseline system

   • Host:

        – cpu: The cpu usage does not reach more than 50%. Therefore, it never becomes
          a bottleneck.
        – Network link: The system’s maximum throughput is lower than 725 mb/sec. It is
          lower than InfiniBand bandwidth (920 mb/sec) and, therefore, it never becomes
          a bottleneck.
        – Memory: The memory is used mainly for rdma operations (from the controller).
          The memory bandwidth is higher than 725 mb/sec and, therefore, it never be-
          comes a bottleneck.

                                             78
                       Figure 25: read request throughput

• Controller:

    – cpu: As the request size gets smaller, the controller’s tx thread has to do more
      work. The cpu usage reaches 100% and, therefore, the throughput is low.
    – Network link: When the transfer size is large, the amount of cpu work per byte is
      small. Consequently, InfiniBand’s bandwidth becomes the bottleneck. The actual
      InfiniBand bandwidth is 920mb/sec. As described in 4.3.6, during the execution
      of read requests, targets send parity blocks to the controller. Therefore, since
      the raid comprises 5 targets, ˜20% of the blocks that are sent from the targets to
      the controller are parity blocks and, therefore, the maximum InfiniBand goodput
      is only 725mb/sec.
    – Memory: The memory is used mainly for rdma operations (from the targets
      and to the host). However, we’ve already seen that the bottleneck is the cpu or
      network link.

• Target:

    – cpu: The cpu usage does not reach more than 65%. Therefore, it never becomes
      a bottleneck.
    – Network link: Each target sends raid1size of the data to the controller. Therefore,
      the controller’s network link will become a bottleneck before the target’s network
      link.
    – Memory: The memory is used mainly for rdma operations (to the controller) and
      disk operations. Therefore, it never becomes a bottleneck.

                                         79
               Figure 26: Controller’s tx thread cpu usage in read requests

    Fig. 25 and Fig. 26 show that for block size < 64kb, when the request size ≤ 256kb,
the controller’s cpu usage reaches 100% and the throughput is lower than 610mb/sec. For a
larger request size, the controller’s cpu usage is low (less than 60%) and as it decreases, the
controller’s InfiniBand link becomes a bottleneck and the Baseline system throughput does
not go over 725mb/sec.
    When using block size ≥ 64kb, the controller’s cpu usage is low. Therefore, the cpu
never becomes a bottleneck. When all targets participate in each request, the throughput is
never lower than 725mb/sec. For a small request size, the system throughput is higher than
725mb/sec because almost no parity blocks are read from the targets. As the request size
increases over 1024kb, more parity blocks are read and the net system throughput reaches
725mb/sec. Fig. 27 shows a comparison of the throughput of the Baseline system, which
includes only data blocks and the throughput of the same system, which includes also parity
blocks according to the calculation in section 4.3.6. When the InfiniBand link becomes a
bottleneck, the throughput of data blocks does not go over 725mb/sec. The throughput of
data and parity blocks reaches 900mb/sec, which is very close to InfiniBand bandwidth and
shows that, indeed, InfiniBand is the system’s bottleneck.




                                              80
          Figure 27: read request throughput with and without parity blocks

TPT-RAID system

• Host:

    – cpu: The cpu usage does not reach more than 70%. Therefore, it never becomes
      a bottleneck.
    – Network link: The system reaches a maximum throughput of 920 mb/sec, which
      is very close to InfiniBand bandwidth. Therefore, the InfiniBand link is a system
      bottleneck.
    – Memory: The memory is used mainly for rdma operations (from the targets).
      However, we’ve already seen that the bottleneck is the network link.

• Controller:

    – cpu: The amount of work of the controller’s tx thread is influenced by the block
      size and request size. For a small block size and request size, the controller’s tx
      thread has to do more work. The cpu usage reaches 100% and, therefore, the
      throughput is low.
    – Network link: Only control pdus are sent. Therefore, the network link never
      becomes a bottleneck.
    – Memory: No large memory operations because no data passes through the con-
      troller.

• Target:

                                         81
        – cpu: The cpu usage does not reach more than 65%. Therefore, it never becomes
          a bottleneck.
        – Network link: Each target sends raid1size of the data to the host. Therefore, the
          host’s network link will become a bottleneck before the target’s network link.
        – Memory: The memory is used mainly for rdma operations (to the controller) and
          disk operations. Therefore, it never becomes a bottleneck.

    Fig. 25 and Fig. 26 show that for block size = 4kb, the controller’s cpu usage reaches
100% and the throughput does not reach more than 240mb/sec. When the request size
≤ 64kb, the throughput of the tpt-raid is lower than the Baseline system. For block size
= 16kb, the controller’s cpu usage reaches 100% when the request size ≤ 512kb and the
throughput does not reach more than 830 mb/sec. For request size > 512kb, the cpu usage
is below 95% and the throughput reaches 895mb/sec. For block size = 32kb, the cpu usage
is high (97%) only when the request size is 128kb and the throughput is 835 mb/sec. For
a larger request size, the cpu usage is below 80%, the bottleneck becomes the InfiniBand
link in the host and the throughput is 925mb/sec. For block size ≥ 64kb, the cpu usage is
below 52%, the bottleneck is, again, the InfiniBand link in the host and the throughput is
925mb/sec.

6.2.2   WRITE requests
The write request throughput test was executed for several request sizes (4kb-8mb) and
several block sizes (4kb-128kb). For a given block size, there is no data if request size <
block size. Fig. 28 depicts the throughput comparison of the two systems. Fig. 29 shows the
cpu usage of two threads in the Baseline controller: the tx thread, which is responsible for
sending pdus to the targets and the thread that performs xor operations. Fig. 30 shows
the cpu usage of the tpt controller’s tx thread, which is responsible for sending pdus to
the targets.


   Baseline system

   • Host:

        – cpu: The cpu usage does not reach more than 10%. Therefore, it never becomes
          a bottleneck.
        – Network link: The system’s maximum throughput is lower than 700 mb/sec,
          which is lower than InfiniBand bandwidth. Therefore, the InfiniBand link never
          becomes a bottleneck.
        – Memory: The memory is used mainly for rdma operations (from the targets).
          The memory bandwidth is higher than 700 mb/sec and, therefore, it never be-
          comes a bottleneck.

                                            82
   • Controller:

        – cpu: For a small request size, the controller’s tx thread has to do more work. The
          cpu usage of the tx thread reaches 100% and, therefore, the throughput is low.
          For a larger request size, each request requires xor operations with more data
          blocks. The cpu usage of the thread that performs xor operations reaches 100%
          and the throughput is low.
        – Network link: The system’s maximum throughput is lower than 700 mb/sec.
          The controller also sends parity blocks to the targets, so the actual maximum
          throughput is 840 mb/sec. This is lower than InfiniBand bandwidth. Therefore,
          the InfiniBand link never becomes a bottleneck.
        – Memory: The memory is used mainly for rdma operations (from the targets and
          to the host) and xor operations. However, we’ve already seen that the bottleneck
          is the cpu.

   • Target:

        – cpu: The cpu usage does not reach more than 65%. Therefore, it never becomes
          a bottleneck.
        – Network link: Each target receives raid1size of the data that is sent from the
          controller. Therefore, the target’s network link never becomes a bottleneck. The
          target also sends data to the controller. However, it sends only blocks that are
          required for parity calculation of partial stripes.
        – Memory: The memory is used mainly for rdma operations (to the controller) and
          disk operations. Therefore, it never becomes a bottleneck.

   Fig. 29 shows that the controller’s cpu usage is always near 100%: For block size < 64
kb, the tx thread reaches 100% for a small request size. For a larger request size, the cpu
usage of the xor thread reaches 100% and then the cpu usage of the tx thread decreases.
For block size ≥ 64 kb, the cpu usage of the tx thread is low because the cpu usage of the
xor thread reaches 100%.

   TPT-RAID system

   • Host:

        – cpu: The cpu usage does not reach more than 35%. Therefore, it never becomes
          a bottleneck.
        – Network link: The system reaches a maximum throughput of 850 mb/sec, which is
          lower than InfiniBand bandwidth. Therefore, the InfiniBand link never becomes
          a bottleneck.


                                            83
                           Figure 28: write request throughput

        – Memory: The memory is used mainly for rdma operations (from the targets).
          The memory bandwidth is higher than 850 mb/sec and, therefore, it never be-
          comes a bottleneck.

   • Controller: Similar to read requests.
   • Target:

        – cpu: The cpu usage does not reach more than 60%. Therefore, it never becomes
          a bottleneck.
        – Network link: The target receives data from the host and other targets and sends
          data to other targets. However, each target does not send more than 200 mb/sec
          and does not receive more than 380 mb/sec. Therefore, the network link never
          becomes a bottleneck.
        – Memory: The memory is used for rdma operations (to the host and other tar-
          gets), xor operations and disk operations. We showed that it is not a bottleneck
          by adding extra memory operations in a different thread and showing that the
          maximum system throughput almost didn’t change.

    Fig. 28 and Fig. 30 show that for any given block size, when only some of the targets
participate in a request (i.e. small requests), the controller’s cpu usage decreases because
the controller has to wait to targets. When all targets participate in a request (i.e. larger
requests), the cpu usage is higher. For a larger request size, a request has more complete
stripes and, therefore, fewer commands are sent and the cpu usage decreases.
    for block size = 4 kb and request size ≥ 32 kb, the controller’s cpu usage reaches 100%
and the maximum throughput does not reach more than 160 mb/sec. For a larger block

                                             84
               Figure 29: Baseline controller’s cpu usage in write requests

size, the controller’s cpu usage is lower (because the controller has to send fewer commands
for the same request size) and the maximum throughput is higher. However, even when
using block size = 128 kb, the maximum throughput is lower than InfiniBand bandwidth
(although the cpu usage is low). This is because the targets are the active side of some
rdma operations (to the host and to other targets) and the passive side of other rdma
operations (from other targets). The interleaving of read and write requests towards each
target’s memory reduce the maximum throughput.

6.3    Scalability
In this section, we try to assess the scalability of the tpt-raid system; i.e., how much work
load it can handle without rendering any of its components a bottleneck. The Baseline
system has poor scalability because its controller is a bottleneck as shown in section 6.2.
Therefore, it will not be discussed in the section.
    Since the controller in the tpt system is out of the data path, the amount of activity that
it can manage is only limited by its cpu’s capability to process commands. Other potential
limiting factors in a complete system are the communication bandwidth of the host(s) and
target(s). In a single-host system, the host’s InfiniBand link is the communication-related
bottleneck.
    Therefore, since the controller’s cpu usage depends on the the amount of work of the
tx thread, calculating the cpu usage of a single command that is sent from the controller
to a target will allow us to assess the maximum number of commands that the controller
can send to the targets. By knowing the number of commands that are required for a single
read/write, we can calculate the controller’s maximum throughput when the number of
hosts and targets is unlimited.
    The measurements of the system’s cpu usage and requests/sec count were taken while

                                              85
                 Figure 30: tpt controller’s cpu usage in write requests

all components of the system were running. Therefore, all internal overheads are already
included in the results.
    The cpu usage of a single command that is sent from the controller to a target may be
calculated in the following way:
    The number of commands that are sent from the controller to the targets for a single
read request is calculated in appendix C. The cpu usage per command/sec is:
                                            ctrl tx thread cpu util
           ctrl cpu pdu work =
                                 requests per sec · sent pdus readctrl (req size)
                                                                  tpt



    According to the throughput and cpu usage values that are shown in Fig. 25 and Fig.
26, the average cpu usage per command/sec (ctrl cpu pdu work) is 0.0015%. Now, we can
use this result in order to calculate the maximum number of read/write requests that a
single controller can handle without becoming a bottleneck:
                                                                100
    tpt ctrl max read reqs/sec(req size) = ctrl cpu pdu work·sent pdus readctrl (req size)
                                                                           tpt
                                                                  100
    tpt ctrl max write reqs/sec(req size) = ctrl cpu pdu work·sent pdus writectrl (req size)
                                                                              tpt
    The maximum controller throughput is, therefore:
         tpt ctrl max read thr = tpt ctrl max read reqs/sec(req size) · req size


        tpt ctrl max write thr = tpt ctrl max write reqs/sec(req size) · req size



   When more than a single host is connected to the system, a single target-set may also
become a bottleneck. In each target, the cpu or InfiniBand link may become a bottleneck.

                                            86
By calculating the usage of the cpu and the network link for a single request, we can assess
the maximum number of requests that a single target-set can handle without becoming a
bottleneck. By knowing that, we can calculate the maximum throughput of a single target-
set.
    The number of pdus/rdma operations that are sent from a target for a single read
request is calculated in appendix C. The cpu usage per command/sec is:
                                                         target tx thread cpu util
             target cpu pdu work =
                                              requests per sec · sent pdus readtarget (req size)
                                                                               tpt



   According to the throughput and cpu usage values, the average cpu percentage usage
per (command per sec) (target cpu pdu work) is 0.0022%. Now, we can use this result in
order to calculate the maximum number of read/write requests that a single target-set
can handle without becoming a bottleneck:
                                                                       100
   tpt target max read reqscpu /sec(req size) = target cpu pdu work·sent pdus readtarget (req size)
                                                                                                       tpt

                                                                                            100
    tpt target max write reqscpu /sec(req size) =                    target cpu pdu work·sent pdus writetarget (req size)
                                                                                                        tpt


  tpt target max write reqsnetwork /sec(req size) =
                link bw                            link bw
min( sent data writetarget (req size) , rcvd data writetarget (req size) )
                       tpt                               tpt


    The maximum throughput for a single target-set is, therefore:

                                            tpt target max read thr =



                                            cpu limitation                              link bandwidth limitation
                                                  cpu
        min(tpt target max read reqs                    /sec(req size) · req size, raid size · link bw )

                                           tpt target max write thr =


                                                          cpu limitation

                      min(tpt target max write reqscpu /sec(req size) · req size,


                                                link bandwidth limitation

                       tpt target max write reqsnetwork /sec(req size) · req size)




                                                               87
    Fig. 31 and Fig. 32 show the maximum throughput of the controller when it is not
limited by hosts and targets, and the maximum throughput of a single target-set when it
is not limited by hosts and the controller. Table 4 summarizes the maximum number of
hosts that can be connected to a single-controller tpt system (the numbers for the Baseline
system are in parentheses to the right of the numbers for the tpt-raid) that comprises a
single/multiple target sets without having the controller or the targets become a bottleneck.
(Using multiple target sets means that there are enough targets and they never become a
bottleneck.)




                       Figure 31: System scalability (read requests)



                            Table 4: raid controller scalability

 Block size    Request size           READ requests                      WRITE requests
   (kB)           (kB)          Multiple tgt sets Single tgt set   Multiple tgt sets Single tgt set
      4           ≥ 16               1 (1)            1 (1)             1 (1)            1 (1)
     16           ≥ 64               1 (1)            1 (1)             1 (1)            1 (1)
     32           ≥ 128              2 (1)            2 (1)             1 (1)            1 (1)
     64           ≥ 256              4 (1)            4 (1)             3 (1)            2 (1)
    128           ≥ 512              8 (1)        raid size(1)          5 (1)            2 (1)

   As mentioned before, in the Baseline system, even when a single host is connected, the
controller is already a bottleneck.
   It is important to note that the results in this section were measured with targets that
use memory disks. In practice, the physical storage is expected to be slower and, therefore,

                                             88
                     Figure 32: System scalability (write requests)

a single controller may be connected to even more targets without becoming a bottleneck.
This is true for both tpt and Baseline systems.




                                          89
6.4     Latency comparison
The latency of the two systems was compared by measuring the elapsed time between the
time that a request was sent from the host until a response for the request was received by
the host. No other request was sent at that time.

6.4.1   READ requests
The read request latency test was executed for several request sizes (4kb-8mb) and several
block sizes (4-128kb). For a given block size, there is no data if request size < block size. Fig.
33 depicts the latency comparison of the two systems.
    For a small block size, latency does not improve (relative to the Baseline system) because
the overhead of prep read commands is very high. Using a larger block size (≥ 32kb)
reduces the prep read commands overhead (because the controller sends fewer commands
for the same request size) and latency is improved relative to the Baseline system (20% for
block size = 32kb, 30% for block size = 128kb).




                               Figure 33: read request latency

    Fig. 34 shows a comparison of the latency of the two systems for a constant request
size (8mb) and several block size values. Since a read request execution in the Baseline
system does not depend on the block size, changing it has almost no influence on latency.
In the tpt-raid system, however, for a constant request size, the number of prep read
commands depends on the block size and, therefore, the latency is lower for a large block
size.

                                               90
         Figure 34: read request latency = f(block size) with request size = 8mb

6.4.2   WRITE requests
The write request latency test was executed for several request sizes (4kb-8mb) and several
block sizes (4-128kb). For a given block size, there is no data if request size < block size.
Fig. 35 depicts the latency comparison of the two systems. One can see that the latency of
the two systems becomes very similar when using block size ≥ 16kb. For a small block size,
the latency of the tpt-raid system is higher because the controller has to send more prep
write commands for the same request size.
    Fig. 36 shows a comparison of the latency of the two systems for a constant request
size (8mb) and several block size values. Since a write request execution in the Baseline
system almost does not depend on the block size (except for the length of old data that
the controller reads for parity calculation), changing it has almost no influence on latency.
In the tpt-raid system, however, for a constant request size, the number of prep write
commands depends on the block size and, therefore, the latency is lower for a large block
size.




                                             91
                             Figure 35: write request latency

6.4.3   Comparison with theoretical calculation
In this section, we compare the measured latency with the latency approximation that was
calculated in section 4.5.

READ requests
Fig. 37 shows the latency difference between the two systems for read requests. The latency
difference is shown according to the theoretical calculation as shown in 4.5.1 and according
to the actual results. The measurements were taken for several request size and several block
size values.
    The main reason for the difference between the actual latency difference and the cal-
culated latency difference is the theoretical calculation of prep read commands latency:
we use the upper bound that assumes that the execution time of prep read commands is
linear. This is not accurate because the stages in each command are pipelined. Our mea-
surements show that each prep read command adds 70% of the latency of a single prep
read command. Fig. 38 shows the latency difference between the two systems when the
latency of prep read commands is calculated as described above.

WRITE requests
Fig. 39 and Fig. 40 show a comparison between the actual latency and the calculated latency
of write requests for the Baseline and tpt systems. The calculated latency is according
to the theoretical calculation as shown in 4.5.2 and according to the actual results. The

                                             92
        Figure 36: write request latency = f(block size) with request size = 8mb

measurements were taken for several request sizes and several block size values. One can see
that there is almost no difference between the actual latency and the calculated latency in
both systems.




                                            93
             (a) block size = 4kb                            (b) block size = 32kb




                                    (c) block size = 128kb

            Figure 37: Calculated and actual read request latency difference




             (a) block size = 4kb                            (b) block size = 32kb




                                    (c) block size = 128kb

Figure 38: Calculated and actual read request latency difference (empirical approximation)



                                             94
    (a) block size = 4kb                            (b) block size = 32kb




                           (c) block size = 128kb

Figure 39: Calculated and actual write request latency (Baseline raid)




    (a) block size = 4kb                            (b) block size = 32kb




                           (c) block size = 128kb

 Figure 40: Calculated and actual write request latency (tpt-raid)



                                    95
6.5     Comparison with single-box RAID
So far, we have compared the tpt-raid system with the Baseline raid system. However,
most raid system that are used today are single-box raid systems. (i.e a single box contains
the raid controller and the disks.) We next compare the tpt-raid performance with a
single-box raid system.

6.5.1   READ requests
The only difference between the two systems is that the tpt-raid controller has to send
more commands (prep read) to the targets as part of the 3rd Party Transfer mechanism.
We now check the influence of the prep read commands on tpt-raid performance.
    Fig. 41 shows the latency overhead incurred by prep read commands for several request
sizes (128kb-8mb) and several block sizes (4-128kb). For a small block size, the overhead
incurred by prep read commands is high (relative to the overall latency). For a larger
block size, fewer prep read commands are required and, therefore, the overhead incurred
by them is smaller. For block size = 128kb, the prep read commands latency is negligible
(relative to the overall latency).




             Figure 41: 3rd Party Transfer latency overhead in read requests

   Fig. 42 shows the cpu usage of the controller’s tx thread incurred by sending prep
read commands when the host sends a sequence of read requests. For a small block size
(< 32kb), more prep read commands are required and, therefore, the cpu usage is very
high. As the block size grows, the cpu usage of sending prep read commands is getting
lower (because InfiniBand becomes the limiting factor). For block size = 128kb, the cpu
usage of sending prep read commands is below 15%.



                                            96
               Figure 42: 3rd Party Transfer cpu overhead in read requests

6.5.2   WRITE requests
There are a few differences between the two systems in the execution of write requests:

   • In the tpt-raid system, more commands (prep write commands and commands
     for parity calculation) are sent from the raid controller to the targets.

   • In the tpt-raid system, there are rdma operations between the targets as part of
     parity calculation.

     The extra commands that were mentioned add more latency than the latency incurred
by the extra commands that are sent in case of a read request. However, this latency is
still very small relative to the overall latency of a write request.
     We next calculate how much latency is incurred by the rdma operations: if the request
size is large enough, the rdma operations for the parity calculation in the partial stripes are
negligible. Therefore, we will calculate only the latency incurred by the rdma operations for
the parity calculation in the complete stripes. For a tpt-raid system with raid size targets,
 log2 raid size steps are executed for the parity calculation in the complete stripes. For
raid size = 5, 3 steps are required. Therefore, for k complete stripes where raid size = 5,
the latency incurred by the rdma operations for the parity calculation is: 3·rdma latency(k·
block size).
     The maximum throughput of the tpt system may be lower than the throughput of a
single-box raid system when multiple host are connected because of the rdma operations
between the targets. Each target reads more data using rdma operations (roughly 80%
more data) and, therefore, its InfiniBand link will become a bottleneck when multiple hosts
are connected.


                                              97
7       Other RAID types
3rd Party Transfer and ecc calculation in the targets may be used for other raid types. We
investigate two additional prominent disk array configurations: mirroring and Row-Diagonal
Parity [26] (rdp).

    • Mirroring: with the drop in disk drive cost and the ever growing gap between a disk
      drive’s storage and access capabilities, mirroring is a very common organization.

    • rdp: There is a growing importance of 2-fault tolerance. rdp is provably optimal in
      computational complexity, both during construction and reconstruction. Like other
      algorithms, it is optimal in the amount of redundant information stored and accessed.

    In this section, we carry out a brief comparison for mirroring and rdp.

7.1     Mirroring
When using mirroring with 3rd Party Transfer, data is sent directly between hosts and tar-
gets. Unlike raid-5, the data is not striped and, therefore, a target can read/write data
from/to a host with a single rdma operation.


7.1.1    READ requests
The controller has to send the same amount of commands to the target whether 3rd Party
Transfer is used or not. The only difference is that when using 3rd Party Transfer, the target
writes the requested data directly to the host. Therefore, the latency is 50% lower.
   The maximum throughput of the tpt-raid when the number of hosts and targets is
unlimited is limited only by the cpu of the controller and, therefore, is much higher than the
maximum throughput of the Baseline system (which is limited by the network link of the
controller). Fig. 43 shows the maximum throughput of the Baseline raid-1 and tpt-raid-1
systems when the number of hosts and targets is unlimited.


7.1.2    WRITE requests
There are two ways to execute write requests:

    • Both targets read the data directly from the host: The controller has to send the same
      amount of commands to the targets whether 3rd Party Transfer is used or not. The
      only difference is that when using 3rd Party Transfer, the targets read the requested
      data directly from the host. Therefore, the latency is 50% lower.




                                             98
              Figure 43: Maximum throughput for read requests (mirroring)

   • A single target (target ’a’) reads the data directly from the host and the other target
     (target ’b’) reads the data from target ’a’: The controller has to send the same amount
     of commands to the targets whether 3rd Party Transfer is used or not. However, since
     target ’a’ reads the data directly from the host, the latency is 33% lower.

   The maximum throughput of the tpt-raid when the number of hosts and targets is
unlimited is limited only by the cpu of the controller and, therefore, is much higher than the
maximum throughput of the Baseline system (which is limited by the network link of the
controller). Fig. 44 shows the maximum throughput of the Baseline raid-1 and tpt-raid-1
systems when the number of hosts and targets is unlimited.

7.1.3   Scalability
The system’s scalability can be calculated using the results in Fig. 43 and Fig. 44. Table 5
summarizes the maximum number of hosts that can be connected to a single-controller tpt
system (the numbers for the Baseline system are in parentheses to the right of the numbers
for the tpt-raid) that comprises a single target set without having the controller or the
targets become a bottleneck.

7.2     RDP
Row-Diagonal Parity (rdp) [26] is an extension of a raid-5 array with a second independent
distributed parity scheme. Data and parity are striped on a block level across multiple array


                                             99
             Figure 44: Maximum throughput for write requests (mirroring)

                      Table 5: raid controller scalability (mirroring)

                  Request size     READ requests       WRITE requests
                    256 kb             18 (1)               9 (1)
                    512 kb             36 (1)              18 (1)
                     1 mb              73 (1)              36 (1)
                     8 mb             587 (1)             293 (1)


members, just like in raid-5, and a second set of parity is calculated and written across all
the drives.
    read requests execution are identical to raid-5. When executing write requests, the
row parity is calculated in a similar way to raid-5. Then, the diagonal parity is calculated.
We distinguish between stripes grouping of p − 1 stripes (p is a parameter of an rdp array.
For more details, refer to [26]) that include a complete set of row and diagonal parity sets
(complete stripes-group) and stripes grouping of fewer than p − 1 stripes (partial stripes-
group). The diagonal parity calculation for a complete stripes-group is very similar to parity
calculation of complete stripes in raid-5. The diagonal parity calculation for a partial
stripes-group is similar to parity calculation of partial stripes in raid-5.
    The maximum throughput of the tpt-raid when the number of hosts and targets is
unlimited is limited only by the cpu of the controller and, therefore, is much higher than
the maximum throughput of the Baseline system (which is limited by the controller’s cpu
that has to perform more xor operations). Fig. 45 shows the maximum throughput of the


                                             100
Baseline rdp-raid and tpt-rdp-raid systems when the number of hosts and targets is
unlimited.




            Figure 45: Maximum throughput for write requests (rdp-raid)

    The system’s scalability can be calculated using the results in Fig. 45. Table 6 summa-
rizes the maximum number of hosts that can be connected to a single-controller tpt system
(the numbers for the Baseline system are in parentheses to the right of the numbers for the
tpt-raid) that comprises a single target set without having the controller or the targets
become a bottleneck.

                        Table 6: raid controller scalability (rdp)

                    Request size     Block size   WRITE requests
                       1 mb            32 kb          1 (1)
                       1 mb            64 kb          1 (1)
                       8 mb            32 kb          2 (1)
                       8 mb            64 kb          3 (1)




                                           101
8     Summary
In this research, we focused on relieving the performance bottleneck in multi-box raid sys-
tems by using an out-of-the-data-path controller.

   We proposed to remove the controller from the data path by adding two mechanisms:
 rd
3 Party Transfer and distributed parity calculation in the targets. These mechanisms do
not require hardware changes and only few changes are required in scsi, iscsi and iser
protocols. The required protocol extensions have been detailed.

    In a multi-box raid with 3rd Party Transfer and distributed parity calculation in the
targets, requested data is transferred directly between hosts and targets while still having
the controller managing and handling all the control message exchange. Parity calculations
are performed between targets under the controller command. By doing that, no data passes
through the controller, it does not need to perform parity calculations and, therefore, it is
limited only by its cpu and is more scalable than the Baseline system. On the other hand,
using a single controller retains simplicity. The architecture of tpt-raid and its prototype
were both in the context of kernel space. However, tpt-raid can also be implemented in
user space, and the prototype can be ported.

   The performance comparison shows that tpt-raid scales better than the Baseline sys-
tem. In all raid types that were compared, the Baseline controller is a bottleneck even
when a single host is connected. For raid-5, the comparison shows that the tpt controller
can handle up to 5 hosts without becoming a bottleneck. For mirroring, the tpt controller
can handle up to 293 hosts without becoming a bottleneck. For rdp, the tpt controller can
handle up to 3 hosts without becoming a bottleneck.

   The latency of the two systems was also compared. The results showed that for a large
enough block size (≥ 16kb), the latency with tpt-raid is not higher than with the Baseline
system. For a small block size (< 16kb), the latency of tpt-raid is higher than the Baseline
system and the latency difference grows as the block size gets smaller.

   We have proven the correctness of the new mechanisms and shown the potential perfor-
mance increase. By way of a complete prototype, we demonstrated the correctness of the
scheme even with standard Linux machines and InfiniBand hardware.
   tpt-raid dramatically improves scalability, with a single pc-based controller capable
of managing many targets and a large amount of data traffic over a conventional switched-
network infrastructure.




                                            102
A     Required changes to protocols
A.1     Changes to the SCSI protocol
The tpt-raid architecture only requires additional scsi commands. These commands are
sent only from the raid controller to the target. The new commands are used by the target
only for software purposes. (i.e. the target’s scsi hardware doesn’t need to support these
opcodes.)
   The following commands were added:
  1. prep read (see Fig. 46): Requests the target to prepare for a read operation. The
     iser header for this command contains a read stag (refer to [7]). If the final field is
     set to ’1’, the target is requested to perform the following operations:

      (a) Allocate a buffer, and read from the disk according to the data that was received
          in all prep read commands specifying the same value for the group number
          field (in the cdb) as the current prep read command.
      (b) Execute rdma write operations to the host for each prep read command spec-
          ifying the same value for the group number field as the current prep read
          command. The buffer that was allocated by the prep read command and the
          read stag that was received with it should be used for the rdma write operation.

      See the read (10) command in [14] for the definition of all cdb fields.




                            Figure 46: prep read command

  2. prep write (see Fig. 47): Requests the target to prepare for a write operation. The
     iser header for this command contains a write stag. If the final field is set to ’1’,
     the target is requested to Allocate a buffer and execute rdma read operations from
     the host for each prep write command specifying the same value for the group
     number field as the current prep write command. The buffer that was allocated
     and the write stag that was received with the prep write command should be used
     for the rdma read operation.
      See the read (10) command in [14] for the definition of all cdb fields.

                                           103
                         Figure 47: prep write command

3. read old block (see Fig. 48): Requests the target to read a data block from the
   disk and store it in a target buffer. The parity mode field in the cdb specifies the
   parity calculation method.
  See the read (10) command in [14] for the definition of other cdb fields.




                      Figure 48: read old block command

4. read new block (see Fig. 49): Requests the target to perform an rdma read
   operation from the host according to the write stag in the iser header into a target
   buffer.
  See the read (10) command in [14] for the definition of all cdb fields.

5. read parity part tmp (see Fig. 50): Requests the target to perform the following
   operations as part of a partial stripe ecc calculation:

   (a) Read a block from another target using an rdma read operation according to the
       write stag in the iser header.
   (b) Perform an xor operation between the rdma read buffer and the buffer that
       was allocated to store the result of the xor operation of the xdwrite command
       specifying the same values for the group number field, the logical block

                                        104
                     Figure 49: read new block command

       address field, and the transfer length field. The result will be stored in the
       buffer that was allocated in the xdwrite command.

  The parity mode field in the cdb specifies the parity calculation method. See the
  read (10) command in [14] for the definition of other cdb fields.




                  Figure 50: read parity part tmp command

6. read parity comp tmp (see Fig. 51): Requests the target to perform the following
   operations as part of a complete stripe ecc calculation:

   (a) Read blocks from another target using an rdma read operation according to the
       write stag in the iser header.
   (b) If no previous read parity comp tmp commands specifying the same value
       for the group number field were received, Perform an xor operation between
       the rdma read buffer and the buffer that was allocated in the last prep write
       command specifying the same values for the group number field, the logical
       block address field, and the transfer length field as the read parity
       comp tmp command. Store the result in a target buffer. Else, perform an xor
       operation between the rdma read buffer and the target buffer in which the xor
       result was stored in the previous read parity comp tmp command specifying

                                      105
       the same values for the group number field, the logical block address
       field, and the transfer length field as the current read parity comp tmp
       command.

  See the read (10) command in [14] for the definition of all cdb fields.




                  Figure 51: read parity comp tmp command

7. read parity part block (see Fig. 52): Requests the target to perform the following
   operations:

   (a) Read a block from another target using an rdma read operation according to the
       write stag in the iser header.
   (b) Perform an xor operation between the rdma read block and the block that was
       allocated in the read old block command specifying the same values for the
       group number field, the logical block address field, and the transfer
       length field. The result of the xor operation is the new parity block of a partial
       stripe.

  The parity mode field in the cdb specifies the parity calculation method. See the
  read (10) command in [14] for the definition of other cdb fields.




                 Figure 52: read parity part block command


                                        106
  8. read parity comp block (see Fig. 53): Requests the target to read a block from
     a target as part of a complete stripe ecc calculation. The block that was read is the
     new parity block of a complete stripe. The own field in the cdb specifies whether the
     target that holds the block to be read is the same target that received the command.
      See the read (10) command in [14] for the definition of other cdb fields.




                    Figure 53: read parity comp block command

  9. write new data (see Fig. 54): Requests the target to write data to the disk. The
     data comprises new data that was read in the last prep write command, parity blocks
     of partial stripes that were read in read parity part block commands and parity
     blocks of complete stripes that were read in read parity comp block commands.
     All commands specify the same value for the group number field.
      See the write (10) command in [14] for the definition of other cdb fields.




                         Figure 54: write new data command


A.2     Changes to the iSCSI protocol
The tpt-raid architecture requires several changes to the iscsi protocol. Most of the
required changes are in the raid controller and the targets. Almost no changes are required


                                           107
in the host. All changes are in software layers. The following sections describe the required
changes.

A.2.1     PDU formats
No changes are required to pdus that are sent between the host and the raid controller. The
tpt-raid architecture requires changes in the following pdu types that are sent between
the raid controller and the targets:

  1. scsi command: An rdma entity id field should be added to the pdu. If a target that
     receives such pdu needs to perform an rdma operation, it uses the received rdma
     entity id in order to determine the identity of the passive side of the rdma operation.

  2. scsi login request: The following fields should be added to the pdu:

        (a) Local entity id: It is used in order to send the entity id of the sender.
        (b) Remote entity id: it is used in order to notify the receiver of its own entity id.

  3. scsi login response: The local entity id field should be added to the pdu. It is used
     in order to send the entity id of the sender.

A.2.2     Login and logout
This was already described in section 4.2.

A.3      Changes to the Datamover protocol
The tpt-raid architecture requires changes in several Datamover primitives.
   The following list contains new Datamover primitives and Datamover primitives that
require changes:

  1. Send Control:
      Additional input qualifiers: stag.
      It is used by the target to notify the raid controller of a registered target buffer. The
      controller may use the stag later for 3rd Party Transfer between the target and another
      target. The stag input qualifier is also used by the controller when it sends commands
      to targets. The controller does not need to register buffers. Instead, it uses stags that
      were received from hosts or targets.

  2. Register Buffer:
      Input qualifiers: Connection Handle, Buffer Handle
      Return Results: Registered Buffer Handle, stag
      The target uses it to register a buffer to be used later for a 3rd Party Transfer operation.

                                              108
3. Deregister Buffer:
  Input qualifiers: Connection Handle, Registered Buffer Handle
  Return Results: Not specified.
  The target uses it to deregister a buffer that was registered with the Register Buffer
  primitive.

4. Put Data:
  Additional input qualifiers: stag.
  The stag is used by the target when the stag that was received in the command pdu
  shouldn’t be used. The only case when the stag parameter is used is the last prep
  write command in which the stag values that were received in prior prep write
  commands should be used.

5. Get Data:
  Additional input qualifiers: stag.
  The stag is used by the target when the stag that was received in the command
  pdu shouldn’t be used. The only case when the stag parameter is used is the last
  prep read command in which the stag values that were received in prior prep read
  commands should be used.




                                       109
B       READ and WRITE request examples
The following examples show the execution of read and write requests in the Baseline
and tpt-raid systems. All examples refer to Fig. 55 in which 15 data blocks are read or
written, starting from block 2.




                            Figure 55: Requested blocks example



B.1      READ request
B.1.1     Baseline system
The following steps are executed:

    1. The host sends a read command to the raid controller.

    2. The raid controller processes the command and sends read commands to all targets:

        (a) Target 0: 3 blocks, starting from block 5.
        (b) Target 1: 4 blocks, starting from block 6.
        (c) Target 2: 4 blocks, starting from block 2.
        (d) Target 3: 4 blocks, starting from block 3.
        (e) Target 4: 3 blocks, starting from block 4.



                                             110
  3. Each target that receives the command, reads the requested blocks from the disk, writes
     them to the raid controller using an rdma write operation, and sends a response.

  4. The raid controller waits until it receives 5 responses (from all targets). Then, it
     writes the requested data to the host using an rdma write operation, and sends a
     response.

B.1.2     TPT-RAID system
The following steps are executed:
  1. The host sends a read command to the raid controller.

  2. The raid controller processes the command and sends the following prep read com-
     mands:

        (a) Block 2 To target 2.
        (b) Block 3 To target 3.
        (c) Block 4 To target 4.
        (d) Block 5 To target 0.
        (e) Block 6 To target 1.
        (f) Block 7 To target 2.
        (g) Block 8 To target 3.
        (h) Block 9 To target 4.
        (i) Block 10 To target 0.
        (j) Block 11 To target 1.
        (k) Block 12 To target 2. The final field is set to ’1’.
        (l) Block 13 To target 3. The final field is set to ’1’.
     (m) Block 14 To target 4. The final field is set to ’1’.
        (n) Block 15 To target 0. The final field is set to ’1’.
        (o) Block 16 To target 1. The final field is set to ’1’.

  3. Each target that receives a prep read command saves its data. If the final field is
     set to ’1’, it performs the following operations:

        (a) Reads the requested data from the disk. The logical block address and
            transfer length are based on the data that was received in the prep read
            commands:
              i. Target 0: 3 blocks starting from block 5.
             ii. Target 1: 4 blocks starting from block 6.

                                             111
            iii. Target 2: 4 blocks starting from block 2.
            iv. Target 3: 4 blocks starting from block 3.
             v. Target 4: 3 blocks starting from block 4.
        (b) Writes the data that was read from the disk to the host with rdma write opera-
            tions based on the read stag that was received in the prep read commands:
              i.   Target   0   performs   3   rdma   write   operations:   block   5,   block   10 and block 15.
             ii.   Target   1   performs   3   rdma   write   operations:   block   6,   block   11 and block 16.
            iii.   Target   2   performs   3   rdma   write   operations:   block   2,   block   7 and block 12.
            iv.    Target   3   performs   3   rdma   write   operations:   block   3,   block   8 and block 13.
             v.    Target   4   performs   3   rdma   write   operations:   block   4,   block   9 and block 14.

      Then, it sends a response to the raid controller.

  4. The raid controller waits until it receives 15 responses. Then, it sends a response to
     the host.

B.2      WRITE request
B.2.1     Baseline system
The following steps are executed:

  1. The host sends a write command to the raid controller.

  2. The raid controller processes the command and reads the new data from the host
     using an rdma read operation. Then it sends the following read commands:

        (a) Block 0 from target 0.
        (b) Block 1 from target 1.
        (c) Parity block 16-19 from target 0.
        (d) Block 16 from target 1.

      In the meantime, the raid controller recalculates the parity blocks for the complete
      stripes: parity block 4-7, parity block 8-11 and parity block 12-15.

  3. Each target that receives the read command, reads the requested block from the disk,
     writes it to the raid controller using an rdma write operation, and sends a response.

  4. The raid controller waits until it receives 4 responses. Then, it recalculates the parity
     blocks for the partial stripes:

        (a) Parity block 0-3 = xor(old block 0, old block 1, new block 2, new block 3)
        (b) Parity block 16-19 = xor(old parity block 16-19, old block 16, new block 16)

                                                       112
     Then, the raid controller sends write commands to all targets (since all targets have
     requested blocks):

        (a) Target 0: 4 blocks starting at block 5.
        (b) Target 1: 4 blocks starting at block 6.
        (c) Target 2: 4 blocks starting at block 2.
        (d) Target 3: 4 blocks starting at block 3.
        (e) Target 4: 4 blocks starting at parity block 0-3.

  5. Each target that receives the write command, reads the data from the raid controller
     using an rdma read operation, writes it to the disk, and sends a response to the raid
     controller.

  6. The raid controller waits until it receives 5 responses. Then, it sends a response to
     the host.

B.2.2     TPT-RAID system
The following steps are executed:

  1. The host sends a write command to the raid controller.

  2. The raid controller processes the command and sends the following prep write
     commands:

        (a) Block 2 to target 2.
        (b) Block 3 to target 3.
        (c) Parity block 0-3 to target 4.
        (d) Block 5 to target 0.
        (e) Block 6 to target 1.
        (f) Block 7 to target 2.
        (g) Parity block 4-7 to target 3.
        (h) Block 4 to target 4.
        (i) Block 10 to target 0.
        (j) Block 11 to target 1.
        (k) parity block 8-11 to target 2.
        (l) Block 8 to target 3.
     (m) Block 9 to target 4.
        (n) Block 15 to target 0.

                                              113
   (o) Parity block 12-15 to target 1.
   (p) Block 12 to target 2. The final field is set to ’1’.
   (q) Block 13 to target 3. The final field is set to ’1’.
    (r) Block 14 to target 4. The final field is set to ’1’.
    (s) Parity block 16-19 to target 0. The final field is set to ’1’.
    (t) Block 16 to target 1. The final field is set to ’1’.

  The raid controller sends the following xdwrite command:

   (a) Block 16 to target 1.

  The raid controller sends the following read old block commands:

   (a) Block 0 to target 0. (parity mode is batch.)
   (b) Block 1 to target 1. (parity mode is batch.)
   (c) Parity block 16-19 to target 0. (parity mode is incremental.)

  The raid controller sends the following read new block commands:

   (a) Block 2 to target 2.
   (b) Block 3 to target 3.

3. Each target that receives a prep write command saves its data. If the final field
   is set to ’1’, it reads the new data from the host with rdma read operations based on
   the write stag that was received in the prep write commands:

   (a) Target 0 performs 3 rdma read operations: block 5, block 10 and block 15.
   (b) Target 1 performs 3 rdma read operations: block 6, block 11 and block 16.
   (c) Target 2 performs 3 rdma read operations: block 2, block 7 and block 12.
   (d) Target 3 performs 3 rdma read operations: block 3, block 8 and block 13.
   (e) Target 4 performs 3 rdma read operations: block 4, block 9 and block 14.

  Then, it sends a response to the raid controller.

4. Each target that receives an xdwrite command reads the old data block from the
   disk, reads the new block from the host using an rdma read operation, and sends a
   response to the raid controller.

5. Each target that receives a read old block command reads the old data block from
   the disk, and sends a response to the raid controller.

6. Each target that receives a read new block command reads the new block from the
   host using an rdma read operation, and sends a response to the raid controller.

                                          114
 7. The raid controller waits until it receives 26 responses. Then, it starts recalculating the
    parity blocks. (The following combinations of active and passive targets are random.)
    It sends the following read parity part tmp command:

     (a) Block 1 to target 1 with write stag of block 0 in target 0. parity mode is batch.
     (b) Block 2 to target 2 with write stag of block 3 in target 3. parity mode is batch.

    The raid controller sends the following read parity comp tmp commands:

     (a) 3 blocks starting from block 6 to target 1 with write stag of blocks 5,10 and 15
         in target 0.
     (b) 3 blocks starting from block 7 to target 2 with write stag of blocks 4,9 and 14 in
         target 4.

 8. Each (active) target that receives a read parity part tmp command reads the data
    block from the other (passive) target using an rdma read operation. Then it checks
    the parity mode field in the cdb:

       • Incremental parity calculation: It performs an xor operation between the rdma
         read buffer and the buffer that was allocated to store the result of the xor oper-
         ation of the xdwrite command of the same data block, stores the result in the
         buffer that as allocated in the xdwrite command, and sends a response to the
         raid controller.
       • Batch parity calculation: Since no previous read parity part tmp commands
         specifying the same values for the group number field, the logical block
         address field and the transfer length field were received, it performs an
         xor operation between the rdma read buffer and the buffer that was allocated
         in the read old block (block 1 in target 1) or read new block (block 2 in
         target 2), and sends a response to the raid controller.

 9. Each (active) target that receives a read parity comp tmp command reads the data
    blocks from the other (passive) target using an rdma read operation, performs an xor
    operation between the rdma read buffer and the buffer that was allocated in the last
    prep write command (since it’s the 1st iteration of read parity comp tmp in the
    target), stores the result in a target buffer, and sends a response to the raid controller.

10. The raid controller waits until it receives 4 responses. Then, it starts the next iteration
    of parity blocks calculation. It sends the following read parity part tmp command:

     (a) Block 2 to target 2 with write stag of block 1 in target 1. parity mode is batch.

    It sends the following read parity comp tmp command:

     (a) 3 blocks starting from block 7 to target 2 with write stag of parity block 4-7,
         block 8 and block 13 in target 3.

                                            115
11. Target 2 receives the read parity part tmp command, reads the data block from
    target 1 using an rdma read operation, performs an xor operation between the rdma
    read buffer and the buffer that was allocated in the previous read parity part
    tmp command, stores the result in the buffer that was allocated in the previous read
    parity part tmp command, and sends a response to the raid controller.

12. Target 2 receives the read parity comp tmp command, reads the data blocks from
    target 3 using an rdma read operation, performs an xor operation between the rdma
    read buffer and the buffer that was allocated in the previous read parity comp
    tmp command, stores the result in the buffer that was allocated in the previous read
    parity comp tmp command, and sends a response to the raid controller.

13. The raid controller waits until it receives 2 responses from target 2. Then, it sends
    the following read parity part block commands:

     (a) Parity block 0-3 to target 4 with write stag of block 2 in target 2. parity mode
         is batch.
    (b) Parity block 16-19 to target 0 with write stag of block 16 in target 1. parity
        mode is incremental.

    The raid controller sends the following read parity comp block commands:

     (a) Parity block 4-7 to target 3 with write stag of block 7 in target 2.
    (b) Parity block 8-11 to target 2 with its own write stag.
     (c) Parity block 12-15 to target 1 with write stag of block 12 in target 2.

14. Each (active) target that receives a read parity part block command reads the
    block from the other (passive) target using an rdma read operation, if the parity
    mode is incremental, it performs an xor operation between the rdma read buffer and
    the buffer that was allocated in the read old block command of the same block.
    Then, it sends a response to the raid controller. If the parity mode is incremental,
    the result of the xor operation is the new parity block. If the parity mode is batch,
    the block that was read in the rdma operation is the new parity block.

15. Each (active) target that receives a read parity comp block command reads the
    block from the other (passive) target using an rdma read operation, and sends a
    response to the raid controller. The block that was read from the passive target is
    the new parity block.

16. The raid controller waits until it receives 5 responses. Now, the parity calculation is
    complete. The binary tree that represents the parity calculation can be seen in Fig.
    56 and Fig. 57. Now ,the raid controller sends write new data commands to all
    targets (since all targets have requested blocks).



                                           116
Figure 56: Example of a binary tree for parity calculation in partial stripes




                                    117
Figure 57: Example of a binary tree for parity calculation in complete stripes




                                     118
17. Each target that receives a write new data command writes the new data to the
    disk:

     (a) Target 0 writes blocks 5, 10 and 15 that were read from the host during the
         execution of the last prep write command and parity block 16-19 that was
         recalculated in a read parity part block command.
    (b) Target 1 writes blocks 6 and 11 that were read from the host during the execution
        of the last prep write command, parity block 12-15 that was read from another
        target in a read parity comp block command and block 16 that was read
        from the host in a xdwrite command.
     (c) Target 2 writes block 2 that was read from the host in a read new block
         command, blocks 7 and 12 that were read from the host during the execution of
         the last prep write command and parity block 8-11 that was read from another
         target in a read parity comp block command.
    (d) Target 3 writes block 3 that was read from the host in a read new block
        command, parity block 4-7 that was read from another target in a read parity
        comp block command and blocks 8 and 13 that were read from the host during
        the execution of the last prep write command.
     (e) Target 4 writes parity block 0-3 that was recalculated in a read parity part
         block command and blocks 4,9 and 14 that were read from the host during the
         execution of the last prep write command.

18. Each target sends a response to the raid controller.

19. The raid controller waits until it receives 5 responses. Then, it sends a response to
    the host.




                                          119
C      Maximum throughput — detailed comparison
This section contains a detailed calculation of the maximum throughput for read and write
requests, which was discussed in section 4.4. We compare the Baseline raid with tpt-raid.

C.1     READ requests
In order to assess the maximum system throughput for read requests for each system, we
calculate the amount of work of every resource in each entity. Then, by dividing the capac-
ity of a resource by the amount of work it has to do per request, we arrive at its maximum
throughput (requests/sec).

Host
The only difference between the two architectures is that when using the tpt-raid, the
host acts as the passive side of, possibly, more rdma write operations (from the target).
However, the total size of transferred data is similar.
   The following resources are used by the host in a read request for both architectures:
    • cpu: A single pdu is sent and received. Therefore, the cpu will never become a
      bottleneck.
    • Network link: The host acts as a passive side of rdma write operation(s). Therefore,
      req size kb are received. Therefore:
                                        net bw
      net max read req host /sec =     req size

    • Memory: The host acts as a passive side of rdma write operation(s). Therefore,
      req size kb are written to the memory. Therefore:
                                          mem bw
      mem max read req host /sec =        req size

    Therefore, the maximum throughput of the host (for both architectures) is:

host max read thr = min(net max read req host /sec, mem max read req host /sec)
RAID controller
The main difference is that the Baseline raid controller acts as a passive side of rdma
write operations (from the target) and as an active side of rdma operations (to the host).
In addition, as described in 4.3.6, the targets in the Baseline raid send unnecessary parity
blocks to the controller. The tpt-raid controller is used only for control pdus (command
and response).
    Another difference between the raid controllers is that the tpt-raid controller has to
send more commands (prep read) to the targets. Sending the commands doesn’t render
the network link a bottleneck because the pdu size is very small. However, it adds cpu
work, especially for a small block size in which many prep read commands are required
for executing a read command.
    The following resources are used by the controller in a read request:

                                            120
• cpu:

    – Baseline system: The controller sends a read command to each participating
      target (i.e. that has requested data blocks). If blocksize < raid size, not all targets
                                                          req
                                                               size
                                                      req
      participate in the request and, therefore, blocksize commands are sent. Else, we
                                                          size
      assume that raid size commands are sent (although in some cases raid size − 1
      commands are sent). The controller also sends a response to the host. The
      response in sent in another context (and may even be executed by a different cpu)
      and since only a single pdu is sent per request, we ignore it (in the throughput
      calculation). Therefore:
                                               req size       req
                                                         , blocksize < raid size
      sent pdus readctrl
                       Baseline (req size) =  block size           size
                                              raid size, Else
                         ctrl                                        100
         cpu max read reqBaseline /sec =         cpu pdu work·sent pdus readctrl
                                                                            Baseline (req size)

    – tpt system: The controller sends a prep read command for each data block.
                  req
      Therefore, blocksize prep read commands are sent. The controller also sends a
                       size
      response to the host. The response in sent in another context and since only a
      single pdu is sent, we ignore it. Therefore:
                                        req
      sent pdus readctrl (req size) = blocksize
                      tpt                   size

                         ctrl                                100
         cpu max read reqtpt /sec =        cpu pdu work·sent pdus readctrl (req size)
                                                                      tpt


• Network link:

    – Baseline system: The targets send the requested data to the controller. Also, as
      explained in section 4.3.6, the targets send parity blocks. Then, the controller
      sends the requested data to the host. Therefore:

         rcvd data readctrl
                       Baseline (req size) = req size+read req parity blocks cnt(req size)·
         block size

         sent data readctrl
                       Baseline (req size) = req size


                             ctrl                                 net bw
         net rcv max read reqBaseline /sec =          rcvd data readctrl
                                                                    Baseline (req size)


                              ctrl                                  net bw
         net send max read reqBaseline /sec =           sent data readctrl
                                                                      Baseline (req size)



    – tpt system: No data passes through the controller.

• Memory:



                                           121
        – Baseline system: Very similar to the network link calculation (all memory work
          is caused only by rdma operations):
                                ctrl
          mem max read reqBaseline /sec =
                                          mem bw
            rcvd data readctrl                              ctrl
                          Baseline (req size)+sent data readBaseline (req size)



        – tpt system: No data passes through the controller.

   Therefore, the maximum throughput of the Baseline controller is:

                                 ctrl max read thrBaseline =
                         ctrl                               ctrl
     min(cpu max read reqBaseline /sec, net rcv max read reqBaseline /sec,
                           ctrl                           ctrl
      net send max read reqBaseline /sec, mem max read reqBaseline /sec)
   The maximum throughput of the tpt controller is:

                                                         ctrl
                  ctrl max read thrtpt = cpu max read reqtpt /sec
Target
When using the tpt-raid, the target acts as the active side of, possibly, more rdma write
operations (to the host). Also, as explained in section 4.3.6, the targets in the Baseline
system send unnecessary parity blocks.
   The following resources are used by the target in a read request:

   • cpu:

        – Baseline system: A single pdu is received. Therefore, for the throughput calcu-
          lation, it may be ignored.
        – tpt system: The targets send a response for each received pdu. In addition, a
          target performs an rdma write operation for each received prep read command.
                                                                                  req
                                                  sent pdus readctrl (req size)+ blocksize
            sent pdus readtarget (req size) =
                          tpt
                                                                tpt
                                                                raid size
                                                                                       size




                            target                                      100
            cpu max read reqtpt /sec =                                          target
                                                     cpu pdu work·sent pdus readtpt    (req size)

   • Network link:

        – Baseline system:
                                                    rcvd data readctrl
                                                                  Baseline (req size)
            sent data readtarget (req size) =
                          Baseline                             raid size

                                 target                                       net bw
            net send max read reqBaseline /sec =                                target
                                                                  sent data readBaseline (req size)




                                                  122
       – tpt system:
         sent data readtarget (req size) =
                       tpt
                                             req size
                                             raid size

                              target                                  net bw
         net send max read reqtpt /sec =                                 target
                                                           sent data readtpt    (req size)



  • Disk operations: Both systems use this resource identically.
    disk read(req size) = sent data readtarget (req size)
                                        Baseline

  • Memory:

       – Baseline system:
                          target                                               mem bw
         mem max read reqBaseline /sec =                               target
                                                         sent data readBaseline (req size)+disk read(req size)


       – tpt system:
                         target                                                mem bw
         mem max read reqtpt /sec =                sent data readtpt
                                                                     target
                                                                              (req size)+disk read(req size)



  Therefore, the maximum throughput of the Baseline targets is:

                           target max read thrBaseline =
                           target                         target
  min(net send max read reqBaseline /sec, mem max read reqBaseline /sec)
  The maximum throughput of the tpt targets is:

                                                      target
         target max read thrtpt = min(cpu max read reqtpt /sec,
                                              target
                         net send max read reqtpt /sec,
                                           target
                           mem max read reqtpt /sec)



  The maximum throughput of the Baseline system is:

                               max read thrBaseline =
min(host max read thr, ctrl max read thrBaseline , target max read thrBaseline )



  The maximum throughput of the tpt system is:

                                 max read thrtpt =
   min(host max read thr, ctrl max read thrtpt , target max read thrtpt )

                                             123
   The bottleneck in the Baseline raid will become the raid controller’s link because all
data passes through it. The bottleneck in the tpt-raid will become either the controller’s
cpu (for a small block size) or the host’s link (for a larger block size).

C.2     WRITE requests
In order to assess the maximum system throughput for write requests for each system,
we calculate the amount of work of all resources in each entity. Then, by dividing the
capacity of a resource by the amount of work it has to do per request, we arrive at its
maximum throughput (requests/sec). In order to simplify the calculation, we assume that
block count mod(raid size − 1) = 0 and req size ≥ block size · (raid size − 1).
    If the 1st stripe contains k data blocks, the last stripe contains raid size − 1 − k data
blocks. If k = raid size − 1, the number of complete stripes is blocks count . Else, the number
                                                                  raid size−1
of complete stripes is blocks count − 1.
                        raid size−1


Host
Similar to read requests.

                     host max write thr = host max read thr
RAID controller
There are several differences between the two systems:

   • The Baseline raid controller acts as an active side of an rdma read operation (from
     the host) in order to read the new data.

   • The Baseline raid controller has to read old blocks from the targets. In order to do
     that, it acts as a passive side of rdma write operations (from the targets).

   • The Baseline raid controller has to perform ecc calculations.

   • The tpt-raid controller sends more pdus to the targets.

   The following resources are used by the controller in a write request:

   • cpu:

        – Baseline system: The cpu is used for sending pdus to the targets and performing
          xor operations.
             ∗ Sending pdus to targets:
                 · Partial stripes:
                   i. Incremental parity calculation: The controller sends p + 1 read
                      commands.



                                             124
        ii. Batch parity calculation: The controller sends raid size − 1 − p
           read commands.
                                               p + 1, p < raid2size − 1(incremental)
           sent pdus part writectrl
                                Baseline (p) =
                                               raid size − 1 − p, else(batch)
      · Complete stripes: No commands are sent.

        After the parity calculation for the partial and complete stripes is done,
        the controller sends raid size write commands.
        Therefore, the total number of commands is the sum of the commands for
        the first partial stripe (if exists), the commands for the last partial stripe
        (if exists) and the write commands that are sent after parity calculation
        is done.

        sent pdus writectrl
                       Baseline (req size) =
        sent pdus part writectrl
                             Baseline (f irst part stripe blocks) +
        sent pdus part writectrl
                             Baseline (last part stripe blocks) + raid size =
         no partial stripes

              1
                      · 0+
        raid size − 1
          raid size−2       1
          k=1         raid size−1
                                   ·
            k blocks in f irst partial stripe

        [sent pdus part writectrl
                             Baseline (k) +
                raid size−1−k blocks in last partial stripe

        sent pdus part writectrl
                            Baseline (raid size − 1 − k)
        +raid size

  ∗ Parity calculation:
      · Partial stripes:
        i. Incremental parity calculation: In order to calculate the new par-
           ity block when p new data blocks are written, an xor operation is
           performed on 2 · p + 1 blocks. Therefore:
           xor part incctrl
                         Baseline (p) = 2 · p + 1
        ii. Batch parity calculation: In order to calculate the new parity block
           when p new data blocks are written, an xor operation is performed
           on raid size − 1 blocks. Therefore:
           xor part batchctrl
                            Baseline (p) = raid size − 1
      · Complete stripes: In order to calculate the new parity block, an xor
        operation is performed on raid size − 1 blocks. Therefore:
        xor compctrl
                   Baseline = raid size − 1

We will now calculate the total number of blocks that are used in xor operations
that are executed during parity calculations by the controller in a single command:

                                         125
  xorctrl     (req size, k) =
   Baseline N o partial stripes
  
  
  
         blocks
  
                                     ctrl
   raid size − 1 · xor compBaseline , k = raid size − 1
  
  
  
  
  
  
  
  
  
                    Complete stripes
  
  
  
          blocks
   (                                      ctrl
   raid size − 1 − 1) · xor compBaseline +
  
  
   1st stripe−Incremental
  
  
  
  
  
   xor part incctrl
  
                     Baseline (k) +
  
  
  
  
                       last stripe−Batch
  
  
   xor part batchctrl                                   k < raid2size − 1
  
                        Baseline (raid size − 1 − k),
  
  
  
  
  
                    Complete stripes
  
  
           blocks                          ctrl
   (
   raid size − 1 − 1) · xor compBaseline +
  
  
           1st stripe−Batch
  
  
  
  
   xor part batchctrl
                        Baseline (k) +
  
  
  
                      last stripe−Batch
  
  
   xor part batchctrl
                                                        raid size
                                                                   ≥ k ≥ raid2size − 1
  
                        Baseline (raid size − 1 − k),       2
  
  
  
  
  
                    Complete stripes
  
  
  
  
  
   (      blocks                          ctrl
  
   raid size − 1 − 1) · xor compBaseline +
  
  
  
           1st stripe−Batch
  
  
  
   xor part batchctrl
  
                        Baseline (k) +
  
  
  
                 last stripe−Incremental
  
  
     xor part incctrl Baseline (raid size − 1 − k),    k > raid2size
  Therefore:
     ctrl                         1         raid size−1      ctrl
  xorBaseline (req size) =   raid size−1    k=1           xorBaseline (req size, k)

                   ctrl
  cpu max write reqBaseline /sec =
                                          100
  cpu pdu work·sent pdus writectrl
                              Baseline
                                                                   ctrl
                                       (req size)+cpu xor block·xorBaseline (req size)


– tpt system: The cpu is mainly used for sending pdus to the targets.
    ∗ Partial stripes:
        · Incremental parity calculation: The controller sends p xdwrite com-
          mands and a single read old block command. Then, during the parity
          calculation phase, it sends p − 1 read parity part tmp commands and
          a single read parity part block command.

                                           126
             · Batch parity calculation: The controller sends p read new block com-
               mands and raid size − p − 1 read old block commands. Then, during
               the parity calculation phase, it sends raid size − 2 read parity part
               tmp commands and a single read parity part block command.
                                                  2 · p + 1, p < raid2size − 1(incremental)
               sent pdus part writectrl (p) =
                                      tpt
                                                  2 · raid size − 2, else(batch)
         ∗ Complete stripes: The controller sends c prep write commands. Then,
           during the parity calculation phase, it sends raid size−1 read parity comp
                                      c
           tmp commands and raid size−1 read parity comp block commands.
                                                                        c
                                              c + raid size − 1 + raid size−1 , c > 0
           sent pdus comp writectrl (c) =
                                  tpt
                                              0, else
       After the parity calculation for the partial and complete stripes is done, the
       controller sends raid size write new data commands.
       Therefore, the total number of commands is the sum of the commands for the first
       partial stripe (if exists), the commands for complete stripes (if exist), the com-
       mands for the last partial stripe (if exists) and the write new data commands
       that are sent after parity calculation is done.
       sent pdus writectrl (req size) =
                         tpt
       sent pdus part writectrl (f irst part stripe blocks) +
                              tpt
       sent pdus part writectrl (last part stripe blocks)+
                              tpt
       sent pdus comp writectrl (blocks count − f irst part stripe blocks−
                               tpt
       last part stripe blocks) + raid size =
                                  no partial stripes

              1
                        · sent pdus comp writectrl (blocks count) +
                                              tpt
       raid size − 1
         raid size−2       1
         k=1         raid size−1
                                 ·
         k blocks in f irst partial stripe             raid size−1−k blocks in last partial stripe

       [sent pdus part writectrl (k) + sent pdus part writectrl (raid size − 1 − k)
                            tpt                            tpt
                       blocks count−raid size+1 in complete stripes

       + sent pdus comp writectrl (blocks count − raid size + 1)] + raid size
                             tpt


                        ctrl                                              100
       cpu max write reqtpt /sec =                     cpu pdu work·sent pdus writectrl (req size)
                                                                                   tpt


• Network link: We will now calculate the cost of rdma operations in the raid controller.
  We distinguish between partial stripes and complete stripes. We also distinguish be-
  tween the two parity calculation methods that were discussed earlier.

    – Baseline system:
         ∗ Partial stripes:


                                                  127
      · Incremental parity calculation: In order to write p data blocks when using
        the incremental parity calculation method, the Baseline raid controller
        has to read p data blocks from the host, read p + 1 data blocks from the
        targets, and write p + 1 data blocks to the targets. Therefore:

        rcvd data part inc writectrl
                                Baseline (p) = 2 · p · block size


        sent data part inc writectrl
                                Baseline (p) = (p + 1) · block size


      · Batch parity calculation: In order to write p data blocks when using the
        batch parity calculation method, the Baseline raid controller has to read
        p data blocks from the host, read raid size − p − 1 data blocks from the
        targets, and write p + 1 data blocks to the targets. Therefore:

        rcvd data part batch writectrl
                                  Baseline (p) = (raid size − 1) · block size


        sent data part batch writectrl
                                  Baseline (p) = (p + 1) · block size


 ∗ Complete stripes: In order to write raid size − 1 data blocks, the Baseline
   raid controller has to read raid size − 1 data blocks from the host, and write
   raid size blocks to the targets. Therefore:

    rcvd data comp writectrl
                        Baseline = (raid size − 1) · block size


    sent data comp writectrl
                        Baseline = raid size · block size


We will now calculate the total amount of data that is sent/received by the con-
troller in a single command:
rcvd data writectrl
                  Baseline (req size, k) =




                                  128
                      N o partial stripes



      blocks



                 · rcvd data comp writectrl
                                        Baseline ,         k = raid size − 1

   raid size − 1





                            Complete stripes



          blocks



   (                 − 1) · rcvd data comp writectrl
                                                 Baseline +

       raid size − 1

              1st stripe−Incremental





   rcvd data part inc writectrl
                            Baseline (k) +




                              last stripe−Batch


   rcvd data part batch writectrl                                          raid size

                             Baseline (raid size − 1 − k),           k<                −1


                                                                                2



                            Complete stripes


           blocks

   (                 − 1) · rcvd data comp writectrl
                                                 Baseline +

       raid size − 1

                  1st stripe−Batch





   rcvd data part batch writectrl
                              Baseline (k) +

                             last stripe−Batch





   rcvd data part batch writectrl
                              Baseline (raid size − 1 − k),
                                                                      raid size
                                                                                  ≥k≥    raid size
                                                                                                     −1

                                                                         2                  2





                            Complete stripes



          blocks

   (                 − 1) · rcvd data comp writectrl

                                                Baseline +

       raid size − 1

                  1st stripe−Batch



   rcvd data part batch writectrl

                             Baseline (k) +

                       last stripe−Incremental



    rcvd data part inc writectrl
                            Baseline (raid size − 1 − k),            k>   raid size
                                                                              2

Therefore:
rcvd data writectrl
               Baseline (req size) =
     1         raid size−1
raid size−1    k=1           rcvd data writectrl
                                            Baseline (req size, k)


The total length of sent data is:
sent data writectrl
                Baseline (req size, k) =




                                             129
                           N o partial stripes
     
     
     
           blocks
     
     
     
                      · sent data comp writectrl
                                             Baseline ,    k = raid size − 1
     
        raid size − 1
     
     
     
     
     
                               Complete stripes
     
     
     
               blocks
     
     
     
        (                 − 1) · sent data comp writectrl
                                                      Baseline +
     
            raid size − 1
     
                   1st stripe−Incremental
     
     
     
     
     
        sent data part inc writectrl
                                 Baseline (k) +
     
     
     
     
                                  last stripe−Batch
     
     
        sent data part batch writectrl                                     raid size
     
                                  Baseline (raid size − 1 − k),      k<                −1
     
     
                                                                                2
     
     
     
                               Complete stripes
     
     
                blocks
     
        (                 − 1) · sent data comp writectrl
                                                      Baseline +
     
            raid size − 1
     
                       1st stripe−Batch
     
     
     
     
     
        sent data part batch writectrl
                                   Baseline (k) +
     
                                 last stripe−Batch
     
     
     
     
     
        sent data part batch writectrl
                                   Baseline (raid size − 1 − k),
                                                                      raid size
                                                                                  ≥k≥    raid size
                                                                                                     −1
     
                                                                         2                  2
     
     
     
     
     
                               Complete stripes
     
     
     
               blocks
     
        (                 − 1) · sent data comp writectrl
     
                                                     Baseline +
     
            raid size − 1
     
                       1st stripe−Batch
     
     
     
        sent data part batch writectrl
     
                                  Baseline (k) +
     
                            last stripe−Incremental
     
     
     
         sent data part inc writectrl
                                 Baseline (raid size − 1 − k),       k>   raid size
                                                                              2

     Therefore:
     sent data writectrl
                      Baseline (req size) =
          1      raid size−1
     raid size−1 k=1         sent data writectrl
                                            Baseline (req size, k)

                          ctrl
     net rcv max write reqBaseline /sec =
                  net bw
      rcvd data writectrl
                     Baseline (req size)


                           ctrl
     net send max write reqBaseline /sec =
                  net bw
      sent data writectrl
                     Baseline (req size)




   – tpt system: No data passes through the controller.
• Memory:

                                                  130
– Baseline system: Memory is used when performing rdma operations (as described
  above) and xor operations. Memory usage for xor operations in partial stripes
  and complete stripe is different:
    ∗ Partial stripes:
        · Incremental parity calculation: In order to calculate the new parity block
          when p new data blocks are written, p old data blocks are read, p new
          data blocks are read, the old parity block is read and the new parity block
          is written. Therefore:
          mem xor part incctrl
                             Baseline (p) = (2 · p + 2) · block size


        · Batch parity calculation: In order to calculate the new parity block when
          p new data blocks are written, raid size − p − 1 data blocks are read, p
          new data blocks are read and the new parity block is written. Therefore:
          mem xor part batchctrl
                               Baseline (p) = raid size · block size


    ∗ Complete stripes: In order to calculate the new parity block, raid size − 1
      new data blocks are read and the new parity block is written. Therefore:
      mem xor compctrl
                    Baseline = raid size · block size


  We will now calculate the total number of memory operations that are executed
  during parity calculations by the controller in a single command:
           ctrl
  mem xorBaseline (req size, k) =




                                    131
                         N o partial stripes
       
       
       
             blocks
       
       
       
                        · mem xor compctrl
                                       Baseline ,                k = raid size − 1
       
          raid size − 1
       
       
       
       
       
                             Complete stripes
       
       
       
                 blocks
       
       
       
          (                 − 1) · mem xor compctrl
                                                Baseline +
       
              raid size − 1
       
                 1st stripe−Incremental
       
       
       
       
       
          mem xor part incctrl
                           Baseline (k) +
       
       
       
       
                                last stripe−Batch
       
       
          mem xor part batchctrl                                                  raid size
       
                            Baseline (raid size − 1 − k),                    k<               −1
       
       
                                                                                       2
       
       
       
                             Complete stripes
       
       
                  blocks
       
          (                 − 1) · mem xor compctrl
                                                Baseline +
       
              raid size − 1
       
                     1st stripe−Batch
       
       
       
       
       
          mem xor part batchctrl
                             Baseline (k) +
       
                               last stripe−Batch
       
       
       
       
       
          mem xor part batchctrl
                             Baseline (raid size − 1 − k),
                                                                              raid size
                                                                                          ≥k≥   raid size
                                                                                                            −1
       
                                                                                 2                 2
       
       
       
       
       
                             Complete stripes
       
       
       
                 blocks
       
          (                 − 1) · mem xor compctrl
       
                                               Baseline +
       
              raid size − 1
       
                     1st stripe−Batch
       
       
       
          mem xor part batchctrl
       
                            Baseline (k) +
       
                          last stripe−Incremental
       
       
       
           mem xor part incctrl
                           Baseline (raid size − 1 − k),                    k>   raid size
                                                                                     2

       Therefore:
                ctrl                                 1        raid size−1          ctrl
       mem xorBaseline (req size) =             raid size−1   k=1           mem xorBaseline (req size, k)

                        ctrl
       mem max write reqBaseline /sec =
                                                   mem bw
       rcvd data writectrl                               ctrl                       ctrl
                      Baseline (req size)+sent data writeBaseline (req size)+mem xorBaseline (req size)



    – tpt system: No data passes through the controller.

Therefore, the maximum throughput of the Baseline controller is:

                              ctrl max write thrBaseline =
                     ctrl                                ctrl
min(cpu max write reqBaseline /sec, net rcv max write reqBaseline /sec,
                       ctrl                            ctrl
 net send max write reqBaseline /sec, mem max write reqBaseline /sec)

                                                     132
   The maximum throughput of the tpt controller is:

                                                        ctrl
               ctrl max write thrtpt = cpu max write reqtpt /sec
Target
There are several differences between the two systems:

   • Both targets have to read the new data. The difference is that when using the tpt-
     raid, the target acts as the active side of, possibly, more rdma read operations (from
     the host). However, the total size of transferred data is similar.

   • In the tpt-raid system, the target has to perform ecc calculations.

   • In the Baseline raid system, the target has to send the old data and old parity blocks
     to the raid controller using an rdma write operation.

   • In the tpt-raid system, the target has to send data blocks to other targets using an
     rdma write operation several times.

   The following resources are used by the target in a write request:

   • cpu: We will now calculate the cost of cpu operations in the target. We distinguish
     between partial stripes and complete stripes. We also distinguish between the two
     parity calculation methods that were discussed earlier.

        – Baseline system: Most of the cpu load in the targets is caused by initiating rdma
          operations and sending response pdus. This work is done by the same thread:
            ∗ Partial stripes:
                · Incremental parity calculation: The targets receive p+1 read commands.
                  For each read command, the target performs an rdma operation and
                  sends a response. Therefore:
                                                        2·(p+1)
                  sent pdus part inc writetarget (p) = raid size
                                           Baseline
                · Batch parity calculation: The targets receive raid size − 1 − p read
                  commands. For each read command, the target performs an rdma op-
                  eration and sends a response. Therefore:
                                                           2·(raid size−1−p)
                  sent pdus part batch writetarget (p) =
                                            Baseline            raid size
            ∗ Complete stripes: No read commands are received.
          After the parity calculation for the partial and complete stripes is done, the
          controller sends raid size write commands. For each write command, the
          target performs an rdma operation, writes data to the disk (this is done by
          another thread) and sends a response.

                                           133
  Therefore, the total number of commands is the sum of the commands for the
  first partial stripe (if exists), the commands for complete stripes (if exist) and the
  commands for the last partial stripe (if exists).

  sent pdus writetarget (req size, k) = 1+
                      Baseline
  
  
     N o partial stripes
  
  
  
            0           , k = raid size − 1
  
  
  
  
  
  
  
  
                 1st stripe−Incremental
  
  
   sent pdus part inc writetarget (k) +
  
  
                                       Baseline
  
                               last stripe−Batch
  
  
  
   sent pdus part batch writetarget (raid size − 1 − k),              raid size
  
                                          Baseline               k<               −1
  
  
                                                                           2
  
  
                 1st stripe−Batch
  
   sent pdus part batch writetarget (k) +
  
                                    Baseline
  
                         last stripe−Batch
  
  
  
  
   sent pdus part batch writetarget (raid size − 1 − k), raid size ≥ k ≥
                                                                                   raid size
                                                                                                −1
  
                                    Baseline                 2                         2
  
  
  
  
  
              1st stripe−Batch
  
  
  
   sent pdus part batch writetarget (k) +
  
                                    Baseline
  
                    last stripe−Incremental
  
  
  
   sent pdus part inc writetarget (raid size − 1 − k), k > raid size
                                  Baseline                       2

  Therefore:
  sent pdus writetarget (req size) =
                    Baseline
              raid size−1
       1
  raid size−1 k=1         sent pdus writetarget (req size, k)
                                         Baseline

                   target                                      100
  cpu max write reqBaseline /sec =                                     target
                                           cpu pdu work·sent pdus writeBaseline (req size)

– tpt system: Most of the cpu load in the targets is caused by initiating rdma
  operations, sending response pdus and xor operations. This work is done by the
  same thread.
    ∗ Sending response pdus and rdma operations:
        · Partial stripes:
          i. Incremental parity calculation: The targets receive p xdwrite com-
             mands and a single read old block command. For each xd-
             write command, the target performs an rdma operation and sends
             a response. For each read old block command, the target reads
             data from the disk (this is done by another thread) and sends a re-
             sponse. Then, during the parity calculation phase, they receive p − 1
             read parity part tmp commands and a single read parity part

                                     134
       block command. For each read parity part tmp command and
       read parity part block command, the target performs an rdma
       operation and sends a response.
       sent pdus part inc writetarget (p) = raid size
                                 tpt
                                             4·p+1

    ii. Batch parity calculation: The targets receive p read new block
       commands and raid size − p − 1 read old block commands. For
       each read new block command, the target performs an rdma op-
       eration and sends a response. For each read old block command,
       the target reads data from the disk (this is done by another thread)
       and sends a response. Then, during the parity calculation phase,
       they receive raid size − 2 read parity part tmp commands and a
       single read parity part block command. For each read parity
       part tmp command and read parity part block command, the
       target performs an rdma operation and sends a response.
       sent pdus part batch writetarget (p) = p+3·raidsize
                                   tpt             raid
                                                        size−3

  · Complete stripes: The targets receive c prep write commands. For each
    prep write the target sends a response. For the last prep write com-
    mand, the target performs raidcsize rdma operations and sends a response.
    Then, during the parity calculation phase, they receive raid size−1 read
                                            c
    parity comp tmp commands and raid size−1 read parity comp block
    commands. For each read parity comp tmp command and read par-
    ity comp block command, the target performs an rdma operation and
    sends a response.
                                                                       c
                                           2·c+2·raid size−2+2· raid
    sent pdus comp   writetarget (c)   =               raid size
                                                                       size−1
                                                                                ,   c>0
                          tpt
                                           0,   else
After the parity calculation for the partial and complete stripes is done, the
controller sends raid size write new data commands. For each write
new data command, the target writes data to the disk (this is done by
another thread) and sends a response.
Therefore, the total number of commands is the sum of the commands for
the first partial stripe (if exists), the commands for complete stripes (if exist)
and the commands for the last partial stripe (if exists).

sent pdus writetarget (req size, k) = 1+
               tpt




                              135
                 N o partial stripes
  
  
  
  
  
      sent pdus comp writetarget (blocks),
                           tpt                      k = raid size − 1
  
  
  
  
  
  
  
  
                              Complete stripes
  
  
  
      sent pdus comp writetarget (blocks − raid size + 1)) +
  
                          tpt
  
              1st stripe−Incremental
  
  
  
      sent pdus part inc writetarget (k) +
  
                              tpt
  
                            last stripe−Batch
  
  
  
  
  
      sent pdus part batch writetarget (raid size − 1 − k),
                                 tpt                               k<     raid size
                                                                                      −1
  
  
                                                                              2
  
  
  
  
  
  
                             Complete stripes

       sent pdus comp writetarget (blocks − raid size + 1) +
                           tpt
  
                  1st stripe−Batch
  
  
  
  
  
      sent pdus part batch writetarget (k) +
  
  
                                 tpt
  
                            last stripe−Batch
  
  
  
      sent pdus part batch writetarget (raid size − 1 − k),        raid size
                                                                                ≥k≥    raid size
                                                                                                   −1
  
                                tpt                                    2                  2
  
  
  
  
  
                            Complete stripes
  
  
  
  
  
      sent pdus comp writetarget (blocks − raid size + 1) +
                           tpt
  
              1st stripe−Batch
  
  
  
  
  
      sent pdus part batchtarget (k) +
  
                          tpt
  
                       last stripe−Incremental
  
  
       sent pdus part inc writetarget (raid size − 1 − k),
                               tpt                               k>     raid size
                                                                            2
  Therefore:
  sent pdus writetarget (req size) =
                    tpt
              raid size−1
       1
  raid size−1 k=1         sent pdus writetarget (req size, k)
                                         tpt


∗ Parity calculation:
      · Partial stripes:
       i. Incremental parity calculation: In order to calculate the new parity
          block when p new data blocks are written, p xor operations between
          the old data blocks and the new data blocks are performed. Then,
          during the iterations to calculate the new parity block, p xor op-
          erations are performed (between the temp result and the received
          block). Therefore:

          xor part inctarget (p) =
                      tpt
                                           4·p
                                        raid size


       ii. Batch parity calculation: In order to calculate the new parity block
          when p new data blocks are written, raid size−2 xor operations are

                                        136
      performed (between the temp result and the received block) during
      the iterations to calculate the new parity block. Therefore:
                                   2·(raid size−2)
      xor part batchtarget (p) =
                    tpt                raid size


  · Complete stripes: In order to calculate the new parity blocks for all com-
    plete stripes, raid size − 2 xor operations are performed (between the
    temp result and the received blocks) during the iterations to calculate the
    new parity blocks. Therefore:
                           T he number of complete stripes

                                       c
                          2·                                 ·(raid size−2)
   xor comptarget (c) =          raid size − 1
           tpt                              raid size


We will now calculate the total number of blocks that are used in xor op-
erations that are executed during parity calculations by a target in a single
request:
xortarget
 tpt N(req size, k) =
          o partial stripes



 xor comptarget (blocks), k = raid size − 1

               tpt








                     Complete stripes


 (xor comptarget (blocks − raid size + 1) +

                tpt
 1st stripe−Incremental




 xor part inctarget (k) +


                   tpt

                   last stripe−Batch



 xor part batchtarget (raid size − 1 − k), k < raid size − 1

                      tpt


                                                      2






                     Complete stripes

   (xor comptarget (blocks − raid size + 1) +
                 tpt

        1st stripe−Batch



 xor part batchtarget (k) +




                       tpt

                   last stripe−Batch


 xor part batchtarget (raid size − 1 − k), raid size ≥ k ≥ raid size − 1


                      tpt                      2               2





                    Complete stripes



 (xor comptarget (blocks − raid size + 1) +

                tpt

        1st stripe−Batch



 xor part batchtarget (k) +



                      tpt

              last stripe−Incremental


   xor part inctarget (raid size − 1 − k), k > raid2size
                    tpt


                               137
           Therefore:
              target                   1           raid size−1      target
           xortpt (req size) =    raid size−1      k=1           xortpt (req size, k)

                            target
           cpu max write reqtpt /sec =
                                                   100
                                       target                               target
           cpu pdu work·sent pdus writetpt      (req size)+cpu xor block·xortpt    (req size)


• Network link: We will now calculate the cost of rdma operations in the target. We
  distinguish between partial stripes and complete stripes. We also distinguish between
  the two parity calculation methods that were discussed earlier.

    – Baseline system:
         ∗ Partial stripes:
             · Incremental parity calculation: In order to write p data blocks, p + 1 data
               blocks are read from the targets, and p + 1 data blocks are written to the
               targets. Therefore:
                                                              (p+1)·block size
               rcvd data part inc writetarget (p) =
                                       Baseline                  raid size

                                                              (p+1)·block size
               sent data part inc writetarget (p) =
                                       Baseline                  raid size


             · Batch parity calculation: In order to write p data blocks, the targets have
               to write raid size − 1 − p data blocks to the raid controller, and read
               p + 1 data blocks from the raid controller. Therefore:
                                                                  (p+1)·block size
               rcvd data part batch writetarget (p) =
                                         Baseline                    raid size

                                                                  (raid size−1−p)·block size
               sent data part batch writetarget (p) =
                                         Baseline                          raid size


         ∗ Complete stripes: In order to write raid size − 1 data blocks, the targets
           have to read raid size blocks from the raid controller. Therefore:

           rcvd data comp writetarget =
                               Baseline
                                                 raid size·block size
                                                      raid size
                                                                        = block size

       We will now calculate the total amount of data that is sent/received by a target
       in a single request:
       rcvd data writetarget (req size, k) =
                        Baseline




                                             138
                      N o partial stripes



      blocks



                 · rcvd data comp writetarget ,
                                        Baseline          k = raid size − 1

   raid size − 1





                            Complete stripes



          blocks



   (                 − 1) · rcvd data comp writetarget +
                                                 Baseline

       raid size − 1

              1st stripe−Incremental





   rcvd data part inc writetarget (k) +
                            Baseline




                              last stripe−Batch



   rcvd data part batch writetarget (raid size − 1 − k),           k<    raid size
                                                                                      −1

                             Baseline                                        2





                            Complete stripes

           blocks

   (                 − 1) · rcvd data comp writetarget +
                                                 Baseline

       raid size − 1

                  1st stripe−Batch





   rcvd data part batch writetarget (k) +
                              Baseline

                             last stripe−Batch






   rcvd data part batch writetarget (raid size − 1 − k),
                              Baseline
                                                                    raid size
                                                                        2
                                                                                ≥k≥    raid size
                                                                                           2
                                                                                                   −1





                            Complete stripes





          blocks

   (                 − 1) · rcvd data comp writetarget +

       raid size − 1                            Baseline

                  1st stripe−Batch






   rcvd data part batch writetarget (k) +
                              Baseline

                       last stripe−Incremental



    rcvd data part inc writetarget (raid size − 1 − k),
                            Baseline                               k>   raid size
                                                                            2

Therefore:
rcvd data writetarget (req size) =
               Baseline
               raid size−1
     1
raid size−1    k=1           rcvd data writetarget (req size, k)
                                            Baseline


The total length of sent data is:
sent data writetarget (req size, k) =
                Baseline




                                             139
         N o partial stripes
  
  
  
        blocks
  
  
  
                   · 0,            k = raid size − 1
  
     raid size − 1
  
  
  
  
  
             Complete stripes
  
  
  
            blocks
  
  
  
     (                 − 1) · 0 +
  
         raid size − 1
  
                  1st stripe−Incremental
  
  
  
  
  
     sent data part inc writetarget (k) +
                              Baseline
  
  
  
  
                                    last stripe−Batch
  
  
  
     sent data part batch writetarget (raid size − 1 − k),      k<    raid size
                                                                                   −1
  
                               Baseline                                   2
  
  
  
  
  
             Complete stripes
  
             blocks
  
     (                 − 1) · 0 +
  
         raid size − 1
  
                       1st stripe−Batch
  
  
  
  
  
     sent data part batch writetarget (k) +
                                Baseline
  
                                   last stripe−Batch
  
  
  
  
  
  
     sent data part batch writetarget (raid size − 1 − k),
                                Baseline
                                                                 raid size
                                                                     2
                                                                             ≥k≥    raid size
                                                                                        2
                                                                                                −1
  
  
  
  
  
             Complete stripes
  
  
  
  
  
            blocks
  
     (                 − 1) · 0 +
  
         raid size − 1
  
                       1st stripe−Batch
  
  
  
  
  
  
     sent data part batch writetarget (k) +
                                Baseline
  
                               last stripe−Incremental
  
  
  
      sent data part inc writetarget (raid size − 1 − k),
                              Baseline                          k>   raid size
                                                                         2

  Therefore:
  sent data writetarget (req size) =
                   Baseline
              raid size−1
       1
  raid size−1 k=1         sent data writetarget (req size, k)
                                         Baseline

                       target
  net rcv max write reqBaseline /sec =
              net bw
                 target
  rcvd data writeBaseline (req size)

                        target
  net send max write reqBaseline /sec =
              net bw
                 target
  sent data writeBaseline (req size)




– tpt system:
    ∗ Partial stripes:

                                                 140
      · Incremental parity calculation: We will first calculate the number of
        rdma operations between the targets:
        The height of the binary tree for parity calculation is log2 p. In order to
        simplify the calculation, we assume that p mod 2 = 0. Therefore, there
        are p/2 rdma operations in the 1st iteration. Therefore, p − 1 rdma
        operations are needed for the calculation of the parity block. Another
        rdma operation is required for the target that holds the parity block to
        read the recalculated parity result. The targets also have to read p data
        blocks from the host. Therefore:

        rcvd data part inc writetarget (p) =
                                tpt
                                                2·p·block size
                                                  raid size


        sent data part inc writetarget (p) =
                                tpt
                                                p·block size
                                                 raid size


      · Batch parity calculation: In order to write p data blocks, the targets
        have to read p blocks from the host. Then, the first iteration of parity
        calculation involves raid size − 1 targets. Therefore, raid size − 2 rdma
        operations are required. Another rdma operation is required for the
        target that holds the parity block to read the recalculated parity result.
        Therefore:
                                                   (p+raid size−1)·block size
        rcvd data part batch writetarget (p) =
                                  tpt                      raid size

                                                   (raid size−1)·block size
        sent data part batch writetarget (p) =
                                  tpt                      raid size


 ∗ Complete stripes: The calculation is similar to partial stripes:
                                   2·(raid size−1)·block size
    rcvd data comp writetarget =
                        tpt                 raid size

                                   (raid size−1)·block size
    sent data comp writetarget =
                        tpt                raid size


We will now calculate the total amount of data that is sent/received by a target
in a single request:
rcvd data writetarget (req size, k) =
                 tpt




                                   141

                     N o partial stripes



      blocks

                 · rcvd data comp writetarget ,    k = raid size − 1

                                       tpt

   raid size − 1





                         Complete stripes



          blocks

   (                 − 1) · rcvd data comp writetarget +

       raid size − 1                            tpt



             1st stripe−Incremental



   rcvd data part inc writetarget (k) +

                           tpt

                           last stripe−Batch





   rcvd data part batch writetarget (raid size − 1 − k),
                              tpt                              k<    raid size
                                                                                 −1


                                                                         2






                          Complete stripes

           blocks

   (                 − 1) · rcvd data comp writetarget +
                                                 tpt

       raid size − 1

                 1st stripe−Batch





   rcvd data part batch writetarget (k) +
                              tpt

                           last stripe−Batch






   rcvd data part batch writetarget (raid size − 1 − k),
                              tpt
                                                               raid size
                                                                   2
                                                                           ≥k≥    raid size
                                                                                      2
                                                                                              −1





                         Complete stripes





          blocks

   (                 − 1) · rcvd data comp writetarget +
                                                 tpt

       raid size − 1

                 1st stripe−Batch





   rcvd data part batch writetarget (k) +
                              tpt




                       last stripe−Incremental


    rcvd data part inc writetarget (raid size − 1 − k),
                            tpt                               k>   raid size
                                                                       2

Therefore:
rcvd data writetarget (req size) =
                 tpt
            raid size−1
     1
raid size−1 k=1         rcvd data writetarget (req size, k)
                                       tpt


The total length of sent data is:
sent data writetarget (req size, k) =
                tpt




                                            142
       
                            N o partial stripes
       
       
       
             blocks
       
                        · sent data comp writetarget ,    k = raid size − 1
       
                                              tpt
       
          raid size − 1
       
       
       
       
       
                                Complete stripes
       
       
       
                 blocks
       
          (                 − 1) · sent data comp writetarget +
       
              raid size − 1                            tpt
       
       
       
                    1st stripe−Incremental
       
       
       
          sent data part inc writetarget (k) +
       
                                  tpt
       
                                  last stripe−Batch
       
       
       
       
       
          sent data part batch writetarget (raid size − 1 − k),
                                     tpt                              k<    raid size
                                                                                        −1
       
       
                                                                                2
       
       
       
       
       
       
                                 Complete stripes

                  blocks
       
          (                 − 1) · sent data comp writetarget +
                                                        tpt
       
              raid size − 1
       
                        1st stripe−Batch
       
       
       
       
       
          sent data part batch writetarget (k) +
                                     tpt
       
                                  last stripe−Batch
       
       
       
       
       
       
          sent data part batch writetarget (raid size − 1 − k),
                                     tpt
                                                                      raid size
                                                                          2
                                                                                  ≥k≥    raid size
                                                                                             2
                                                                                                     −1
       
       
       
       
       
                                Complete stripes
       
       
       
       
       
                 blocks
       
          (                 − 1) · sent data comp writetarget +
                                                        tpt
       
              raid size − 1
       
                        1st stripe−Batch
       
       
       
       
       
          sent data part batch writetarget (k) +
                                     tpt
       
       
       
       
                              last stripe−Incremental
       
       
           sent data part inc writetarget (raid size − 1 − k),
                                   tpt                               k>   raid size
                                                                              2

       Therefore:
       sent data writetarget (req size) =
                        tpt
                   raid size−1
            1
       raid size−1 k=1         sent data writetarget (req size, k)
                                              tpt


                            target
       net rcv max write reqtpt /sec =
                  net bw
                      target
       rcvd data writetpt    (req size)

                             target
       net send max write reqtpt /sec =
                  net bw
                      target
       sent data writetpt    (req size)




• Disk operations: Both systems use this resource identically. We distinguish between


                                                   143
partial stripes and complete stripes. We also distinguish between the two parity cal-
culation methods that were discussed earlier:

  – Partial stripes:
      ∗ Incremental parity calculation: In order to write p data blocks, p + 1 blocks
        are read from the disks. Then, p+1 blocks are written to the disks. Therefore:

        disk write part inc(p) = 2 · (p + 1) · block size
      ∗ Batch parity calculation: In order to write p data blocks, raid size − 1 − p
        blocks are read from the disks. Then, p + 1 blocks are written to the disks.
        Therefore:

         disk write part batch(p) = raid size · block size
  – Complete stripes: No blocks are read from the disks. raid size blocks are written
    to the disks. Therefore:

    disk write comp = raid size · block size

We will now calculate the total amount of data that is read/written to the disks by a
target in a single request:
disk write(req size, k) =




                                      144
 
                  N o partial stripes
 
 
 
 
 
       blocks
 
                  · disk write comp,               k = raid size − 1
 
    raid size − 1
 
 
 
 
 
                     Complete stripes
 
 
 
 
 
           blocks
 
    (                 − 1) · disk write comp +
 
        raid size − 1
 
         1st stripe−Incremental
 
 
 
 
 
    disk write part inc(k) +
 
                       last stripe−Batch
 
 
 
                                                                     raid size
 
    disk write part batch(raid size − 1 − k),                   k<               −1
 
                                                                         2
 
 
 
 
 
                     Complete stripes

            blocks
 
    (                 − 1) · disk write comp +
 
        raid size − 1
 
             1st stripe−Batch
 
 
 
 
 
    disk write part batch(k) +
 
                       last stripe−Batch
 
 
 
                                                                raid size         raid size
 
    disk write part batch(raid size − 1 − k),                               ≥k≥               −1
 
 
                                                                     2                 2
 
 
 
 
 
 
                      Complete stripes
 
 
 
           blocks
 
    (                 − 1) · disk write comp +
 
        raid size − 1
 
 
 
 
              1st stripe−Batch
 
 
 
    disk write part batch(k) +
 
 
 
                  last stripe−Incremental
 
                                                                   raid size
     disk write part inc(raid size − 1 − k),                   k>       2

 Therefore:
                                        1        raid size−1
 disk write(req size) =            raid size−1   k=1           disk write(req size, k)


• Memory:

     – Baseline system: Memory is used when performing rdma operations and disk
       operations which were described above. Therefore:

                          target
         mem max write reqBaseline /sec =
                                                  mem bw
                        target                             target
          rcv data writeBaseline (req size)+sent data writeBaseline (req size)+disk write(req size)


     – tpt system: Memory is used when performing rdma operations, disk opera-
       tions and xor operations. rdma operations and disk operations were described

                                                    145
above. For xor operations, we distinguish between partial stripes and complete
stripes. We also distinguish between the two parity calculation methods that were
discussed earlier:
 ∗ Partial stripes:
      · Incremental parity calculation: In order to write p new data blocks, p
        xor operations between the old data blocks and the new data blocks are
        performed. Then, during the iterations to calculate the new parity block,
        p xor operations are performed. In each operation, 2 blocks are read and
        1 block is written. Therefore:

        mem xor part inctarget (p) =
                        tpt
                                        6·p·block size
                                          raid size


      · Batch parity calculation: In order to write p new data blocks, raid size−2
        xor operations are performed during the iterations to calculate the new
        parity block. In each operation, 2 blocks are read and 1 block is written.
        Therefore:
                                           3·(raid size−2)·block size
        mem xor part batchtarget (p) =
                          tpt                       raid size


 ∗ Complete stripes: Similar to batch parity calculation:
                           3·(raid size−2)·block size
    mem xor comptarget =
                tpt                 raid size


We now calculate the total number of memory operations that are executed during
parity calculations by the target in a single command:
          target
mem xortpt (req size, k) =




                                  146
       
                        N o partial stripes
       
       
       
             blocks
       
                        · mem xor comptarget ,                      k = raid size − 1
       
                                      tpt
       
          raid size − 1
       
       
       
       
       
                            Complete stripes
       
       
       
                 blocks
       
          (                 − 1) · mem xor comptarget +
       
              raid size − 1                    tpt
       
       
       
                1st stripe−Incremental
       
       
       
          mem xor part inctarget (k) +
       
                          tpt
       
                                 last stripe−Batch
       
       
       
       
       
          mem xor part batchtarget (raid size − 1 − k),
                             tpt                                               k<   raid size
                                                                                                −1
       
       
                                                                                        2
       
       
       
       
       
       
                             Complete stripes

                  blocks
       
          (                 − 1) · mem xor comptarget +
                                                tpt
       
              raid size − 1
       
                    1st stripe−Batch
       
       
       
       
       
          mem xor part batchtarget (k) +
                             tpt
       
                                 last stripe−Batch
       
       
       
       
       
       
          mem xor part batchtarget (raid size − 1 − k),
                             tpt
                                                                               raid size
                                                                                   2
                                                                                           ≥k≥   raid size
                                                                                                     2
                                                                                                             −1
       
       
       
       
       
                            Complete stripes
       
       
       
       
       
                 blocks
       
          (                 − 1) · mem xor comptarget +
                                                tpt
       
              raid size − 1
       
                    1st stripe−Batch
       
       
       
       
       
          mem xor part batchtarget (k) +
                             tpt
       
       
       
       
                          last stripe−Incremental
       
       
           mem xor part inctarget (raid size − 1 − k),
                           tpt                                               k>   raid size
                                                                                      2

       Therefore:
                target                            1            raid size−1          target
       mem xortpt (req size) =               raid size−1       k=1           mem xortpt (req size, k)

                        target
       mem max write reqtpt /sec =
                                         mem bw
                         target                             target
       rcvd data writetpt         (req size)+sent data writetpt    (req size)
                                               target
       +disk write(req size)+mem xortpt                 (req size)



Therefore, the maximum throughput of the Baseline targets is:

                             target max write thrBaseline =


                     target                              target
min(cpu max write reqBaseline /sec, net rcv max write reqBaseline /sec,

                                                        147
                          target                          target
    net send max write reqBaseline /sec, mem max write reqBaseline /sec)



  The maximum throughput of the tpt targets is:

                          target max write thrtpt =


                          target
     min(cpu max write reqtpt /sec, net rcv max write req)target /sec,
                                                          tpt




     net send max write req)target /sec, mem max write req)target /sec)
                            tpt                            tpt

  The maximum throughput of the Baseline system is:

                            max write thrBaseline =
min(host max write thr, ctrl max write thrBaseline , target max write thrBaseline )



  The maximum throughput of the tpt system is:

                              max write thrtpt =
  min(host max write thr, ctrl max write thrtpt , target max write thrtpt )




                                       148
References
 [1] “SCSI architecture model - 3 (SAM-3),” INCITS T10 Technical working group, 2004.

 [2] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, “Internet Small
     Computer Systems Interface (iSCSI),” RFC 3720 (Proposed Standard), 2004. [Online].
     Available: http://www.ietf.org/rfc/rfc3720.txt

 [3] R. Recio, P. Culley, D. Garcia, J. Hilland, and B. Metzler, “A Remote Direct
     Memory Access Protocol Specification.” [Online]. Available: http://www.ietf.org/
     internet-drafts/draft-ietf-rddp-rdmap-07.txt

 [4] “InfiniBand Architecture Specification,” InfiniBand Trade Association, Oct. 2004.
     [Online]. Available: http://www.infinibandta.org

 [5] P. Culley, U. Elzur, R. Recio, S. Bailey, and J. Carrier, “Marker PDU Aligned Framing
     for TCP Specification.” [Online]. Available: http://www.ietf.org/internet-drafts/
     draft-ietf-rddp-mpa-06.txt

 [6] H. Shah, J. Pinkerton, R. Recio, and P. Culley, “Direct Data Placement
     over Reliable Transports.” [Online]. Available: http://www.ietf.org/internet-drafts/
     draft-ietf-rddp-ddp-06.txt

 [7] M. Ko, M. Chadalapaka, U. Elzur, H. Shah, P. Thaler, and J. Hufferd, “iSCSI Extensions
     for RDMA Specification,” IETF Draft, 2005.

 [8] “SCSI RDMA Protocol (SRP),” INCITS T10 Technical working group, July 2002.
     [Online]. Available: http://www.t10.org/ftp/t10/drafts/srp/srp-r16a.pdf

 [9] D. Patterson, G. Gibson, and R. Katz, “A Case for Redundant Arrays of Inexpensive
     Disks (RAID),” ACM International Conference on Management of Data (SIGMOD),
     pp. 109–116, 1988.

[10] M. Stonebraker and G. A. Schloss, “Distributed RAID - A New Multiple Copy Algo-
     rithm,” in Proc. of the Sixth International Conf. on Data Engineering, Feb. 1990, pp.
     430–437.

[11] P. Cao, S. B. Lin, S. Venkataraman, and J. Wilkes, “The TickerTAIP Parallel RAID
     Architecture,” ACM Trans. on Computer System, vol. 12, no. 3, pp. 236–269, Aug.
     1994.

[12] E. K. Lee and C. A. Thekkath, “Petal: Distributed Virtual Disks,” in Proc. of the
     7th International Conference on Architectural Support for Programming Languages and
     Operating Systems (ASPLOS VII), Oct. 1996, pp. 84–92.




                                           149
[13] X. He, P. Beedanagari, and D. Zhou, “Performance Evaluation of Distributed iSCSI
     RAID,” in International Workshop on Storage Network Architecture and Parallel I/Os
     (SNAPI), Sept. 2003.

[14] “SCSI Block Commands - 2 (SBC-2),” INCITS T10 Technical working group, 2004.

[15] Storage Virtualization Manager. StoreAge. [Online]. Available: http://www.storeage.
     com/media/upload/Datasheet%20-%20SVM%20with%20Host%20Agent.pdf

[16] Y. Birk and N. Bishara, “Distributed-and-Split Data-Control Extension to SCSI for
     Scalable Storage Area Networks,” in 10th Symposium on High Performance Intercon-
     nects HOT Interconnects (HotI’02), 2002, p. 77.

[17] Myrinet. Myricom. [Online]. Available: http://www.myri.com/myrinet

[18] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and
     W.-K. Su, “Myrinet: A Gigabit-per-Second Local Area Network,” IEEE Micro, vol. 15,
     no. 1, pp. 29–36, 1995.

[19] QsNet. Quadrics. [Online]. Available:      http://www.quadrics.com/Quadrics/
     QuadricsHome.nsf/DisplayPages/3A912204F260613680256DD9005122C7

[20] F. Petrini, A. Hoisie, W. chun Feng, and R. Graham, “Performance Evaluation of the
     Quadrics Interconnection Network,” Proceedings of the 15th International Parallel &
     Distributed Processing Symposium, p. 165, 2001.

[21] “SYMMETRIX Storage system,” EMC Corporation.

[22] M. Ko, “Technical Overview of iSCSI Extensions for RDMA (iSER) & Datamover
     Architecture for iSCSI (DA),” 2003. [Online]. Available: http://www.rdmaconsortium.
     org/home/iSER DA intro.pdf

[23] iSER - iSCSI RDMA. Voltaire. [Online]. Available:           http://www.voltaire.com/
     Products/Server Products/iSER iSCSI RDMA

[24] Linux Proc filesystems man page. [Online]. Available: http://www.die.net/doc/linux/
     man/man5/proc.5.html

[25] “kDAPL: kernel Direct Access Programming Library (Version 1.1),” DAT Collaborative,
     May. 2003. [Online]. Available: http://www.datcollaborative.org/kdapl.html

[26] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, and S. Sankar, “Row-
     Diagonal Parity for Double Disk Failure Correction,” in Proceedings of the USENIX
     FAST ’04 Conference on File and Storage Technologies, Mar. 2004, pp. 1–14.




                                            150
    ) RDP     .                  293                         TPT-   ,mirroring
.                          3      TPT-              ,(                   RAID-5-

                                         ¯                              ¯
       .Baseline RAID                            TPT-RAID           ,(16 kB)
               Baseline RAID                      TPT-RAID            ,
                                             .                      ¯¯       ¯

                                     ,                   ¯             TPT-RAID
       ¯                                                               ,
                                                                .




                               III
                                               :                        ,             ¯ ,Baseline RAID-                        :                 ,
                                                           .                             targets-                                                    ¯   •
                                                                                                                                            .
                                                           .                                     .                                                   ¯   •

    ,                            ¯                                          ¯ ,                                             ¯                   ¯
        -                    ¯                                      .
                                                                              (TPT-RAID) 3rd Party Transfer          -                                   RAID
        ¯                                 ¯.                                            RAID-            ¯ .targets-
                             SAN                                                    .RDMA
                                                                                      .RAID-

                      ¯)                       ¯ ,                                                                                 -   RAID-                  ¯
                             ,                  .                                                                    ,                               (
                                                                                  :    ¯                 ¯               .targets-
                             .                                                          ¯                                                                •
                                                                            ¯                ,                                                   ¯       •
                    targets-                                            .targets-                                .                                   ¯
                                                               .                                             ,

                                                                                                                              :
                                     .targets-                                                                           :3rd Party Transfer             •
                         .                                         targets                       ¯
                                                                                                                           .
                                     rd
                                 ,3 Party Transfer-                                      TPT-RAID-     ¯ :                                               •
            -                               .                                              ¯       , ¯     ¯ .
                                                                              .                                                        targets

    RAID-                                          .                     ¯       iSER- RDMA
                                                       .           (scalability)         ¯                                                                -

                       ¯     TPT-RAID-                                                               .                                          TPT-RAID
                Login/Logout           ,iSCSI                                                                                      ,SCSI-
                  ¯                                                                                                  .iSER-                              iSCSI
                                                                                                                                   .

                ¯        . ¯                                                        ,TPT-RAID Baseline RAID-
                    .InfiniBand                                              ,Linux      ¯
            ¯                                                                   (2-         (1 :       -   ¯
                                                                                                           .

-                                                                                     TPT-RAID-
                     ¯                                                      Baseline-    ,      RAID-      ¯ .Baseline RAID
    .                                                                                      TPT-    ,RAID-5     .



                                                                                        II
                    ,1990-               ,                .                                                                           ¯             ,
                                                                                                           .                                ¯
                                 ,                                                                 ,               .SCSI     ¯
                             ¯                                        ¯                                              ,                              ¯
                                                                                                                             .(                 )

    Network Attached Storage :                                                                                           ¯               ¯
                   (SAN) Storage Area Network- ,                                                       ¯                          ,(NAS)
          NAS-         ¯    SAN-                                                                                               ¯ .
            )                ¯ ¯         "                                                                                   .Object Store
                                                                                                                         .(SAN- ¯

                            (iSCSI) Internet SCSI                           ¯
        (RDMA) Remote DMA                      ¯        .Fibre Channel        IP
    iSCSI Extensions for RDMA            .iWARP InfiniBand ¯
                ¯    RDMA                      (SRP) SCSI RDMA Protocol- (iSER)
                                                                     .      ¯

                                 ,                             (RAID) Redundant Array of Inexpensive Disks
                                                                 RAID-           .                      ¯
    .                                                         ¯(         )

                                                                                  :                                      ¯             SAN
    .                                                                                                      ¯:                             •
                .                                                                                      :                                  •
                                                              .

        ,   RAID-                            (                                        ) RAID-                                ,
                ¯                                ¯ ,RAID-                         ¯             .SAN-                                                           ¯
            ,                                                                                     ,   .                                         .
        ¯     ¯                                                                                                                                         ¯
,       ¯     )                                               (                           ¯            )       ¯     ¯           .
                                                                                                   .                                 (' ¯

                                                          ,                                   RAID
        .                            (                        ,               )                                                      ¯  .
                    ¯.                   ¯            ¯                                                                              ¯, ¯
                              ¯.                                                      ¯                     ,
                             . ¯                      ¯           ¯                   ,        ¯

            ,             .                                               ,                                                                                 ¯
                         .      -                 ¯                                                                 ,DMA-
        -   ¯              DMA-                                          .RDMA                                           ,iSER
                                                              .Baseline RAID                                       ¯   ,       .


                                                                      I
 49   .........................................................................                                1
 68   ......................................................                      ¯    (    )                  2
 68   ...............................................................................           ¯              3
 88   .............................................................................. RAID        Scalability   4
100   ........................................................... (mirroring) RAID               Scalability   5
101    .................................................................. (RDP) RAID             Scalability   6
101 ............................................ (RDP-RAID)                    ¯                        45
103 ....................................................................................... PREP READ   46
104 ..................................................................................... PREP WRITE    47
104 ......................................................................... READ OLD BLOCK            48
105 ......................................................................... READ NEW BLOCK            49
105 .............................................................. READ PARITY PART TMP                 50
106 ............................................................ READ PARITY COMP TMP                   51
106 ........................................................ READ PARITY PART BLOCK                     52
107 ....................................................... READ PARITY COMP BLOCK                      53
107 ......................................................................... WRITE NEW DATA            54
110 ........................................................................................            55
117 ..............................................                                                      56
118 ...............................................                                                     57
 12 ............................................................... Network Attached Storage (NAS)                       1
 12 ...................................................................... Storage Area Network (SAN)                    2
 14 ......................................................................................                      RAID     3
 17 ...................................                           ¯         target-                                      4
 22 ................................................................................... SCSI-               -            5
 23 ............................................................................. iSCSI-               ¯                 6
 23 ............................................................................. iSCSI-                                 7
 24 ............................................................................................. iSER          iSCSI    8
 25 .................................. iSER            iSCSI SCSI                           ¯/                           9
 25 .................................................................................................. iSER header      10
 26 ......................................................................................................... RAID-4    11
 27 ......................................................................................................... RAID-5    12
 27 ......................................................................................                              13
 28 ............................................................................................. Baseline RAID         14
 29 ................................................................................................... TPT-RAID        15
 33 .............................................................. targets-                    3rd Party Transfer       16
 34 ..................................................................................                                  17
 36 ....................................................................... Baseline RAID-                              18
 37 ............................................................................. TPT-RAID-                             19
 38 ..............................................................                                                      20
 40 ....................................................................... Baseline RAID-                     ¯        21
 42 ............................................................................. TPT-RAID-                    ¯        22
 50 .........................................................................                                           23
 77 ......................................................................                          ¯             ¯     24
 79 .........................................................................................                           25
 80 ............................................                                      tx thread "                       26
 81 .....................................................                                /                              27
 84 .........................................................................................        ¯                  28
 85 ....................................................         ¯              Baseline-                               29
 86 ...........................................................         ¯              TPT-                             30
 88 ............................................................. (                    ) ¯                Scalability   31
 89 ............................................................. (        ¯           ) ¯                Scalability   32
 90 .........................................................................................                           33
 91 ...................... 8MB =                                                            ¯                           34
 92 .........................................................................................        ¯                  35
 93 ...................... 8MB =                                                            ¯        ¯                  36
 94 ...............................                                                                                     37
 94 ..... (                ¯ )                                                                                          38
 95 ........ (Baseline RAID)                                 ¯                                       ¯                  39
 95 .............. (TPT-RAID)                                ¯                                       ¯                  40
                                                                            rd
 96 ..............................................                        3 Party Transfer                              41
 97 ............................................                        3rd Party Transfer                              42
 99 ............................................... (mirroring)                                                         43
100 ............................................... (mirroring)                 ¯                                       44
113 ............................................................................... TPT ¯           B.1.2

120                                                                                                 –             C
120.............................................................................................            C.1
124 .............................................................................................    ¯      C.1
 72                                                                                                            ¯   ¯       5
 72 ........................................................................................................   5.1
 72 ..............................................................................................       ¯   ¯ 5.2
 72 ..............................................................................                     5.2.1
 72 ..............................................................................        ¯            5.2.2
 74 .............................................................................................            ¯ 5.3

 75                                                                                                                        6
 75 ...................................................................................................        6.1
 75 ........................................................................................           6.1.1
 75 ......................................................................................... ¯ 6.1.2
 78 ............................................................................................               6.2
 78 ..............................................................................                     6.2.1
 82 ..............................................................................         ¯           6.2.2
 85 .............................................................................................. Scalability 6.3
 90 ............................................................................................               6.4
 90 ..............................................................................                     6.4.1
 91 ..............................................................................         ¯           6.4.2
 92 ............................................................                                       6.4.3
 96 ......................................................................           - RAID                    6.5
 96 ..............................................................................                     6.5.1
 97 ..............................................................................         ¯           6.5.2

 98                                                                                                            RAID        7
 98 ................................................................................................ Mirroring 7.1
 98 ..............................................................................                     7.1.1
 98 ..............................................................................        ¯            7.1.2
 99 ................................................................................ Scalability 7.1.3
 99 ........................................................................................................ RDP 7.2

102                                                                                                                    ¯   8

103                                                                                                                        A
103 ............................................................................ SCSI                       A.1
107 ........................................................................... iSCSI                       A.2
108 ................................................................................ PDU  A.2.1
108 ....................................................................... Logout- Login A.2.2
108 .................................................................. Datamover                            A.3

110                                                                                         ¯                              B
110 .............................................................................................           B.1
110 .........................................................................Baseline ¯             B.1.1
111 ............................................................................... TPT ¯           B.1.2
112 .............................................................................................    ¯      B.2
112 ........................................................................ Baseline ¯             B.1.1
                                                                                                                  ¯
12                                                                                                                    1
13 ......................................................................................         -   RAID 1.1
13 ................................................................................                   RAID 1.2
15 ............................................................................................            1.3

17                                                                                                                    2
18 .......................................................................................... Fibre Channel 2.1
18 ........................................................................................... RDMA         2.2
19 ..................................................................................... InfiniBand         2.3
20 ................................................................................. ¯             2.3.1

21                                                                                                       ¯            3
21 .............................................................................................. SCSI     3.1
22 ............................................................................................. iSCSI     3.2
24 .............................................................................................. iSER     3.3
26 ............................................................................................. RAID      3.4
28 .................................................................. iSER                            RAID 3.5

29                                             (TPT-RAID) 3rd Party Transfer RAID                               ¯     4
29 .......................................................................................................  4.1
29 ........................................................................................ Logout- Login 4.2
31 ..............................................................................................           4.3
                                                                           rd
31 ..................................................................... 3 Party Transfer 4.3.1
34 ................................................................. targets-                         4.3.2
34 .........................................................                                          4.3.3
35 ...........................................................................                        4.3.4
35 ..............................................................................                     4.3.5
37 ..................................................                                                 4.3.6
39 ..............................................................................         ¯           4.3.7
48 ..................................................                                                 4.3.8
51 .......................................................                          ¯                 4.3.9
54 .............................................................................                            4.4
55 ..............................................................................                     4.4.1
56 ..............................................................................         ¯           4.4.2
59 ...........................................................................................              4.5
59 ..............................................................................                     4.5.1
60 ..............................................................................         ¯           4.5.2
66 .............................................................                                            4.6
66 ..............................................................................                     4.6.1
66 ..............................................................................         ¯           4.6.2
69 ............................................................................................             4.7
69 ......................................................................                          ¯ 4.7.1
69 ............................................................................ degraded              4.7.2
71 ...........................................................                                              4.8
"
             ¯       – TPT-RAID




       ¯   ¯ –   ¯



2007                   "
¯   – TPT-RAID

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:6/20/2011
language:English
pages:164