Chapter 7 Storage Systems by wuxiangyu


									Chapter 7 Storage Systems

•   Introduction
•   Types of Storage Devices
•   Buses – Connecting I/O Devices to CPU/Memory
•   Reliability, Availability, and Dependability
•   RAID: Redundant Arrays of Inexpensive Disks
•   Errors and Failures in Real Systems
•   I/O Performance Measures
•   A Little Queuing Theory
•   Benchmarks of Storage Performance and Availability
•   Crosscutting Issues
•   Design An I/O System

7.1 Introduction

           Motivation: Who Cares About
• CPU Performance: 2 times very 18 months
• I/O performance limited by mechanical delays (disk I/O)
    – < 10% per year (I/O per sec or MB per sec)
• Amdahl's Law: system speed-up limited by the slowest part!
    – 10% I/O & 10x CPU  5x Performance (lose 50%)
    – 10% I/O & 100x CPU  10x Performance (lose 90%)
•   I/O bottleneck:
    – Diminishing fraction of time in CPU
    – Diminishing value of faster CPUs

          Position of I/O in Computer
          Architecture – Past
• An orphan in the architecture domain
• I/O meant the non-processor and memory stuff
   – Disk, tape, LAN, WAN, etc.
   – Performance was not a major concern
• Devices characterized as:
   – Extraneous, non-priority, infrequently used, slow
• Exception is swap area of disk
   – Part of the memory hierarchy
   – Hence part of system performance but you’re hosed if you use it

           Position of I/O in Computer
           Architecture – Now
• Trends – I/O is the bottleneck
   – Communication is frequent
      • Voice response & transaction systems, real-time video
      • Multimedia expectations
   – Even standard networks come in gigabit/sec flavors
   – For multi-computers
• Result
   – Significant focus on system bus performance
      • Common bridge to the memory system and the I/O systems
      • Critical performance component for the SMP server platforms

          System vs. CPU Performance
• Care about speed at which user jobs get done
   – Throughput - how many jobs/time (system view)
   – Latency - how quick for a single job (user view)
   – Response time – time between when a command is issued and
     results appear (user view)
• CPU performance main factor when:
   – Job mix fits in the memory  there are very few page faults
• I/O performance main factor when:
   – The job is too big for the memory - paging is dominant
   – When the job reads/writes/creates a lot of unexpected files
      • OLTP – Decision support -- Database
   – And then there is graphics & specialty I/O devices

            System Performance
• Depends on many factors in the worst case
   –   CPU
   –   Compiler
   –   Operating System
   –   Cache
   –   Main Memory
   –   Memory-IO bus
   –   I/O controller or channel
   –   I/O drivers and interrupt handlers
   –   I/O devices: there are many types
         • Level of autonomous behavior
         • Amount of internal buffer capacity
         • Device specific parameters for latency and throughput

                                        May the same or different
I/O Systems                             Memory – I/O Bus



             Memory - I/O Bus

  Main            I/O                I/O          I/O
 Memory        Controller         Controller   Controller

              Disk    Disk        Graphics        Network

          Keys to a Balanced System

• It’s all about overlap - I/O vs CPU
   – Timeworkload = Timecpu + TimeI/O - Timeoverlap
• Consider the benefit of just speeding up one
   – Amdahl’s Law (see P4 as well)
• Latency vs. Throughput

            I/O System Design
• Depends on type of I/O device
   – Size, bandwidth, and type of transaction
   – Frequency of transaction
   – Defer vs. do now
• Appropriate memory bus utilization
• What should the controller do
   –   Programmed I/O
   –   Interrupt vs. polled
   –   Priority or not
   –   DMA
   –   Buffering issues - what happens on over-run
   –   Protection
   –   Validation

           Types of I/O Devices

• Behavior
   –   Read, Write, Both
   –   Once, multiple
   –   Size of average transaction
   –   Bandwidth
   –   Latency
• Partner - the speed of the slowest link theory
   – Human operated (interactive or not)
   – Machine operated (local or remote)

Some I/O Device

          Is I/O Important?

• Depends on your application
   – Business - disks for file system I/O
   – Graphics - graphics cards or special co-processors
   – Parallelism - the communications fabric
• Our focus = mainline uniprocessing
   – Storage subsystems (Chapter 7)
   – Networks (Chapter 8)
• Noteworthy Point
   – The traditional orphan
   – But now often viewed more as a front line topic

7.2 Types of Storage Devices

          Magnetic Disks

• 2 important Roles
   – Long term, non-volatile storage – file system and OS
   – Lowest level of the memory hierarchy
      • Most of the virtual memory is physically resident on the disk
• Long viewed as a bottleneck
   – Mechanical system  slow
   – Hence they seem to be an easy target for improved technology
   – Disk improvement w.r.t. density have done better than Moore’s law

            Disks are organized into
            platters, tracks, and sectors
                                          (1-12 * 2 sides)

                                        (5000 – 30000
                                         each surface)

                                                     (100 – 500)

A sector is the smallest
unit that can be read or written

          Physical Organization
• Platters – one or many
• Density - fixed or variable
   – All tracks have the same no. of sectors?)
• Organization - sectors, cylinders, and tracks
   – Actuators - 1 or more
   – Heads - 1 per track or 1 per actuator
   – Access - seek time vs. rotational latency
      • Seek related to distance but not linearly
      • Typical rotation: 3600 RPM or 15000 RPM
• Diameter – 1.0 to 3.5 inches

          Typical Physical Organization

• Multiple platters
   – Metal disks covered with magnetic recording material on both sides
• Single actuator (since they are expensive)
   – Single R/W head per arm
   – One arm per surface
   – All heads therefore over same cylinder
• Fixed sector size
• Variable density encoding
• Disk controller – usually built in processor + buffering

Characteristics of Three Magnetic Disks of 2000

          Anatomy of a Read Access

• Steps
  –   Memory mapped I/O over bus to controller
  –   Controller starts access
  –   Seek + rotational latency wait
  –   Sector is read and buffered (validity checked)
  –   Controller says ready or DMA’s to main memory and then says ready

           Access Time

• Access Time
    – Seek time: time to move the arm over the proper track
        • Very non-linear: accelerate and decelerate times complicate
    – Rotation latency (delay): time for the requested sector to rotate under
      the head (on average: 0.5 * RPM)
    – Transfer time: time to transfer a block of bits (typically a sector)
      under the read-write head
    – Controller overhead: the overhead the controller imposes in
      performing an I/O access
    – Queuing delay: time spent waiting for a disk to become free

AverageAcc essTime  AverageSeekTime  AverageRot ationalDel ay 
TransferTime  Controller Overhead  QueuingDelay
          Access Time Example

• Assumption: average seek time – 5ms; transfer rate –
  40MB/sec; 10,000 RPM; controller overhead – 0.1ms; no
  queuing delay
• What is the average time to r/w a 512-byte sector?
• Answer
            0.5        0.5 KB
  5ms                           0.1ms  5.0  3.0  0.013  0.1  8.1ms
        10000 RPM 40.0 MB
      0.5            0.5
                               0.003 sec  3.0ms
  10000 RPM (   10000 ) RPS

         Cost VS Performance

• Large-diameter drives have many more data to amortize the
  cost of electronics  lowest cost per GB
• Higher sales volume  lower manufacturing cost
• 3.5-inch drive, the largest surviving drive in 2001, also has
  the highest sales volume, so it unquestionably has the best
  price per GB

          Future of Magnetic Disks

• Areal density: bits/unit area is common improvement metric
• Trends
   – Until 1988: 29% improvement per year
   – 1988 – 1996: 60% per year
   – 1997 – 2001: 100% per year
• 2001
   – 20 billion bits per square inch
   – 60 billion bit per square inch demonstrated in labs

                   Tracks                  Bits
  ArealDensity           OnADiskSurface *      OnATrack
                    Inch                   Inch

Disk Price Trends by

Disk Price Trends – Dollars
Per MB

Cost VS Access Time for SRAM,
DRAM, and Magnetic Disk

             Disk Alternatives

• Optical Disks
    –   Optical compact disks (CD) – 0.65GB
    –   Digital video discs, digital versatile disks (DVD) – 4.7GB * 2 sides
    –   Rewritable CD (CD-RW) and write-once CD (CD-R)
    –   Rewritable DVD (DVD-RAM) and write-once DVD (DVD-R)
•   Robotic Tape Storage
•   Optical Juke Boxes
•   Tapes – DAT, DLT
•   Flash memory
    – Good for embedded systems
    – Nonvolatile storage and rewritable ROM

7.3 Bus – Connecting I/O
Devices to CPU/Memory

          I/O Connection Issues
                           Connecting the CPU to the I/O device world

• Shared communication link between subsystems
   – Typical choice is a bus
• Advantages
   – Shares a common set of wires and protocols  low cost
   – Often based on standard - PCI, SCSI, etc.  portable and versatility
• Disadvantages
   – Poor performance
   – Multiple devices imply arbitration and therefore contention
   – Can be a bottleneck

            I/O Connection Issues –
            Multiple Buses
• I/O bus                                  • CPU-memory bus
   – Lengthy                                  – Short
   – Many types of connected                  – High speed
     devices                                  – Match to the memory system to
   – Wide range in device bandwidth             maximize CPU-memory
   – Follow a bus standard                      bandwidth
   – Accept devices varying in                – Knows all types of devices that
     latency and bandwidth                      must connect together

Typical Bus Synchronous
Read Transaction

           Bus Design Decisions

• Other things to standardize as well
   –   Connectors
   –   Voltage and current levels
   –   Physical encoding of control signals
   –   Protocols for good citizenship

          Bus Design Decisions (Cont.)

• Bus master: devices that can initiate a R/W transaction
   – Multiple : multiple CPUs, I/O device initiate bus transactions
   – Multiple bus masters need arbitration (fixed priority or random)
• Split transaction for multiple masters
   – Use packets for the full transaction (does not hold the bus)
   – A read transaction is broken into read-request and memory-reply
   – Make the bus available for other masters while the data is
     read/written from/to the specified address
   – Transactions must be tagged
   – Higher bandwidth, but also higher latency

Split Transaction Bus

          Bus Design Decisions (Cont.)

• Clocking: Synchronous vs. Asynchronous
   – Synchronous
      • Include a clock in the control lines, and a fixed protocol for
        address and data relative to the clock
      • Fast and inexpensive (little or no logic to determine what's next)
      • Everything on the bus must run at the same clock rate
      • Short length (due to clock skew)
      • CPU-memory buses
   – Asynchronous
      • Easier to connect a wide variety of devices, and lengthen the bus
      • Scale better with technological changes
      • I/O buses

Synchronous or

• The Good
   – Let the computer and I/O-device designers work independently
   – Provides a path for second party (e.g. cheaper) competition
• The Bad
   – Become major performance anchors
   – Inhibit change
• How to create a standard
   – Bottom-up
      • Company tries to get standards committee to approve it’s latest
         philosophy in hopes that they’ll get the jump on the others (e.g. S bus,
         PC-AT bus, ...)
      • De facto standards
   – Top-down
      • Design by committee (PCI, SCSI, ...)

Some Sample I/O Bus

Some Sample Serial I/O Bus
   Often used in embedded computers

CPU-Memory Buses Found in
2001 Servers Crossbar Switch

            Connecting the I/O Bus
• To main memory
    – I/O bus and CPU-memory bus may the same
        • I/O commands on bus could interfere with CPU's access memory
    – Since cache misses are rare, does not tend to stall the CPU
    – Problem is lack of coherency
    – Currently, we consider this case
• To cache
• Access
    – Memory-mapped I/O or distinct instruction (I/O opcodes)
• Interrupt vs. Polling
• DMA or not
    – Autonomous control allows overlap and latency hiding
    – However there is a cost impact

A typical interface of I/O devices and
an I/O bus to the CPU-memory bus

            Processor Interface Issues
• Processor interface
    – Interrupts
    – Memory mapped I/O
•   I/O Control Structures
    –   Polling
    –   Interrupts
    –   DMA
    –   I/O Controllers
    –   I/O Processors
• Capacity, Access Time, Bandwidth
• Interconnections
    – Busses

                 I/O Controller

                       Ready, done, error…

   I/O Address

Command, Interrupt…

         Memory Mapped I/O
                  Single Memory & I/O Bus
                  No Separate I/O Instructions

Memory      Interface       Interface                      RAM

           Peripheral       Peripheral
    $                                                       I/O

  L2 $                                   Some portions of memory address
         Memory Bus     I/O bus          space are assigned to I/O device.
                                         Reads/Writes to these space cause
                                         data transfer
  Memory      Bus Adaptor

           Programmed I/O
• Polling
• I/O module performs the action,
  on behalf of the processor
• But I/O module does not interrupt
  CPU when I/O is done
• Processor is kept busy checking
  status of I/O module
   – not an efficient way to use the
     CPU unless the device is very
• Byte by Byte…

          Interrupt-Driven I/O
• Processor is interrupted when
  I/O module ready to exchange
• Processor is free to do other
• No needless waiting
• Consumes a lot of processor
  time because every word read or
  written passes through the
  processor and requires an
• Interrupt per byte

              Direct Memory Access (DMA)
•   CPU issues request to a DMA
    module (separate module or
    incorporated into I/O module)
•   DMA module transfers a block of
    data directly to or from memory
    (without going through CPU)
•   An interrupt is sent when the task is
     – Only one interrupt per block, rather
       than one interrupt per byte
•   The CPU is only involved at the
    beginning and end of the transfer
•   The CPU is free to perform other
    tasks during data transfer

                  Input/Output Processors
          CPU            IOP             D1

              main memory                D2
          Mem     bus                   . . .
                                 I/O                          target device
                                 bus                                    where cmnds are
         CPU      (4) issues instruction to IOP           OP Device Address
          IOP           interrupts when done              looks in memory for commands
            (3)         memory                            OP Addr Cnt Other

      Device to/from memory                       what                        special
      transfers are controlled                    to do                       requests
      by the IOP directly.                                   where    how
                                                             to put   much
      IOP steals memory cycles.                               data

7.4 Reliability, Availability, and

            Dependability, Faults, Errors,
            and Failures
• Computer system dependability is the quality of delivered service such
  that reliance can justifiably be placed on this service. The service
  delivered by a system is its observed actual behavior as perceived by
  other system(s) interacting with this system's users. Each module also
  has an ideal specified behavior, where a service specification is an
  agreed description of the expected behavior. A system failure occurs
  when the actual behavior deviates from the specified behavior. The
  failure occurred because of an error, a defect in that module. The cause
  of an error is a fault. When a fault occurs, it creates a latent error, which
  becomes effective when it is activated; when the error actually affects
  the delivered service, a failure occurs. The time between the occurrence
  of an error and the resulting failure is the error latency. Thus, an error is
  the manifestation in the system of a fault, and a failure is the
  manifestation on the service of an error.

          Faults, Errors, and Failures

• A fault creates one or more latent errors
• The properties of errors are
   – A latent error becomes effective once activated
   – An error may cycle between its latent and effective states
   – An effective error often propagates from one component to another,
     thereby creating new errors.
• A component failure occurs when the error affects the
  delivered service.
• These properties are recursive and apply to any component

          Example of Faults, Errors,
          and Failures
• Example 1
  –   A programming mistake: fault
  –   The consequence is an error or latent error
  –   Upon activation, the error becomes effective
  –   When this effective error produces erroneous data that affect the
      delivered service, a failure occurs
• Example 2
  –   An alpha particle hitting a DRAM  fault
  –   It changes the memory  latent error
  –   Affected memory word is read  effective error
  –   The effective error produces erroneous data that affect the delivered
      service  failure (If ECC corrected the error, a failure would not

         Service Accomplishment and
• Service accomplishment: service is delivered as specified
• Service interruption:delivered service is different from the
  specified service

• Transitions between these two states are caused by failures
  or restorations

          Measure Reliability And
• Reliability: measure of the continuous service
  accomplishment from a reference initial instant
   – Mean time to failure (MTTF)
   – The reciprocal of MTTF is a rate of failures
   – Service interruption is measured as mean time to repair (MTTR)
• Availability: measure of the service accomplishment w.r.t the
  alternation between the above-mentioned two states
   – Measured as: MTTF/(MTTF + MTTR)
   – Mean time between failure = MTTF+ MTTR

• A disk subsystem
   –   10 disks, each rated at 1,000,000-hour MTTF
   –   1 SCSI controller, 500,000-hour MTTF
   –   1 power supply, 200,.000-hour MTTF
   –   1 fan, 200,000-hour MTTF
   –   1 SCSI cable, 1000,000-hour MTTF
• Component lifetimes are exponentially distributed (the component age is
  not important in probability of failure), and independent failure

                              1        1       1       1        1
  Failure _ Rate  10 *                                 
                          1,000,000 500,000 200,000 200,000 1,000,000
  MTTF                    43,500hour ( 5Years)
           Failure _ Rate
         Cause of Faults

• Hardware faults: devices that fail
• Design faults: faults in software (usually) and hardware
  design (occasionally)
• Operation faults: mistakes by operations and maintenance
• Environmental faults: fire, flood, earthquake, power failure,
  and sabotage

         Classification of Faults

• Transient faults exist for a limited time and are not recurring
• Intermittent faults cause a system to oscillate between faulty
  and fault-free operation
• Permanent faults do not correct themselves with the passing
  of time

          Reliability Improvements

• Fault avoidance: how to prevent, by construction, fault
• Fault tolerance: how to provide, by redundancy, service
  complying with the service specification in spite of faults
  having occurred or that are occurring
• Error removal: how to minimize, by verification, the
  presence of latent errors
• Error forecasting: how to estimate, by evaluation, the
  presence, creation, and consequences of errors

7.5 RAID: Redundant Arrays of
      Inexpensive Disks

          3 Important Aspects of File
• Reliability – is anything broken?
   – Redundancy is main hack to increased reliability
• Availability – is the system still available to the user?
   – When single point of failure occurs is the rest of the system still
   – ECC and various correction schemes help (but cannot improve
• Data Integrity
   – You must know exactly what is lost when something goes wrong

          Disk Arrays

• Multiple arms improve throughput, but not necessarily
  improve latency
• Striping
   – Spreading data over multiple disks
• Reliability
   – General metric is N devices have 1/N reliability
      • Rule of thumb: MTTF of a disk is about 5 years
   – Hence need to add redundant disks to compensate
      • MTTR ::= mean time to repair (or replace) (hours for disks)
      • If MTTR is small then the array’s MTTF can be pushed out
        significantly with a fairly small redundancy factor

          Data Striping

• Bit-level striping: split the bit of each bytes across multiple
   – No. of disks can be a multiple of 8 or divides 8
• Block-level striping: blocks of a file are striped across
  multiple disks; with n disks, block i goes to disk (i mod n)+1
• Every disk participates in every access
   – Number of I/O per second is the same as a single disk
   – Number of data per second is improved
• Provide high data-transfer rates, but not improve reliability

          Redundant Arrays of Disks

• Files are "striped" across multiple disks
• Availability is improved by adding redundant disks
   – If a single disk fails, the lost information can be reconstructed from
     redundant information
   – Capacity penalty to store redundant information
   – Bandwidth penalty to update
   – Redundant Arrays of Inexpensive Disks
   – Redundant Arrays of Independent Disks

Raid Levels, Reliability,
Overhead                    information

            RAID Levels 0 - 1

• RAID 0 – No redundancy (Just block striping)
   – Cheap but unable to withstand even a single failure
• RAID 1 – Mirroring
   –   Each disk is fully duplicated onto its "shadow―
   –   Files written to both, if one fails flag it and get data from the mirror
   –   Reads may be optimized – use the disk delivering the data first
   –   Bandwidth sacrifice on write: Logical write = two physical writes
   –   Most expensive solution: 100% capacity overhead
   –   Targeted for high I/O rate , high availability environments
• RAID 0+1 – stripe first, then mirror the stripe
• RAID 1+0 – mirror first, then stripe the mirror

            RAID Levels 2 & 3
• RAID 2 – Memory style ECC
    – Cuts down number of additional disks
    – Actual number of redundant disks will depend on correction model
    – RAID 2 is not used in practice
• RAID 3 – Bit-interleaved parity
    – Reduce the cost of higher availability to 1/N (N = # of disks)
    – Use one additional redundant disk to hold parity information
    – Bit interleaving allows corrupted data to be reconstructed
    – Interesting trade off between increased time to recover from a failure and
      cost reduction due to decreased redundancy
    – Parity = sum of all relative disk blocks (module 2)
        • Hence all disks must be accessed on a write – potential bottleneck
    – Targeted for high bandwidth applications: Scientific, Image Processing

       RAID Level 3: Parity Disk

  11001101                                           P
logical record             1        1        1        0
                           0        1        0        0
Striped physical           0        0        0        1
     records               1        0        1        1
                           0        1        0        0
                           0        1        0        0
                           1        0        1        0
                           1        1        1        0

  25% capacity cost for parity in this configuration (1/N)

            RAID Levels 4 & 5 & 6
• RAID 4 – Block interleaved parity
   – Similar idea as RAID 3 but sum is on a per block basis
   – Hence only the parity disk and the target disk need be accessed
   – Problem still with concurrent writes since parity disk bottlenecks
• RAID 5 – Block interleaved distributed parity
   –   Parity blocks are interleaved and distributed on all disks
   –   Hence parity blocks no longer reside on same disk
   –   Probability of write collisions to a single drive are reduced
   –   Hence higher performance in the consecutive write situation
• RAID 6
   – Similar to RAID 5, but stores extra redundant information to guard
     against multiple disk failures

Raid 4 & 5 Illustration

RAID 4                                        RAID 5
         Targeted for mixed applications
         A logical write becomes four physical I/Os

Small Write Update on RAID

           Small Writes Update on RAID
RAID-5: Small Write Algorithm
           1 Logical Write = 2 Physical Reads + 2 Physical Writes

           D0'           D0     D1     D2     D3     P

         new              old                       old
         data             data (1. Read)            parity
                                                           (2. Read)

                    +    XOR

                                      + XOR

                         (3. Write)                (4. Write)

                        D0'    D1     D2      D3    P'

7.6 Errors and Failures in Real


•   Berkeley’s Tertiary Disk
•   Tandem
•   VAX
•   FCC

Berkeley’s Tertiary Disk
                   18 months of operation

                  SCSI backplane, cables, Ethernet
                  cables were no more reliable than
             77   data disks
7.7 I/O Performance Measures

          I/O Performance Measures

• Some similarities with CPU performance measures
   – Bandwidth - 100% utilization is maximum throughput
   – Latency - often called response time in the I/O world
• Some unique
   – Diversity - what types can be connected to the system
   – Capacity - how many and how much storage on each unit
• Usual relationship between Bandwidth & Latency

          Latency VS. Throughput

• Response time (latency): the time a task takes from the
  moment it is placed in the buffer until the server finishes the
• Throughput: the average number of tasks completed by the
  server over a time period
• Knee of the curve (L VS. T): the area where a little more
  throughput results in much longer response time, or, a little
  shorter response time results in much lower throughput
                                 Proc                    Device
                           Response time = Queue + Device Service time

          Latency VS. Throughput


          Transaction Model

• In an interactive environment, faster response time is
• Impact of inherent long latency
• Transaction time: sum of 3 components
   – Entry time - time it takes user (usually human) to enter command
   – System response time - command entry to response out
   – Think time - user reaction time between response and next entry

The Impact of Reducing
Response Time

          Transaction Time Oddity

• As system response time goes down
   – Think time goes down even more
• Could conclude
   – That system performance magnifies human talent
   – OR conclude that with a fast system less thinking is necessary
   – OR conclude that with a fast system less thinking is done

7.8 A Little Queuing Theory


• Help calculate response time and throughput
• More interested in long term, steady state than in startup 
   – No. of tasks entering the system = No. of tasks leaving the system
• Little’s Law:
   – Mean number tasks in system = arrival rate x mean response time
• Applies to any system in equilibrium, as long as nothing in
  black box is creating or destroying tasks
                                                 Queue        server
Arrivals                Departures        Proc              IOC   Device

          Little's Law

• Mean no. of tasks in system = arrival rate * mean response
• We observe a system for Timeobserve
• No. of tasks completed during Timeobserve is Numbertask
• Sum of the times each task in the system: Timeaccumulated
  Mean number of tasks in system =
           Mean response time =
     Timeaccumulated              Timeaccumulated           Numbertasks
                        =                               *
     Timeobserve                  Numbertasks               Timeobserve

                                     87                         Arrive Rate
            Queuing Theory Notation
• Queuing models assume state of equilibrium: input rate = output rate
• Notation:
    – Timeserver – average time to service a task
        • Service rate – 1/Timeserver
    – Timequeue – average time/task in queue
    – Timesystem – reseponse, average time/task in system
        • Timesystem = Timeserver + Timequeue
    – Arrival rate – average number of arriving tasks/second
    – Lengthserver – average number of tasks in service
    – Lengthqueue – average number of tasks in queue
    – Lengthsystem – average number of tasks in system
        • Lengthsystem = Lengthserver + Lengthqueue
    – Server Utilization = Arrival Rate / Service Rate (0 – 1) (equilibrium)
• Little’s Law  Lengthsystem = Arrivial Rate * Timesystem

•   An I/O system with a single disk
•   10 I/O requests per second, average time to service = 50ms
•   Arrival Rate = 10 IOPS Service Rate = 1/50ms = 20 IOPS
•   Server Utilization = 10/20 = 0.5

•   Lengthqueue = Arrivial Rate * Timequeue
•   Lengthserver = Arrivial Rate * Timeserver
•   Average time to satisfy a disk request = 50ms, Arrival Rate = 200 IOPS
•   Lengthserver = Arrivial Rate * Timeserver= 200 * 0.05 =10

                                                         Queue      server
           Response Time                       Proc                  IOC   Device

• Service time completions vs. waiting time for a busy server: randomly
  arriving event joins a queue of arbitrary length when server is busy,
  otherwise serviced immediately (Assume unlimited length queues)
• A single server queue: combination of a servicing facility that
  accomodates 1 task at a time (server) + waiting area (queue): together
  called a system
• Timequeue (suppose FIFO queue)
   – Timequeue = Lengthqueue * Timeserver + M
   – M = mean time to complete service of current task when new task arrives if
     the server is busy
       • A new task can arrive at any instant
       • Use distribution of a random variable: histogram? curve?
       • M is also called Average Residual Service Time (ARST)

           Response Time (Cont.)
• Server spends a variable amount of time with tasks
   – Weighted mean m1 = (f1 x T1 + f2 x T2 +...+ fn x Tn)/F (F=f1 + f2...)
   – variance = (f1 x T12 + f2 x T22 +...+ fn x Tn2)/F – m12
      • Must keep track of unit of measure (100 ms2 vs. 0.1 s2 )
   – Squared coefficient of variance: C = variance/m12
      • Unitless measure (100 ms2 vs. 0.1 s2)
• Three Distributions
   – Exponential distribution C = 1 : most short relative to average, few others
     long; 90% < 2.3 x average, 63% < average
   – Hypoexponential distribution C < 1 : most close to average,
     C=0.5  90% < 2.0 x average, only 57% < average
   – Hyperexponential distribution C > 1 : further from average
     C=2.0  90% < 2.8 x average, 69% < average
• ARST = 0.5 * Weighted Mean Time * (1 + C)

         Characteristics of Three

Memory-less: C does not vary over time and does not consider past history
of events


• Derive Timequeue in terms of Timeserver, server utilization, and
   – Timequeue = Lengthqueue * Timeserver + ARST * server utilization
   – Timequeue = (arrival rate * Timequeue ) * Timeserver +
                (0.5 * Timeserver * (1+C)) * server utilization
   – Timequeue = Timequeue * server utilization +
                (0.5 * Timeserver * (1+C)) * server utilization
   – Timequeue=Timeserver*(1+C)*server utilization / (2*(1-server utilization))
• For exponential distribution, the C = 1.0 
   – Timequeue=Timeserver * server utilization / (1-server utilization)

           Queuing Theory
• Predict approximate behavior of random variables
   – Make a sharp distinction between past events (arithmetic measurements)
     and future events (mathematical predictions)
   – In computer system, future rely on past  arithmetic measurements and
     mathematical predictions (distribution) are blurred
• Queuing model assumption  M/G/1
   – Equilibrium system
   – Exponential inter-arrival time (time between two successive tasks arriving) or
     arrival rate
   – Unlimited sources of requests (infinite population model)
   – Unlimited queue length, and FIFO queue
   – Server starts on next task immediately after finish the prior one
   – All tasks must be completed
   – One server

          M/G/1 and M/M/1
• M/G/1 queue
   – M = exponentially random request arrival (C = 1)
      • M for ―memoryless‖ or Markovian
   – G: general service distribution (no restrictions)
   – 1 server
• M/M/1 queue
   – Exponential service distribution (C=1)
• Why exponential distribution (used often in queuing theory)
   – A collection of many arbitrary distributions acts as an exponential
     distribution (A computer system comprises many interacting
   – Simpler math

• Processor sends 10 disk I/Os per second; requests & service are
  exponentially distributed; avg. disk service = 20 ms
   –   On average, how utilized is the disk?
   –   What is the average time spent in the queue?
   –   What is the 90th percentile of the queuing time?
   –   What is the number of requests in the queue?
   –   What is the average response time for a disk request?
• Answer
   – Arrival rate = 10 IOPS, Service rate = 1/0.02 = 50 IOPS
   – Service utilization = 10/50 = 0.2
   – Timequeue=Timeserver * server utilization / (1-server utilization)
     = 20 * 0.2 / (1 – 0.2) = 20 * 0.25 = 5ms
   – 90th percentile of the queuing time = 2.3 (slide 91) * 5 = 11.5ms
   – Lengthqueue = Arrival rate * Timequeue = 10 * 0.05 = 0.5
   – Average response time = 5 + 20 = 25 ms
   – Lengthsystem = Arrival rate * Timesystem = 10 * 0.25 = 2.5

7.9 Benchmarks of Storage
Performance and Availability

          Transaction Processing (TP)
• TP: database applications, OLTP
• Concerned with I/O rate (# of disk accesses per second)
• Started with anonymous gang of 24 members in 1985
   – DebitCredit benchmark: simulate bank tellers and has as it bottom
     line the number of debit/credit transactions per second (TPS)
• Tighter & more standard benchmark versions
   – TPC-A, TPC-B
   – TPC-C: complex query processing - more accurate model of a real
     bank which models credit analysis for loans
• Also must report the cost per TPS
   – Hence machine configuration is considered

TP Benchmarks

           TP Benchmark -- DebitCredit
• Disk I/O is random reads and writes of 100-byte records along with
  occasional sequential writes
   – 2—10 disk I/Os per transaction
   – 5000 – 20000 CPU instructions per disk I/O
• Performance relies on…
   – The efficiency of TP software
   – How many disk accesses can be avoided by keeping information in main
     memory (cache) !!!  wrong for measuring disk I/O
• Peak TPS
   – Restriction: 90% of transactions have < 2sec. response time
   – For TPS to increase, # of tellers and the size of the account file must also
     increase (more TPS requires more users)
       • To ensure that the benchmark really measure disk I/O (not cache…)

      Relationship Among TPS,
      Tellers, and Account File Size

The data set generally must scale in size as the throughput increases

           SPEC System-Level File
           Server (SFS) Benchmark
• SPECsfs - system level file server
   –   1990 agreement by 7 vendors to evaluate NFS performance
   –   Mix of file reads, writes, and file operations
   –   Write: 50% done on 8KB blocks, 50% on partial (1, 2, 4KB)
   –   Read: 85% full block, 15% partial block
• Scales the size of FS according to the reported throughput
   – For every 100 NFS operations per second, the capacity must
     increase by 1GB
   – Limit average response time, such as 40ms
• Does not normalize for different configuration
• Retired in June 2001 due to bugs

                           Unfair configuration

Response time


• Benchmark for evaluating the performance of WWW servers
• SPECWeb99 workload simulates accesses to a Web server
  provider supporting HP for several organizations
• For each HP, nine files in each of the four classes
   –   Less than 1KB (small icons): 35% of activity
   –   1—10KB: 50% of activity
   –   10—100KB: 14% of activity
   –   100KB—1MB (large document and image): 1% of activity
• SPECWeb99 results in 2000 for Dell Computers
   – Large memory is used for a file cache to reduce disk I/O
   – Impact of Web server software and OS

SPECWeb99 Results for Dell

          Examples of Benchmarks of
          Dependability and Availability
• TPC-C has a dependability requirement: must handle a
  single disk failure
• Brown and Patterson [2000]
   – Focus on the effectiveness of fault tolerance in systems
   – Availability can be measured by examining the variations in system
     QOS metrics over time as faults are injected into the system
   – The initial experiment injected a single disk fault
      • Software RAID by Linux, Solaris, and Windows 2000
           – Reconstruct data onto a hot spare disk
      • Disk emulator injects faults
      • SPECWeb99 workload

  Availability Benchmark for
  Software RAID
(Red Hat 6.0)         (Solaris 7)

Availability Benchmark for
Software RAID (Cont.)
     (Windows 2000)

          Availability Benchmark for
          Software RAID (Cont.)
• The longer the reconstruction (MMTF), the lower the
   – Increased reconstruction speed implies decreased application
   – Linux VS. Solaris and Windows 2000
• RAID reconstruction
   – Linux and Solaris: initiate reconstruction automatically
   – Windows 2000: initiate reconstruction manually by operators
• Managing transient faults
   – Linux: paranoid
   – Solaris and Windows: ignore most transient faults

7.10 Crosscutting Issues:
     Interface to OS

          I/O Interface to the OS

• OS controls what I/O technique implemented by HW will
  actually be used
• Early Unix head wedge
   – 16 bit controllers could only transfer 64KB at a time
      • Later controllers go to 32 bit devices
      • And are optimized for much larger blocks
   – Unix however did not want to distinguish  so it kept the 64KB bias
      • A new I/O controller designed to efficiently transfer 1 MB files
        would never see more than 63KB at a time under early UNIX

          Cache Problems -- Stale Data

• 3 potential copies - cache, memory, and disk
   – Stale data: CPU or I/O system could modify one copy without
     updating the other copies
   – Where the I/O system is connected to the computer?
      • CPU cache: no stale-data problem
           – All I/O devices and CPU see the most accurate data
           – Cache system’s multi-level inclusion
           – Disadvantages
               » Lost CPU performance  all I/O data goes through the
                  cache, but little is referenced
               » Arbitration between CPU and I/O for accessing cache
      • Memory: stale-data problem occurs

Connect I/O to Cache

              Cache-Coherence Problem
                        Output                  Input

      CPU                CPU                     CPU

     Cache              Cache                   Cache

A’    100          A’    500               A’    100

B’    200          B’    200               B’    200
     Memory             Memory                  Memory

A     100          A     100               A     100

B     200          B     200               B     440

      IO                  IO     A stale          IO     B' stale

          Stale Data Problem
• I/O sees stale data on output because memory data is not
   – Write-through cache OK
   – Write-back cache
      • OS flushes data to make sure they are not in cache before output
      • HW checks cache tags to see if they are in cache, and only
        interact with cache if the output tries to use in-cache data
• CPU sees stale data in cache on input after I/O has updated
   – OS guarantees input data area cannot possibly be in cache
   – OS flushes data to make sure they are not in cache before input
   – HS checks tags during an input and invalidate the data if conflict

          DMA and Virtual Memory

• 2 types of addresses: Virtual (VA) and Physical address (PA)
• Physically addressed I/O problems for DMA
   – Block size larger than a page
      • Will likely not fall on consecutive physical page numbers
   – What happens if the OS victimizes a page when DMA is in progress
      • Pinned the page in the memory (not allow to be replaced)
      • OS copy user data into the kernel address space and then
        transfer between the kernel address space to I/O space

         Virtual DMA

• DMA uses VAs that are mapped to PAs during the DMA
• DMA buffer sequential in VM, but can be scattered in PM
• Virtual addresses provide the protection of other processes
• OS updates the address tables of a DMA if a process is
  moved using virtual DMA
• Virtual DMA requires a register for each page to be
  transferred in the DMA controller, showing the protection
  bits and the physical page corresponding to each virtual

Virtual DMA Illustration

7.11 Designing an I/O System

           I/O Design Complexities

• Huge variety of I/O devices
    – Latency
    – Bandwidth
    – Block size
•   Expansion is a must – longer buses, larger power and cabinets
•   Balanced Performance and Cost
•   Yet another n-dimensional conflicting
•   Constraint problem
    – Yep - it’s NP hard just like all the rest
    – Experience plays a big role since the solutions are heuristic

            7 Basic I/O Design Steps
• List types of I/O devices and buses to be supported
• List physical requirements of I/O devices
    – Volume, power, bus slots, expansion slots or cabinets, ...
• List cost of each device and associated controller
• List the reliability of each I/O device
• Record CPU resource demands - e.g. cycles
    – Start, support, and complete I/O operation
    – Stalls due to I/O waits
    – Overhead - e.g. cache flushes and context switches
• List memory and bus bandwidth demands
• Assess the performance of different ways to organize I/O devices
    – Of course you’ll need to get into queuing theory to get it right

            An Example
• Impact on CPU of reading a disk page directly into cache.
• Assumptions
   –   16KB page, 64-bytes cache-block
   –   Addresses of the new page are not in the cache
   –   CPU will not access data in the new page
   –   95% displaced cache block will be read in again (miss)
   –   Write-back cache, 50% are dirty
   –   I/O buffers a full cache block before writing to cache
   –   Access and misses are spread uniformly to all cache blocks
   –   No other interference between CPU and I/O for the cache slots
   –   15,000 misses per 1 million clock cycles when no I/O
   –   Miss penalty = 30 CC, 30 CC mores to write dirty-blocks
   –   1 page is brought in every 1 million clock cycles

          An Example (Cont.)

• Each page fills 16,384/64 or 256 blocks
• 0.5 * 256 * 30 CCs to write dirty blocks to memory
• 95% * 256 (244) are referenced again and misses
   – All of them are dirty and will need to be written back when replaced
   – 244 * 60 more CCs to write back
• In total: 128 * 30 + 244 * 60 more CCs than
   – 1% decrease in performance

           Five More Examples

•   Naive cost-performance design and evaluation
•   Availability of the first example
•   Response time of the first example
•   Most realistic cost-performance design and evaluation
•   More realistic design for availability and its evaluation


To top