Docstoc

Chapter 7 Storage Systems

Document Sample
Chapter 7 Storage Systems Powered By Docstoc
					Chapter 7 Storage Systems




                            1
            Outline
•   Introduction
•   Types of Storage Devices
•   Buses – Connecting I/O Devices to CPU/Memory
•   Reliability, Availability, and Dependability
•   RAID: Redundant Arrays of Inexpensive Disks
•   Errors and Failures in Real Systems
•   I/O Performance Measures
•   A Little Queuing Theory
•   Benchmarks of Storage Performance and Availability
•   Crosscutting Issues
•   Design An I/O System


                                     2
7.1 Introduction




                   3
           Motivation: Who Cares About
           I/O?
• CPU Performance: 2 times very 18 months
• I/O performance limited by mechanical delays (disk I/O)
    – < 10% per year (I/O per sec or MB per sec)
• Amdahl's Law: system speed-up limited by the slowest part!
    – 10% I/O & 10x CPU  5x Performance (lose 50%)
    – 10% I/O & 100x CPU  10x Performance (lose 90%)
•   I/O bottleneck:
    – Diminishing fraction of time in CPU
    – Diminishing value of faster CPUs




                                     4
          Position of I/O in Computer
          Architecture – Past
• An orphan in the architecture domain
• I/O meant the non-processor and memory stuff
   – Disk, tape, LAN, WAN, etc.
   – Performance was not a major concern
• Devices characterized as:
   – Extraneous, non-priority, infrequently used, slow
• Exception is swap area of disk
   – Part of the memory hierarchy
   – Hence part of system performance but you’re hosed if you use it
     often



                                     5
           Position of I/O in Computer
           Architecture – Now
• Trends – I/O is the bottleneck
   – Communication is frequent
      • Voice response & transaction systems, real-time video
      • Multimedia expectations
   – Even standard networks come in gigabit/sec flavors
   – For multi-computers
• Result
   – Significant focus on system bus performance
      • Common bridge to the memory system and the I/O systems
      • Critical performance component for the SMP server platforms



                                   6
          System vs. CPU Performance
• Care about speed at which user jobs get done
   – Throughput - how many jobs/time (system view)
   – Latency - how quick for a single job (user view)
   – Response time – time between when a command is issued and
     results appear (user view)
• CPU performance main factor when:
   – Job mix fits in the memory  there are very few page faults
• I/O performance main factor when:
   – The job is too big for the memory - paging is dominant
   – When the job reads/writes/creates a lot of unexpected files
      • OLTP – Decision support -- Database
   – And then there is graphics & specialty I/O devices


                                     7
            System Performance
• Depends on many factors in the worst case
   –   CPU
   –   Compiler
   –   Operating System
   –   Cache
   –   Main Memory
   –   Memory-IO bus
   –   I/O controller or channel
   –   I/O drivers and interrupt handlers
   –   I/O devices: there are many types
         • Level of autonomous behavior
         • Amount of internal buffer capacity
         • Device specific parameters for latency and throughput

                                         8
                                        May the same or different
I/O Systems                             Memory – I/O Bus



                 interrupts
 Processor


  Cache


             Memory - I/O Bus


  Main            I/O                I/O          I/O
 Memory        Controller         Controller   Controller


              Disk    Disk        Graphics        Network


                              9
          Keys to a Balanced System

• It’s all about overlap - I/O vs CPU
   – Timeworkload = Timecpu + TimeI/O - Timeoverlap
• Consider the benefit of just speeding up one
   – Amdahl’s Law (see P4 as well)
• Latency vs. Throughput




                                      10
            I/O System Design
            Considerations
• Depends on type of I/O device
   – Size, bandwidth, and type of transaction
   – Frequency of transaction
   – Defer vs. do now
• Appropriate memory bus utilization
• What should the controller do
   –   Programmed I/O
   –   Interrupt vs. polled
   –   Priority or not
   –   DMA
   –   Buffering issues - what happens on over-run
   –   Protection
   –   Validation


                                        11
           Types of I/O Devices

• Behavior
   –   Read, Write, Both
   –   Once, multiple
   –   Size of average transaction
   –   Bandwidth
   –   Latency
• Partner - the speed of the slowest link theory
   – Human operated (interactive or not)
   – Machine operated (local or remote)




                                     12
Some I/O Device
Characteristics




           13
          Is I/O Important?

• Depends on your application
   – Business - disks for file system I/O
   – Graphics - graphics cards or special co-processors
   – Parallelism - the communications fabric
• Our focus = mainline uniprocessing
   – Storage subsystems (Chapter 7)
   – Networks (Chapter 8)
• Noteworthy Point
   – The traditional orphan
   – But now often viewed more as a front line topic



                                    14
7.2 Types of Storage Devices




                               15
          Magnetic Disks

• 2 important Roles
   – Long term, non-volatile storage – file system and OS
   – Lowest level of the memory hierarchy
      • Most of the virtual memory is physically resident on the disk
• Long viewed as a bottleneck
   – Mechanical system  slow
   – Hence they seem to be an easy target for improved technology
   – Disk improvement w.r.t. density have done better than Moore’s law




                                    16
            Disks are organized into
            platters, tracks, and sectors
                                          (1-12 * 2 sides)



                                        (5000 – 30000
                                         each surface)


                                                     (100 – 500)




A sector is the smallest
unit that can be read or written

                                   17
          Physical Organization
          Options
• Platters – one or many
• Density - fixed or variable
   – All tracks have the same no. of sectors?)
• Organization - sectors, cylinders, and tracks
   – Actuators - 1 or more
   – Heads - 1 per track or 1 per actuator
   – Access - seek time vs. rotational latency
      • Seek related to distance but not linearly
      • Typical rotation: 3600 RPM or 15000 RPM
• Diameter – 1.0 to 3.5 inches


                                    18
          Typical Physical Organization

• Multiple platters
   – Metal disks covered with magnetic recording material on both sides
• Single actuator (since they are expensive)
   – Single R/W head per arm
   – One arm per surface
   – All heads therefore over same cylinder
• Fixed sector size
• Variable density encoding
• Disk controller – usually built in processor + buffering



                                   19
Characteristics of Three Magnetic Disks of 2000




                   20
          Anatomy of a Read Access

• Steps
  –   Memory mapped I/O over bus to controller
  –   Controller starts access
  –   Seek + rotational latency wait
  –   Sector is read and buffered (validity checked)
  –   Controller says ready or DMA’s to main memory and then says ready




                                   21
           Access Time

• Access Time
    – Seek time: time to move the arm over the proper track
        • Very non-linear: accelerate and decelerate times complicate
    – Rotation latency (delay): time for the requested sector to rotate under
      the head (on average: 0.5 * RPM)
    – Transfer time: time to transfer a block of bits (typically a sector)
      under the read-write head
    – Controller overhead: the overhead the controller imposes in
      performing an I/O access
    – Queuing delay: time spent waiting for a disk to become free

AverageAcc essTime  AverageSeekTime  AverageRot ationalDel ay 
TransferTime  Controller Overhead  QueuingDelay
                                      22
          Access Time Example

• Assumption: average seek time – 5ms; transfer rate –
  40MB/sec; 10,000 RPM; controller overhead – 0.1ms; no
  queuing delay
• What is the average time to r/w a 512-byte sector?
• Answer
            0.5        0.5 KB
  5ms                           0.1ms  5.0  3.0  0.013  0.1  8.1ms
        10000 RPM 40.0 MB
                             sec
      0.5            0.5
                               0.003 sec  3.0ms
  10000 RPM (   10000 ) RPS
                      60




                                      23
         Cost VS Performance

• Large-diameter drives have many more data to amortize the
  cost of electronics  lowest cost per GB
• Higher sales volume  lower manufacturing cost
• 3.5-inch drive, the largest surviving drive in 2001, also has
  the highest sales volume, so it unquestionably has the best
  price per GB




                               24
          Future of Magnetic Disks

• Areal density: bits/unit area is common improvement metric
• Trends
   – Until 1988: 29% improvement per year
   – 1988 – 1996: 60% per year
   – 1997 – 2001: 100% per year
• 2001
   – 20 billion bits per square inch
   – 60 billion bit per square inch demonstrated in labs

                   Tracks                  Bits
  ArealDensity           OnADiskSurface *      OnATrack
                    Inch                   Inch


                                      25
Disk Price Trends by
Capacity




            26
Disk Price Trends – Dollars
Per MB




            27
Cost VS Access Time for SRAM,
DRAM, and Magnetic Disk




            28
             Disk Alternatives

• Optical Disks
    –   Optical compact disks (CD) – 0.65GB
    –   Digital video discs, digital versatile disks (DVD) – 4.7GB * 2 sides
    –   Rewritable CD (CD-RW) and write-once CD (CD-R)
    –   Rewritable DVD (DVD-RAM) and write-once DVD (DVD-R)
•   Robotic Tape Storage
•   Optical Juke Boxes
•   Tapes – DAT, DLT
•   Flash memory
    – Good for embedded systems
    – Nonvolatile storage and rewritable ROM

                                        29
7.3 Bus – Connecting I/O
Devices to CPU/Memory




                           30
          I/O Connection Issues
                           Connecting the CPU to the I/O device world

• Shared communication link between subsystems
   – Typical choice is a bus
• Advantages
   – Shares a common set of wires and protocols  low cost
   – Often based on standard - PCI, SCSI, etc.  portable and versatility
• Disadvantages
   – Poor performance
   – Multiple devices imply arbitration and therefore contention
   – Can be a bottleneck




                                    31
            I/O Connection Issues –
            Multiple Buses
• I/O bus                                  • CPU-memory bus
   – Lengthy                                  – Short
   – Many types of connected                  – High speed
     devices                                  – Match to the memory system to
   – Wide range in device bandwidth             maximize CPU-memory
   – Follow a bus standard                      bandwidth
   – Accept devices varying in                – Knows all types of devices that
     latency and bandwidth                      must connect together
     capabilities




                                      32
Typical Bus Synchronous
Read Transaction




           33
           Bus Design Decisions

• Other things to standardize as well
   –   Connectors
   –   Voltage and current levels
   –   Physical encoding of control signals
   –   Protocols for good citizenship




                                      34
          Bus Design Decisions (Cont.)

• Bus master: devices that can initiate a R/W transaction
   – Multiple : multiple CPUs, I/O device initiate bus transactions
   – Multiple bus masters need arbitration (fixed priority or random)
• Split transaction for multiple masters
   – Use packets for the full transaction (does not hold the bus)
   – A read transaction is broken into read-request and memory-reply
     transactions
   – Make the bus available for other masters while the data is
     read/written from/to the specified address
   – Transactions must be tagged
   – Higher bandwidth, but also higher latency


                                     35
Split Transaction Bus




            36
          Bus Design Decisions (Cont.)

• Clocking: Synchronous vs. Asynchronous
   – Synchronous
      • Include a clock in the control lines, and a fixed protocol for
        address and data relative to the clock
      • Fast and inexpensive (little or no logic to determine what's next)
      • Everything on the bus must run at the same clock rate
      • Short length (due to clock skew)
      • CPU-memory buses
   – Asynchronous
      • Easier to connect a wide variety of devices, and lengthen the bus
      • Scale better with technological changes
      • I/O buses

                                    37
Synchronous or
Asynchronous?




           38
           Standards
• The Good
   – Let the computer and I/O-device designers work independently
   – Provides a path for second party (e.g. cheaper) competition
• The Bad
   – Become major performance anchors
   – Inhibit change
• How to create a standard
   – Bottom-up
      • Company tries to get standards committee to approve it’s latest
         philosophy in hopes that they’ll get the jump on the others (e.g. S bus,
         PC-AT bus, ...)
      • De facto standards
   – Top-down
      • Design by committee (PCI, SCSI, ...)


                                        39
Some Sample I/O Bus
Designs




           40
Some Sample Serial I/O Bus
   Often used in embedded computers




                   41
CPU-Memory Buses Found in
2001 Servers Crossbar Switch




            42
            Connecting the I/O Bus
• To main memory
    – I/O bus and CPU-memory bus may the same
        • I/O commands on bus could interfere with CPU's access memory
    – Since cache misses are rare, does not tend to stall the CPU
    – Problem is lack of coherency
    – Currently, we consider this case
• To cache
• Access
    – Memory-mapped I/O or distinct instruction (I/O opcodes)
• Interrupt vs. Polling
• DMA or not
    – Autonomous control allows overlap and latency hiding
    – However there is a cost impact


                                       43
A typical interface of I/O devices and
an I/O bus to the CPU-memory bus




                44
            Processor Interface Issues
• Processor interface
    – Interrupts
    – Memory mapped I/O
•   I/O Control Structures
    –   Polling
    –   Interrupts
    –   DMA
    –   I/O Controllers
    –   I/O Processors
• Capacity, Access Time, Bandwidth
• Interconnections
    – Busses

                             45
                 I/O Controller




                       Ready, done, error…




   I/O Address




Command, Interrupt…

                                        46
         Memory Mapped I/O
           CPU
                  Single Memory & I/O Bus
                  No Separate I/O Instructions
                                                           ROM

Memory      Interface       Interface                      RAM


           Peripheral       Peripheral
  CPU
    $                                                       I/O


  L2 $                                   Some portions of memory address
         Memory Bus     I/O bus          space are assigned to I/O device.
                                         Reads/Writes to these space cause
                                         data transfer
  Memory      Bus Adaptor


                                        47
           Programmed I/O
• Polling
• I/O module performs the action,
  on behalf of the processor
• But I/O module does not interrupt
  CPU when I/O is done
• Processor is kept busy checking
  status of I/O module
   – not an efficient way to use the
     CPU unless the device is very
     fast!
• Byte by Byte…




                                       48
          Interrupt-Driven I/O
• Processor is interrupted when
  I/O module ready to exchange
  data
• Processor is free to do other
  work
• No needless waiting
• Consumes a lot of processor
  time because every word read or
  written passes through the
  processor and requires an
  interrupt
• Interrupt per byte


                                    49
              Direct Memory Access (DMA)
•   CPU issues request to a DMA
    module (separate module or
    incorporated into I/O module)
•   DMA module transfers a block of
    data directly to or from memory
    (without going through CPU)
•   An interrupt is sent when the task is
    complete
     – Only one interrupt per block, rather
       than one interrupt per byte
•   The CPU is only involved at the
    beginning and end of the transfer
•   The CPU is free to perform other
    tasks during data transfer


                                              50
                  Input/Output Processors
          CPU            IOP             D1

              main memory                D2
          Mem     bus                   . . .
                                         Dn
                                 I/O                          target device
                                 bus                                    where cmnds are
         CPU      (4) issues instruction to IOP           OP Device Address
(1)
          IOP           interrupts when done              looks in memory for commands
                  (2)
            (3)         memory                            OP Addr Cnt Other

      Device to/from memory                       what                        special
      transfers are controlled                    to do                       requests
      by the IOP directly.                                   where    how
                                                             to put   much
      IOP steals memory cycles.                               data

                                                51
7.4 Reliability, Availability, and
         Dependability




                                     52
            Dependability, Faults, Errors,
            and Failures
• Computer system dependability is the quality of delivered service such
  that reliance can justifiably be placed on this service. The service
  delivered by a system is its observed actual behavior as perceived by
  other system(s) interacting with this system's users. Each module also
  has an ideal specified behavior, where a service specification is an
  agreed description of the expected behavior. A system failure occurs
  when the actual behavior deviates from the specified behavior. The
  failure occurred because of an error, a defect in that module. The cause
  of an error is a fault. When a fault occurs, it creates a latent error, which
  becomes effective when it is activated; when the error actually affects
  the delivered service, a failure occurs. The time between the occurrence
  of an error and the resulting failure is the error latency. Thus, an error is
  the manifestation in the system of a fault, and a failure is the
  manifestation on the service of an error.

                                       53
          Faults, Errors, and Failures

• A fault creates one or more latent errors
• The properties of errors are
   – A latent error becomes effective once activated
   – An error may cycle between its latent and effective states
   – An effective error often propagates from one component to another,
     thereby creating new errors.
• A component failure occurs when the error affects the
  delivered service.
• These properties are recursive and apply to any component



                                   54
          Example of Faults, Errors,
          and Failures
• Example 1
  –   A programming mistake: fault
  –   The consequence is an error or latent error
  –   Upon activation, the error becomes effective
  –   When this effective error produces erroneous data that affect the
      delivered service, a failure occurs
• Example 2
  –   An alpha particle hitting a DRAM  fault
  –   It changes the memory  latent error
  –   Affected memory word is read  effective error
  –   The effective error produces erroneous data that affect the delivered
      service  failure (If ECC corrected the error, a failure would not
      occur)

                                     55
         Service Accomplishment and
         Interruption
• Service accomplishment: service is delivered as specified
• Service interruption:delivered service is different from the
  specified service

• Transitions between these two states are caused by failures
  or restorations




                                56
          Measure Reliability And
          Availability
• Reliability: measure of the continuous service
  accomplishment from a reference initial instant
   – Mean time to failure (MTTF)
   – The reciprocal of MTTF is a rate of failures
   – Service interruption is measured as mean time to repair (MTTR)
• Availability: measure of the service accomplishment w.r.t the
  alternation between the above-mentioned two states
   – Measured as: MTTF/(MTTF + MTTR)
   – Mean time between failure = MTTF+ MTTR




                                   57
            Example
• A disk subsystem
   –   10 disks, each rated at 1,000,000-hour MTTF
   –   1 SCSI controller, 500,000-hour MTTF
   –   1 power supply, 200,.000-hour MTTF
   –   1 fan, 200,000-hour MTTF
   –   1 SCSI cable, 1000,000-hour MTTF
• Component lifetimes are exponentially distributed (the component age is
  not important in probability of failure), and independent failure

                              1        1       1       1        1
  Failure _ Rate  10 *                                 
                          1,000,000 500,000 200,000 200,000 1,000,000
                 1
  MTTF                    43,500hour ( 5Years)
           Failure _ Rate
                                         58
         Cause of Faults

• Hardware faults: devices that fail
• Design faults: faults in software (usually) and hardware
  design (occasionally)
• Operation faults: mistakes by operations and maintenance
  personnel
• Environmental faults: fire, flood, earthquake, power failure,
  and sabotage




                                59
         Classification of Faults

• Transient faults exist for a limited time and are not recurring
• Intermittent faults cause a system to oscillate between faulty
  and fault-free operation
• Permanent faults do not correct themselves with the passing
  of time




                                60
          Reliability Improvements

• Fault avoidance: how to prevent, by construction, fault
  occurrence
• Fault tolerance: how to provide, by redundancy, service
  complying with the service specification in spite of faults
  having occurred or that are occurring
• Error removal: how to minimize, by verification, the
  presence of latent errors
• Error forecasting: how to estimate, by evaluation, the
  presence, creation, and consequences of errors



                                61
7.5 RAID: Redundant Arrays of
      Inexpensive Disks




                                62
          3 Important Aspects of File
          Systems
• Reliability – is anything broken?
   – Redundancy is main hack to increased reliability
• Availability – is the system still available to the user?
   – When single point of failure occurs is the rest of the system still
     usable?
   – ECC and various correction schemes help (but cannot improve
     reliability)
• Data Integrity
   – You must know exactly what is lost when something goes wrong




                                      63
          Disk Arrays

• Multiple arms improve throughput, but not necessarily
  improve latency
• Striping
   – Spreading data over multiple disks
• Reliability
   – General metric is N devices have 1/N reliability
      • Rule of thumb: MTTF of a disk is about 5 years
   – Hence need to add redundant disks to compensate
      • MTTR ::= mean time to repair (or replace) (hours for disks)
      • If MTTR is small then the array’s MTTF can be pushed out
        significantly with a fairly small redundancy factor


                                    64
          Data Striping

• Bit-level striping: split the bit of each bytes across multiple
  disks
   – No. of disks can be a multiple of 8 or divides 8
• Block-level striping: blocks of a file are striped across
  multiple disks; with n disks, block i goes to disk (i mod n)+1
• Every disk participates in every access
   – Number of I/O per second is the same as a single disk
   – Number of data per second is improved
• Provide high data-transfer rates, but not improve reliability



                                     65
          Redundant Arrays of Disks

• Files are "striped" across multiple disks
• Availability is improved by adding redundant disks
   – If a single disk fails, the lost information can be reconstructed from
     redundant information
   – Capacity penalty to store redundant information
   – Bandwidth penalty to update
• RAID
   – Redundant Arrays of Inexpensive Disks
   – Redundant Arrays of Independent Disks




                                      66
Raid Levels, Reliability,
                            Redundant
Overhead                    information




             67
            RAID Levels 0 - 1

• RAID 0 – No redundancy (Just block striping)
   – Cheap but unable to withstand even a single failure
• RAID 1 – Mirroring
   –   Each disk is fully duplicated onto its "shadow―
   –   Files written to both, if one fails flag it and get data from the mirror
   –   Reads may be optimized – use the disk delivering the data first
   –   Bandwidth sacrifice on write: Logical write = two physical writes
   –   Most expensive solution: 100% capacity overhead
   –   Targeted for high I/O rate , high availability environments
• RAID 0+1 – stripe first, then mirror the stripe
• RAID 1+0 – mirror first, then stripe the mirror

                                        68
            RAID Levels 2 & 3
• RAID 2 – Memory style ECC
    – Cuts down number of additional disks
    – Actual number of redundant disks will depend on correction model
    – RAID 2 is not used in practice
• RAID 3 – Bit-interleaved parity
    – Reduce the cost of higher availability to 1/N (N = # of disks)
    – Use one additional redundant disk to hold parity information
    – Bit interleaving allows corrupted data to be reconstructed
    – Interesting trade off between increased time to recover from a failure and
      cost reduction due to decreased redundancy
    – Parity = sum of all relative disk blocks (module 2)
        • Hence all disks must be accessed on a write – potential bottleneck
    – Targeted for high bandwidth applications: Scientific, Image Processing

                                         69
       RAID Level 3: Parity Disk
       (Cont.)

  10010011
  11001101                                           P
  10010011
     ...
logical record             1        1        1        0
                           0        1        0        0
Striped physical           0        0        0        1
     records               1        0        1        1
                           0        1        0        0
                           0        1        0        0
                           1        0        1        0
                           1        1        1        0



  25% capacity cost for parity in this configuration (1/N)

                               70
            RAID Levels 4 & 5 & 6
• RAID 4 – Block interleaved parity
   – Similar idea as RAID 3 but sum is on a per block basis
   – Hence only the parity disk and the target disk need be accessed
   – Problem still with concurrent writes since parity disk bottlenecks
• RAID 5 – Block interleaved distributed parity
   –   Parity blocks are interleaved and distributed on all disks
   –   Hence parity blocks no longer reside on same disk
   –   Probability of write collisions to a single drive are reduced
   –   Hence higher performance in the consecutive write situation
• RAID 6
   – Similar to RAID 5, but stores extra redundant information to guard
     against multiple disk failures


                                      71
Raid 4 & 5 Illustration




RAID 4                                        RAID 5
         Targeted for mixed applications
         A logical write becomes four physical I/Os

                          72
Small Write Update on RAID
3




           73
           Small Writes Update on RAID
           4/5
RAID-5: Small Write Algorithm
           1 Logical Write = 2 Physical Reads + 2 Physical Writes

           D0'           D0     D1     D2     D3     P

         new              old                       old
         data             data (1. Read)            parity
                                                           (2. Read)

                    +    XOR

                                      + XOR


                         (3. Write)                (4. Write)

                        D0'    D1     D2      D3    P'

                                       74
7.6 Errors and Failures in Real
           Systems




                                  75
           Examples

•   Berkeley’s Tertiary Disk
•   Tandem
•   VAX
•   FCC




                               76
Berkeley’s Tertiary Disk
                   18 months of operation




                  SCSI backplane, cables, Ethernet
                  cables were no more reliable than
             77   data disks
7.7 I/O Performance Measures




                               78
          I/O Performance Measures

• Some similarities with CPU performance measures
   – Bandwidth - 100% utilization is maximum throughput
   – Latency - often called response time in the I/O world
• Some unique
   – Diversity - what types can be connected to the system
   – Capacity - how many and how much storage on each unit
• Usual relationship between Bandwidth & Latency




                                    79
          Latency VS. Throughput

• Response time (latency): the time a task takes from the
  moment it is placed in the buffer until the server finishes the
  task
• Throughput: the average number of tasks completed by the
  server over a time period
• Knee of the curve (L VS. T): the area where a little more
  throughput results in much longer response time, or, a little
  shorter response time results in much lower throughput
                                             Queue
                                 Proc                    Device
                           Response time = Queue + Device Service time

                                80
          Latency VS. Throughput

Latency




                     81
          Transaction Model

• In an interactive environment, faster response time is
  important
• Impact of inherent long latency
• Transaction time: sum of 3 components
   – Entry time - time it takes user (usually human) to enter command
   – System response time - command entry to response out
   – Think time - user reaction time between response and next entry




                                   82
The Impact of Reducing
Response Time




           83
          Transaction Time Oddity

• As system response time goes down
   – Think time goes down even more
• Could conclude
   – That system performance magnifies human talent
   – OR conclude that with a fast system less thinking is necessary
   – OR conclude that with a fast system less thinking is done




                                   84
7.8 A Little Queuing Theory




                              85
           Introduction

• Help calculate response time and throughput
• More interested in long term, steady state than in startup 
   – No. of tasks entering the system = No. of tasks leaving the system
• Little’s Law:
   – Mean number tasks in system = arrival rate x mean response time
• Applies to any system in equilibrium, as long as nothing in
  black box is creating or destroying tasks
                                                         System
                                                 Queue        server
Arrivals                Departures        Proc              IOC   Device


                                     86
          Little's Law

• Mean no. of tasks in system = arrival rate * mean response
  time
• We observe a system for Timeobserve
• No. of tasks completed during Timeobserve is Numbertask
• Sum of the times each task in the system: Timeaccumulated
                                          Timeaccumulated
  Mean number of tasks in system =
                                          Timeobserve
                                          Timeaccumulated
           Mean response time =
                                          Numbertasks
     Timeaccumulated              Timeaccumulated           Numbertasks
                        =                               *
     Timeobserve                  Numbertasks               Timeobserve

                                     87                         Arrive Rate
            Queuing Theory Notation
• Queuing models assume state of equilibrium: input rate = output rate
• Notation:
    – Timeserver – average time to service a task
        • Service rate – 1/Timeserver
    – Timequeue – average time/task in queue
    – Timesystem – reseponse, average time/task in system
        • Timesystem = Timeserver + Timequeue
    – Arrival rate – average number of arriving tasks/second
    – Lengthserver – average number of tasks in service
    – Lengthqueue – average number of tasks in queue
    – Lengthsystem – average number of tasks in system
        • Lengthsystem = Lengthserver + Lengthqueue
    – Server Utilization = Arrival Rate / Service Rate (0 – 1) (equilibrium)
• Little’s Law  Lengthsystem = Arrivial Rate * Timesystem

                                          88
            Example
•   An I/O system with a single disk
•   10 I/O requests per second, average time to service = 50ms
•   Arrival Rate = 10 IOPS Service Rate = 1/50ms = 20 IOPS
•   Server Utilization = 10/20 = 0.5

•   Lengthqueue = Arrivial Rate * Timequeue
•   Lengthserver = Arrivial Rate * Timeserver
•   Average time to satisfy a disk request = 50ms, Arrival Rate = 200 IOPS
•   Lengthserver = Arrivial Rate * Timeserver= 200 * 0.05 =10




                                      89
                                                               System
                                                         Queue      server
           Response Time                       Proc                  IOC   Device


• Service time completions vs. waiting time for a busy server: randomly
  arriving event joins a queue of arbitrary length when server is busy,
  otherwise serviced immediately (Assume unlimited length queues)
• A single server queue: combination of a servicing facility that
  accomodates 1 task at a time (server) + waiting area (queue): together
  called a system
• Timequeue (suppose FIFO queue)
   – Timequeue = Lengthqueue * Timeserver + M
   – M = mean time to complete service of current task when new task arrives if
     the server is busy
       • A new task can arrive at any instant
       • Use distribution of a random variable: histogram? curve?
       • M is also called Average Residual Service Time (ARST)


                                       90
           Response Time (Cont.)
• Server spends a variable amount of time with tasks
   – Weighted mean m1 = (f1 x T1 + f2 x T2 +...+ fn x Tn)/F (F=f1 + f2...)
   – variance = (f1 x T12 + f2 x T22 +...+ fn x Tn2)/F – m12
      • Must keep track of unit of measure (100 ms2 vs. 0.1 s2 )
   – Squared coefficient of variance: C = variance/m12
      • Unitless measure (100 ms2 vs. 0.1 s2)
• Three Distributions
   – Exponential distribution C = 1 : most short relative to average, few others
     long; 90% < 2.3 x average, 63% < average
   – Hypoexponential distribution C < 1 : most close to average,
                                                                           Avg.
     C=0.5  90% < 2.0 x average, only 57% < average
   – Hyperexponential distribution C > 1 : further from average
     C=2.0  90% < 2.8 x average, 69% < average
• ARST = 0.5 * Weighted Mean Time * (1 + C)


                                        91
         Characteristics of Three
         Distributions




Memory-less: C does not vary over time and does not consider past history
of events

                                   92
           Timequeue

• Derive Timequeue in terms of Timeserver, server utilization, and
  C
   – Timequeue = Lengthqueue * Timeserver + ARST * server utilization
   – Timequeue = (arrival rate * Timequeue ) * Timeserver +
                (0.5 * Timeserver * (1+C)) * server utilization
   – Timequeue = Timequeue * server utilization +
                (0.5 * Timeserver * (1+C)) * server utilization
   – Timequeue=Timeserver*(1+C)*server utilization / (2*(1-server utilization))
• For exponential distribution, the C = 1.0 
   – Timequeue=Timeserver * server utilization / (1-server utilization)



                                       93
           Queuing Theory
• Predict approximate behavior of random variables
   – Make a sharp distinction between past events (arithmetic measurements)
     and future events (mathematical predictions)
   – In computer system, future rely on past  arithmetic measurements and
     mathematical predictions (distribution) are blurred
• Queuing model assumption  M/G/1
   – Equilibrium system
   – Exponential inter-arrival time (time between two successive tasks arriving) or
     arrival rate
   – Unlimited sources of requests (infinite population model)
   – Unlimited queue length, and FIFO queue
   – Server starts on next task immediately after finish the prior one
   – All tasks must be completed
   – One server


                                        94
          M/G/1 and M/M/1
• M/G/1 queue
   – M = exponentially random request arrival (C = 1)
      • M for ―memoryless‖ or Markovian
   – G: general service distribution (no restrictions)
   – 1 server
• M/M/1 queue
   – Exponential service distribution (C=1)
• Why exponential distribution (used often in queuing theory)
   – A collection of many arbitrary distributions acts as an exponential
     distribution (A computer system comprises many interacting
     components)
   – Simpler math


                                     95
            Example
• Processor sends 10 disk I/Os per second; requests & service are
  exponentially distributed; avg. disk service = 20 ms
   –   On average, how utilized is the disk?
   –   What is the average time spent in the queue?
   –   What is the 90th percentile of the queuing time?
   –   What is the number of requests in the queue?
   –   What is the average response time for a disk request?
• Answer
   – Arrival rate = 10 IOPS, Service rate = 1/0.02 = 50 IOPS
   – Service utilization = 10/50 = 0.2
   – Timequeue=Timeserver * server utilization / (1-server utilization)
     = 20 * 0.2 / (1 – 0.2) = 20 * 0.25 = 5ms
   – 90th percentile of the queuing time = 2.3 (slide 91) * 5 = 11.5ms
   – Lengthqueue = Arrival rate * Timequeue = 10 * 0.05 = 0.5
   – Average response time = 5 + 20 = 25 ms
   – Lengthsystem = Arrival rate * Timesystem = 10 * 0.25 = 2.5

                                         96
7.9 Benchmarks of Storage
Performance and Availability




                               97
          Transaction Processing (TP)
          Benchmarks
• TP: database applications, OLTP
• Concerned with I/O rate (# of disk accesses per second)
• Started with anonymous gang of 24 members in 1985
   – DebitCredit benchmark: simulate bank tellers and has as it bottom
     line the number of debit/credit transactions per second (TPS)
• Tighter & more standard benchmark versions
   – TPC-A, TPC-B
   – TPC-C: complex query processing - more accurate model of a real
     bank which models credit analysis for loans
   – TPC-D, TPC-H, TPC-R, TPC-W
• Also must report the cost per TPS
   – Hence machine configuration is considered

                                   98
TP Benchmarks




          99
           TP Benchmark -- DebitCredit
• Disk I/O is random reads and writes of 100-byte records along with
  occasional sequential writes
   – 2—10 disk I/Os per transaction
   – 5000 – 20000 CPU instructions per disk I/O
• Performance relies on…
   – The efficiency of TP software
   – How many disk accesses can be avoided by keeping information in main
     memory (cache) !!!  wrong for measuring disk I/O
• Peak TPS
   – Restriction: 90% of transactions have < 2sec. response time
   – For TPS to increase, # of tellers and the size of the account file must also
     increase (more TPS requires more users)
       • To ensure that the benchmark really measure disk I/O (not cache…)

                                        100
      Relationship Among TPS,
      Tellers, and Account File Size




The data set generally must scale in size as the throughput increases




                                101
           SPEC System-Level File
           Server (SFS) Benchmark
• SPECsfs - system level file server
   –   1990 agreement by 7 vendors to evaluate NFS performance
   –   Mix of file reads, writes, and file operations
   –   Write: 50% done on 8KB blocks, 50% on partial (1, 2, 4KB)
   –   Read: 85% full block, 15% partial block
• Scales the size of FS according to the reported throughput
   – For every 100 NFS operations per second, the capacity must
     increase by 1GB
   – Limit average response time, such as 40ms
• Does not normalize for different configuration
• Retired in June 2001 due to bugs

                                   102
           SPECsfs
                           Unfair configuration



Overall
Response time
(ms)




                     103
           SPECWeb

• Benchmark for evaluating the performance of WWW servers
• SPECWeb99 workload simulates accesses to a Web server
  provider supporting HP for several organizations
• For each HP, nine files in each of the four classes
   –   Less than 1KB (small icons): 35% of activity
   –   1—10KB: 50% of activity
   –   10—100KB: 14% of activity
   –   100KB—1MB (large document and image): 1% of activity
• SPECWeb99 results in 2000 for Dell Computers
   – Large memory is used for a file cache to reduce disk I/O
   – Impact of Web server software and OS

                                   104
SPECWeb99 Results for Dell




           105
          Examples of Benchmarks of
          Dependability and Availability
• TPC-C has a dependability requirement: must handle a
  single disk failure
• Brown and Patterson [2000]
   – Focus on the effectiveness of fault tolerance in systems
   – Availability can be measured by examining the variations in system
     QOS metrics over time as faults are injected into the system
   – The initial experiment injected a single disk fault
      • Software RAID by Linux, Solaris, and Windows 2000
           – Reconstruct data onto a hot spare disk
      • Disk emulator injects faults
      • SPECWeb99 workload


                                  106
  Availability Benchmark for
  Software RAID
(Red Hat 6.0)         (Solaris 7)




                107
Availability Benchmark for
Software RAID (Cont.)
     (Windows 2000)




                 108
          Availability Benchmark for
          Software RAID (Cont.)
• The longer the reconstruction (MMTF), the lower the
  availability
   – Increased reconstruction speed implies decreased application
     performance
   – Linux VS. Solaris and Windows 2000
• RAID reconstruction
   – Linux and Solaris: initiate reconstruction automatically
   – Windows 2000: initiate reconstruction manually by operators
• Managing transient faults
   – Linux: paranoid
   – Solaris and Windows: ignore most transient faults


                                  109
7.10 Crosscutting Issues:
     Interface to OS




                            110
          I/O Interface to the OS

• OS controls what I/O technique implemented by HW will
  actually be used
• Early Unix head wedge
   – 16 bit controllers could only transfer 64KB at a time
      • Later controllers go to 32 bit devices
      • And are optimized for much larger blocks
   – Unix however did not want to distinguish  so it kept the 64KB bias
      • A new I/O controller designed to efficiently transfer 1 MB files
        would never see more than 63KB at a time under early UNIX




                                   111
          Cache Problems -- Stale Data

• 3 potential copies - cache, memory, and disk
   – Stale data: CPU or I/O system could modify one copy without
     updating the other copies
   – Where the I/O system is connected to the computer?
      • CPU cache: no stale-data problem
           – All I/O devices and CPU see the most accurate data
           – Cache system’s multi-level inclusion
           – Disadvantages
               » Lost CPU performance  all I/O data goes through the
                  cache, but little is referenced
               » Arbitration between CPU and I/O for accessing cache
      • Memory: stale-data problem occurs

                                  112
Connect I/O to Cache




           113
              Cache-Coherence Problem
                        Output                  Input

      CPU                CPU                     CPU

     Cache              Cache                   Cache

A’    100          A’    500               A’    100

B’    200          B’    200               B’    200
     Memory             Memory                  Memory

A     100          A     100               A     100

B     200          B     200               B     440

      IO                  IO     A stale          IO     B' stale

                                 114
          Stale Data Problem
• I/O sees stale data on output because memory data is not
  up-to-date
   – Write-through cache OK
   – Write-back cache
      • OS flushes data to make sure they are not in cache before output
      • HW checks cache tags to see if they are in cache, and only
        interact with cache if the output tries to use in-cache data
• CPU sees stale data in cache on input after I/O has updated
  memory
   – OS guarantees input data area cannot possibly be in cache
   – OS flushes data to make sure they are not in cache before input
   – HS checks tags during an input and invalidate the data if conflict


                                    115
          DMA and Virtual Memory

• 2 types of addresses: Virtual (VA) and Physical address (PA)
• Physically addressed I/O problems for DMA
   – Block size larger than a page
      • Will likely not fall on consecutive physical page numbers
   – What happens if the OS victimizes a page when DMA is in progress
      • Pinned the page in the memory (not allow to be replaced)
      • OS copy user data into the kernel address space and then
        transfer between the kernel address space to I/O space




                                 116
         Virtual DMA

• DMA uses VAs that are mapped to PAs during the DMA
• DMA buffer sequential in VM, but can be scattered in PM
• Virtual addresses provide the protection of other processes
• OS updates the address tables of a DMA if a process is
  moved using virtual DMA
• Virtual DMA requires a register for each page to be
  transferred in the DMA controller, showing the protection
  bits and the physical page corresponding to each virtual
  page



                              117
Virtual DMA Illustration




             118
7.11 Designing an I/O System




                               119
           I/O Design Complexities

• Huge variety of I/O devices
    – Latency
    – Bandwidth
    – Block size
•   Expansion is a must – longer buses, larger power and cabinets
•   Balanced Performance and Cost
•   Yet another n-dimensional conflicting
•   Constraint problem
    – Yep - it’s NP hard just like all the rest
    – Experience plays a big role since the solutions are heuristic


                                     120
            7 Basic I/O Design Steps
• List types of I/O devices and buses to be supported
• List physical requirements of I/O devices
    – Volume, power, bus slots, expansion slots or cabinets, ...
• List cost of each device and associated controller
• List the reliability of each I/O device
• Record CPU resource demands - e.g. cycles
    – Start, support, and complete I/O operation
    – Stalls due to I/O waits
    – Overhead - e.g. cache flushes and context switches
• List memory and bus bandwidth demands
• Assess the performance of different ways to organize I/O devices
    – Of course you’ll need to get into queuing theory to get it right


                                         121
            An Example
• Impact on CPU of reading a disk page directly into cache.
• Assumptions
   –   16KB page, 64-bytes cache-block
   –   Addresses of the new page are not in the cache
   –   CPU will not access data in the new page
   –   95% displaced cache block will be read in again (miss)
   –   Write-back cache, 50% are dirty
   –   I/O buffers a full cache block before writing to cache
   –   Access and misses are spread uniformly to all cache blocks
   –   No other interference between CPU and I/O for the cache slots
   –   15,000 misses per 1 million clock cycles when no I/O
   –   Miss penalty = 30 CC, 30 CC mores to write dirty-blocks
   –   1 page is brought in every 1 million clock cycles

                                       122
          An Example (Cont.)

• Each page fills 16,384/64 or 256 blocks
• 0.5 * 256 * 30 CCs to write dirty blocks to memory
• 95% * 256 (244) are referenced again and misses
   – All of them are dirty and will need to be written back when replaced
   – 244 * 60 more CCs to write back
• In total: 128 * 30 + 244 * 60 more CCs than
  1,000,000+15,000*30+7,500*30
   – 1% decrease in performance




                                   123
           Five More Examples

•   Naive cost-performance design and evaluation
•   Availability of the first example
•   Response time of the first example
•   Most realistic cost-performance design and evaluation
•   More realistic design for availability and its evaluation




                                 124

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:7
posted:8/17/2011
language:English
pages:124