Docstoc

VMWORLD EUROPE

Document Sample
VMWORLD EUROPE Powered By Docstoc
					AP02

NFS & iSCSI: Performance
Characterization and Best
Practices in ESX 3.5
       Priti Mishra
       MTS, VMware

       Bing Tsai
       Sr. R&D Manager, VMware
Housekeeping

  Please turn off your mobile phones, blackberries
  and laptops
  Your feedback is valued: please fill in the session
  evaluation form (specific to that session) & hand it to the
  room monitor / the materials pickup area at registration
  Each delegate to return their completed event evaluation
  form to the materials pickup area will be eligible for a free
  evaluation copy of VMware’s ESX 3i
  Please leave the room between sessions, even if your next
  session is in the same room as you will need to be
  rescanned
Topics

   General Performance Data and Comparison
    Improvements in ESX 3.5 over ESX 3.0.x
   Performance Best Practices
   Troubleshooting Techniques
    Basic methodology
    Tools
    Case studies
Key performance improvements since ESX3.0.x (1 of 3)

   NFS
     Accurate CPU accounting further improves load balancing among
     multiple VMs
     Optimized buffer and heap sizes
     Improvements in TSO support
       TSO (TCP segmentation offload) improves large writes
   H/W iSCSI (with QLogic 405x HBA)
     Improvements in PAE (large memory) support
       Results in better multi-VM performance in large systems
     Minimized NUMA performance overhead
       This overhead exists in physical systems as well
     Improved CPU cost per I/O
Key performance improvements since ESX3.0.x (2 of 3)

   S/W iSCSI (S/W-based initiator in ESX)
     Improvements in CPU costs per I/O
     Accurate CPU accounting further improves load balance among
     multiple VMs
     Increased maximum transfer size
       Minimizes iSCSI protocol processing cost
       Reduces network overhead for large I/Os
     Ability to handle more concurrent I/Os
       Improved multi-VM performance
Key performance improvements since ESX3.0.x (3 of 3)

   S/W iSCSI (continued)
     Improvements in PAE (large memory) support
      CPU efficiency much improved for systems with >4GB memory
     Minimizing NUMA performance overhead
Performance Experiment Setup (1 of 3)

   Workload: Iometer
    Standard set based on
      Request size
         1k, 4k, 8k, 16k, 32k, 64k, 72k, 128k, 256k, 512k
      Access mode
         50% read/ write
      Access pattern
         100% sequential
    1 worker, 16 Outstanding I/Os
   Cached runs
    100MB data disks to minimize array/server disk activities
    All I/Os served from server/array cache
    Gives upper bound on performance
Performance Experiment Setup (2 of 3)

   VM information
    Windows 2003 Enterprise Edition
    1 VCPU; 256 MB memory
      No file system used in VM (Iometer sees disk as physical drive)
      No caching done in VM
    Virtual disks located on RDM device configured in physical mode
      Note: VMFS-formatted volumes are used in some tests where noted
Performance Experiment Setup (3 of 3)

   ESX Server
    4-socket, 8 x 2.4GHz cores
    32GB DRAM
    2 x Gigabit NICs
      One for vmkernel networking: used for NFS and software iSCSI
      protocols
      One for general VM connectivity
   Networking Configuration
    Dedicated VLANs for data traffic isolated from general networking
How to read performance comparison charts

   Throughput
    Higher is better
    Positive is better à higher throughput
   Latency
    Lower is better
    Negative is better à lower response time
   CPU cost
    Lower is better
    Negative is better à reduced CPU cost
    How does this metric matter?
CPU Costs

   Why is CPU cost data useful?
   Determines how much I/O traffic the system CPUs can handle
     How many I/O-intensive VMs can be consolidated in a host
   How to compute CPU cost
   Measure total physical CPU usage in ESX
     esxtop counter: Physical Cpu(_Total)
   Normalize to per I/O or per MBps
     Example: MHz/MBps =
     {(Physical CPU usage percentage out 100%) ) X (# of physical CPUs) X
        (CPU MHz rating)} /
     (throughput in MBps)
Performance Data

   First set: Relative to baselines in ESX 3.0.x
   Second set: Comparison of storage options using Fibre
   Channel data as the baseline
   Last: VMFS vs. RDM physical
Software iSCSI –
Throughput Comparison to 3.0.x:
   higher is
   better
Software iSCSI –
Latency Comparison to 3.0.x:
   lower is better
Software iSCSI –
CPU Cost Comparison to 3.0.x:
   lower is
   better
Software iSCSI – Performance Summary

   Lower CPU costs
    Can lead to higher throughput for small IO sizes when CPU is
    pegged
    CPU costs per IO also greatly improved for larger block sizes
   Latency is lower
    Especially for smaller data sizes
    Read operations benefit most
   Throughput levels
    Dependent on workload
      Mixed read-write patterns show most gain
      Read I/Os show gains for small data sizes
Hardware iSCSI –
Throughput Comparison to 3.0.x:
   higher is
   better
Hardware iSCSI –
Latency Comparison to 3.0.x:
   lower is better
Hardware iSCSI –
CPU Cost Comparison to 3.0.x :
   lower is
   better
Hardware iSCSI – Performance Summary

   Lower CPU costs
    Results in higher throughput levels for small IO sizes
    CPU costs per IO are especially improved for larger data sizes
   Latency is better
    Smaller data sizes show the most gain
    Mixed read-write and read I/Os benefit more
   Throughput levels
    Dependent on workload
      Mixed read-write patterns show most gain for all block sizes
      Pure read and write I/Os show gains for small block sizes
NFS – Performance Summary

   Performance also significantly improved in ESX 3.5
   Data now shown here for interest of time
Protocol Comparison

   Which storage option to choose?
    IP Storage vs. Fibre Channel
   How to read the charts?
    All data is presented as ratio to the corresponding 2Gb FC (Fibre
    Channel) data
    If the ratio is 1, the FC and IP protocol data is identical; if < 1, FC
    data value is larger
Comparison with FC: Throughput

if < 1, FC data
value is larger
Comparison with FC: Latency

   lower
   is
   better
VMFS vs. RDM

  Which one has better performance?
  Data shown as ratio to RDM physical
VMFS vs. RDM-physical: Throughput

   higher is
   better
VMFS vs. RDM-physical: Latency

   lower is better
VMFS vs. RDM-physical: CPU Cost

   lower is better
Topics

   General Performance Data and Comparison
    Improvements in ESX 3.5 over ESX 3.0.x
   Performance Best Practices
   Troubleshooting Techniques
    Basic methodology
    Tools
    Case studies
Pre-Deployment Best Practices: Overview
 Understand the performance capability of your
  Storage server/array
  Networking hardware and configurations
  ESX host platform

 Know your workloads

 Establish performance baselines
Pre-Deployment Best Practices (1 of 4)
 Storage server/array: a complex system by itself
  Total spindle count
  Number of spindles allocated for use
  RAID level and stripe size
  Storage processor specifications
  Read/write cache sizes and caching policy settings
    Read-Ahead, Write-Behind, etc.
  Useful sources of information:
    Vendor documentation: manuals, best practice guides, white papers, etc.
    Third-party benchmarking reports
    NFS-specific tuning information: SPEC-SFS disclosures in http://www.spec.org
Pre-Deployment Best Practices (2 of 4)
 Networking
   Routing topology and path configurations: # of links in between, etc.
   Switch type, speed and capacity
   NIC brand/model, speed and features
   H/W iSCSI HBAs
 ESX host
   CPU: revision, speed and core count
   Architecture basics
     SMP or NUMA?
       Disabling NUMA is not recommended
     Bus speed, I/O subsystems, etc.
   Memory configuration and size
     Note: NUMA nodes may not have equal amount of memory
Pre-Deployment Best Practices (3 of 4)
 Workload characteristics
  What are the smallest, largest and most common I/O sizes?
  What is the read%? write%?
  Is access pattern sequential? random? mixed?
  Response time more important or aggregate throughput?
  Response time variance an issue or not?
  Important: know the peak resource usage, not just the average
Pre-Deployment Best Practices (4 of 4)
 Establish performance baselines by running standardized
 benchmarks
   What’s the upperbound IOps for small I/Os?
   What’s the upperbound MBps?
   What’s the average/worst case response time?
   What’s the CPU cost of doing I/O?
Additional Considerations (1 of 3)
   NFS parameters
    # of NFS mount points
      Multiple VMs using multiple mount points may give higher aggregate
      throughput with slightly higher CPU cost
     Export option on NFS server affects performance
   iSCSI protocol parameters
     Header digest processing: slight impact on performance
     Data digest processing: turning off may result in
      Improved CPU utilization
      Slightly lower latencies
      Minor throughput improvement
      Actual outcome highly dependent on workload
Additional Considerations (2 of 3)

   NUMA specific
    If only one VM is doing heavy I/O, may be beneficial to pin the VM and
    its memory to node 0
       If CPU usage is not a concern; no pinning necessary
    On each VM reboot, ESX Server will place it on the next adjacent NUMA
    node
       Minor performance implications for certain workloads
       To avoid this movement, VM should be affinitized using VI client
   SMP VMs
    For I/O workloads within an SMP VM that migrate frequently between
    VCPUs
      Pin the guest thread/process to a specific VCPU
    Some versions of Linux has KHz timer rate and may incur high overhead
Additional Considerations (3 of 3)

    CPU headroom
     Software initiated iSCSI and NFS protocols can consume
     significant amount of CPU in certain I/O patterns
        Small I/O workloads require large amount of CPU; ensure that CPU
        saturation does not restrict I/O rate
    Networking
     Avoid link over-subscription
     Ensure all networking parameters or even the basic gigabit
     connection is consistent across the full network path
     Intelligent use of VLAN or zoning to minimize traffic interference
General Troubleshooting Tips (1 of 3)

   Identify
     Components in the whole I/O path
     Possible issues at each layer in the path
   Check all hardware & software configuration parameters, in particular
     Disk configurations and cache management policies on storage
     server/array
     Network settings and routing topology
   Design experiments to isolate problems, such as:
     Cached runs
       Use a small file or logical device, or a physical host configured with
       RAM-disks: Minimizing physical disk effects
       Indicate upper-bound throughput and I/O rate achievable
General Troubleshooting Tips (2 of 3)

    Run tests with single outstanding I/O
       Easier for analysis on packet traces
       Throughput entirely dependent on I/O response times
    Micro benchmarking each layer in the I/O path
    Compare to non-virtualized, native performance results
   Collect data
    Guest OS data: But don’t trust the CPU%
    Esxtop data
    Storage server/array data: Cache hit ratio, storage processor busy%, etc.
    Packet tracing with tools like TCPdump, Ethereal, Wireshark, etc.
General Troubleshooting Tips (3 of 3)

   Analyze performance data
     Do any stats, e.g., throughput or latency, change drastically over time?
     Check esxtop data for anomalies, e.g., CPU spikes or excessive
     queueing
     Server/array stats
       Compare array stats with ESX stats
       Is cache hit ratio reasonable? Storage processor overloaded?
   Network trace analysis
     Inspect packet traces to see if
       NFS and iSCSI requests are processed timely
       IO sizes issued by the guest match the transfer sizes over the wire
       Block addresses aligned to appropriate boundaries?
Isolating Performance Problems: Case
Study#1 (1 of 3)
 Symptoms
   Throughput can reach Gigabit wire speed doing 128KB sequential
   reads from a 20GB LUN on an iSCSI array with 2GB cache
   Throughput degrades for larger data sizes beyond 128KB

 From esxtop data
   CPU utilization also lower for l/O sizes larger than 128KB
   CPU cost per I/O is in expected range for all I/O sizes
Isolating Performance Problems: Case
Study#1 (2 of 3)
 From esxtop or benchmark output
  I/O response times in the 10 to 20ms range for the problematic IOs
    Indicates constant physical disk activities required to serve the reads
 From network packet traces
  No retransmissions or packet loss observed indicating no networking
  issue
  Packet time stamps indicating array takes 10ms to 20ms to respond to a
  read request, no delay in the ESX host
 From cached run results
  No throughput degradation above 128KB!
  Problem exists only for file sizes exceeding cache capacity
    Array appears to have cache-management issues with large sequential reads
Isolating Performance Problems: Case
Study#1 (3 of 3)
 From native tests to same array
   Same problem observed
 From the administration GUI of the array
   Read-ahead policies set to highly aggressive
   Is the policy appropriate for the workload?
 Solution
   Understand performance characteristics of the array
     Experiment with different read-ahead policies
     Try turning off read-ahead entirely to get the baseline behavior
Isolating Performance Problems: Case
Study#2 (1 of 4)
 Symptoms
   1KB random write throughput much lower (< 10%) than
   sequential writes to a 4GB vmdk file located on an NFS server
     Even after extensive warm-up period
   But very little difference in performance between random and
   sequential reads
 From NFS server spec
   3GB read/write cache
     Most data should be in cache after warming up
Isolating Performance Problems: Case
Study#2 (2 of 4)
 From esxtop and application/benchmark data
   CPU% utilization lower but CPU cost per I/O mostly same
   regardless of randomness
    Not likely a client side (i.e., ESX host) issue
   Random write latency in the 20ms range
   Sequential write < 1ms
 From NFS server stats
   cache hit% much lower for random writes, even after warm-up
Isolating Performance Problems: Case
Study#2 (3 of 4)
 From cached runs to a 100MB vmdk
   Random write latency almost matches sequential write
    Again, suggests that issue is not in ESX host
 From native tests
   Random and sequential write performance is almost same
 From network packet traces
   Server responds to random writes in 10 to 20ms, sequential
   writes in <1ms
   Offset in NFS WRITE requests is not aligned to power-of-2
   boundary
    Packet traces from native runs show correct alignment
Isolating Performance Problems: Case
Study#2 (4 of 4)
 Question
  Why are sequential writes not affected?
 NFS Server file system idiosyncrasies
  Manages cache memory at 4KB granularity
  Old blocks are not updated in place; writes go to new blocks
  Each < 4KB write incurs a read from the old block
  Aggressive read-ahead masks the read latency associated with
  sequential writes
 Solution
  Use disk alignment tool in the guest OS to align disk partition
  Alternatively, use unformatted partition inside guest OS
Summary and Takeaways

  IP-based storage performance in ESX is being constantly
  improved; Key enhancements in ESX 3.5:
     Overall storage subsystem
     Networking
     Resource scheduling and management
     Optimized NUMA, multi-core, and large memory support
   IP-based network storage technologies are maturing
     Price/performance can be excellent
     Deployment and troubleshooting could be challenging
     Knowledge is key: server/array, networking, host, etc.
  Stay tuned for further updates from VMware
Questions?


 NFS & iSCSI –
   Performance Characterization and Best Practices in ESX 3.5

 Priti Mishra & Bing Tsai
 VMware

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:7/26/2013
language:English
pages:50