Docstoc

VMFS Modes and ESX Server Locking in ESX Server 2

Document Sample
VMFS Modes and ESX Server Locking in ESX Server 2 Powered By Docstoc
					VMFS Modes and ESX Server Locking in ESX Server 2.1
By Jay Judkowitz 2/24/03

Acknowledgements:
Many thanks to Satyam Vaghani for patiently supplying almost all of this information and for
proofreading it.

Fundamental problem addressed by this tech note:
Many Enterprise applications need to share disk between multiple operating systems directly
through multi-headed SCSI arrays or SANs without the benefit of a fileserver in the middle
negotiating locking and contention issues. Some such applications are MSCS, Oracle 9i RAC
Oracle 10g and VCS. As ESX Server matures and moves more into the production datacenter,
these applications are now being run in ESX Server VMs, which increases the requirements of
the ESX Server’s disk locking mechanisms. Furthermore, ESX Server now has the capability to
do VMotion with the help of VirtualCenter. Since VMotion does not move the disk, but rather
hands it off from one ESX Server to another, shared physical media between ESX Servers is now
even a more fundamental need.
Shared disk is a problem that affects many hardware platforms, operating systems, filesystems
and applications. Enabling multiple machines to access shared disk – either one at a time or
concurrently poses a fundamental difficulty. When the filesystem/disk structure is modified by
one machine, the other machines sharing the disk will not know about the modification. They will
continue to work as if the change had not been made and will get confused and/or corrupt the
disk/filesystem structure.
With ESX Server, this problem is extra complicated. The traditional problem, multiple physical
ESX Servers needing to manage shared physical disks, still exists. In addition, the virtual
machines also have to manage access to their shared virtual disks.
Note: This tech note, whether in discussing the ESX Servers or the virtual machines, refers to
storage attached directly or through a SAN, not over a network. Network attached storage has
built in procedures to manage concurrent access from multiple systems on a network. Direct
attached storage offers lower overhead and higher performance, but lacks inherent capability to
manage concurrent access from multiple systems.

Differences in the document from the 1.5.2 and 2.0 documents:
     The underlying mechanisms of disk locking got more complicated from 1.5.2 to 2.0 and
        are even more complex in 2.1. It has gotten to the point where explaining all the details
        is not feasible or advisable. It will likely confuse us all and help us to confuse our
        customers and partners. Furthermore, it allows us to give away patent-pending secrets.
        Therefore this tech note will not be anywhere near as thorough as the two preceding
        documents. However, for those interested in all the gory details, Satyam Vaghani is
        working on an engineering quality document that we will all have access to.
     This document will focus solely on
            o Variables that are under control of an ESX Server administrator
            o Combination of settings ought to be used to meet a variety of customer use
                 cases.
            o Caveats.

General Architecture:
    Virtual Disks are files to the vmkernel. The vmkernel controls all I/O from the VMs to their
       respective virtual disks. It also controls all I/O from the Service Console to the virtual
       disks. The vmkernel is all-knowing regarding I/O on a single ESX Server, but the
       vmkernels of separate ESX Servers do not communicate and only know what others are
       doing via persistent data in the form of files or filesystem metadata on the shared disk
       itself.
    Virtual disk files live on VMFS filesystems. In ESX Server 2.1, the recommended
       filesystem is VMFS version 2. VMFS is a filesystem like ext3 on Linux or NTFS on
        Windows. However, one major difference is that VMFS version 2 is a distributed
        filesystem. That is to say, unlike ext3 and NTFS, multiple hosts can, within certain limits,
        read and write to the same filesystem.
       Logical Unit Numbers (LUNs) – LUNs are the physical disks themselves that hold the
        VMFSs. The LUNs are controlled by the SCSI protocol. Its locking mechanism (SCSI-2
        reservations) is at a lower and less subtle level than the locking the vmkernel does on
        either files or VMFSs.
       LUNs can be dedicated to VMs entirely without first being formatted with a VMFS. Then
        VM itself formats the LUN with ext3 or NTFS. Previously, raw disks were outside the
        context of the disk locking mechanisms that governed virtual disk files. With ESX Server
        2.0.1, we have raw disk mappings (RDMs), which are essentially links to raw disks from
        within a VMFS, making the raw disks look and behave as virtual disk files stored on the
        VMFS containing the link. This will eventually allow VMotions of raw disk VMs and also
        makes all of the locking concepts discussed in this document equally valid for both raw
        disks and virtual disk files (assuming the raw disk mapping is used – they are
        recommended, but not mandatory). Besides VMotion, the locking on raw disks enabled
        by RDMs allows undoable raw disks and backups of raw disks as files.

Solution:
ESX Server provides three types of locking – all for different permutations of the shared disk
problem. Different combinations of the three types of locks are needed to accomplish various
practical tasks.

    1) File Locking – File locks are held both in the vmkernel to handle contention between
       multiple VMs on the same system and the Service Console and also on the VMFS itself
       to handle contention between separate ESX Servers. Different use cases will cause
       different file locks to be set and unset. All of these locks are set automatically by the
       vmkernel, and there are no administrator settings, so no more will be said about the
       subject in this tech note.

    2) On-disk VMFS locking – Locking the VMFS is done on a per physical machine basis, not
       on a per virtual machine basis. Just as virtual machines need locks on their virtual disks
       so that they don’t get confused by changes made to their internal filesystems that they
       did not initiate, ESX Servers need locks their physical disks so that they do not get
       confused by changes made to their internal filesystems (VMFS) that they did not initiate.
       There are two settings that allow different underlying locking schemes to accomplish
       different disks.
            a. Public
                       i. This is the default VMFS2 state. Almost all virtual disk files are stored in
                          public VMFSs.
                      ii. It allows multiple ESX Servers to access the same filesystem
                          simultaneously as long as they are accessing separate files.
                     iii. The concurrent read/write access to filesystem metadata (like when files
                          are created, expanded, locked or unlocked) by separate ESX Servers is
                          governed by short-term SCSI-2 reservations. Basically, the vmkernel
                          needing to write to the filesystem metadata quickly locks the entire LUN
                          holding the first extent of the VMFS, does the metadata write, and
                          releases the LUN.
                                1. This fact is the cause of most concurrency limitations of ESX
                                   Servers to VMFS
                                2. This fact also leads to the critical warning – NEVER PLACE
                                   MORE THAN ONE PUBLIC VMFS VOLUME PER LUN. One
                                   detail is that you can have multiple physical extents on a LUN as
                                   long as there are combined into the same single VMFS. This
                                   use case could occur if a LUN is expanded, and the VMFS
                                   needs to then be expanded to fill the LUN. While dynamic LUN
                            growth is not yet supported, the locking protocols will allow this
                            use case when it is supported.
              iv. Since the SCSI-2 reservations used to implement the per-file or VMFS-
                   wide locks are the very same that are be used in a shared disk clustering
                   scenario between VMs residing on separate ESX Servers, public VMFSs
                   can not be used to store shared data or quorum disks for these types of
                   clusters.
        b. Shared
                i. Specifically to hold data and quorum disks for VMs clustered on separate
                   ESX Servers using shared disk clustering that relies on SCSI-2
                   reservations. As stated previously, the file locking implementation of the
                   public VMFS makes it unusable for this purpose.
               ii. One side effect of the locking mechanisms of shared VMFS is that redo
                   logs may never be stored on a shared VMFS.

3) SCSI-2 Reservations – Per LUN basis.
      a. The SCSI reservation locks an entire LUN for access regardless of what is on it.
          SCSI reservations are used by various clustering products, including Microsoft
          Cluster Server (MSCS). These products expect to be able to lock entire LUNs
          from within virtual machines as they would from within physical machines.
      b. It is important to note that SCSI reservations are allowed (on behalf of the VMs)
          by the vmkernel but not initiated by vmkernel. The SCSI reservations must be
          requested by the clustering application within the VM, such as MSCS. The type
          of SCSI reservations allowed (virtual or physical) are determined by the setting of
          the virtual shared SCSI bus (virtual or physical) as configured when creating the
          VM. Note: the vmkernel does initiate its own SCSI reservations as well for the
          purpose of public VMFS locking.
      c. There are two types of SCSI reservations
                 i. Virtual
                         1. The SCSI reservation is virtualized and maintained as a file
                            attribute by VMFS. This prevents multiple VMs on the same
                            machine from accessing the same virtual LUN (i.e. virtual disk
                            file).
                                  a. Public VMFSs are fine for use here
                         2. The clustering application, requests a SCSI reservation and the
                            vmkernel passes this reservation on to the virtual disk controller,
                            locking the virtual disk file from access by other VMs on the
                            same physical machine.
                ii. Physical
                         1. The SCSI reservation is held on the physical disk controller.
                            This prevents multiple physical machines from accessing the
                            same physical LUN regardless of how many virtual disks it
                            contains.
                                  a. Shared VMFS is required here.
                         2. The clustering application requests a SCSI reservation and the
                            vmkernel passes this reservation on to the physical disk
                            controller, locking the physical disk file from access by any other
                            physical machine.
                         3. Setting the shared virtual SCSI bus sharing to “physical” is what
                            allows the monitor to pass the SCSI reservation through to the
                            physical disk controller.
                         4. Since physical SCSI reservations are set on entire physical
                            LUNs, please make sure to only place one VMFS on a LUN that
                            may get reserved by a clustering application. Only put virtual
                            disks on that VMFS that will failover from a VM on one ESX
                            Server to a VM on another ESX Server together at the same
                                time. When in doubt, put only one virtual disk on a VMFS that
                                resides on a LUN that may become reserved.

Common Situations, the right settings with explanation:
   1) Standalone VMs on ESX Servers without SAN disk – Public VMFS + No Bus Sharing.
          a. Since only one ESX Server has access to the VMFS, the question of shared
              VMFS does not even arise.
          b. Since the VMs are not sharing virtual disks, neither virtual, nor physical SCSI
              reservations are needed.
   2) Shared nothing cluster on a single physical ESX Server – Public VMFS + No Bus Sharing
          a. Since only one ESX Server has access to the VMFS, the question of shared
              VMFS does not even arise.
          b. Since the VMs are not sharing virtual disks, neither virtual, nor physical SCSI
              reservations are needed.
   3) Shared disk cluster on a single physical ESX Server – Public VMFS + Virtual Bus
      Sharing
          a. Since only one ESX Server has access to the VMFS, the question of shared
              VMFS does not even arise.
          b. Since the clustering software needs to be able to ensure that the virtual disk is
              accessed by only one VM at a time, SCSI reservations are needed to protect the
              virtual disk. Since all the clustered VMs are on the same ESX Server, virtual
              SCSI reservations, which are maintained in the vmkernel and VMFS, are
              sufficient.
   4) Cluster with VMs residing on multiple physical machines where the active VM is switched
      between hosts by a script (like in VCS for ESX Server) or manually – Public VMFS + No
      Bus Sharing
          a. Since only one VM (on a single ESX Server) should have access to the disk at a
              given time, (the other being shutdown or suspended), public VMFS is mandatory.
          b. No SCSI reservations – This is a cheap cluster that does not implement SCSI
              reservations by design. There is no case where two VMs are powered on using
              the same disk simultaneously.
   5) Shared disk cluster with VMs residing on multiple ESX Servers – Shared VMFS +
      Physical Bus Sharing
          a. Since multiple VMs on multiple ESX Servers need to be able to attempt to
              access the virtual disk simultaneously, shared VMFS is the only way to go.
          b. Since the clustering software should be able to ensure that the virtual disk is
              accessed by only one VM (from one ESX Server) at a time, SCSI reservations
              are needed to protect the virtual disk. Since the clustered VMs reside on multiple
              ESX Servers, physical SCSI reservations, which are maintained on the physical
              disk controller, are needed. Maintaining SCSI reservations on the individual
              VMware kernels, which are not aware of each other, is insufficient.
   6) Shared disk cluster between a physical machine and a VM – Raw Disk (or RDM in a
      public VMFS) + Physical bus sharing
          a. Raw disk is necessary so that the physical machine and the virtual machine can
              read the same filesystem format.
          b. Since the clustering software should be able to ensure that the raw LUN is
              accessed by only one machine at a time, SCSI reservations are needed to
              protect it. Since the machines reside on multiple ESX Servers, physical SCSI
              reservations, which are maintained on the physical disk controller, are needed.
              Maintaining SCSI reservations on the vmkernel, which the physical machine is
              not aware of, is insufficient.
   7) Standalone VMs on ESX Servers with SAN disk – Public VMFS + No Bus Sharing
          a. Since virtual disks files are not being accessed by multiple VMs on separate ESX
              Servers, public VMFS is appropriate.
          b. Since the VMs are not sharing virtual disks, neither virtual, nor physical SCSI
              reservations are needed.
8) Multiple VMs (possibly from separate ESX Servers) in non-persistent mode using the
    same base disk – Public VMFS + No bus Sharing
        a. Public VMFS allows concurrent access to the VMFS so that the base disk can be
              read by VMs on separate ESX Servers.
        b. No bus sharing is needed since this is not a clustered environment and SCSI
              reservations are not necessary
9) Multiple VMs (possibly from separate ESX Servers) in undoable or append mode using
    the same base disk – Impossible
        a. This use case is not possible due to the various layers of locking.
        b. Allowing this use case would risk corrupting the base disk and crashing n-1 VMs.
10) Multiple VMs (possibly from separate ESX Servers) in any mode using different redo logs
    as their base disks while maintaining the same actual base disk – Public VMFS + No bus
    Sharing
        a. Public VMFS allows the concurrent access to the VMFS so that the actual base
              disk can be read by VMs on separate ESX Servers.
        b. No bus sharing is needed since this is not a clustered environment and SCSI
              reservations are not necessary
                                                    rd
11) Dynamic redo logs on virtual disk files with 3 party systems backing up the base virtual
    disk file – Cannot be done in 2.1.
                                                               rd
12) Dynamic redo logs applied to raw disk mappings with 3 party systems backing up the
    raw LUN (not via the raw disk mapping) – Public VMFS (for the raw disk mapping) + No
    Bus Sharing
        a. Since this is not a clustered environment, shared VMFS is unnecessary.
        b. No bus sharing is needed since this is not a clustered environment and SCSI
              reservations are not necessary.
        c. One risk of this scheme is that while reading the base disk is fine so long as the
              dynamic redo log is in place, writing to the raw LUN would be disastrous. No
                                                      rd
              writes could safely happen from the 3 party system unless there was no
              dynamic redo log at the time and the VM were powered off.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:23
posted:8/23/2011
language:English
pages:5