VIEWS: 23 PAGES: 5 POSTED ON: 8/23/2011
VMFS Modes and ESX Server Locking in ESX Server 2.1 By Jay Judkowitz 2/24/03 Acknowledgements: Many thanks to Satyam Vaghani for patiently supplying almost all of this information and for proofreading it. Fundamental problem addressed by this tech note: Many Enterprise applications need to share disk between multiple operating systems directly through multi-headed SCSI arrays or SANs without the benefit of a fileserver in the middle negotiating locking and contention issues. Some such applications are MSCS, Oracle 9i RAC Oracle 10g and VCS. As ESX Server matures and moves more into the production datacenter, these applications are now being run in ESX Server VMs, which increases the requirements of the ESX Server’s disk locking mechanisms. Furthermore, ESX Server now has the capability to do VMotion with the help of VirtualCenter. Since VMotion does not move the disk, but rather hands it off from one ESX Server to another, shared physical media between ESX Servers is now even a more fundamental need. Shared disk is a problem that affects many hardware platforms, operating systems, filesystems and applications. Enabling multiple machines to access shared disk – either one at a time or concurrently poses a fundamental difficulty. When the filesystem/disk structure is modified by one machine, the other machines sharing the disk will not know about the modification. They will continue to work as if the change had not been made and will get confused and/or corrupt the disk/filesystem structure. With ESX Server, this problem is extra complicated. The traditional problem, multiple physical ESX Servers needing to manage shared physical disks, still exists. In addition, the virtual machines also have to manage access to their shared virtual disks. Note: This tech note, whether in discussing the ESX Servers or the virtual machines, refers to storage attached directly or through a SAN, not over a network. Network attached storage has built in procedures to manage concurrent access from multiple systems on a network. Direct attached storage offers lower overhead and higher performance, but lacks inherent capability to manage concurrent access from multiple systems. Differences in the document from the 1.5.2 and 2.0 documents: The underlying mechanisms of disk locking got more complicated from 1.5.2 to 2.0 and are even more complex in 2.1. It has gotten to the point where explaining all the details is not feasible or advisable. It will likely confuse us all and help us to confuse our customers and partners. Furthermore, it allows us to give away patent-pending secrets. Therefore this tech note will not be anywhere near as thorough as the two preceding documents. However, for those interested in all the gory details, Satyam Vaghani is working on an engineering quality document that we will all have access to. This document will focus solely on o Variables that are under control of an ESX Server administrator o Combination of settings ought to be used to meet a variety of customer use cases. o Caveats. General Architecture: Virtual Disks are files to the vmkernel. The vmkernel controls all I/O from the VMs to their respective virtual disks. It also controls all I/O from the Service Console to the virtual disks. The vmkernel is all-knowing regarding I/O on a single ESX Server, but the vmkernels of separate ESX Servers do not communicate and only know what others are doing via persistent data in the form of files or filesystem metadata on the shared disk itself. Virtual disk files live on VMFS filesystems. In ESX Server 2.1, the recommended filesystem is VMFS version 2. VMFS is a filesystem like ext3 on Linux or NTFS on Windows. However, one major difference is that VMFS version 2 is a distributed filesystem. That is to say, unlike ext3 and NTFS, multiple hosts can, within certain limits, read and write to the same filesystem. Logical Unit Numbers (LUNs) – LUNs are the physical disks themselves that hold the VMFSs. The LUNs are controlled by the SCSI protocol. Its locking mechanism (SCSI-2 reservations) is at a lower and less subtle level than the locking the vmkernel does on either files or VMFSs. LUNs can be dedicated to VMs entirely without first being formatted with a VMFS. Then VM itself formats the LUN with ext3 or NTFS. Previously, raw disks were outside the context of the disk locking mechanisms that governed virtual disk files. With ESX Server 2.0.1, we have raw disk mappings (RDMs), which are essentially links to raw disks from within a VMFS, making the raw disks look and behave as virtual disk files stored on the VMFS containing the link. This will eventually allow VMotions of raw disk VMs and also makes all of the locking concepts discussed in this document equally valid for both raw disks and virtual disk files (assuming the raw disk mapping is used – they are recommended, but not mandatory). Besides VMotion, the locking on raw disks enabled by RDMs allows undoable raw disks and backups of raw disks as files. Solution: ESX Server provides three types of locking – all for different permutations of the shared disk problem. Different combinations of the three types of locks are needed to accomplish various practical tasks. 1) File Locking – File locks are held both in the vmkernel to handle contention between multiple VMs on the same system and the Service Console and also on the VMFS itself to handle contention between separate ESX Servers. Different use cases will cause different file locks to be set and unset. All of these locks are set automatically by the vmkernel, and there are no administrator settings, so no more will be said about the subject in this tech note. 2) On-disk VMFS locking – Locking the VMFS is done on a per physical machine basis, not on a per virtual machine basis. Just as virtual machines need locks on their virtual disks so that they don’t get confused by changes made to their internal filesystems that they did not initiate, ESX Servers need locks their physical disks so that they do not get confused by changes made to their internal filesystems (VMFS) that they did not initiate. There are two settings that allow different underlying locking schemes to accomplish different disks. a. Public i. This is the default VMFS2 state. Almost all virtual disk files are stored in public VMFSs. ii. It allows multiple ESX Servers to access the same filesystem simultaneously as long as they are accessing separate files. iii. The concurrent read/write access to filesystem metadata (like when files are created, expanded, locked or unlocked) by separate ESX Servers is governed by short-term SCSI-2 reservations. Basically, the vmkernel needing to write to the filesystem metadata quickly locks the entire LUN holding the first extent of the VMFS, does the metadata write, and releases the LUN. 1. This fact is the cause of most concurrency limitations of ESX Servers to VMFS 2. This fact also leads to the critical warning – NEVER PLACE MORE THAN ONE PUBLIC VMFS VOLUME PER LUN. One detail is that you can have multiple physical extents on a LUN as long as there are combined into the same single VMFS. This use case could occur if a LUN is expanded, and the VMFS needs to then be expanded to fill the LUN. While dynamic LUN growth is not yet supported, the locking protocols will allow this use case when it is supported. iv. Since the SCSI-2 reservations used to implement the per-file or VMFS- wide locks are the very same that are be used in a shared disk clustering scenario between VMs residing on separate ESX Servers, public VMFSs can not be used to store shared data or quorum disks for these types of clusters. b. Shared i. Specifically to hold data and quorum disks for VMs clustered on separate ESX Servers using shared disk clustering that relies on SCSI-2 reservations. As stated previously, the file locking implementation of the public VMFS makes it unusable for this purpose. ii. One side effect of the locking mechanisms of shared VMFS is that redo logs may never be stored on a shared VMFS. 3) SCSI-2 Reservations – Per LUN basis. a. The SCSI reservation locks an entire LUN for access regardless of what is on it. SCSI reservations are used by various clustering products, including Microsoft Cluster Server (MSCS). These products expect to be able to lock entire LUNs from within virtual machines as they would from within physical machines. b. It is important to note that SCSI reservations are allowed (on behalf of the VMs) by the vmkernel but not initiated by vmkernel. The SCSI reservations must be requested by the clustering application within the VM, such as MSCS. The type of SCSI reservations allowed (virtual or physical) are determined by the setting of the virtual shared SCSI bus (virtual or physical) as configured when creating the VM. Note: the vmkernel does initiate its own SCSI reservations as well for the purpose of public VMFS locking. c. There are two types of SCSI reservations i. Virtual 1. The SCSI reservation is virtualized and maintained as a file attribute by VMFS. This prevents multiple VMs on the same machine from accessing the same virtual LUN (i.e. virtual disk file). a. Public VMFSs are fine for use here 2. The clustering application, requests a SCSI reservation and the vmkernel passes this reservation on to the virtual disk controller, locking the virtual disk file from access by other VMs on the same physical machine. ii. Physical 1. The SCSI reservation is held on the physical disk controller. This prevents multiple physical machines from accessing the same physical LUN regardless of how many virtual disks it contains. a. Shared VMFS is required here. 2. The clustering application requests a SCSI reservation and the vmkernel passes this reservation on to the physical disk controller, locking the physical disk file from access by any other physical machine. 3. Setting the shared virtual SCSI bus sharing to “physical” is what allows the monitor to pass the SCSI reservation through to the physical disk controller. 4. Since physical SCSI reservations are set on entire physical LUNs, please make sure to only place one VMFS on a LUN that may get reserved by a clustering application. Only put virtual disks on that VMFS that will failover from a VM on one ESX Server to a VM on another ESX Server together at the same time. When in doubt, put only one virtual disk on a VMFS that resides on a LUN that may become reserved. Common Situations, the right settings with explanation: 1) Standalone VMs on ESX Servers without SAN disk – Public VMFS + No Bus Sharing. a. Since only one ESX Server has access to the VMFS, the question of shared VMFS does not even arise. b. Since the VMs are not sharing virtual disks, neither virtual, nor physical SCSI reservations are needed. 2) Shared nothing cluster on a single physical ESX Server – Public VMFS + No Bus Sharing a. Since only one ESX Server has access to the VMFS, the question of shared VMFS does not even arise. b. Since the VMs are not sharing virtual disks, neither virtual, nor physical SCSI reservations are needed. 3) Shared disk cluster on a single physical ESX Server – Public VMFS + Virtual Bus Sharing a. Since only one ESX Server has access to the VMFS, the question of shared VMFS does not even arise. b. Since the clustering software needs to be able to ensure that the virtual disk is accessed by only one VM at a time, SCSI reservations are needed to protect the virtual disk. Since all the clustered VMs are on the same ESX Server, virtual SCSI reservations, which are maintained in the vmkernel and VMFS, are sufficient. 4) Cluster with VMs residing on multiple physical machines where the active VM is switched between hosts by a script (like in VCS for ESX Server) or manually – Public VMFS + No Bus Sharing a. Since only one VM (on a single ESX Server) should have access to the disk at a given time, (the other being shutdown or suspended), public VMFS is mandatory. b. No SCSI reservations – This is a cheap cluster that does not implement SCSI reservations by design. There is no case where two VMs are powered on using the same disk simultaneously. 5) Shared disk cluster with VMs residing on multiple ESX Servers – Shared VMFS + Physical Bus Sharing a. Since multiple VMs on multiple ESX Servers need to be able to attempt to access the virtual disk simultaneously, shared VMFS is the only way to go. b. Since the clustering software should be able to ensure that the virtual disk is accessed by only one VM (from one ESX Server) at a time, SCSI reservations are needed to protect the virtual disk. Since the clustered VMs reside on multiple ESX Servers, physical SCSI reservations, which are maintained on the physical disk controller, are needed. Maintaining SCSI reservations on the individual VMware kernels, which are not aware of each other, is insufficient. 6) Shared disk cluster between a physical machine and a VM – Raw Disk (or RDM in a public VMFS) + Physical bus sharing a. Raw disk is necessary so that the physical machine and the virtual machine can read the same filesystem format. b. Since the clustering software should be able to ensure that the raw LUN is accessed by only one machine at a time, SCSI reservations are needed to protect it. Since the machines reside on multiple ESX Servers, physical SCSI reservations, which are maintained on the physical disk controller, are needed. Maintaining SCSI reservations on the vmkernel, which the physical machine is not aware of, is insufficient. 7) Standalone VMs on ESX Servers with SAN disk – Public VMFS + No Bus Sharing a. Since virtual disks files are not being accessed by multiple VMs on separate ESX Servers, public VMFS is appropriate. b. Since the VMs are not sharing virtual disks, neither virtual, nor physical SCSI reservations are needed. 8) Multiple VMs (possibly from separate ESX Servers) in non-persistent mode using the same base disk – Public VMFS + No bus Sharing a. Public VMFS allows concurrent access to the VMFS so that the base disk can be read by VMs on separate ESX Servers. b. No bus sharing is needed since this is not a clustered environment and SCSI reservations are not necessary 9) Multiple VMs (possibly from separate ESX Servers) in undoable or append mode using the same base disk – Impossible a. This use case is not possible due to the various layers of locking. b. Allowing this use case would risk corrupting the base disk and crashing n-1 VMs. 10) Multiple VMs (possibly from separate ESX Servers) in any mode using different redo logs as their base disks while maintaining the same actual base disk – Public VMFS + No bus Sharing a. Public VMFS allows the concurrent access to the VMFS so that the actual base disk can be read by VMs on separate ESX Servers. b. No bus sharing is needed since this is not a clustered environment and SCSI reservations are not necessary rd 11) Dynamic redo logs on virtual disk files with 3 party systems backing up the base virtual disk file – Cannot be done in 2.1. rd 12) Dynamic redo logs applied to raw disk mappings with 3 party systems backing up the raw LUN (not via the raw disk mapping) – Public VMFS (for the raw disk mapping) + No Bus Sharing a. Since this is not a clustered environment, shared VMFS is unnecessary. b. No bus sharing is needed since this is not a clustered environment and SCSI reservations are not necessary. c. One risk of this scheme is that while reading the base disk is fine so long as the dynamic redo log is in place, writing to the raw LUN would be disastrous. No rd writes could safely happen from the 3 party system unless there was no dynamic redo log at the time and the VM were powered off.
Pages to are hidden for
"VMFS Modes and ESX Server Locking in ESX Server 2"Please download to view full document