Common issues with NFS.LockDisable=1

After seeing a mention on Scott Lowe’s blog (blog.scottlowe.org) and on Storage Monkeys Blog (blogs.storagemonkeys.com) I’ve decided to discuss the issue(s) that I’ve came across in regards to disabling NFS Locking with the NFS.LockDisable=1 function.

As the problem can arise from many different circumstances, the majority of feedback I’m receiving appears to be caused by a VMware HA failover (either intentional or unintentional). Thus, I would like to discuss VMware HA and how it works (based on my experience and knowledge).

But before that, let me mention that the end result of having NFS.LockDisable set to 1 is that Virtual Machines can become corrupt (Windows VMs blue screen and give NTFS errors / Linux guests are more resilient and could potentially be fixed by a fsck, but you should always have a good backup regardless). This is caused by the fact that multiple ESX hosts can start the same VMX at the same time. Ok, lets continue…

From what I can see when you configure VMware HA the first (4) nodes configured are marked as primary, every host after the fourth is considered a backup node. In the event of a HA fail-over the primary nodes will all attempt to start the VMs that were running on the failed node. It appears they rely on the VM locking to determine if the VM is actually down or not. So what this means is regardless of Isolation Response the VM can actually be powered on multiple times. In fact, in the couple times this has happened to me I had the same running VM on up to (3) hosts at once.

You can also see some strange behaviors in VirtualCenter, such as the number of Virtual Machines registered in each host will jump up and down within seconds. I would look at the summary of one of my hosts and see the Virtual Machine count go from 20 to 35 to 28 to 40 and so on.

The only true way to clear this up is from the service console to do the following;

  1. Run a vmware-cmd -l on each of your hosts within the cluster.
    • Output this data to a file so you can sort it later (ie: vmware-cmd -l > host1).
  2. Sort those host files together into one master file (ie: sort host1 host2 host3 > masterVMlist).
  3. View the master VM list and determine which VMX files are registered multiple times.
    • Now this is the tricky part, if you have tons of hosts within a cluster it will take some time to actually find where they are really located, but you do know which ones are registered multiple times. Knowing the list of multi-registered VMX files, you could potentially create a script that ssh’s to each of your ESX hosts and runs a vmware-cmd -l grepping for the VMX file, then returning a code notifying you if its there or not. Since I only had (4) nodes on the cluster that failed this wasn’t necessary for me.
  4. You can run a ps aux | grep VMX-FILE on the hosts where they are registered to determine the PID.
  5. Use kill -9 PID to remove the running VM. Magically it will become unregistered on the invalid hosts.

Ok, so in closing I do not want to put all the blame on VMware HA, it is actually a combination of NFS.LockDisable=1 and what happens because of that that causes the potential corruption. The same result can occur by manually registering and starting the same VMX on multiple hosts (as with disabling locking it removes the that added layer of security).

It is extremely important that you enable NFS Locking by changing NFS.LockDisable back to the default setting of 0. You should also install VMware Patch ESX350-200808401-BG. I discuss the fix of this issue in another posting, which can be found here.


Created on October 18, 2008 by Rick Scherer

Posted under ESX 3.5 Tips, ESXi 3.5 Tips, NetApp, Storage, VMware, VMware HA.

This blog has 5,990 views and 9 responses.

Tags: , , , , , , , , ,

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 5.00 out of 5)
Loading ... Loading ...

8 Comments so far

  1. Aaron Delp
    4:57 am on November 2nd, 2008

    Hello Rick, so if the patch is applied and licking is enabled, what is the recommended way to set the HA isolation response?

    Is it OK to leave it to power on the VM’s or should we change it to power them off?

    I think you are saying as long as locking is enabled we can leave the VM’s powered on but I just wanted a double check.

    Thank you!

  2. Rick Scherer
    9:32 am on November 2nd, 2008

    The best practice and default is to Power Off Virtual Machines on the isolated host.

    Long story short, it really depends on how comfortable you feel with your environment. HA depends on your network to test if a host is actually offline or not.

    In an event where you have a network outage (and not a host failure) changing Isolation Response to Power On will make sure your VMs are not shutdown, but this may not be good if the network issue is isolated to that one host, since the VMDK lock is not removed another host cannot start it.

    On the flip side, if you do have a host outage and Isolation Response is set to Power On it really wouldn’t matter, the VM is killed anyway and the lock is removed so another host will be able to start it (It is also possible to have a semi-system failure where the machine hasn’t fully crashed but is still in a state where VMDK locks are still in place).

    Because the default and best practice is to Power Off/Shutdown I would have to recommend that. If your concerned with false positive network issues, you should really worry about fixing those problems before changing Isolation Response to Power On.

    I currently run with the Shutdown setting, this will attempt to do a graceful shutdown first, then kill the VMX if it hasn’t responded to the guest shutdown.

    Another feature coming out next year that will put all my HA worries to rest is VMware Fault Tolerance. If you haven’t heard about this upcoming feature I recommend checking out this demo http://download3.vmware.com/vdcos/demos/FT_Demo_800x600.html

  3. Aaron Delp
    5:59 am on November 3rd, 2008

    Good Morning Rick – Good Discussions and thank you for your time. The default setting for Isolation Response has changed with Update 2 to Leave VM’s Powered On so I’m not sure I consider Powering off a best practice anymore. The response from our customers has been over whelmingly to leave them powered on. I have choosen to present it to my customers with the pro’s and con’s and let them decide. This is more information for me so that is great. Thank you!!

    So, another question for you. If you have applied the NFS patch for ESX and you have locking set properly, is there a techincal NEED to change the Isolation Response to Power off? I haven’t seen one and I wanted to get your thoughts.

    Thank you again!

  4. Rick Scherer
    12:25 pm on November 3rd, 2008

    Technically I cannot see a reason to change Isolation Response to Power Off if you have the ESX350-200808401-BG patch installed and you have changed NFS.LockDisable back to 0.

    There still would be other reason why you want this changed, including intermittent network issues, etc. But you can work around those issues with advanced settings for HA (http://vmwaretips.com/wp/2008/10/20/advanced-settings-for-vmware-ha/) – such as das.isolationAddress or by increasing the detection time with das.failureDetectionTime

  5. Aaron Delp
    5:53 am on November 4th, 2008

    Thank you very much for the information!

  6. andrewstaflin
    5:13 am on December 17th, 2008

    Does it really worth it that much to purchase Virtual Center? Would it be worth it to purchase Virtual center
    If I don’t upgrade my VI3 from standard to Enterprise. The post above cheered me up a lot into buying it, but it did not
    mention what it will give & not give in standard edition.

    http://www.virtualizationteam.com/virtualization-vmware/vmware-virtual-server-virtualization-vmware/virtualcenter-for-vm-ware-server-real-value.html

  7. Rick Scherer
    9:19 am on December 17th, 2008

    Andrew, your comment and that link confuse me a little. That blog you referenced talks about the advantages of using VirtualCenter with a VMware Server deployment–and your discussing an upgrade from VI3 standard to enterprise… You should already have VirtualCenter for your deployment. Can you please elaborate a little more on this?

  8. shankyrhodes
    1:32 am on January 2nd, 2009

    Hi,

    we are considering to buy a VMware Virtual Center.
    We have two servers running VMware Standard edition.
    Do you believe it will be worth it? Or do we have to
    upgrade our VMware licenses to Enterprise before upgrading
    virtual center to make it worth it. I had just read the
    following article
    VMware virtual center real value

Trackbacks

Leave a Comment

Name (required)

Email (required)

Website

Comments

More Blog Post