Issue: VMware ESX/ESXi SAN I/O Failure

In the twitter queue today I heard a new VMware KB has been released in regards to a potential SAN failure (FibreChannel and iSCSI).  More information and the most recent updates can be found on the VMware KB.  I’ve included a breakdown of the issue and resolution, which can be found below.

Symptoms
 
One or more of the following may be present:

  • VMware ESX or ESXi host might get disconnected form VirtualCenter.
     
  • All paths to the LUNs are in standby state.
     
  • esxcfg-rescan might take a long time to complete or never completes (hung).
     
  • Error messages matching this pattern are repeated continually in vmkernel:
    <date and time> <hostname> vmkernel: <server uptime> cpu6:1177)SCSI: 675: Queue for device vml.<Vol. Dev. ID> has been blocked for 7 seconds.
    <date and time> <hostname> vmkernel: <server uptime> cpu7:1184)SCSI: 675: Queue for device vml.<Vol. Dev. ID> has been blocked for 6399 seconds.
     
    If you look at log entries previous to the first blocked message, you will see storage events and a failover attempt.
    Example:
    <date and time> <hostname> vmkernel: 31:19:32:26.199 cpu3:3824)Fil3: 5004:  READ error 0xbad00e5
    <date and time> <hostname> vmkernel: 31:19:32:29.224 cpu1:3961)StorageMonitor: 196: vmhba0:0:0:0 status = 0/5 0x0 0x0 0x0
    <date and time> <hostname> vmkernel: 31:19:32:29.382 cpu2:1144)FS3: 5034: Waiting for timed-out heartbeat [HB state abcdef02 offset 3736576 gen 26 stamp 2748610023852 uuid 4939b0cf-c85aa695-158d-00144f021dd4 jrnl <FB 383397> drv 4.31]
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)<6>qla2xxx_eh_device_reset(1): device reset failed
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4279: Reset during HBA failover on vmhba1:2:1 returns Failure
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 3746: Could not switchover to vmhba1:2:1. Check Unit Ready Command returned an error instead of NOT READY for standby controller .
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4622: Manual switchover to vmhba1:2:1 completed unsuccessfully.
    <date and time> <hostname> vmkernel: 31:19:32:29.638 cpu3:1053)StorageMonitor: 196: vmhba0:2:1:0 status = 0/1 0x0 0x0 0x0
    <date and time> <hostname> vmkernel: 31:19:32:29.640 cpu2:1067)scsi(1): Waiting for LIP to complete…
    <date and time> <hostname> vmkernel: 31:19:32:29.640 cpu2:1067)<6>qla2x00_fw_ready ha_dev_f=0xc
    <date and time> <hostname> vmkernel: 31:19:32:30.532 cpu2:1026)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
    <date and time> <hostname> last message repeated 31 times
    <date and time> <hostname> vmkernel: 31:19:32:31.535 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x81
    <date and time> <hostname> vmkernel: 31:19:32:31.541 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x82
    <date and time> <hostname> vmkernel: 31:19:32:31.547 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x83
    <date and time> <hostname> vmkernel: 31:19:32:31.568 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x84
    <date and time> <hostname> vmkernel: 31:19:32:31.573 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x85
    <date and time> <hostname> vmkernel: 31:19:32:31.576 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x86
    <date and time> <hostname> vmkernel: 31:19:32:32.531 cpu2:4267)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
    <date and time> <hostname> last message repeated 31 times
    <date and time> <hostname> vmkernel: 31:19:32:32.532 cpu1:3973)StorageMonitor: 196: vmhba0:0:0:0 status = 2/0 0x6 0x29 0x0
Resolution
 
This issue can occur on VMware ESX servers under the following conditions:
  • Hypervisor version: VMware ESX 3.5 U3.
  • SAN hardware: Active/Passive and Active/Active arrays (Fibre Channel and iSCSI).
  • Trigger: This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides .
A reboot is required to clear this condition.
 
VMware is working on a patch to address this issue. This knowledge base article will be updated after the patch is available.

Created on January 12, 2009 by Rick Scherer

Posted under ESX 3.5 Tips, ESXi 3.5 Tips, Storage.

This blog has 14,574 views.

Tags: , , , , , , , ,

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

6 Comments so far

  1. just paul
    8:47 am on January 13th, 2009

    I wouldn’t call that a resolution- I would call that a workaround until the issue is resolved.

  2. Rick Scherer
    9:28 am on January 13th, 2009

    I agree. Although the likeliness of this happening is minimal, it is still a problem that needs to be addressed. I’m sure VMware will release a patch soon.

  3. Joe Moore
    11:23 am on January 18th, 2009

    It’s especially not a good workaround since when the ESX server goes disconnected from VirtualCenter, we can’t VMotion the VMs onto another node in our cluster. So the ESX server rebooting has to take down the guest servers too.

    –Joe

  4. Joe Moore
    9:11 am on January 19th, 2009

    In further research, I’ve been able to get the ESX host connected back to VirtualCenter by restarting the mgmt-vmware and vpxa services on ESX, which at least allowed me to VMotion the rest of the VMs.

    –Joe

  5. Alastair
    3:02 am on January 23rd, 2009

    This is a huge problem for us. We’ve had this problem twice in 2 days.

    Come one Vmware, let’s have a fix!

  6. Tony Scott
    4:10 am on January 28th, 2009

    I have the akorri monitoring software in eval right now and it does a great job of acatching these issues by monitoring all the vmware storage. the software is called balancepoint.

Trackbacks

Leave a Comment

Name (required)

Email (required)

Website

Comments

More Blog Post