Issue: VMware ESX/ESXi SAN I/O Failure

In the twitter queue today I heard a new VMware KB has been released in regards to a potential SAN failure (FibreChannel and iSCSI).  More information and the most recent updates can be found on the VMware KB.  I’ve included a breakdown of the issue and resolution, which can be found below.

One or more of the following may be present:
  • VMware ESX or ESXi host might get disconnected form VirtualCenter.
  • All paths to the LUNs are in standby state.
  • esxcfg-rescan might take a long time to complete or never completes (hung).
  • Error messages matching this pattern are repeated continually in vmkernel:
      vmkernel:  cpu6:1177)SCSI: 675: Queue for device vml. has been blocked for 7 seconds.
    </span></span></span> vmkernel: cpu7:1184)SCSI: 675: Queue for device vml. has been blocked for 6399 seconds.</span>
    If you look at log entries previous to the first blocked message, you will see storage events and a failover attempt.
    vmkernel: 31:19:32:26.199 cpu3:3824)Fil3: 5004:  READ error 0xbad00e5
    vmkernel: 31:19:32:29.224 cpu1:3961)StorageMonitor: 196: vmhba0:0:0:0 status = 0/5 0x0 0x0 0x0
    vmkernel: 31:19:32:29.382 cpu2:1144)FS3: 5034: Waiting for timed-out heartbeat [HB state abcdef02 offset 3736576 gen 26 stamp 2748610023852 uuid 4939b0cf-c85aa695-158d-00144f021dd4 jrnl <FB 383397> drv 4.31]
    vmkernel: 31:19:32:29.638 cpu3:1053)<6>qla2xxx_eh_device_reset(1): device reset failed
    vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4279: Reset during HBA failover on vmhba1:2:1 returns Failure
    vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 3746: Could not switchover to vmhba1:2:1. Check Unit Ready Command returned an error instead of NOT READY for standby controller .
    vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4622: Manual switchover to vmhba1:2:1 completed unsuccessfully.
    vmkernel: 31:19:32:29.638 cpu3:1053)StorageMonitor: 196: vmhba0:2:1:0 status = 0/1 0x0 0x0 0x0
    vmkernel: 31:19:32:29.640 cpu2:1067)scsi(1): Waiting for LIP to complete…
    vmkernel: 31:19:32:29.640 cpu2:1067)<6>qla2x00_fw_ready ha_dev_f=0xc
    vmkernel: 31:19:32:30.532 cpu2:1026)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
    last message repeated 31 times
    vmkernel: 31:19:32:31.535 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x81
    vmkernel: 31:19:32:31.541 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x82
    vmkernel: 31:19:32:31.547 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x83
    vmkernel: 31:19:32:31.568 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x84
    vmkernel: 31:19:32:31.573 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x85
    vmkernel: 31:19:32:31.576 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x86
    vmkernel: 31:19:32:32.531 cpu2:4267)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
    last message repeated 31 times
    vmkernel: 31:19:32:32.532 cpu1:3973)StorageMonitor: 196: vmhba0:0:0:0 status = 2/0 0x6 0x29 0x0</span></span> </div> </li> </ul> </td> </tr>
This issue can occur on VMware ESX servers under the following conditions:
  • Hypervisor version: VMware ESX 3.5 U3.
  • SAN hardware: Active/Passive and Active/Active arrays (Fibre Channel and iSCSI).
  • Trigger: This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides .
A reboot is required to clear this condition.
VMware is working on a patch to address this issue. This knowledge base article will be updated after the patch is available.