Issue: VMware ESX/ESXi SAN I/O Failure

January 12, 2009 • 0 Comments • ESX 3.5 Tips, ESXi 3.5 Tips, Storage

In the twitter queue today I heard a new VMware KB has been released in regards to a potential SAN failure (FibreChannel and iSCSI). More information and the most recent updates can be found on the VMware KB. I’ve included a breakdown of the issue and resolution, which can be found below.

</table>

Symptoms

One or more of the following may be present:

VMware ESX or ESXi host might get disconnected form VirtualCenter.
All paths to the LUNs are in standby state.
esxcfg-rescan might take a long time to complete or never completes (hung).
Error messages matching this pattern are repeated continually in vmkernel:
vmkernel: cpu6:1177)SCSI: 675: Queue for device vml. has been blocked for 7 seconds.
 vmkernel: cpu7:1184)SCSI: 675: Queue for device vml. has been blocked for 6399 seconds.

If you look at log entries previous to the first blocked message, you will see storage events and a failover attempt.
Example:
vmkernel: 31:19:32:26.199 cpu3:3824)Fil3: 5004: READ error 0xbad00e5
vmkernel: 31:19:32:29.224 cpu1:3961)StorageMonitor: 196: vmhba0:0:0:0 status = 0/5 0x0 0x0 0x0
vmkernel: 31:19:32:29.382 cpu2:1144)FS3: 5034: Waiting for timed-out heartbeat [HB state abcdef02 offset 3736576 gen 26 stamp 2748610023852 uuid 4939b0cf-c85aa695-158d-00144f021dd4 jrnl <FB 383397> drv 4.31]
vmkernel: 31:19:32:29.638 cpu3:1053)<6>qla2xxx_eh_device_reset(1): device reset failed
vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4279: Reset during HBA failover on vmhba1:2:1 returns Failure
vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 3746: Could not switchover to vmhba1:2:1. Check Unit Ready Command returned an error instead of NOT READY for standby controller .
vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4622: Manual switchover to vmhba1:2:1 completed unsuccessfully.
vmkernel: 31:19:32:29.638 cpu3:1053)StorageMonitor: 196: vmhba0:2:1:0 status = 0/1 0x0 0x0 0x0
vmkernel: 31:19:32:29.640 cpu2:1067)scsi(1): Waiting for LIP to complete…
vmkernel: 31:19:32:29.640 cpu2:1067)<6>qla2x00_fw_ready ha_dev_f=0xc
vmkernel: 31:19:32:30.532 cpu2:1026)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
last message repeated 31 times
vmkernel: 31:19:32:31.535 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x81
vmkernel: 31:19:32:31.541 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x82
vmkernel: 31:19:32:31.547 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x83
vmkernel: 31:19:32:31.568 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x84
vmkernel: 31:19:32:31.573 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x85
vmkernel: 31:19:32:31.576 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x86
vmkernel: 31:19:32:32.531 cpu2:4267)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0
last message repeated 31 times
vmkernel: 31:19:32:32.532 cpu1:3973)StorageMonitor: 196: vmhba0:0:0:0 status = 2/0 0x6 0x29 0x0 </div> </li> </ul> </td> </tr>

Resolution

This issue can occur on VMware ESX servers under the following conditions:

Hypervisor version: VMware ESX 3.5 U3.
SAN hardware: Active/Passive and Active/Active arrays (Fibre Channel and iSCSI).
Trigger: This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides .

A reboot is required to clear this condition.

VMware is working on a patch to address this issue. This knowledge base article will be updated after the patch is available.

Share it

Twitter Facebook Google+ LinkedIn

About the Author

Issue: VMware ESX/ESXi SAN I/O Failure

Share it

Related Posts:

Changing ESXi Hostname on an EVO:RAIL Appliance

VMworld Day 2 – Quick Review

VMworld Day 1 – Quick Review

VMworld 2015 Itinerary