Rick Scherer is a Principal Global Architect & CTO Ambassador at Dell EMC, supporting our long standing relationship with American Express. He is also VCDX #21.
In the twitter queue today I heard a new VMware KB has been released in regards to a potential SAN failure (FibreChannel and iSCSI). More information and the most recent updates can be found on the VMware KB. I’ve included a breakdown of the issue and resolution, which can be found below.
Symptoms
One or more of the following may be present:</p>
VMware ESX or ESXi host might get disconnected form VirtualCenter.
All paths to the LUNs are in standby state.
esxcfg-rescan might take a long time to complete or never completes (hung).
Error messages matching this pattern are repeated continually in vmkernel: vmkernel: cpu6:1177)SCSI: 675: Queue for device vml. has been blocked for 7 seconds. </span></span></span> vmkernel: cpu7:1184)SCSI: 675: Queue for device vml. has been blocked for 6399 seconds.</span> </span></span></span> If you look at log entries previous to the first blocked message, you will see storage events and a failover attempt. Example: vmkernel: 31:19:32:26.199 cpu3:3824)Fil3: 5004: READ error 0xbad00e5 vmkernel: 31:19:32:29.224 cpu1:3961)StorageMonitor: 196: vmhba0:0:0:0 status = 0/5 0x0 0x0 0x0 vmkernel: 31:19:32:29.382 cpu2:1144)FS3: 5034: Waiting for timed-out heartbeat [HB state abcdef02 offset 3736576 gen 26 stamp 2748610023852 uuid 4939b0cf-c85aa695-158d-00144f021dd4 jrnl <FB 383397> drv 4.31] vmkernel: 31:19:32:29.638 cpu3:1053)<6>qla2xxx_eh_device_reset(1): device reset failed vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4279: Reset during HBA failover on vmhba1:2:1 returns Failure vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 3746: Could not switchover to vmhba1:2:1. Check Unit Ready Command returned an error instead of NOT READY for standby controller . vmkernel: 31:19:32:29.638 cpu3:1053)WARNING: SCSI: 4622: Manual switchover to vmhba1:2:1 completed unsuccessfully. vmkernel: 31:19:32:29.638 cpu3:1053)StorageMonitor: 196: vmhba0:2:1:0 status = 0/1 0x0 0x0 0x0 vmkernel: 31:19:32:29.640 cpu2:1067)scsi(1): Waiting for LIP to complete… vmkernel: 31:19:32:29.640 cpu2:1067)<6>qla2x00_fw_ready ha_dev_f=0xc vmkernel: 31:19:32:30.532 cpu2:1026)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0 last message repeated 31 times vmkernel: 31:19:32:31.535 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x81 vmkernel: 31:19:32:31.541 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x82 vmkernel: 31:19:32:31.547 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x83 vmkernel: 31:19:32:31.568 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x84 vmkernel: 31:19:32:31.573 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x85 vmkernel: 31:19:32:31.576 cpu2:1067)<6>dpc1 port login OK: logged in ID 0x86 vmkernel: 31:19:32:32.531 cpu2:4267)StorageMonitor: 196: vmhba0:0:0:0 status = 0/2 0x0 0x0 0x0 last message repeated 31 times vmkernel: 31:19:32:32.532 cpu1:3973)StorageMonitor: 196: vmhba0:0:0:0 status = 2/0 0x6 0x29 0x0</span></span>
</div>
</li>
</ul>
</td>
</tr>
Resolution
This issue can occur on VMware ESX servers under the following conditions:
Hypervisor version: VMware ESX 3.5 U3.
SAN hardware: Active/Passive and Active/Active arrays (Fibre Channel and iSCSI).
Trigger: This occurs when VMFS3 metadata updates are being done at the same time failover to an alternate path occurs for the LUN on which the VMFS3 volume resides .
A reboot is required to clear this condition.
VMware is working on a patch to address this issue. This knowledge base article will be updated after the patch is available.
The other day I did a VSPEX Blue (EMC’s flavor of EVO:RAIL Appliance) install for a customer that couldn’t conform to the hostname standards set forth by the...
Better late than never, right? I’m about 36 hours past due on this, but the crazy that is VMworld is in full effect and I’m finally just getting caught up o...
There’s already a ton of great bloggers out there that are getting into deep in-depth reviews of their first day of VMworld 2015, so I’m going to take a diff...