Committing snapshots generates a content ID mismatch error

I had a big problem Monday AM on one of my core SAP VM instances, that also happens to have a SQL DB server on it. Our VCB process finishes up on late Sunday night, if you’re not aware of how VCB works, it basically creates a snapshot of the Virtual Machine, then mounts the now readable VMDK parent to a proxy server where your backup agent resides. Once the backup is complete the snapshot is committed.  This wasn’t the case Monday AM — the VM crashed and I was paged. Snapshot didn’t commit, parent VMDK could not be found, had to manually set Parent CID in the delta VMDK file then finally when I got it back online the SQL DB was corrupt :( — luckily I had a full SQL backup from the night before.

This is where VMware KB 1007969 comes into the story…

Symptoms

  • Performing a commit of a snapshot fails
  • The virtual machine shuts down abruptly during snapshot commit
  • Performing a snapshot commit generates the error:
    Content ID mismatch 
  • Powering on the virtual machine generates the error:
    Content ID mismatch 
  • The virtual machine log contains the following:
    Sep 11 03:01:45.328: vmx| DISKLIB-LINK : Attach: Content ID mismatch (d504c2f0 != 62e0e8bf).
    Sep 11 03:01:45.331: vmx| DISKLIB-CHAIN : “/vmfs/volumes/48a1b01c-67422c6d-f5aa-00188b50e0ff/test/w2k3-lsi-64.vmdk” : failed to open (The parent virtual disk has been modified since the child was created).
    Sep 11 03:01:45.336: vmx| DISKLIB-VMFS : “/vmfs/volumes/48a1b01c-67422c6d-f5aa-00188b50e0ff/186-testing/w2k3-lsi-64-000001-delta.vmdk” : closed.
    Sep 11 03:01:45.342: vmx| DISKLIB-VMFS : “/vmfs/volumes/48a1b01c-67422c6d-f5aa-00188b50e0ff/186-testing/w2k3-lsi-64-000013-delta.vmdk” : closed.
    Sep 11 03:01:45.348: vmx| DISKLIB-VMFS : “/vmfs/volumes/48a1b01c-67422c6d-f5aa-00188b50e0ff/test/w2k3-lsi-64-flat.vmdk” : closed.
    Sep 11 03:01:45.352: vmx| DISKLIB-LIB : Failed to open ‘/vmfs/volumes/48a1b01c-67422c6d-f5aa-00188b50e0ff/186-testing/w2k3-lsi-64-000001.vmdk’ with flags 0xa (The parent virtual disk has been modified since the child was created).
    Sep 11 03:01:45.355: vmx| DISK: Cannot open disk “/vmfs/volumes/48a1b01c-67422c6d-f5aa-00188b50e0ff/186-testing/w2k3-lsi-64-000001.vmdk”: The parent virtual disk has been modified since the child was created (18).

Resolution

VMware is aware of, and actively investigating the issue.

When a snapshot delete is requested:

  1. The CID of the disk being combined into is updated
  2. The virtual disk is updated with changes.
  3. The CIDs of the children (that are not being removed) are updated.

If a failure occurs during the combine process (I/O errors or running out of disk space), the combine process aborts. The CIDs of the supporting children files never get updated, resulting in mismatch.

Warning: Do not perform a Go To and do not Revert to the parent snapshot.

You must correct the snapshot parent/child relationship.

To correct the parent/child relationship:

  1. Log in to the ESX Server console and verify the CID of all the virtual disks. The current snapshot disks are identified in the virtual machine configuration file (.vmx).
  2. Examine the virtual disk header files to verify the CID and ParentCID of each member to ensure that they match all the way up the tree.
  3. When the one that does not match is found, update the ParentCID of the child to match the CID of the file next up the chain.

Note: For more information related to performing these steps, see Consolidating Snapshots (1007849).

The virtual machine powers on normally at this point.

You can safely continue to use the snapshot or perform the commit operation again. VMware recommends to perform the commit operation so that all changes found in the delta are written down to the next level of snapshot or base disk (if there was only one level of snapshot).


Created on April 14, 2009 by Rick Scherer

Posted under ESX 3.5 Tips, ESXi 3.5 Tips, Storage.

This blog has 12,435 views.

Tags: , , , , , ,

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

4 Comments so far

  1. Duncan
    6:58 am on April 14th, 2009

    I never use snapshots on SQL servers or AD or Exchange…. For VCB you could use the VSS integration or just disable the quescing, have seen to many issues related to snapshots and databases not being aware of them.

  2. Rick Scherer
    9:02 am on April 14th, 2009

    Yeah, I’ve already disabled quiescing in the config.js file for VCB – problem still occured. Don’t think this is SQL/DB/Snapshot related, think it is just a bug in ESX that they haven’t resolved yet.

  3. Jones
    7:41 am on July 23rd, 2009

    Confirmed the bug!

    On yesterday I called this exact issue in to VMWare support and actually spoke with the same rep who a month or so earlier spent 6 (that’s right 6) hours on the phone with me trying to pin down the root cause of this same problem. In the earlier call we couldn’t find the culprit because we had a SAN problem that was causing our VM logs to fill up with garbage overwriting any chance of seeing what may have happened. Yesterday, we found it and the VM rep acknowledged that there is an ongoing research project by VMWare and EMC specifically but maybe more iSCSI vendors to address a problem where data corruption on the SAN can be caused by iSCSI software initiators. So good news is a solution is being worked on, bad news is the only way to fix the problem now is either continue to deal with the problem by fixing the snapshot chain when this occurs OR move all servers off the iSCSI based datastores and format them (no guarantee the issue won’t resurface).

    Bug is confirmed though! The solution from VMWare’s perspective as I was told is possibly to have the VMKernel recognize the problem when it occurs and automatically respond to it. This will be until the SAN folks can come up with a way to prevent the corruption in the first place.

  4. Rick Scherer
    11:44 am on July 23rd, 2009

    Hey Jones, this bug has been “supposedly” resolved in patch ESX350-200904401-BG – more info can be found on this blog post.

    I can also say that this problem happens with NFS Datastores, so I think it is a problem directly related with the VMKernel and not necessarily the storage subsystem.

Trackbacks

Leave a Comment

Name (required)

Email (required)

Website

Comments

More Blog Post