HP MSA1500 Ruined My Customers Week

A customer of mine with a small ESX deployment ran into some major grief this week with their MSA1500.  Unfortunately for them, it took 3 days for them to pick up the phone and get hold of me. This isn’t the first time I’ve ran into a problem caused by an MSA1500, and it will not be the last.  Symptoms start out minimal, VMs appear to be running slow…systems unresponsive…then BOOM all out catastrophic failure!

The problem is that the controller built into the MSA1500, it just was not made to support any throughput.  The sweet spot for these devices is 2 ESX hosts and a handful of Virtual Machines (0-15), anything more than that and you’ll be asking for trouble.

Here is some of the errors you should be expecting in your vmkernel and vmkwarning logs;

Jan  9 22:13:13 groucho vmkernel: 0:01:32:55.713 cpu0:1146)Fil3: 9811: Max Timeout retries exceeded for caller 0x928f4b (status ‘Timeout’)
Jan  9 22:13:13 groucho vmkernel: 0:01:32:55.713 cpu1:1087)VSCSI: 2803: Reset request on handle 8195 (0 outstanding commands)
Jan  9 22:13:13 groucho vmkernel: 0:01:32:55.713 cpu1:1054)VSCSI: 3019: Resetting handle 8195 [0/0]
Jan  9 12:15:25 groucho vmkernel: 2:18:31:13.494 cpu1:1037)WARNING: FS3: 4784: Reservation error: Timeout
Jan  9 21:28:16 groucho vmkernel: 0:00:47:59.406 cpu1:1033)VSCSI: 2803: Reset request on handle 8192 (1 outstanding commands)
Jan  9 21:28:16 groucho vmkernel: 0:00:47:59.406 cpu1:1054)VSCSI: 3019: Resetting handle 8192 [0/0]
Jan  9 12:15:25 groucho vmkernel: 2:18:31:13.494 cpu1:1037)WARNING: FS3: 4784: Reservation error: Timeout
Jan  9 15:36:08 groucho vmkernel: 2:21:51:56.008 cpu0:1034)VSCSI: 2803: Reset request on handle 8208 (3 outstanding commands)
Jan  9 15:36:08 groucho vmkernel: 2:21:51:56.008 cpu1:1054)VSCSI: 3019: Resetting handle 8208 [0/0]
Jan  9 20:40:49 groucho vmkernel: 0:00:00:02.483 cpu0:1024)CpuSched: 16758: Reset scheduler statistics
Jan  9 20:40:50 groucho vmkernel: 0:00:00:10.004 cpu1:1035)World: vm 1064: 895: Starting world FS3ResMgr with flags 1
Jan  8 07:58:05 groucho vmkernel: 1:14:13:57.304 cpu0:1034)VSCSI: 2803: Reset request on handle 8201 (1 outstanding commands)
Jan  8 07:58:05 groucho vmkernel: 1:14:13:57.305 cpu1:1054)VSCSI: 3019: Resetting handle 8201 [0/0]

So, what happens is the controller in the MSA simply chokes and slows everything down to a screeching halt. This of course does not play well with ESX.  The only way to resolve was to remote into each VM and shutdown, luckily SOME of them responded to the VMware Tools guest shutdown…only two out of the 14 needed to be forcefully killed (kill -9 <VMPID>).  After everything was down we shutdown the ESX hosts.  Then we proceeded to shutdown and restart the MSA (controller -> shelves -> shelves -> controller).  Once back online we powered on only 2 of the 3 ESX hosts, I do not want to create too much contention on the MSA and luckily those two hosts will still run all their VMs without a problem.  Next week sometime we will be migrating to an EVA they have laying around.

So, in the end what did we learn?   MSA = Good for Test and Labs … BAD for Production


Created on January 9, 2009 by Rick Scherer

Posted under Storage.

This blog has 11,576 views.

Tags: , , , , , , ,

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

15 Comments so far

  1. pricemc1
    6:09 pm on January 11th, 2009

    Granted the MSA1x00 series equipment was never designed to be high end equipment I can’t say I agree with your assessment. The MSA is quite capable of supporting 8 ESX hosts with 40-50 VMs if properly configured. There are a number of potential issues that could be effecting the performance. Specifically it would be interesting to know:

    – what pathing policy the customer was using in ESX?
    – what was the firmware on the MSA controllers?
    – how were the LUNs assigned to the MSA controllers?
    – how many chassis’ and drives are attached to the MSA?
    – what are the RAID configurations of each array/lun that was created on the MSA?

    You really dont discuss any of these factors in your post. Improper config of any or all of these could all seriously impact performance of the MSA and be the cause of the issues your customer was experiencing.

  2. Rick Scherer
    10:45 am on January 12th, 2009

    I’m curious of where you have seen those types of numbers with an HP MSA1x00 – I’ve had two separate customers with completely different set ups experience problems.

    Most recent customer only had a single controller and two MSA20 enclosures with SATA disks and the v5.20 active/passive firmware. The arrays were configured in a RAID6 and the LUNs were way too large (1TB). All of this mixed together is why I believe they ran into so many problems.

    The previous customer (1-2 years ago) had a MSA1500 with dual controllers with a MSA30 SCSI shelf configured in a RAID5 with 2-3 300GB LUNs. The controllers were in a Active/Active running v7.00 firmware. ESX was configured for MRU and we separated the LUNs to run on independent controllers. After some time, we also ran into SCSI Reservation and other misc I/O errors. This site had 3 ESX hosts as well with about 20-25 Virtual Machines.

    I can only speak from experience, both customers that utilized MSA1x00 storage systems ran into problems. Both were configured to all best practices and both experienced problems in the end.

  3. pricemc1
    11:16 am on January 12th, 2009

    I have an MSA1000 I run with 4 ESX hosts and 25 VMs with no issues. I’m comfortable I could go higher its just I have no need at the moment and a lack of storage space. I’m pretty sure HP has published something saying the MSA was suiteable for about 10 servers but I’d have to do some research on that as alot of their MSA1x00 documentation is becoming harder to find these days.

    In the case of your recent customer, you yourself said they were not configured optimally so dont blame the MSA if your not satisified the configuration is appropriate. RAID6 is not really ideal for VMs in my humble opinion (unless you are using a NetApp I suppose and you subscribe to their ideology). Configure RAID-10 LUNs and see how it performs. Adjust cache ratio as well depending on the type of usage case.

    In the case of the other customer I would tend to imagine their problem resulted from using MRU. I know the MSA docs say use MRU but they were all written before version 7 of the FW came out. I run Fixed path policy on my MSA and it works well. The problem with the MSA is that its not truly active/active so if somethings running on the wrong path then performance is going to be seriously degraded because all the commands get forwarded from the improper controller to the assigned controller and this pretty much kills performance.

  4. Rick Scherer
    12:09 pm on January 12th, 2009

    I agree, my recent customer could’ve had a better configuration. They may have been able to do more with an active/active setup and SCSI disks rather than SATA.

    I’m curious though in your setup, what type of load are you putting on the MSA? What types of applications are those 25 VMs? 25 idle VMs is one thing, but 25 active high I/O VMs is another.

    This recent customer was running their AD domain in VM and also their Exchange 2003 implementation, not including other industry related software for their company. All of this generates some significant I/O — not as much as an Enterprise but for a SMB it is still a lot. I’m still staying firm on my stance, MSA’s are good storage arrays but they’re not meant for production IMO — but they do serve for low I/O and also test/dev scenarios.

  5. Paul
    6:59 am on January 13th, 2009

    “unless you are using a NetApp I suppose and you subscribe to their ideology”.

    Hehe, that’s real funny.
    I guess businesses using NetApp storage are ‘idealogically’ supporting their environment.

  6. pricemc1
    9:46 am on January 13th, 2009

    Ok well here comes the credibility gap, but in my case the VMs are for lab use since you asked about load. With that said, I still can say I put load on them occasionally during certain tests. I actually have 4 BL25p blades and 4 BL20p G4 blades all connected to a single Act/Act MSA1000. I run a combination of ESX, XenServer and Virtual Iron hypervisors on them right now. I generally keep the following VMs running all the time on the 4 BL25p ESX servers: 2 AD DCs, 1 SQL, 3 Exchange 2K7, 2 Exchange 2K3, 1 SCOM, 1 SCCM, 1 vCenter, 1 Terminal Server, 1 MDT/WDS, 2 XP desktop, 2 Vista desktop, 1 for Virtual Iron Mgmt, 1 for HP SIM. Of course they dont generate much load on a normal basis.

    My other VMs are typically short term test systems that I have been putting on the BL20p G4 XenServer and Virtual Iron hosts lately. Thats where I might typically run up load. It has not been uncommon to run Exchange LoadGen/Loadsim against test VMs on these other systems to build up I/O. That I/O would still potentially effect the ESX host VMs since they are off the same array. I havent had the systems grind to a halt in those cases but perhaps I’m not generating as much load as I think or my system is configured a little more efficiently?

    The more important issue here though is that if a customer is running 25 or 50 I/O intensive VMs then you are right, the MSA is clearly not suited for that. I wouldn’t try to make that argument.

    In my experience most customers have very few I/O intesive apps, especially if we are talking SMB environments here. AD services typically aren’t load intensive in an SMB environment. Exchange certainly can be intensive but if we’re talking SMB I’d be surprised if it is.

    If your customer had an HP EVA laying around though I really have to wonder why they were messing with an MSA in the first place? MSA has never been anything more than an SMB oriented device. Saying its sized for SMB doesnt mean in it can’t reasonably handle 50 production VMs though. It just means that you have to be reasonable about type of load you expect the device to handle and insure it is configured as ideally as possible. In your case the customer had 2 strikes against them, a less then optimally configured MSA and apparently, persistent high I/O requirements. HP EVAs are much more suited to any kind of dedicated high I/O environment as I’m sure you would agree. If your customer really falls into that category then they should be much happier with the EVA (assuming they have it optimally configured). :)

  7. Rick Scherer
    10:24 am on January 13th, 2009

    I think we’re on the same page primemc1 – I agree with you that my previous customer was not configured optimally. However, I had another customer (as mentioned earlier) that was configured to best practice and still had performance bearing issues.

    Either way, in the end we both agree — MSA1x00’s are good for labs and also minimal SMB use. I’m not sure why the didn’t use the EVA to begin with, but it looks like they’ll be moving to it now.

  8. pricemc1
    10:29 am on January 13th, 2009

    “Hehe, that’s real funny.
    I guess businesses using NetApp storage are ‘idealogically’ supporting their environment.”

    I was mearly referring to the fact that RAID-DP is essentially RAID-6 and so if your running stock NetApp setups it is essentially a RAID-6 configuration. No knock against NetApp or anything to that effect. Some folks like the NetApp approach and some folks don’t. If you don’t like RAID-DP then your probably not a NetApp customer in the first place.

  9. Paul
    6:55 am on January 14th, 2009

    Hi pricemc1,

    Thanks for the response.

    Not liking RAID-DP, whats not to like about it? Sure I understand conceptually why it shouldn’t be done, why WAFL is theoretically bad for your health and so on.

    Look at another way then. A world where only RAID 0 exists. Someone comes along, lets say brand “X” and offers something new:- RAID 5 and RAID 10. “Heh, whats with all this new RAID options stuff – why do we need ‘disk protection’, I say we stick with the way its always been done” “Parity, Miroring, it’s just not acceptable – maybe for some folk but not for us thanks”.

  10. Rick Scherer
    9:19 am on January 14th, 2009

    I agree Paul, RAID-DP is an amazing technology. The concept is so simple, you have the speed and throughput of RAID-5 and security of RAID-10 but with a lot less disks than RAID-10. Also WAFL is an awesome filesystem. Why would someone ‘not’ want a NetApp? :)

  11. Chad Sakac
    2:36 pm on February 19th, 2009

    I’ll give one answer. RAID-6/RAID-DP has a higher availability envelope than RAID 5, but there’s no escaping

  12. Eric Tam
    11:48 pm on May 21st, 2009

    MSA = Good for Test and Labs

    Totally agreed. The design of MSA is more like a bunch of SCSI drives (or SATA) in an enclosure with a Fibre/SCSI router.

  13. Eric K. Miller
    11:25 pm on October 13th, 2009

    I’m a little late to the thread, but the MSA1500cs has bugs in the firmware that have never been corrected.

    I started a forum long ago at http://www.msa1500cs.com/ with the hopes of solving the problem, but there seems to be no solution to the crashing problems of the MSA1500cs other than to reduce the number of VMs per LUN to “one”.

    Basically, SCSI I/O Reservation conflicts cause a memory leak, from what I can tell, and ultimately cause the controller to crash with a “table full” message. This is easily reproducible over time, but we were able to do it quickly within less than 2 days while running backups with esXpress against VMs on the same LUN on multiple hosts (thus causing a lot of SCSI I/O reservations).

    I wouldn’t blame the problems on the configuration. Many people have run into this problem with perfect, supported (or lack-there-of) configurations.

    Eric

  14. Michael
    9:15 am on September 22nd, 2010

    Looks like there was a firmware update on the 5th of May 2010 version 7.20 that fixes the reservation issue (sorry if this has already been mentioned)

  15. Eric K. Miller
    12:16 pm on September 25th, 2010

    Thanks Michael! I wasn’t aware of 7.20 being available.

    They indicate:
    Resolved issue of MSA persistent reservation table becoming full when used with various supported levels of VMware.
    Test Status:Not Verified.Issue could not be reproduced in HP tests.

    Not sure if this means that they “believe” they fixed the problem, or that they fixed the problem and testing afterwards has not shown the issue reproducible in the lab?

    Regardless, I thought the MSA1500cs team died off since it’s such an old unit, but we still run a number of them, so it’s nice to see that we might finally (after, what 7 years) have a unit that works consistently without reboots to fix the table full problem? =)

    If anyone has proven that 7.20 truly fixes this problem, please let me know! emiller @ genesishosting.com.

    Thanks!

    Eric

Leave a Comment

Name (required)

Email (required)

Website

Comments

More Blog Post