|
Today not really a tech-post, but more or less a discussion starter... And the question is: "Are physical limits insignificant?"
The reason I've come up with this was my action last night at one of my customers. We needed to re-patch some ESX servers and since we didn't want any downtime, we've put the hosts in maintenance mode. Hardware was Blades with 16 ESX servers each. While putting all those hosts in maintenance mode, waiting 'till every VM was VMotionned, I kept time. To fully migrate the whole enclosure, it took us almost: 45 minutes! That's 45 minutes for migrating all those VMs! And the environment isn't that special, having about 20:1 / 30:1 ratios.
Anyway, I see a lot of trends where you have BIGGER, FASTER, LARGER hardware. Especially looking at Cisco UCS, which allows for a stunning 384GB of memory. Can you imagine how long it would take to VMotion that? Besides of that, how high is the impact you loose 1 piece of hardware, with all those VMs (or better yet, loose an enclosure, with 4, 8 or 16 blades)?
On the other hand, how high is the risk that a hardware piece is failing, using redundant power supplies, raid solutions, etc.? To be honest - I've seen large environments and different brands of hardware, and the only thing that breaks once in a while, is a management controller, HDD (which is in raid set anyway), or a small fan which can be easily replaced. So while the impact is very high, the risk is very low, so is it OK for us to get those gigantic machines and have incredible virtualization ratios?
Well, I think other things come into play when we do. One of the things is 'Human error'. Ever shutdown or put a host in standby without putting that host in maintenance mode first? Ever started an firmware update and see all hosts shutdown one-by-one. Ever pulled out the 'active' cable while you where thinking it was the standby one? Well, I can't say I did those things - but I've seen it happen. And an other thing - perhaps there is an SLA where you need to be able to empty a whole enclosure within a specific time range? OK, OK, that's a bit out of line, or is it? How long would it take to remediate ESX servers using Update Manager?
Anyway, lucky for us tweaking helps us a bit. While the maximum VMotions at a time is 2, you are able to tweak this to 12, so it would somewhat go faster. Check it out here on Boche his blog.
Cheers.
|