linux-ha

Nov 18 21:05

Got Interviewed

by @botchagalupe
on Virtualization, Open Source tools and DNS Problems

Oct 19 22:46

Nines , Damn Nines and More Nines

Funny how different experiences lead to different evaluations of tools. The MySQL HA solutions the MySQL Performanceblog list, are almost listed in the complete opposited order of what my impressions are.

Ok agreed, I should probably not put my MySQL NDB experiences from 2-3 years ago with multiple Query of deaths and more problems than you into account anymore , but back then went in the list Less stable than a single node. I've had NDB POC setups going down for much more than 05:16 minutes
Ndb comes with a lot of restrictions, there are

As for MySQL on DRBD, I've said this before , I love DRBD, but having to wait for a long InnoDB recovery after a failover just kills your uptime ,
I remember being called by a customer during Fred last holiday who was waiting over 20 minutes for recovery , twice, so putting the DRBD/San setup second would not be my preference. But agreed .. it's only listed at 99.9% meaning almost 9 hours of downtime per year are allowed.

On the other hand we've seen database uptime of MySQL MultiMaster setups with Heartbeat reaching better figures than 99.99% Heck I've seen single nodes achieve better than 99.99% :)

So what does this teach us ... there is no golden rule for HA, lots of situations are different, it's the preferences of the customer, the size of the database, the kind of application , and much
more .. you always need to think and evaluate the environment ...

Technorati Tags:Technorati Tags:
Feb 02 2009

Everything is a fine whitespace problem ...

Couple of days ago I was working on a Linux Heartbeat v2 setup.
Upon inserting an XML snippet into the cib cib-adm started eating memory fast until the oom killer kicked in.

The environment was running a fairly old heartbeat-2.0.8 version so I upgraded to heartbeat-2.1.4-2.1 and there I got a nice warning that my XML sintax wasn't correct.

There was a whitespace in the XML syntax.

  1. <expression attribute="#replicationvalue" id="is_lagged" operation ="gt" ... ><

Removing the whitespace solves the problem, also on the older version. So the problem is already fixed upstream.. but you might run into it anyhow.

Oct 25 2008

Wholesale High Availability

Alan just coined WholeSale HA. The idea of rebooting a whole virtual machine rather than just failing over 1 service.

He wants to have the best of both worlds in 1 framework, he however doesn't specify what parts he likes from the WholeSale HA setup

Yes you want to use it coupled with hardware predictive failure analysis tools. In order to achieve Higher Availabilty, but I don't think the WholeSale HA part is real HA.

WholeSale HA isn't going to be fast enough for most of the business critical environments.
You simply cannot afford to reboot or even boot a full machine and the related downtime that brings for your service.

So yes a best effort combination, but with a strong focus on the application state would be preferred. WholeSale is a good start .. but it's definitely not where you want to stop.

Sep 24 2008

Bug in ifconfig ?

So earlier this week I ran into the weirdest problem with Linux-HA. Heartbeat was happily adding an IP address as an active resource so one of my nodes when needed, but upon removal it failed to remove the IP from the stack. Further debugging learned that the Heartbeat scripts claimed the ip wasn't on the actual stack.

It was.. but it the output from ifconfig was different from what it expected it to be.

Heartbeat checks the output of ifconfig and expects to find the IP address it added itselve to be on a :0 or similar interface. Now ifconfig only seems output 8 characters for the interface name Which means that when you have an interface called eth0:0 the output perfectly lists it and heartbea
t is smart enough to remove the ip again when the node goes to standby. If however you have a vlan with 3 digits on a bond interface Heartbeat will add :0 to bond0.129 , the Heartbeat resource will add the ip address perfectly but opon checking all the :0 interfaces the bond0.129:0 interface won't be parsed as ifconfig outputs it as bond0.129 , hence resulting in a potential painfull situation where 2 nodes still share an IP address.

So where's the actual problem ifconfig, or heartbeat, I'd say both, but the easiest fix will be in Heartbeat, afterall there are other preferred ways of adding an ip addres to an interface. ip addr add comes to mind :)

So I filed a bug report :)

Dec 17 2007

Integrating HA and Virtualization

Alan Robertson is discussing how Managed Virtualization (including HA) conflicts with System Management

He has some interesting points regarding managing infrastructure , in his vision there are just too much layers that don't talk to eachother .
He also points out some of the issues with CIM and SNMP .

Alan thinks the ideal way to go is to have your HA solution manage your Virtualization also.

I`m wondering if this doesn't add too much complexity.

If you are already making sure the services in your virtual machines are highly available, then why would you want to add another layer of complexity ? Surely the idea of being able to migrate virtual machines around sounds tempting but do we really need that extra layer of complexity ?

I've explained that migrating a virtual machine to another server won't help you when your apps crash or when your physical server fails.

But keeping an overview of which services are running where from 1 place seems like an interesting idea.

I've been tinkering about using the resource concept of Linux-HA however to serve another purpose than pure high availability. You might want to use its constraints to define how many virtual machines should run on and how much resources they can use on a certain physical machine. Hence create a loadbalancing infrastructure with it.

(I`m really really hoping someone now replies to this with a url which gives me a HAResource that does Live Migration :))

Oct 17 2007

Virtual Machine Replication

I don't know on which planet I have been for the past couple of years , days or hours but since when do
VMware’s Vmotion, XenSource’s Xenmotion or Virtual Iron’s Virtual Iron support Replication ?

Live Migration yes, but Replication , No.

I discussed this kind of technologies with Mark and Vincent , Moshe and others already a zillion times.. Continuously mirroring or realtime replication of a virtual machine is really difficult to do. And I haven't heard from a working scalable solution yet .. (Shared Memory issues such as we had with openMosix still are amongst the issue to be tackled)

Live Replication would mean that you mirror the full state of your virtual machine realtime to another running virtual machine. Every piece of disk/memory and screen you are using has to be replicated to the other side of the wire realtime. Yes you can take snapshots of filesystems and checkpoints of virtual machines. But continuous checkpointing over the network , I'd love to see that.. (outside of a lab)

So with a promise like that .. our good friends the CIO will be dreaming and the vendors will be blamed for not delivering what was promised to them.

But on the subject of using just Live Migration features as an alternative for a real High Availability solution , I know different vendors are singing this song, but it's a bad one.

Using Live migration in your infrastructure will give you the opportunity to move your applications away from a bad behaving machine when you notice it starts behaving badly, hence giving you a better overall uptime. If however you don't notice the machine is failing, or if it just suddenly stops working, or if your application crashes you are out of luck.
Live migration won't work anymore since you are to late, you can't migrate a machine that's dead. The only thing you can do is quickly redeploy your virtual machine on another node, which for me doesn't really qualify as a Clustered or HA solution.

Real HA looks at all the aspects of an application, the state of the application, the state of the server it is running on and the state of the network it is connected to. It has an alternative ready if any of these aspects fail. Session data is replicated, data storage is done redundantly and your network has multiple paths. If your monitoring decides something went wrong another alternative should take over with no visible interruption for the end user. You don't have to wait till your application is restarted on the other side of the network, you don't have to wait till your virtual machine is rebooted, your filesystems are rechecked and your database has recovered no it happens right away .

But Virtual Machine Replication as an alternative for HA ? I'd call that wishfull thinking and vapourware today