heartbeat

Nov 18 21:05

Got Interviewed

by @botchagalupe
on Virtualization, Open Source tools and DNS Problems

Oct 16 19:53

Heartbeat 2 OpenAIS

While upgrading a pretty recent Heartbeat cluster to OpenAis earlier today I ran into the following weird situation

  1. Last updated: Fri Oct 16 08:50:03 2009
  2. Stack: openais
  3. Current DC: CO_NMS-1 - partition with quorum
  4. Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
  5. 4 Nodes configured, 2 expected votes
  6. 1 Resources configured.
  7. ============
  8.  
  9. Online: [ CO_NMS-1 CO_NMS-2 ]
  10. OFFLINE: [ co_nms-1 co_nms-2 ]

or

  1. crm(live)node# show
  2. co_nms-1(5c48ab4f-767f-e2dc-20ec-5969cddad152): normal
  3. co_nms-2(922ff786-eca9-bed0-d79d-8222727a2c5b): normal
  4. CO_NMS-1: normal
  5. CO_NMS-2: normal

Whohoo.. OpenAIS must have realized I have upperase and lowercase cores :)

Funny to see .. but quickly solved..

Feb 02 2009

Everything is a fine whitespace problem ...

Couple of days ago I was working on a Linux Heartbeat v2 setup.
Upon inserting an XML snippet into the cib cib-adm started eating memory fast until the oom killer kicked in.

The environment was running a fairly old heartbeat-2.0.8 version so I upgraded to heartbeat-2.1.4-2.1 and there I got a nice warning that my XML sintax wasn't correct.

There was a whitespace in the XML syntax.

  1. <expression attribute="#replicationvalue" id="is_lagged" operation ="gt" ... ><

Removing the whitespace solves the problem, also on the older version. So the problem is already fixed upstream.. but you might run into it anyhow.

Sep 24 2008

Bug in ifconfig ?

So earlier this week I ran into the weirdest problem with Linux-HA. Heartbeat was happily adding an IP address as an active resource so one of my nodes when needed, but upon removal it failed to remove the IP from the stack. Further debugging learned that the Heartbeat scripts claimed the ip wasn't on the actual stack.

It was.. but it the output from ifconfig was different from what it expected it to be.

Heartbeat checks the output of ifconfig and expects to find the IP address it added itselve to be on a :0 or similar interface. Now ifconfig only seems output 8 characters for the interface name Which means that when you have an interface called eth0:0 the output perfectly lists it and heartbea
t is smart enough to remove the ip again when the node goes to standby. If however you have a vlan with 3 digits on a bond interface Heartbeat will add :0 to bond0.129 , the Heartbeat resource will add the ip address perfectly but opon checking all the :0 interfaces the bond0.129:0 interface won't be parsed as ifconfig outputs it as bond0.129 , hence resulting in a potential painfull situation where 2 nodes still share an IP address.

So where's the actual problem ifconfig, or heartbeat, I'd say both, but the easiest fix will be in Heartbeat, afterall there are other preferred ways of adding an ip addres to an interface. ip addr add comes to mind :)

So I filed a bug report :)

Feb 06 2008

It's February again

It seems like for the past 4 years February is the month that O'Reilly really loves me and decides to publish one of my articles.

This years version was cowritten together with my collegue Johan Huysmans and tackles the creation of Highly Available Gateways

Altough the every HA situation is different and this is a pretty easy setup it's a good start for other setups.

Enjoy the read

PS. Yes I know , in 2006 I also had a January article :)

Technorati Tags:Technorati Tags:
Oct 17 2007

Virtual Machine Replication

I don't know on which planet I have been for the past couple of years , days or hours but since when do
VMware’s Vmotion, XenSource’s Xenmotion or Virtual Iron’s Virtual Iron support Replication ?

Live Migration yes, but Replication , No.

I discussed this kind of technologies with Mark and Vincent , Moshe and others already a zillion times.. Continuously mirroring or realtime replication of a virtual machine is really difficult to do. And I haven't heard from a working scalable solution yet .. (Shared Memory issues such as we had with openMosix still are amongst the issue to be tackled)

Live Replication would mean that you mirror the full state of your virtual machine realtime to another running virtual machine. Every piece of disk/memory and screen you are using has to be replicated to the other side of the wire realtime. Yes you can take snapshots of filesystems and checkpoints of virtual machines. But continuous checkpointing over the network , I'd love to see that.. (outside of a lab)

So with a promise like that .. our good friends the CIO will be dreaming and the vendors will be blamed for not delivering what was promised to them.

But on the subject of using just Live Migration features as an alternative for a real High Availability solution , I know different vendors are singing this song, but it's a bad one.

Using Live migration in your infrastructure will give you the opportunity to move your applications away from a bad behaving machine when you notice it starts behaving badly, hence giving you a better overall uptime. If however you don't notice the machine is failing, or if it just suddenly stops working, or if your application crashes you are out of luck.
Live migration won't work anymore since you are to late, you can't migrate a machine that's dead. The only thing you can do is quickly redeploy your virtual machine on another node, which for me doesn't really qualify as a Clustered or HA solution.

Real HA looks at all the aspects of an application, the state of the application, the state of the server it is running on and the state of the network it is connected to. It has an alternative ready if any of these aspects fail. Session data is replicated, data storage is done redundantly and your network has multiple paths. If your monitoring decides something went wrong another alternative should take over with no visible interruption for the end user. You don't have to wait till your application is restarted on the other side of the network, you don't have to wait till your virtual machine is rebooted, your filesystems are rechecked and your database has recovered no it happens right away .

But Virtual Machine Replication as an alternative for HA ? I'd call that wishfull thinking and vapourware today

Sep 05 2007

LinuxConference Europe 2007 5/X

So today is the last day of LinuxConference Europe , down the stairs in the same building there a bunch of weirdos sitting at round tables for some highly elite and secret meeting. , also known as the KernelSummit.

I just heard someone say that they are figuring out which new bug they are planning to introduce into the new kernels.

I`m in LMB's tutorial on Linux HA, so I`ll be musing about one of my favourite topics today :)
Or I`ll just pay attention ;)

I`m wondering why Lars just modified one of his slides... maybe I`ll ask him over Lunch...