ha

Mar 06 14:03

Better days Arrive when Dev Meet Ops

A couple of weeks a go Brian Profitt pinged me for a chat about Devops , the result of that chat , his article can now be found on the Zenoss blog, it's titled Datacenter Barometer: Better days arrive when dev meets ops

It's a very nice read with some pointers to places regular readers of my blog should already know ;)
So with lots of leading Open Source infrastructure companies on different levels, such as config management (OpsCode and Reductive Labs) , monitoring (Zenoss) , deployment (openQRM, RPath, and obviously Consultancy companies , the upcoming Devops conferences around the planet promise to be a lot of fun ! ;)

Oh, and apparently there is some more on the story on /.

Dec 04 19:43

Disabling DHCP on a LibVirt setup

So you have this libvirt setup and you want to have a dhcp server on the virtual machines you are playing with , or you want to have all static IP's.

Libvirt uses dnsmasq to provide dhcp services etc and when you generate a config from the gui it will look like

  1. <network>
  2. <name>piponet</name>
  3. <uuid>e87d3bf1-a2e7-96ca-e131-7ae51ac033f9</uuid>
  4. <bridge name='virbr2' stp='on' delay='0' />
  5. <ip address='192.168.100.1' netmask='255.255.255.0'>
  6. <dhcp>
  7. <range start='192.168.100.128' end='192.168.100.254' />
  8. </dhcp>
  9. </ip>
  10. </network>

If you fully remove the dhcp section, then restart libvirt you'll notice dnsmasq running with no dhcpd on that subnet so you'll have full control again :)

  1. <network>
  2. <name>piponet</name>
  3. <uuid>e87d3bf1-a2e7-96ca-e131-7ae51ac033f9</uuid>
  4. <bridge name='virbr2' stp='on' delay='0' />
  5. <ip address='192.168.100.1' netmask='255.255.255.0'>
  6. </ip>
  7. </network>

Technorati Tags:Technorati Tags:
Nov 18 21:05

Got Interviewed

by @botchagalupe
on Virtualization, Open Source tools and DNS Problems

Nov 12 22:01

Yet Another DNS Issue

While browsing trough my enormous mailinglist backlog I ran into the following message from Gianluca Cecchi on the DRBD-user mailing list

guess I`ll have to give Lars a T-Shirt when we next meet ;)

  1. From: Gianluca Cecchi
  2. To: drbd-user@lists.linbit.com
  3. Subject: [DRBD-user] notes on 8.3.2
  4.  
  5.  
  6. - drbdadm create-md r0 segfaults when the command "hostname" on the
  7. server contains the fully qualified domain name but you have put only
  8. the hostname part in drbd.conf
  9. Instead, the command "drbdadm dump" correctly gives you a warning in
  10. this case (suggesting how to correct the error you made....):
  11.  
  12. suppose complete hostname is virtfed.domainname.com and you put
  13. virtfed alone in drbd.conf
  14. [root@virtfed ~]# drbdadm dump
  15. WARN: no normal resources defined for this host (virtfed.domainname.com)!?
  16.  
  17. while
  18. [root@virtfed ~]# drbdadm create-md r0
  19. Segmentation fault

Guess I`ll have to give the Linbit crowd a T-Shirt when we next meet ;)

Technorati Tags:Technorati Tags:
Oct 19 22:46

Nines , Damn Nines and More Nines

Funny how different experiences lead to different evaluations of tools. The MySQL HA solutions the MySQL Performanceblog list, are almost listed in the complete opposited order of what my impressions are.

Ok agreed, I should probably not put my MySQL NDB experiences from 2-3 years ago with multiple Query of deaths and more problems than you into account anymore , but back then went in the list Less stable than a single node. I've had NDB POC setups going down for much more than 05:16 minutes
Ndb comes with a lot of restrictions, there are

As for MySQL on DRBD, I've said this before , I love DRBD, but having to wait for a long InnoDB recovery after a failover just kills your uptime ,
I remember being called by a customer during Fred last holiday who was waiting over 20 minutes for recovery , twice, so putting the DRBD/San setup second would not be my preference. But agreed .. it's only listed at 99.9% meaning almost 9 hours of downtime per year are allowed.

On the other hand we've seen database uptime of MySQL MultiMaster setups with Heartbeat reaching better figures than 99.99% Heck I've seen single nodes achieve better than 99.99% :)

So what does this teach us ... there is no golden rule for HA, lots of situations are different, it's the preferences of the customer, the size of the database, the kind of application , and much
more .. you always need to think and evaluate the environment ...

Technorati Tags:Technorati Tags:
Oct 16 19:53

Heartbeat 2 OpenAIS

While upgrading a pretty recent Heartbeat cluster to OpenAis earlier today I ran into the following weird situation

  1. Last updated: Fri Oct 16 08:50:03 2009
  2. Stack: openais
  3. Current DC: CO_NMS-1 - partition with quorum
  4. Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7
  5. 4 Nodes configured, 2 expected votes
  6. 1 Resources configured.
  7. ============
  8.  
  9. Online: [ CO_NMS-1 CO_NMS-2 ]
  10. OFFLINE: [ co_nms-1 co_nms-2 ]

or

  1. crm(live)node# show
  2. co_nms-1(5c48ab4f-767f-e2dc-20ec-5969cddad152): normal
  3. co_nms-2(922ff786-eca9-bed0-d79d-8222727a2c5b): normal
  4. CO_NMS-1: normal
  5. CO_NMS-2: normal

Whohoo.. OpenAIS must have realized I have upperase and lowercase cores :)

Funny to see .. but quickly solved..

Oct 11 19:18

Monitoring MySQL

Ronald Bradford wants to know what kind of Monitoring you use..
He specifically wants to know about Alerting tools

There's different cases , looking at it from a full infrastructure point my current favourite is Zabbix or good old Nagios,

But when looking at it from a debugging perspective you have MySQLAR or Hyperic, but those aren't in the alerting list.

However, when you are building HA clusters, you have custom scripts running either from mon or from pacemaker ..

Still .. Ronald probably wants more input :)

Technorati Tags:Technorati Tags:
Oct 09 11:06

Why learn to type ?

When your machine knows what you mean ..

  1. [s3p-root@XMS-1 tomcat6]# crm configure
  2. crm(live)configure# bye
  3. [s3p-root@XMS-1 tomcat6]# crm confiure
  4. crm(live)configure# bye
  5. [s3p-root@XMS-1 tomcat6]# crm confiture
  6. crm(live)configure# bye
  7. [s3p-root@XMS-1 tomcat6]#

I'd better

  1. apt-get install coffee

Technorati Tags:Technorati Tags:
Jul 01 15:01

DRBD2, OCFS2, Unexplained crashes

I was trying to setup a dual-primary DRBD environment, with a shared disk with either OCFS2 or GFS. The environment is a Centos 5.3 with DRBD82 (but also tried with DRBD83 from testing) .

Setting up a single primary disk and running bonnie++ on it worked Setting up a dual-primary disk, only mounting it on one node (ext3) and running bonnie++ worked

When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both nodes, basic functionality seemed in place but usually less than 5-10 minutes after I start bonnie++ as a test on one of the nodes , both nodes power cycle with no errors in the logfiles, just a crash.

When at the console at the time of crash it looks like a disk IO (you can type , but actions happen) block happens then a reboot, no panics, no oops , nothing. ( sysctl panic values set to timeouts etc )
Setting up a dual-primary disk , with ocfs2 only mounting it on one node and starting bonnie++ causes only that node to crash.

On DRBD level I got the following error when that node disappears

  1. drbd0: PingAck did not arrive in time.
  2. drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
  3. pdsk(UpToDate -> DUnknown )
  4. drbd0: asender terminated
  5. drbd0: Terminating asender thread

That however is an expected error because of the reboot.

At first I assumed OCFS2 to be the root of this problem ..so I moved forward and setup an ISCSI target on a 3rd node, and used that device with the same OCFS2 setup. There no crashes occured and bonnie++ flawlessly completed it test run.

So my attention went back to the combination of DRBD and OCFS
I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and the 83 variant from Centos Testing

At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour

Both the DRBD as the OCFS mailinglist were fairly supportive pointing me out that it was probably OCFS2 fencing both hosts after missing the heartbeat, and suggested increasing the deathtimetimeout values.

I however wanted to confirm that. As I got no entries in syslog I attached a Cyclades err Avocent Terminal server to the device in the hope that I'd capture the last kernel messsages there ... no such luck either.

On the OCFS2 mailinlist people pointed out that i'd use netconsole to catch the logs on a remote node
I set up netconsole using

  1. modprobe netconsole netconsole="@/,@172.16.32.1/"
  2. sysctl -w kernel.printk="7 4 1 7"

After which indeed I catched error on my remote host..

  1. [base-root@CCMT-A ~]# nc -l -u -p 6666
  2. (8,0):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device
  3. drbd0 after 478000 milliseconds
  4. (8,0):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active
  5. regions.
  6. ocfs2 is very sorry to be fencing this system by restarting

One'd think that it output over Serial console before it log over the network :) It doesn't .

Next step is that I`ll start fiddling some more with the timeout values :) (note the ":)")

Technorati Tags:Technorati Tags:
Feb 02 2009

Everything is a fine whitespace problem ...

Couple of days ago I was working on a Linux Heartbeat v2 setup.
Upon inserting an XML snippet into the cib cib-adm started eating memory fast until the oom killer kicked in.

The environment was running a fairly old heartbeat-2.0.8 version so I upgraded to heartbeat-2.1.4-2.1 and there I got a nice warning that my XML sintax wasn't correct.

There was a whitespace in the XML syntax.

  1. <expression attribute="#replicationvalue" id="is_lagged" operation ="gt" ... ><

Removing the whitespace solves the problem, also on the older version. So the problem is already fixed upstream.. but you might run into it anyhow.