Kris Buytaert's blog

Jul 31 2009

RiverMuse on RHEL/Centos 5

My yesterday post about RiverMuse only being available on Fedora Core 9 wasn't even cold yet and today Rivermuse already announced the availability of their RHEL5 binaries.

Awesome Open Source !

Jul 30 2009

Djagios

Probably every day there's a new Open Source project popping up left or right, sometimes they dissapear quickly , others are here to stay.

Enter Djagios, a Nagios configuration tool written in Django. At this time it's in development phase but development is going pretty good now, Djagios has been incepted by fellow Inuit Jochen Maes

It aims at filling the gap between the vi and emacs Nagios guru's and the manegerial type who just wants to and make Nagios usable for them.

We're already rolling out proof of concept setups of Djagios at different key customers so it's definitely here to stay :)

Jul 30 2009

Rivermuse First Impressions

First of all, I don't come from a Tivoli, OpenView background , I have never touched the commercial network monitoring tools and I`m not a network guy . I'm an infrastructure guy whith a focus on Open Source platforms so I have been using Nagios and more recently Zabbix, Zenoss etc for the better part of the last 2 decades in large to very large environments.
My syslogs go to a central (r)syslog)-ng) server where I frequently abuse grep. So If my experience with RiverMuse is not what it should be , there's work to be done on both sides ;)

So When looking at my Rivermuse setup (in a VirtualBox FC9 setup) my first tought is "Those Rivermuse folks will really need to explain me what their tool is all about .. as to me it's just a fancy colortail integrated with snmp traps."

Hopefully it's not just that and it all becomes clear in a couple of days .. Apart from the FC9 annoyancy there is the frequent Unresponsive script errors.

And those I fear will be the real killer problems for RiverMuse

On the other hand, RiverMuse does good job in displaying the actual events in your network and following up the actions that one ... after a while you'll get a good overview of the actual issues as opposed to all the relevant events

I've dropped RiverMuse into my blade test setup (more on that subject later) and I`ll be keeping a look on what I can learn from it but the dreaded Unresponsive scripts that I know so well from Bamboo really need to be fixed :)

Well time will tell :)

Jul 30 2009

KVM vs Virtualbox

So you need that old FC9 instance on your fresh F11 install.

Obviously I started a KVM instance on my desktop and installed FC9 in a Virtual Machine,
It took long to install, too long, so I looked if KVM was working correctly
Kvm was loaded but not in use ..
And then you remember why you had VirtualBox on that machine before it's upgrade . indeed, this machine was to old it was not a VT Capable machine . VirtualBox performs much better there ..

But you already have a working installed Qemu image.
What do you do ? Google tells you about vidtool but all you find are broken links ..

So you look further and you find that you can use VBoxManage for the same functionality

  1. VBoxManage convertdd FC9.bin FC0.vdi

Jul 30 2009

RiverMuse

Rivermuse is the new Fault Management platform tool on the market, their initial relase was lurking around the corner for a while now but since earlier this week it finally arrived

Eager to see what all the fuzz was about I jumped to their dowload page to find an yum repository for Fedora Core 9, that's right .. it's july 2009 and RiverMuse released their platform for a Linux distribution that got it's End Of Life notice last month.

On a fresh Fedora 11 box off course you get a zillion dependency mismatches.
So over to the source code. After installing some build dependencies I managed to build desktop-trunk-8.fc11.noarch.rpm rivermusece-trunk-8.fc11.noarch.rpm
rpms

However these depend on extremely fresh versions of rsyslog >= 4.1.6 where as Fedora 11 is only on 3.21 and RHEL is even only on 2.0.6.

The initial checkout I had had no README file yet .. so creating a build wasn't really easy ..

In the meanwhile they have promised builds for other versions.

So the battle is between me getting time to setup an FC9 box somehwere and they releasing fresher RPM's .. I hope they win :)

Jul 06 2009

KVM or Xen

Over at Virtualization.com I asked the crowd what they planned to do when RedHat plans on finally migrating from Xen to KVM .. you can have your say too . :

Jul 01 2009

SquashFS errors during FC11 install

My FC10 to FC11 yum upgrade got stuck on a zillion dependencies .. well actually libssl and mysql from Remi .. but from there it's a whole chain of other things. So I had the great idea to go for the F11 Live CD, the LiveCD works like a charm,, only upon trying to install the live CD to my OS partition it started failing on me with a bunch of squashfs errors ,

It really looks like this F11Beta blocking issue which apparently didn't get solved completely after all.. the annoying part is that I can't reproduce the issue anymore .. as my system is now fully working .. maybe on my office desktop next week :)

Oh well.. the full DVD ISO got downloaded and burned and now I can once again enjoy the annoyancies of a freshly installed distro, missing packages, configs that have lightly chaged etc :) Wonder if sound is working out of the box now :)

Update: It works .. partly ... (no sound in firefox it seems atm..)

Jul 01 2009

Webbased Administration Interfaces

Dear Hardware / Appliance and any other kind of being that creates web interfaces to manage devices remotely. Pleaze keep in mind that plenty of these kind of machines are being managed from behind a ssh tunnel so people often connect to them using http://localhost:8080/ where 8080 locally is mapped e.g. via ssh to the port 80 on the remote port.

If your application does anything closely to rewriting the localhost:8080 part to localhost it will fail and I as a potential purchaser of your device wil vote against buying your stuff.
(This is specifically annoying if I want to tunnel 16 KVM/IP boxen to my localhost 8001:8016)

You offcourse score good points for supporting an ssh connection to that same device that allows me to do the everything I actually want to do :)

Jul 01 2009

DRBD2, OCFS2, Unexplained crashes

I was trying to setup a dual-primary DRBD environment, with a shared disk with either OCFS2 or GFS. The environment is a Centos 5.3 with DRBD82 (but also tried with DRBD83 from testing) .

Setting up a single primary disk and running bonnie++ on it worked Setting up a dual-primary disk, only mounting it on one node (ext3) and running bonnie++ worked

When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both nodes, basic functionality seemed in place but usually less than 5-10 minutes after I start bonnie++ as a test on one of the nodes , both nodes power cycle with no errors in the logfiles, just a crash.

When at the console at the time of crash it looks like a disk IO (you can type , but actions happen) block happens then a reboot, no panics, no oops , nothing. ( sysctl panic values set to timeouts etc )
Setting up a dual-primary disk , with ocfs2 only mounting it on one node and starting bonnie++ causes only that node to crash.

On DRBD level I got the following error when that node disappears

  1. drbd0: PingAck did not arrive in time.
  2. drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure )
  3. pdsk(UpToDate -> DUnknown )
  4. drbd0: asender terminated
  5. drbd0: Terminating asender thread

That however is an expected error because of the reboot.

At first I assumed OCFS2 to be the root of this problem ..so I moved forward and setup an ISCSI target on a 3rd node, and used that device with the same OCFS2 setup. There no crashes occured and bonnie++ flawlessly completed it test run.

So my attention went back to the combination of DRBD and OCFS
I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and the 83 variant from Centos Testing

At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour

Both the DRBD as the OCFS mailinglist were fairly supportive pointing me out that it was probably OCFS2 fencing both hosts after missing the heartbeat, and suggested increasing the deathtimetimeout values.

I however wanted to confirm that. As I got no entries in syslog I attached a Cyclades err Avocent Terminal server to the device in the hope that I'd capture the last kernel messsages there ... no such luck either.

On the OCFS2 mailinlist people pointed out that i'd use netconsole to catch the logs on a remote node
I set up netconsole using

  1. modprobe netconsole netconsole="@/,@172.16.32.1/"
  2. sysctl -w kernel.printk="7 4 1 7"

After which indeed I catched error on my remote host..

  1. [base-root@CCMT-A ~]# nc -l -u -p 6666
  2. (8,0):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device
  3. drbd0 after 478000 milliseconds
  4. (8,0):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active
  5. regions.
  6. ocfs2 is very sorry to be fencing this system by restarting

One'd think that it output over Serial console before it log over the network :) It doesn't .

Next step is that I`ll start fiddling some more with the timeout values :) (note the ":)")

Jun 29 2009

Diaper Needs Service Problem

Last Saturday late, Sandy gave birth to our 2nd daughter
Amber, pics etc are on her own site

So we'll be changing diapers of 2 little Buytaert kids for a while )

PS. Craig from O'ReillyGMT gets the credit for inventing the new DNS acronym,