Jul 17 2011

Drupal and Configuration Mgmt, we're getting there ...

For those who haven't noticed yet .. I`m into devops .. I`m also a little bit into Drupal, (blame my last name..) , so one of the frustrations I've been having with Drupal (an much other software) is the automation of deployment and upgrades of Drupal sites ...

So for the past couple of days I've been trying to catch up to the ongoing discussion regarding the results of the configuration mgmt sprint , I've been looking at it mainly from a systems point of view , being with the use of Puppet/ Chef or similar tools in mind .. I know I`m late to the discussion but hey , some people take holidays in this season :) So below you can read a bunch of my comments ... and thoughts on the topic ..

First of all , to me JSON looks like a valid option.
Initially there was the plan to wrap the JSON in a PHP header for "security" reasons, but that seems to be gone even while nobody mentioned the problems that would have been caused for external configuration management tools.
When thinking about external tools that should be capable of mangling the file plenty of them support JSON but won't be able to recognize a JSON file with a weird header ( thinking e.g about Augeas (augeas.net) , I`m not talking about IDE's , GUI's etc here, I`m talking about system level tools and libraries that are designed to mangle standard files. For Augeas we could create a separate lens to manage these files , but other tools might have bigger problems with the concept.

As catch suggest a clean .htaccess should be capable of preventing people to access the .json files There's other methods to figure out if files have been tampered with , not sure if this even fits within Drupal (I`m thinking about reusing existing CA setups rather than having yet another security setup to manage) ,

In general to me tools such as puppet should be capable of modifying config files , and then activating that config with no human interaction required , obviously drush is a good candidate here to trigger the system after the config files have been change, but unlike some people think having to browse to a web page to confirm the changes is not an acceptable solution. Just think about having to do this on multiple environments ... manual actions are error prone..

Apart from that I also think the storing of the certificates should not be part of the file. What about a meta file with the appropriate checksums ? (Also if I`m using Puppet or any other tool to manage my config files then the security , preventing to tamper these files, is already covered by the configuration management tools, I do understand that people want to build Drupal in the most secure way possible, but I don't think this belongs in any web application.

When I look at other similar discussions that wanted to provide a similar secure setup they ran into a lot of end user problems with these kind of setups, an alternative approach is to make this configurable and or plugable. The default approach should be to have it enable, but the more experienced users should have the opportunity to disable this, or replace it with another framework. Making it plugable upfront solves a lot of hassle later.

Someone in the discussion noted :
"One simple suggestion for enhancing security might be to make it possible to omit the secret key file and require the user to enter the key into the UI or drush in order to load configuration from disk."

Requiring the user to enter a key in the UI or drush would be counterproductive in the goal one wants to achieve, the last thing you want as a requirement is manual/human interaction when automating setups. therefore a feature like this should never be implemented

Luckily there seems to be new idea around that doesn't plan on using a raped json file
instead of storing the config files in a standard place, we store them in a directory that is named using a hash of your site's private key, like sites/default/config_723fd490de3fb7203c3a408abee8c0bf3c2d302392. The files in this directory would still be protected via .htaccess/web.config, but if that protection failed then the files would still be essentially impossible to find. This means we could store pure, native .json files everywhere instead, to still bring the benefits of JSON (human editable, syntax checkable, interoperability with external configuration management tools, native + speedy encoding/decoding functions), without the confusing and controversial PHP wrapper.

Figuring out the directory name for the configs from a configuration mgmt tool then could be done by something similar to

  1. cd sites/default/conf/$(ls sites/default/conf|head -1)

In general I think the proposed setup looks acceptable , it definitely goes in the right direction of providing systems people with a way to automate the deployment of Drupal sites and applications at scale.

I`ll be keeping a eye on both the direction they are heading into and the evolution of the code !

Jun 10 2011

The case for Augeas

Ever since I met David Lutterkort over steaks at OLS 2007 augeas was this tool in the back of my mind that I couldn't place... I never saw the need for it... or it seemed to be huge overkill for the problem that needed solving .

Till I ran into sipxecs rewriting XML files on the fly .. and putting values in their XML that I could not trace back to an original source. As of Augeas 0.8.x there's an XML lens out there.

Digging innot blah.xml with augtool you can do stuff like

  1. set /augeas/load/Xml/incl[3] /tmp/blah.xml
  2. set /augeas/load/Xml/lens Xml.lns
  3. load
  4. print /files/tmp/blah.xml/profile/settings/param[17]/
  5. /files/tmp/blah.xml/profile/settings/param[17] = "#empty"
  6. /files/tmp/blah.xml/profile/settings/param[17]/#attribute
  7. /files/tmp/blah.xml/profile/settings/param[17]/#attribute/name = "sip-ip"
  8. /files/tmp/blah.xml/profile/settings/param[17]/#attribute/value = "10.255.202.90"
  9. augtool> print /files/tmp/blah.xml/profile/settings/param[18]/
  10. /files/tmp/blah.xml/profile/settings/param[18] = "#empty"
  11. /files/tmp/blah.xml/profile/settings/param[18]/#attribute
  12. /files/tmp/blah.xml/profile/settings/param[18]/#attribute/name = "ext-rtp-ip"
  13. /files/tmp/blah.xml/profile/settings/param[18]/#attribute/value = "auto-nat"
  14. augtool> print /files/tmp/blah.xml/profile/settings/param[16]/
  15. /files/tmp/blah.xml/profile/settings/param[16] = "#empty"
  16. /files/tmp/blah.xml/profile/settings/param[16]/#attribute
  17. /files/tmp/blah.xml/profile/settings/param[16]/#attribute/name = "rtp-ip"
  18. /files/tmp/blah.xml/profile/settings/param[16]/#attribute/value = "10.255.202.90"

and get and set

  1. augtool> get /files/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml/profile/settings/param[17]/#attribute/value
  2. /files/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml/profile/settings/param[17]/#attribute/value = 10.255.202.90
  3. augtool> set /files/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml/profile/settings/param[16]/#attribute/value 10.0.0.2

Putting that into puppet however isn't that tvivial .

When you try to do this

  1. augeas{"sipxprofile" :
  2. changes => [
  3. "set /augeas/load/Xml/incl[last()+1] /etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml",
  4. "set /files/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml/profile/settings/param[16]/#attribute/value 10.0.0.2",
  5. "set /files/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml/profile/settings/param[17]/#attribute/value 10.0.0.2",
  6. ],
  7. }

Puppet really doesn't output what you want to do ... it only outputs the snippet you modify ..
It's the load statement above that is the really important piece but puppet can't directly work with that so you need to go around that using
The way to solve this is

  1. augeas{"sipxprofile" :
  2. lens => "Xml.lns",
  3. incl => "/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml",
  4. context => "/files/etc/sipxpbx/freeswitch/conf/sip_profiles/sipX_profile.xml",
  5. changes => [
  6. "set profile/settings/param[16]/#attribute/value $ipaddress",
  7. "set profile/settings/param[17]/#attribute/value $ipaddress",
  8. ],
  9. onlyif => "get profile/settings/param[16]/#attribute/value != $ipaddress",
  10. }

Jun 08 2011

Killing VirtualBox

Note to self :
when virtualbox acts up again .

  1. [sdog@stillmine ~]$ VBoxManage list runningvms
  2. "vagrant-etherpad_1307533136" {60697a54-35fc-44e7-96b6-008cde6995a6}
  3. [sdog@stillmine ~]$ VBoxManage controlvm vagrant-etherpad_1307533136 poweroff
  4. 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

May 25 2011

Beyond Configuration Mgmt

(This post has been sitting in the drafts folder for way to long, I decided to push the publish button anyhow .. some people might get ideas from it..)

We've all run in to the problem, you've puppetized, or euh .. cooked , about every part of your infrastructure and then there's this one service which has no config files, a broken api that doesn't allow you to configure antyhing, but a magnificent web gui to configure all aspects of the service. Magnificent for the eye , full of AJAX and other fancy stuff which wget isn't really keen on. Off course before it even starts working you need to set it's password , from that webgui.

Sometimes when you are lucky they store al their config in a database, which you can dump, parse and replace all the host specific parameters for other deployments, but is that an approach you like ? As for each new version you'll need to reanalyze the db layout. But no matter how you look at it ,dumping the DB and restoring it is an ugly hack you don't want.

Other alternatives like sniffing the traffic and replaying the POSTS etc were considered ... but fancy AJAX stuff and SSL make that less trivial than it seems

Wo while discussing with an upstream project they proposed to actually screenscrape their config webgui .

So screenscraping the config gui it is .. but how ... I started looking at tools that are typically used for testing rather than for automation, with the purpose of replaying the scenarios one needs to configure the services.

My first attempt was Selenium, it plugs into a browser , so it's easy to acraully record what it has to do, and it saves it's scenarios in a somewhat readable/ editable format.
Having found the export to perl function it alll looked promising. However the export to perl isn't really an export to perl as I epxected .. I assumed it would just generate the perl code to run the same scneario which would be awesome .. it however generates a perl script that instructs a selenium server to run the script.

One of the annoyancies I ran into with Selenium is that a browser
doesn't accept self signed certificates , and one can't preprovision a browser easyily with those freshly created certificates. (Yes Karl I already read about certutil ... )

I had heard good things about Cucumber so I was pretty eager to start testing it ... In short Cucumber lack documentation ,
I tried a couple of things but I couldn't get beyond testing if a certain string was on a page.. couldn't figure out how to fill in a form etc ...
Maybe if anyone could point me to some great documentation on how you should write recipe's here ... I didn't find the documentation all to easy to find ..
Bummer as it really really looks promisiung .. specially since it is so lightweight ..

IP played with JMeter and Sahi too .. but still

So apart from filing bugs to the upstream project/product and hoping they understand your problem and are willing to oopen up their API , what other options do you folks suggest ?

I gave a short talk about this at Puppetcamp in Amsterdam and the audience came up with a bunch of other potential projects to look at .

The main problem still is that all these are tools to automate testing , they don't provide you with a general purpose approach to solve the configuration mgmt problem, each time the upstream vendor modifies the layout of his page you hav e to do the work again and that .. really doesn't sound promising ..

May 02 2011

And then vagrant gave up ... for a while

Don't you just love Ruby errors ... like the one below ?

Don't they almost make Java stack traces look readable ?
57 lines of jibberish ... where all I wanted to read was "VirtualBox in error state, check gui"

People like Randall Hanssen deserve much more visibility .. they do understand how to write a good error message and there are lots of projects that need improvement there.

Anywhow... vagrant had suddenly stopped working on me with the error below

Turned out that I had deleted some unused Virtualbox images , not from the VirtualBox gui and therefore Virtualbox didn't want to play nice ..

Upon starting the VirtualBox gui and cleaning up the images there , everything started working again ..

But the error wasn't really helpfull ..

  1. vagrant up
  2. /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/com/implementer/ffi.rb:95:in `call_and_check': Error in API call to get_teleporter_enabled: 2147942405 (VirtualBox::Exceptions::FFIException)
  3. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/com/implementer/ffi.rb:69:in `call_vtbl_function'
  4. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/com/implementer/ffi.rb:36:in `read_property'
  5. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/com/abstract_interface.rb:122:in `read_property'
  6. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/com/abstract_interface.rb:64:in `teleporter_enabled'
  7. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:93:in `send'
  8. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:93:in `spec_to_proc'
  9. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:32:in `call'
  10. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:32:in `load_interface_attribute'
  11. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:13:in `load_interface_attributes'
  12. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:12:in `each'
  13. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/interface_attributes.rb:12:in `load_interface_attributes'
  14. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:251:in `initialize_attributes'
  15. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:246:in `initialize'
  16. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:229:in `new'
  17. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:229:in `populate_array_relationship'
  18. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:228:in `each'
  19. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:228:in `populate_array_relationship'
  20. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:218:in `populate_relationship'
  21. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/relatable.rb:242:in `populate_relationship'
  22. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model.rb:215:in `populate_relationship'
  23. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/dirty.rb:129:in `ignore_dirty'
  24. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model.rb:215:in `populate_relationship'
  25. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/global.rb:93:in `load_relationship'
  26. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/relatable.rb:192:in `read_relationship'
  27. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/abstract_model/relatable.rb:146:in `vms'
  28. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:185:in `all'
  29. from /usr/lib/ruby/gems/1.8/gems/virtualbox-0.8.3/lib/virtualbox/vm.rb:193:in `find'
  30. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/vm.rb:13:in `find'
  31. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/environment.rb:378:in `load_vms!'
  32. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/environment.rb:377:in `each'
  33. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/environment.rb:377:in `load_vms!'
  34. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/environment.rb:144:in `vms'
  35. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/environment.rb:180:in `multivm?'
  36. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/command/helpers.rb:19:in `target_vms'
  37. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/command/up.rb:8:in `execute'
  38. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/task.rb:22:in `send'
  39. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/task.rb:22:in `run'
  40. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:118:in `invoke_task'
  41. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:124:in `invoke_all'
  42. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/config.rb:115:in `map'
  43. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/core_ext/ordered_hash.rb:73:in `each'
  44. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:124:in `map'
  45. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:124:in `invoke_all'
  46. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/group.rb:226:in `dispatch'
  47. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:109:in `send'
  48. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:109:in `invoke'
  49. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/lib/vagrant/cli.rb:45:in `up'
  50. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/task.rb:22:in `send'
  51. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/task.rb:22:in `run'
  52. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/invocation.rb:118:in `invoke_task'
  53. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor.rb:263:in `dispatch'
  54. from /usr/lib/ruby/gems/1.8/gems/thor-0.14.6/lib/thor/base.rb:389:in `start'
  55. from /usr/lib/ruby/gems/1.8/gems/vagrant-0.7.2/bin/vagrant:15
  56. from /usr/bin/vagrant:19:in `load'
  57. from /usr/bin/vagrant:19

Apr 20 2011

The 4158 second catalog run.

Two of my tweets , sorry dents, earlier today caused some people to ask me what on earth I was doing :)

You don't exist, go away!

Was the first one, indeed .. it was a long time since I had actually seen that one.. this actually happens when you delete the user you are logged in with on a host, when the host notices you don't exist anymore it will tell you.

Now that is exactly what happend .. We were busy reordening the uid's on some hosts , so I modified the puppet config for that host and changed the uid values, a couple of minutes later I was told that I don't exist ..

The last time I saw that , was about 10 years ago when I was trying to fool some collegues :)

Now the second tweet that tracked some people's attention was the one about a very lengthy catalog run

  1. Apr 20 05:12:42 sipx-a puppet-agent[22384]: Finished catalog run in 4158.09 seconds

Indeed, a puppet catalog run of about 69 minutes, yes thats 1 hour and 9 minutes ..

The reason for this lengthy catalog run was the above uid reordering , combined with

  1. "/var/sipxdata/":
  2. owner => "sipxchange", group => "sipxchange",
  3. recurse => true,
  4. ensure => directory;

And about 5K files in that directory .. apparently recurse doesn't translate to chown -R yet :)

Mar 29 2011

Vagrant & Rubylibs

I was testing some MySQL puppet modules on my Vagrant box earlier this week and one of them required augeas.
I kept running into "Could not find a default provider for augeas", however all the appropriate augeas , augeas-lib and ruby-augeas packages were installed. I inspected the different ruby directories and the files were perfectly in /usr/lib/ruby/site_ruby/1.8 where I expected them.

With all the files seemd to be in the right place, my next option was to strace a small ruby script that included augeas, guess what that showed ..

  1. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  2. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  3. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/i686-linux/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  4. stat64("/opt/ruby/lib/ruby/site_ruby/1.8/i686-linux/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  5. stat64("/opt/ruby/lib/ruby/site_ruby/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  6. stat64("/opt/ruby/lib/ruby/site_ruby/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  7. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  8. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  9. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/i686-linux/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  10. stat64("/opt/ruby/lib/ruby/vendor_ruby/1.8/i686-linux/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  11. stat64("/opt/ruby/lib/ruby/vendor_ruby/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  12. stat64("/opt/ruby/lib/ruby/vendor_ruby/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  13. stat64("/opt/ruby/lib/ruby/1.8/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  14. stat64("/opt/ruby/lib/ruby/1.8/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  15. stat64("/opt/ruby/lib/ruby/1.8/i686-linux/augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  16. stat64("/opt/ruby/lib/ruby/1.8/i686-linux/augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  17. stat64("./augeas.rb", 0xbfd2af1c) = -1 ENOENT (No such file or directory)
  18. stat64("./augeas.so", 0xbfd2af1c) = -1 ENOENT (No such file or directory)

Indeed ... vagrant throws the default ruby to /opt/ruby .. and obviously there were no ruby-augeas files in there.

Mar 10 2011

Watching the Guards

A couple of weeks ago I noticed a weird drop in web usage stats on the site you are browsing now. Kinda weird as the drop was right around Fosdem when usually there is a spike in traffic.

So before you start.. no I don't preach on practice on my own blog, it's a blog dammit, so I do the occasional upgrades on the actual platform , with backups available, do some sanity tests and move on, yes I break the theme pretty often but ya'll reading this trough RSS anyhow.

My backups showed me that drush had made a copy of the Piwik module somewhere early february, exactly when this drop started showing. I verified the module , I verified my Piwik , - Oh Piwik you say .. yes Piwik, if you want a free alternative to Google Analytics , Piwik rocks .. - I even checked other sites using the same piwik setup and they were all still functional happily humming and being analyzed.... everything fine ... but traffic stayed low ..

This taught me I actually had to upgrade my Piwik too ...

So that brings me to the point I`m actually wanting to make...
as according to @patrickdebois in his chapter on Monitoring "Quis custodiet ipsos custodes?" who's monitoring the monitoring tools, who's monitoring the analytics tools,

So not only should you monitor the availability of yor monitoring tools, you should also monitor if their api hasn't changed in some way or another.
Just like when you are monitoring an web app you shoulnd't just see if you can connect to the appropriate http port, but you should be checking if you get sensible results back from it , no gibberish.

But then again ... there's no revenue in my blog or its statistics :)

Mar 04 2011

24 hours of Puppet Drama

Over the past couple of days I've been fighting with a weird puppet problem , we eventually cracked it , but I promised a bunch of you to fully explain it here ;)

So we were deploying 2 Blade chassis at a pretty remote location with a mix of phyisical and virtual machines, some 48 instances in total. This is a pretty standard rollout, we've got a bunch of similar platforms in our lab , so we knew about a couple of glitches, what to expect etc.

I was just keeping an eye on the deployment, looking at the logs seeing if things were running fine, when suddenly a couple of puppet runs didn't come trough, we had seen such behaviour before, usually it's a matter of running them a gain a couple of times and they will come trough. (Upgrading ruby and putting passenger in front of puppet actually solved those issues,
We'd even had a loop built in the platform that runs puppet a couple of times till it returns with the correct exit code just to make sure. )

We were first scratching the A chain of our setup, so that in the event of failure we could still bring up the B chain of the platform and be up and running again. Actually machines were coming up.. slowly .. some of them took a bit longer . One of the machine's clock was seriously off .. the SSL was barfing on it , so we set the bios clock, and restarted .. it was the machine with 6VM's took a while but everything was back on schedule.. then suddenly things were going down fast more and more puppetruns started failing and .. , at some point in time actually none of our puppet runs were working again .
I'd see the puppetmaster perfectly compile it's catalog

  1. notice: Compiled catalog for ctl-0-a

Then the client .. not wanting to get it ..

  1. Mar 1 11:10:45 ctl-0-a puppet-agent[3674]: Not using expired catalog for ctl-0-a from cache; expired at Tue Mar 01 09:50:06 +0000 2011
  2. Mar 1 11:10:45 ctl-0-a puppet-agent[3674]: Using cached catalog
  3. Mar 1 11:10:45 ctl-0-a puppet-agent[3674]: Could not retrieve catalog; skipping run

We had gone from about 60% of our fresly deployed boxen working fine, to not one
So what do you do .. indeed .. you turn on debugging.
You put both your puppetmaster and client in debug. Nothing, no errors no nothing ..

I asked some collegues, asked on irc .. much ideas but none of them that actually cracked the problem. I did what I knew that solve similar problems before,

I switched our serialization format from yaml back to pson , and back, no luck.
I upgraded ruby to a version from the glei.ch repository. No luck.
I upgraded our Puppet version 2.6 to a version from the TMZ Epel repo , we cleaned out ssl the certificates on all sides multiple times. Cleaned out /var/lib/puppet , We uninstalled puppet and reinstalled it.
It wasn't a DNS Problem

I had started stripping my manifests to empty runs, those worked, then started uncommenting the actual manifests again ... Then in the middle of the debug our VPN connection to the remote location broke down, we'd only be getting it back in the morning ..about 12 hours later not fun. Murphy obviously ..

So the next morning we dived right back in ... making those manifests bigger again, removing all the stages, 1 or 2 successful runs, then with the same config .. back to failure. On and off.. successfull and unscussessful. ... it wasn't in the manifests ..

So we decided to roll the puppetmaster back to it's previous version, that one was known to be stable, there obviously was something really fishy going, so that was the safest bet.

Wrong, the machine came up, but it took longer than expected, and when trying to connect new clients to it .. nothing worked anymore .. same problem as before .. puppetmaster compiles catalog, clients didn't get anything. we started to suspect faulty hardware .. but how could that bee.. the puppetclient looked liked the only malfunctioning thing around .

Then Dim0 suggested me to look at the that one logfile I hadn't looked , /var/log/puppet/masterhttp.log and then we saw it . it was being flooded with ssl errors, ssl errors from clients that shouldn't even be connecting to the puppetmaster at all.

  1. [2011-03-02 13:32:00] ERROR OpenSSL::SSL::SSLError: tlsv1 alert decrypt error
  2. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:44:in `accept'
  3. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:44:in `listen'
  4. /usr/lib/ruby/1.8/webrick/server.rb:173:in `call'
  5. /usr/lib/ruby/1.8/webrick/server.rb:173:in `start_thread'
  6. /usr/lib/ruby/1.8/webrick/server.rb:162:in `start'
  7. /usr/lib/ruby/1.8/webrick/server.rb:162:in `start_thread'
  8. /usr/lib/ruby/1.8/webrick/server.rb:95:in `start'
  9. /usr/lib/ruby/1.8/webrick/server.rb:92:in `each'
  10. /usr/lib/ruby/1.8/webrick/server.rb:92:in `start'
  11. /usr/lib/ruby/1.8/webrick/server.rb:23:in `start'
  12. /usr/lib/ruby/1.8/webrick/server.rb:82:in `start'
  13. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:42:in `listen'
  14. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:41:in `initialize'
  15. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:41:in `new'
  16. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:41:in `listen'
  17. /usr/lib/ruby/1.8/thread.rb:135:in `synchronize'
  18. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:38:in `listen'
  19. /usr/lib/ruby/site_ruby/1.8/puppet/network/server.rb:127:in `listen'
  20. /usr/lib/ruby/site_ruby/1.8/puppet/network/server.rb:142:in `start'
  21. /usr/lib/ruby/site_ruby/1.8/puppet/daemon.rb:124:in `start'
  22. /usr/lib/ruby/site_ruby/1.8/puppet/application/master.rb:114:in `main'
  23. /usr/lib/ruby/site_ruby/1.8/puppet/application/master.rb:46:in `run_command'
  24. /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:287:in `run'
  25. /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:393:in `exit_on_fail'
  26. /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:287:in `run'
  27. /usr/sbin/puppetmasterd:4
  28. [2011-03-02 13:32:00] ERROR OpenSSL::SSL::SSLError: tlsv1 alert decrypt error

What happened was that 'we' decided to bring of the one backup machines back online, afterall once the slow starting server came trough, it would be the passive node in the cluster , no worries there, right ? Wrong,
This physical machine had 6 virtual machines with old ssl certificates that got stuck in an loop which was put there to sure their puppetrun came trough correctly at boot time.

Those 7 rogue clients which generated little to no relevant traffic on the network were saturating the default webrick, killing them solved the problem and we were back to regular deployment in no time.

The sad part is that our upcoming release already has passenger , a fresher version of ruby etc .. and that most of the above mentioned errors won't occur anymore there.
But in short .. don't use the default webrick .. it will kill you :)

And no , not everything is a freaking dns problem, ssl is a big pain in the B too .. :)

Feb 23 2011

dhcpd on Shared Networks

And as I always forget (getting old remember) how to have dhcpd configured to serve for multiple networks on one interface (e.g with aliases)
(as in test setups eg. ) I`ll write it down here, So next time google can point right back to me

  1. shared-network thisismessy {
  2. subnet 10.8.0.0 netmask 255.255.0.0 {
  3. option routers 10.8.0.1;
  4. }
  5. subnet 10.12.0.0 netmask 255.255.0.0 {
  6. option routers 10.12.0.1;
  7. }
  8. subnet 10.16.0.0 netmask 255.255.0.0 {
  9. option routers 10.16.0.1;
  10. }
  11. }