puppet

Mar 04 2011

24 hours of Puppet Drama

Over the past couple of days I've been fighting with a weird puppet problem , we eventually cracked it , but I promised a bunch of you to fully explain it here ;)

So we were deploying 2 Blade chassis at a pretty remote location with a mix of phyisical and virtual machines, some 48 instances in total. This is a pretty standard rollout, we've got a bunch of similar platforms in our lab , so we knew about a couple of glitches, what to expect etc.

I was just keeping an eye on the deployment, looking at the logs seeing if things were running fine, when suddenly a couple of puppet runs didn't come trough, we had seen such behaviour before, usually it's a matter of running them a gain a couple of times and they will come trough. (Upgrading ruby and putting passenger in front of puppet actually solved those issues,
We'd even had a loop built in the platform that runs puppet a couple of times till it returns with the correct exit code just to make sure. )

We were first scratching the A chain of our setup, so that in the event of failure we could still bring up the B chain of the platform and be up and running again. Actually machines were coming up.. slowly .. some of them took a bit longer . One of the machine's clock was seriously off .. the SSL was barfing on it , so we set the bios clock, and restarted .. it was the machine with 6VM's took a while but everything was back on schedule.. then suddenly things were going down fast more and more puppetruns started failing and .. , at some point in time actually none of our puppet runs were working again .
I'd see the puppetmaster perfectly compile it's catalog

  1. notice: Compiled catalog for ctl-0-a

Then the client .. not wanting to get it ..

  1. Mar 1 11:10:45 ctl-0-a puppet-agent[3674]: Not using expired catalog for ctl-0-a from cache; expired at Tue Mar 01 09:50:06 +0000 2011
  2. Mar 1 11:10:45 ctl-0-a puppet-agent[3674]: Using cached catalog
  3. Mar 1 11:10:45 ctl-0-a puppet-agent[3674]: Could not retrieve catalog; skipping run

We had gone from about 60% of our fresly deployed boxen working fine, to not one
So what do you do .. indeed .. you turn on debugging.
You put both your puppetmaster and client in debug. Nothing, no errors no nothing ..

I asked some collegues, asked on irc .. much ideas but none of them that actually cracked the problem. I did what I knew that solve similar problems before,

I switched our serialization format from yaml back to pson , and back, no luck.
I upgraded ruby to a version from the glei.ch repository. No luck.
I upgraded our Puppet version 2.6 to a version from the TMZ Epel repo , we cleaned out ssl the certificates on all sides multiple times. Cleaned out /var/lib/puppet , We uninstalled puppet and reinstalled it.
It wasn't a DNS Problem

I had started stripping my manifests to empty runs, those worked, then started uncommenting the actual manifests again ... Then in the middle of the debug our VPN connection to the remote location broke down, we'd only be getting it back in the morning ..about 12 hours later not fun. Murphy obviously ..

So the next morning we dived right back in ... making those manifests bigger again, removing all the stages, 1 or 2 successful runs, then with the same config .. back to failure. On and off.. successfull and unscussessful. ... it wasn't in the manifests ..

So we decided to roll the puppetmaster back to it's previous version, that one was known to be stable, there obviously was something really fishy going, so that was the safest bet.

Wrong, the machine came up, but it took longer than expected, and when trying to connect new clients to it .. nothing worked anymore .. same problem as before .. puppetmaster compiles catalog, clients didn't get anything. we started to suspect faulty hardware .. but how could that bee.. the puppetclient looked liked the only malfunctioning thing around .

Then Dim0 suggested me to look at the that one logfile I hadn't looked , /var/log/puppet/masterhttp.log and then we saw it . it was being flooded with ssl errors, ssl errors from clients that shouldn't even be connecting to the puppetmaster at all.

  1. [2011-03-02 13:32:00] ERROR OpenSSL::SSL::SSLError: tlsv1 alert decrypt error
  2. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:44:in `accept'
  3. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:44:in `listen'
  4. /usr/lib/ruby/1.8/webrick/server.rb:173:in `call'
  5. /usr/lib/ruby/1.8/webrick/server.rb:173:in `start_thread'
  6. /usr/lib/ruby/1.8/webrick/server.rb:162:in `start'
  7. /usr/lib/ruby/1.8/webrick/server.rb:162:in `start_thread'
  8. /usr/lib/ruby/1.8/webrick/server.rb:95:in `start'
  9. /usr/lib/ruby/1.8/webrick/server.rb:92:in `each'
  10. /usr/lib/ruby/1.8/webrick/server.rb:92:in `start'
  11. /usr/lib/ruby/1.8/webrick/server.rb:23:in `start'
  12. /usr/lib/ruby/1.8/webrick/server.rb:82:in `start'
  13. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:42:in `listen'
  14. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:41:in `initialize'
  15. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:41:in `new'
  16. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:41:in `listen'
  17. /usr/lib/ruby/1.8/thread.rb:135:in `synchronize'
  18. /usr/lib/ruby/site_ruby/1.8/puppet/network/http/webrick.rb:38:in `listen'
  19. /usr/lib/ruby/site_ruby/1.8/puppet/network/server.rb:127:in `listen'
  20. /usr/lib/ruby/site_ruby/1.8/puppet/network/server.rb:142:in `start'
  21. /usr/lib/ruby/site_ruby/1.8/puppet/daemon.rb:124:in `start'
  22. /usr/lib/ruby/site_ruby/1.8/puppet/application/master.rb:114:in `main'
  23. /usr/lib/ruby/site_ruby/1.8/puppet/application/master.rb:46:in `run_command'
  24. /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:287:in `run'
  25. /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:393:in `exit_on_fail'
  26. /usr/lib/ruby/site_ruby/1.8/puppet/application.rb:287:in `run'
  27. /usr/sbin/puppetmasterd:4
  28. [2011-03-02 13:32:00] ERROR OpenSSL::SSL::SSLError: tlsv1 alert decrypt error

What happened was that 'we' decided to bring of the one backup machines back online, afterall once the slow starting server came trough, it would be the passive node in the cluster , no worries there, right ? Wrong,
This physical machine had 6 virtual machines with old ssl certificates that got stuck in an loop which was put there to sure their puppetrun came trough correctly at boot time.

Those 7 rogue clients which generated little to no relevant traffic on the network were saturating the default webrick, killing them solved the problem and we were back to regular deployment in no time.

The sad part is that our upcoming release already has passenger , a fresher version of ruby etc .. and that most of the above mentioned errors won't occur anymore there.
But in short .. don't use the default webrick .. it will kill you :)

And no , not everything is a freaking dns problem, ssl is a big pain in the B too .. :)

Feb 10 2011

Ensure Running

Has anyone noticed that pretty much every puppet module one finds on the internet by default enables the service they try to configure in the module

When looking at it from a single machine point of view it makes sense to include the module , have it configure your service and directly enable it by default.

So I started wondering .. isn't there anybody out there who is building clusters ? Where services have to configured on multiple nodes but should NOT be running acitvely on all nodes by default because there is an external tool which manages that for you (Pacemaker framework eg.)

Agreed it's a small patch to get the functionality you want , but it brings an extra overhead when one upgrades the modules etc.

So if it doesn't bother you please split your puppet module in 2 parts.. one you call to configure the service, another which you call to enable the service , if you want to.

thnx !

Feb 06 2011

At Fosdem

  • on Friday evening , apparently having a confirmed reservation in a resto is not enough to actually be welcome at that restaurant.
  • at DrupalDevdays, only 2 laptops were open during our presentation
  • at DrupalDevdays, almost nobody in the room was already using CI
  • at Fosdem , the parking lot is full before 11:30 on a saturday
  • at Fosdem , much less Macs than last years .
  • at Fosdem , way too much rooms are already at full capacity so you need to have 2-3 backup alternatives ..
  • at Fosdem , people expect me to be in certain rooms, at the same time
  • at Fosdem , even with too much rooms already full one still misses a bunch of interresting talks
  • at Fosdem , one doesn't even realize friends are speaking there too ..
  • at Fosdem , Android is the standard ...
  • at Fosdem , you are confronted with the fact you probably forgot more names of people than you remember ;(
  • at Fosdem , you are surrounded by famous open source people, that aren't even on the schedule
  • at the MySQL Meetup Dinner, Monty brings Salmiakki
  • at Fosdem , you wonder how many other people have survived their 11th edition
  • at Fosdem , you can't get into any devroom on sunday morning
  • at Fosdem , begging on Twitter to get in to a devroom from the other side of the door works (at least for me :))
  • at Fosdem , netbooks are much less popular as opposed to 2-3 years ago ..
  • after fosdem ... you crash ..
  • Jan 16 2011

    Devops Meetups before Fosdem , 2011 Edition

    Just last last year we'll have a Devops meetup just before Fosdem,
    I've setup a page for registrations , that way we'll know how many people to make reservations for.

    If possible we'll go to the same place as last year .. walking distance from the Fosdem Beer event.

    Feel free to spread the news !

    Jan 12 2011

    Appliance or Not Appliance

    That's the question Xavier asks in his blog entry titled
    Security: DIY or Plug’n'Play

    To me the answer is simple, most of the appliances I ran into so far have no way of configuring them apart from the ugly webgui they ship with their device. That means that I can't integrate them with the configuration management framework I have in place for the rest of the infrastructure. There is no way to automatically modify e.g firewall rules together with the relocation of a service which does happen automatically, and there is always some kind of manual interaction required. Applicances tend to sit on a island, either stay un managed ( be honest when's the last time you upgraded the firmware of that terminal server ? ) , or take a lot of additional efort to manage manually. They require yet another set of tools than the set you are already using to manage your network.
    They don't integrate with your backup strategy, and don't tell me they all come with perfect MIB's.

    There's other arguments one could bring up against appliances, obviously people can spread fud about some organisation alledgedly paying people to put backdoors in certain operation systems.. so why would they not pay people to put backdoors in appliances , they don't even need to hide them in there .. but my main concern is manageability .. and only a web gui to manage the box to me just means that the vendor hates me and dooesn't want my business

    A good Appliance (either security or other type) needs to provide me an API that I can use to configure it, in all other cases I prefer a DIY platform, as I can keep it in line with all my other tools, config mgmtn, deployment, upgrade strategies etc.

    Mabye a last question for Xavier to finish my reply ... I`m wondering how Xavier thinks he kan achieve High-availability by using a Virtual environment for Virtual Appliances that are not cluster aware using the virtual environment. A fake comfortable feeling of higher availability , maybe.. but High Availability that I'd like to see.

    Dec 18 2010

    Guest Post Season

    Apparently December is the month where everybody starts writing guest posts for other blogs.

    Earlier this month I wrote an article with the title of this blog for Sysadvent ,

    It's a sysadmin relative of the Perl Advent Calendar: One article for each day of December, ending on the 25th article. With the goals of of sharing, openness, and mentoring, we aim to provide great articles about systems administration topics written by fellow sysadmins

    My article is here, but there's plenty more other articles written about a variety of topics, such as chef, tcpdump , how ls works, cucumber and Devops.

    On the other side, Matthias over at Agile Web Development and Operations is hosting a series on Devops where lots of Devops Advocates and Evangelists are having their say about Devops ...

    My entry about the Challenges the Devops Crowd faces was put online yesterday

    Nov 04 2010

    High Availability MySQL Cookbook , the review

    When I read on the internetz that Alex Davies was about the publish a Packt book on MySQL HA I pinged my contacts at Packt and suggested that I'd review the book .

    I've ran into Alex at some UKUUG conferences before and he's got a solid background on MySQL Cluster and other HA alternatives so I was looking forward to reading the book.

    Alex starts of with a couple of indepth chapters on MySQL Cluster, he does mention that it's not a fit for all problems, but I'd hoped he did it a bit more prominently ... an upfront chapter outlining the different approaches and when which approach is a match could have been better. The avid reader now might be 80 pages into MySQL cluster before he realizes it's not going to be a match for his problem.

    I really loved the part where Alex correcly mentions that you should probably be using Puppet or so to manage the config files of your environment, rather than scp them around your different boxes ..

    Alex then goes on to describe setting up MySQL replication and Multi Master replication with the different approaches one can take here, he gives some nice tips on using LVM to reduce the downtime of your MySQL when having to transfer the dataset of an already existing MySQL setup, good stuff.

    He then goes on to describe MySQL with shared storage ... if you only mount your redundant sandisk once on your MySQL nodes my preference would probably be a Pacemaker stack rather than a RedHat Cluster based setup, but his setup seems to work too. Alex quickly touches on using GFS to have your data disk mounted simultaneously on both nodes (keep in mind with only 1 active MySQLd) and then goes on to describe a full DRBD based MySQL HA setup

    The last chapter titled Performance tuning gives some very nice tips on both tuning your regular storage, as your
    GFS setup but also the tuning parameters for MySQL Cluster

    I was also really happy to see the Appendixes on the basic installation where he advocates the use of Cobbler , Kickstart and LVM ..

    One of the better books I read the past couple of years .. certainly the best book from Packt so far , I hope there is more quality stuff coming from that direction !

    Oct 30 2010

    Puppet broke my Xen

    Actually it didn't , but now I got your attention.
    We just adopted the use of adding headers to all of our files that are managed by puppet so people will know not to touch it

    1. file {
    2. "/etc/xen/scripts/network-custom-vlan-bridges":
    3. owner => "root",
    4. group => "root",
    5. mode => "0755",
    6. content => template(
    7. "headers/header-hash.erb',
    8. "xen/co-mmx-network-custom-vlan-bridges.erb");
    9. }

    All worked nice however upon bootstrapping our Xen host the bridges stopped working .. running the network-custom-vlan-bridges script manually solved everything and created the appropriate bridges. But at boottime it didn't..

    I added some debug info to the script and figured it never got executed at boot time.

    Turns out that when I removed the headers Xen actually does configure the bridges at boot time, Xen probably checks for a shebang at the beginning of the file.

    Putting the header at the end of the file therefore solved the problem. ,

    Jun 01 2010

    @Beaker on #Devops

    Yesterday @beaker posted his ideas on the #devops movement ...

    Apparently we haven't been stressing enough on the fact that it isn't just about Devs and Ops,
    So let me repeat it's not just about Devs and Ops, it's about breaking silo's , about being good at our jobs, about getting conversation started, about talking to different stakeholders in the processes . We are absolutely trying to include all groups, not exclude some.

    @beaker also seems to have seen many presentations where developers are shown to have evolved in practice and methodology, but operators (of all kinds) are described as being stuck in the dark ages. , is that a different point of view on another continent \, on this side of the Atlantic, it's mostly the Ops people that are already using agile methods spreading the word and it isn't about Devs talking about Deopvs yet. It's actually mostly the ops spreading the word because they feel most of the pain .

    Hoff also wonders about routers switches firewall and all the other boxen where we aren't running puppet or chef on , the boxes that are left out of our fully automated environments .
    Indeed, Puppetcamp Europe once again woke up the discussion on how to tackle these boxen, the lack of use of existing standards was covered .. and some mentioned that CIM and family are pretty much death or irrelevant for real life usage , both the Puppet and Chef communities are working on manifest, modules and recipes to solve the issues.

    But the good thing is that we now have the security people involved too, maybe they'll figure out how to survive longer than 6 months in a CSO position if they talk to the others and come out of their Ivory towers :)

    Jun 01 2010

    PuppetCamp Europe 2010

    Last week was pretty heavy on conferences for me. On wednesday I had to give my Building Virtual Appliances talk at the at the Sizing Server event on Advanced Virtualization and Hybrid Cloud Computing , but the most important part of the week was the first edition of Puppetcamp Europe.

    When the first ideas about PuppetCamp Europe started I asked Luke when and where it'd be held. He replied that I should know as I was supposed to organise it... I thanked for the honour , he went on to ask Patrick , he accepted ... I hope I helped him out enough :) I even handed out a personal invitation to some of the most famous configuration mgmt people on this planet and Inuits sponsored the event too

    Luke started with the opening talk, talking about the future and past of puppet , about version numbers, 2.6 does sound familiar and stable doesn't it, about forge.puppetlabs.com
    During @puppetmasterd 's talk @kartar played Bugmaster which was great and almost realtime

    The real fun started with the Open Spaces ... after everybody presented themselves, a mix of usual suspects, first timers and oldskoolers from irc #puppet that finally got faces, different sessions were proposed, ranging from Puppet 101, Alternative Puppet Architectures, Puppet HA, MultiMaster Puppet to Dating for PuppetMasters

    Over the 2 days spread the open space different ideas came up on e.g how to scale puppet. Different people are letting their puppetclients run from cron in batches, but probably the weirdest idea I heard was to run Puppet in Jruby in order to speed it up.

    Lots of talk on certificates and how to solve the pains with them .. e.g like in a HA setup .. you need to create an authority chain .. there was also talk about having a
    --trust-my-network feature that would disable certificates, Luke was open to accepting such a patch, or a patch that would make the whole certificate setup more pluggable
    That would for sure be a feature a lot of people would want to use ..

    The thurday evening conference dinner was "Stoofvlees met Frieten" for most of us .. but for me it was a London Devops Curry in Gent, with @unixdaemon @ripienaar and some others ;)

    But with lots of interesting chatter, free beer and free icecream there's for sure going to be another similar event in Europe next year ..