On the importance of idempotence.

A couple of months ago we were seeing weird behaviour with consul not knowing all it's members at a customer where we had deployed Consul for service registration as a POC
The first couple of weeks we hadn't noticed any difficulties but after a while we had the impression that the number of nodes in the cluster wasn't stable.

Obviously the first thought is that such a new tool probably isn't stable enough so it's expected behaviour , but rest asured that was not the case.

We set out to frequently monitor the number of nodes
a simple cron to create a graph.

  1. NOW=`date +%s`
  2. HOST=`hostname -f`
  3. MEMBERS=`/usr/local/bin/consul members | wc -l`
  4.  
  5. echo "consul_members.$HOST $MEMBERS $NOW" | graphite 2003

It didn't take us very long to see that indeed the number members in the cluster wasn't stable, frequently there were less nodes in a cluster then slowly the expected number of nodes came back on our graph.

Some digging learned us that the changes in number of nodes was in sync with our puppetruns.
But we weren't reconfiguring consul anymore, there were no changes in the configuration of our nodes.
Yet puppet triggered a restart of consul on every run. The restart was because knew it had rewritten the consul config file.
Which was weird as the values in that file were the same.

On closer inspection we noticed that the values in the file didn't change, however the order of the values in the file
changed. From a functional point of view that did not introduce any changes, but puppet rightfully assumed the configuration file
had changed and thus restarted the service dutyfully.

The actually problem lied in the implementation of the writing of the config file which was in JSON,
The ancient Ruby library just took the hash and wrote it in no specific order, each time potentially resulting
in a file with the content in a different order.

A bug fix to the puppet module made sure that the hash was written out in a sorted way , so each time resulting in the
same file being generated.

After that bugfix obviously our graph of the number of nodes in the cluster flatlined as restarts were not being introduced anymore.

This is yet another example of the importance of idempotence . When we trigger a configuration run , we want to
be absolutely sure that it won't change the state of the system if it already has been defined the way we want.
Rewriting the config file should only happen if it gets new content.

The yak is shaved .. and sometimes it's not a funky dns problem but just a legacy ruby library one ..

Comments

Paul Cobbaut's picture

#1 Paul Cobbaut : backticks

Sorry to go off topic, but this:

  1. NOW=$(date +%s)
  2. HOST=$(hostname -f)
  3. MEMBERS=$(/usr/local/bin/consul members | wc -l)

... is easier to read (and posix compliant).

cheers,
paul