Jun 15 16:57

Will containers take over ?

and if so why haven't they done so yet ?

Unlike many people think, containers are not new, they have been around for more than a decade, they however just became popular for a larger part of our ecosystem. Some people think containers will eventually take over.

Imvho It is all about application workloads, when 8 years ago I wrote about a decade of open source virtualization, we looked at containers as the solution for running a large number of isolated instances of something on a machine. And with large we meant hundreds or more instances of apache, this was one of the example use cases for an ISP that wanted to give a secure but isolated platform to his users. One container per user.

The majority of enterprise usecases however were full VM's Partly because we were still consolidating existing services to VM's and weren't planning on changing the deployment patterns yet. But mainly because most organisations didn't have the need to run 100 similar or identical instances of an application or a service, they were going from 4 bare metal servers to 40 something VM's but they had not yet come to the need to run 100's of them. The software architecture had just moved from FatClient applications that talked directly to bloated relational databases containing business logic, to web enabled multi-tier
applications. In those days when you suggested to run 1 Tomcat instance per VM because VM's were cheap and it would make management easier, (Oh oops I shut down the wrong tomcat instance) , people gave you very weird looks

Slowly software architectures are changing , today the new breed of applications is small, single function, dedicated, and it interacts frequently with it's peers, together combined they provide similar functionality as a big fat application 10 years ago, But when you look at the market that new breed is a minority. So a modern application might consist of 30-50 really small ones, all with different deployment speeds. And unlike 10 years ago where we needed to fight hard to be able to build both dev, acceptance and production platforms, people now consider that practice normal. So today we do get environments that quickly go to 100+ instances , but requiring similar CPU power as before, so the use case for containers like we proposed it in the early days is now slowly becoming a more common use case.

So yes containers might take over ... but before that happens .. a lot of software architectures will need to change, a lot of elephants will need to be sliced, and that is usually what blocks cloud, container, agile and devops adoption.

Jun 13 21:04

Jenkins DSL and Heisenbugs

I`m working on getting even more moving parts automated, those who use Jenkins frequently probably also have Love - Hate relationship with it.

The love coming from the flexibility , stability and the power you get from it, the hate from it's UI. If you've ever had to create a new Jenkins job or even pipeline based on one that already existed you've gone trough the horror of click and paste errors , and you know where the hate breeds.

We've been trying to automate this with different levels of success, we've puppetized the XML jobs, we've used the Buildflow Plugin (reusing the same job for different pipelines is a bad idea..) We played with JJB running into issues with some plugins (Promoted Build) and most recently we have put our hope in the Job DSL.

While toying with the DSL I ran into a couple of interresting behaviours. Imagine you have an entry like this which is supposed to replace the $foldername with the content of the variable and actually take the correct upstream

  1. cloneWorkspace('${foldername}/dashing-dashboard-test', 'Successful')

You generate the job, look inside the Jenkins UI to verify what the build result was .. save the job and run it .. success ..
Then a couple of times later that same job gives an error ... It can't find the upstream job to copy the workspace from. You once again open up the job in the UI, look at it .. save it , run it again and then it works.. a typical case of Heisenbug ..

When you start looking closer to the XML of the job you notice ..

  1. <parentJobName>${foldername}/dashing-dashboard-test</parentJobName>

obviously wrong .. I should have used double quotes ..

But why doesn't it look wrong in the UI ? That's because the UI autoselects the first option from it's autogenerated pull down list .. Which actually contains the right upstream workplace I wanted to trigger (that will teach me to use 00 as a prefix for the foldername for all my tests..)

So when working with the DSL .. review the generated XML .. not just if the job works ..

Jun 01 09:00

Linux Troubleshooting 101 , 2016 Edition

Back in 2006 I wrote a blog post about linux troubleshoooting. Bert Van Vreckem pointed out that it might be time for an update ..

There's not that much that has changed .. however :)

Everything is a DNS Problem

Everything is a Fscking DNS Problem
No really, Everything is a Fscking DNS Problem
If it's not a fucking DNS Problem ..
It's a Full Filesystem Problem
If your filesystem isn't full
It is a SELinux problem
If you have SELinux disabled
It might be an ntp problem
If it's not an ntp problem
It's an arp problem
If it's not an arp problem...
It is a Java Garbage Collection problem
If you ain't running Java
It's a natting problem
If you are already on IPv6
It's a Spanning Tree problem
If it's not a spanning Tree problem...
It's a USB problem
If it's not a USB Problem
It's a sharing IRQ Problem
If it's not a sharing IRQ Problem
But most often .. its a Freaking Dns Problem !

`

May 28 13:16

Docker and volumes hell

We're increasingly using Docker to build packages, a fresh chroot in which we prepare a number of packages, builds typically for ruby (rvm) , or python (virtualenv) or node stuf where the language ecosystem fails on us ... and fpm the whole tree as a working artifact.

An example of such a build is my work on packaging Dashing. https://github.com/KrisBuytaert/build-dashing

Now part of that build is running the actual build script in docker with a local volume mounted inside the container This is your typical -v=/home/src/dashing-docker/package-scripts:/scripts param.

Earlier this week however I was stuck on a box where that combo did not want to work as expected. Docker clearly mounted the local volume, as it could execute the script in the directory, but for some reason it didn't want to write in the mounted volume.

docker run -v=/home/src/dashing-docker/package-scripts:/scripts dashing/rvm /scripts/packagervm
Usage of loopback devices is strongly discouraged for production use. Either use `--storage-opt dm.thinpooldev` or use `--storage-opt dm.no_warn_on_loop_devices=true` to suppress this warning.
corefines: Your Ruby doesn't support refinements, so I'll fake them using plain monkey-patching (not scoped!).
/usr/local/share/gems/gems/corefines-1.9.0/lib/corefines/support/fake_refinements.rb:26: warning: Refinements are experimental, and the behavior may change in future versions of Ruby!
/usr/share/ruby/fileutils.rb:1381:in `initialize': Permission denied - rvm-1.27.0-1.x86_64.rpm (Errno::EACCES)

So what was I doing wrong, did the Docker params change, did I invert the order of the params, did I mistype them ? I added debugging to the script, (ls , chmod, etc..) and I couldn't seem to read or modify the directory. So I asked a coworker to be my wobbling duck.

He did more .. he wondered if this wasn't selinux.

And he was right..

Apr 23 21:47:00 mine23.inuits.eu audit[9570]: AVC avc: denied { write } for pid=9570 comm="fpm" name="package-scripts" dev="dm-2" ino=368957 scontext=system_u:system_r:svirt_lxc_net_t:s0:c47,c929 tcontext=unconfined_u:object_r:unlabeled_t:s0 tclass=dir permissive=0
Apr 23 21:47:02 mine23.inuits.eu python3[9597]: SELinux is preventing fpm from write access on the directory /home/src/dashing-docker/package-scripts.

So while I was looking for errors in docker, it was just my selinux set to enforce acting up and me not noticing it.

The quick way to verify obvisously was to setenforce 0 and trigger the build again that however is not a long term fix so I changed the

semanage fcontext -a -t cgroup_t '/home/src/dashing-docker/package-scripts'
restorecon -v '/home/src/dashing-docker/package-scripts'

That solves the problem

Mar 23 23:38

It's just rubber , with some air in it.

- A child balloon
you inflate it, play with it for a while, then it explodes, you throw it away you inflate another one, maybe even a different color. the kid plays with it .. till it breaks, then you throw it away...

- An inflatable castle.
you inflate it, play with it for a while, deflate it, move it around, inflate it again, if it has a hole in it, you patch the hole.

- The tyres on your kids bike,.
You inflate it , kid rides on it, if it starts losing air, you fix it asap so you can continue using it.

All of the 3 are really valuable use cases for rubber with air in it,
in a way they are all equally valuable, but serve different purposes, different use cases.

Now think about this next time you spin up a container, that's running a database, application server and where your users ssh to.

It's not just another VirtualMachine

Mar 23 23:28

Lies, Damn Lies and Statistics, 2016 Edition

When people sign up for Configuration Management Camp, we ask them what community room they are mostly interested in.
We ask this question because we have rooms in different sizes and we don't want to put communities with 20 people showing interest in a 120 seat room and we don't want to put a community with 200 people in a 60 seat room.

But it also gives us to opportunity to build some very interesting graph over the potential evolution of the communities.

So looking at the figures ... the overall community is obviously growing,From 350 to 420, to just short of 600 people registered now.

The Puppet Community is not the biggest anymore, that spot went to the Ansible Community room. And all but the CFengine communities are growing.

One more thing , The organisation team discussed several times if we should rebrand the event. We opted not to .. Infracoders.eu could have been an alternative name .. but we decided to stick with the name that already is known,
the content will evolve.. but Config Management Camp will stay the place where people that care about Infrastructure as Code and Infrastructure automation meet.

Mar 23 22:58

Recent and Upcoming Talks

I gave a couple of new and updated talks the last couple of months.

At Fosdem, I opened up the Distributions Devroom with a talk on how we could improve the collaboration between the developers of a distro, and their users, the ops folks. I seemed to not be the only person with similar ideas as the smart folks over at the CentOS project were already actively talking about their efforts into making some my ideas become reality the day before at the CentOS Dojo.


Another talk of early this year was an update on the why Security is requires a devops aproach, why you want to embed security as a standard practice in your development process and why continuous delivery actually is a security requirement.


At last weeks FLOSS UK Conference in London , I gave an updated version of My MonitoringLove talk, giving an opinionated overview of the current state of Open Source Monitoring tools.


I was scheduled to give the Opening Keynote today (23/3/2016) at the Making Open Source Software Conference , but sadly I had to cancel that due to yesterdays (22/3/2016) events at Brussels Airport. My flight to Bucharest was obviously cancelled.

I`m scheduled to open up the 2nd day of the the upcoming Devopsdays London Edition

And Bernd Erk tricked me into giving a follow up talk on my popular 7 tools for your devops stack talk, aptly titled Another 7 tools for your devops stack

Jan 07 10:48

Bimodal IT , redefined

There's been a lot of discussion about the sillynes of the term BiModal IT, aka the next big thing for IT-organisations that don't dare to change, but still want to sound cool.

So here is my idea to reuse that term for something relevant.

BiModal IT, is the idea where you take a fully automated infrastructure which has been build on the principles of Infrastructure as Code. Which gets periodic idempodent updates (e.g every 15 or 30 minutes, or when you orchestrate it), and consistency checks , where the source code for that infrastructure is versioned , tested and delivered trough a traditional Continuous Delivery pipeline for the majority of your services. and add realtime reconfiguration capacities based on service discovery for the other services that really change fast, or in a real elastic way, using tools like Consul, Consul_template, etcd etc..

That way you have 2 modes of managing your infrastructure, aka BiModal

Jul 28 08:35

The power of packaging software, package all the things

Software delivery is hard, plenty of people all over this planet are struggling with delivering software in their own controlled environment. They have invented great patterns that will build an artifact, then do some magic and the application is up and running.

When talking about continuous delivery, people invariably discus their delivery pipeline and the different components that need to be in that pipeline.
Often, the focus on getting the application deployed or upgraded from that pipeline is so strong that teams
forget how to deploy their environment from scratch.

After running a number of tests on the code , compiling it where needed, people want to move forward quickly and deploy their release artifact on an actual platform.
This deployment is typically via a file upload or a checkout from a source-control tool from the dedicated computer on which the application resides.
Sometimes, dedicated tools are integrated to simulate what a developer would do manually on a computer to get the application running. Copy three files left, one right, and make sure you restart the service. Although this is obviously already a large improvement over people manually pasting commands from a 42 page run book, it doesn’t solve all problems.

Like the guy who quickly makes a change on the production server, never to commit the change, (say goodbye to git pull for your upgrade process)
If you package your software there are a couple of things you get for free from your packaging system.
Questions like, has this file been modified since I deployed it, where did this file come from, when was it deployed,
what version of software X do I have running on all my servers, are easily answered by the same
tools we use already for every other package on the system. Not only can you use existing tools you are also using tools that are well known by your ops team and that they
already use for every other piece of software on your system.

If your build process creates a package and uploads it to a package repository which is available for the hosts in the environment you want to deploy to, there is no need anymore for
a script that copies the artifact from a 3rd party location , and even less for that 42 page text document which never gets updated and still tells you to download yaja.3.1.9.war from a location where you can only find
3.2 and 3.1.8 and the developer that knows if you can use 3.2 or why 3.1.9 got removed just left for the long weekend.

Another, and maybe even more important thing, is the current sadly growing practice of having yet another tool in place that translates that 42 page text document to a bunch of shell scripts created from a drag and drop interface, typically that "deploy tool" is even triggered from within the pipeline. Apart from the fact that it usually stimulates a pattern of non reusable code, distributing even more ssh keys , or adding yet another agent on all systems. it doesn’t take into account that you want to think of your servers as cattle and be able to deploy new instances of your application fast.
Do you really want to deploy your five new nodes on AWS with a full Apache stack ready for production, then reconfigure your load balancers only to figure out that someone needs to go click in your continuous integration tool or deployment to deploy the application to the new hosts? That one manual action someone forgets?
Imvho Deployment tools are a phase in the maturity process of a product team.. yes it's a step up from manually deploying software but it creates more and other problems , once your team grows in maturity refactoring out that tool is trivial.

The obvious and trivial approach to this problem, and it comes with even more benefits. is called packaging. When you package your artifacts as operating system (e.g., .deb or .rpm) packages,
you can include that package in the list of packages to be deployed at installation time (via Kickstart or debootstrap). Similarly, when your configuration management tool
(e.g., Puppet or Chef) provisions the computer, you can specify which version of the application you want to have deployed by default.

So, when you’re designing how you want to deploy your application, think about deploying new instances or deploying to existing setups (or rather, upgrading your application).
Doing so will make life so much easier when you want to deploy a new batch of servers.

May 11 2015

On the importance of idempotence.

A couple of months ago we were seeing weird behaviour with consul not knowing all it's members at a customer where we had deployed Consul for service registration as a POC
The first couple of weeks we hadn't noticed any difficulties but after a while we had the impression that the number of nodes in the cluster wasn't stable.

Obviously the first thought is that such a new tool probably isn't stable enough so it's expected behaviour , but rest asured that was not the case.

We set out to frequently monitor the number of nodes
a simple cron to create a graph.

  1. NOW=`date +%s`
  2. HOST=`hostname -f`
  3. MEMBERS=`/usr/local/bin/consul members | wc -l`
  4.  
  5. echo "consul_members.$HOST $MEMBERS $NOW" | graphite 2003

It didn't take us very long to see that indeed the number members in the cluster wasn't stable, frequently there were less nodes in a cluster then slowly the expected number of nodes came back on our graph.

Some digging learned us that the changes in number of nodes was in sync with our puppetruns.
But we weren't reconfiguring consul anymore, there were no changes in the configuration of our nodes.
Yet puppet triggered a restart of consul on every run. The restart was because knew it had rewritten the consul config file.
Which was weird as the values in that file were the same.

On closer inspection we noticed that the values in the file didn't change, however the order of the values in the file
changed. From a functional point of view that did not introduce any changes, but puppet rightfully assumed the configuration file
had changed and thus restarted the service dutyfully.

The actually problem lied in the implementation of the writing of the config file which was in JSON,
The ancient Ruby library just took the hash and wrote it in no specific order, each time potentially resulting
in a file with the content in a different order.

A bug fix to the puppet module made sure that the hash was written out in a sorted way , so each time resulting in the
same file being generated.

After that bugfix obviously our graph of the number of nodes in the cluster flatlined as restarts were not being introduced anymore.

This is yet another example of the importance of idempotence . When we trigger a configuration run , we want to
be absolutely sure that it won't change the state of the system if it already has been defined the way we want.
Rewriting the config file should only happen if it gets new content.

The yak is shaved .. and sometimes it's not a funky dns problem but just a legacy ruby library one ..