Dominic Cleal's Blog


A couple of experiences recently have reminded me that I ought to rely more on profilers when debugging performance problems. Usually I'll start with logging/printing or inline REPLs to figure out where the time is taking, but in heavily iterative functions, a profiler identifies the cause much more quickly.

1. Augeas path expression performance

Two users mentioned that Augeas's path expression performance was terrible in some circumstances. One was working on an Augeas provider for Puppet and it was slow matching entries when there were hundreds or thousands of services defined in a Nagios config, while netcf developers reported the utility was very slow with hundreds of VLAN interfaces.

Specifically it was found that filtering of nodes from the in-memory tree was slow when they had the same label, even when accessed directly by index, e.g. /files/foo[0] or /files/foo[512]. Running Valgrind's callgrind utility against a simple test run of the binary showed a huge proporation of time in the ns_add and memmove functions.

Using KCachegrind to analyse the output file from callgrind:

Visually it's very easy to spot where the time's taken up from the callee map. The two candidates are the red and yellow squares shown on the left, which are being called through two different paths (hence showing up twice). The third large group of functions are from the readline library initialisation (since I was testing the CLI, not the API), and the fourth block is Augeas' own initialisation. These have a much more regular spread of time spent in functions.

The bulk of the time in the two functions on the left was eliminated by David by:

  1. Replacing an expensive check in ns_add that was attempting to build a set of unique nodes. Instead of rescanning the existing set each time to see if the new node was already present, a flag inside the node indicates if it's already been added.
  2. Replacing many memmove calls with batched calls. Instead of removing candidate nodes from the start of a list by memmoving the rest of the list towards the start as each one is identified, the number to remove is batched and removed in one operation.

After fixing it, the group of expensive ns_* functions have moved completely in the callee map. They're now over on the right hand side of KCachegrind, leaving aug_init being the most expensive part, followed by readline, followed by the actual filtering.

2. JGrep filtering performance

Puppet module tests for theforeman/puppet had recently started timing out during startup of the test. The dependencies would install, the RSpec test framework would start up and there would be a long pause before it started listing tests that were running. This caused a timeout on Travis CI and the cancellation of the builds after 10 minutes of inactivity.

Julien dug into it and narrowed it down to a method call to rspec-puppet-facts, which loads hashes of system data (facts) for each supported operating systems and generates sets of tests for each OS. This in turn calls facterdb, a simple library that loads JSON files into memory and provides a filtering API using jgrep. On my system, this was taking about seven or eight seconds per call and per test, so a total of about two minutes on this one method call.

I was rather prejudiced when I started looking at this, as I knew from chasing a Heisenbug into facterdb a few weeks earlier that it loaded all of its JSON files into one large string before passing it into JGrep to be deserialised and filtered. I figured that it was either taking a long time to parse or deserialise the huge blob of JSON or that it was too large to filter quickly. Putting some basic logging into the method showed that I was completely wrong about the JSON, it was extremely quick to parse and was a non-issue.

Next up, with some basic logging and inserting pry (a REPL) into jgrep, I could see that parsing the filter string was taking about 2/3 of the time and then applying it another 1/3. This surprised me, because although the filter string was reasonably complex (it lists all of the supported OSes and versions), it was only a few lines long. Putting ruby-prof around the JGrep method call quickly showed where the issues lay.

ruby-prof has a variety of output formats, including writing callgrind-compatible files which can be imported straight into KCachegrind. These quickly showed an alarming amount of time splitting strings and matching regular expressions:

The string splitting was only in order to access individual characters in an array during parsing of the grep filter, so this could easily be replaced by String#[] for an instant performance boost. The regex matching was mostly to check whether certain characters were in strings, or if the string started with certain substrings, both of which could be done much more cheaply with String#include? and String#start_with?. After these two small changes, the map looks a lot healthier with a more even distribution of runtime between methods:

Actual runtime has dropped from 7-8 seconds to a fraction of a second, and tests no longer timeout during startup.

Both callgrind and ruby-prof are extremely easy to use - it's just a matter of running a binary under Valgrind for the former, and wrapping a code block in a RubyProf.profile call for the latter - and the results from fixing some low-hanging performance problems is very satisfying!

Reproducible builds on Copr with tito and git-annex

Recently, Fedora's Copr service was launched which lets individuals create personal repos and build packages on Fedora's servers for a number of Fedora and EL versions (similar to OBS and PPAs).

I've set up a couple of repos under it, and one which contains builds of various gems as dependencies for librarian-puppet. Setting up and tracking RPM builds is made quite easy with git and tito, which lets you set up a simple directory structure in your RPM repo, track specs and source binaries (.gem files), tag and release changes to Copr:

$ tree 
|-- rel-eng
|   |-- packages
|   |   |-- rubygem-highline
|   |   |-- rubygem-librarian
|   |   |-- rubygem-librarian-puppet
|   |   `-- rubygem-thor
|   |-- releasers.conf
|   `-- tito.props
|-- rubygem-highline
|   |-- highline-1.6.20.gem
|   `-- rubygem-highline.spec
|-- rubygem-librarian
|   |-- librarian-0.1.2.gem
|   `-- rubygem-librarian.spec
|-- rubygem-librarian-puppet
|   |-- librarian-puppet-0.9.17.gem
|   `-- rubygem-librarian-puppet.spec
|-- rubygem-thor
|   |-- rubygem-thor.spec
|   `-- thor-0.15.4.gem

6 directories, 16 files

(my librarian-puppet-copr repo)

However storing binary files in git has lots of problems, including the resulting size of the repo. To solve this, I use git-annex so the repo only stores metadata and a local cache of the binaries. The binaries can then be fetched from the web on a clean checkout using the script and git-annex on the fly.

Setting this up is easy:

  1. mkdir myrepo && cd myrepo
  2. git init
  3. git annex init
  4. tito init
Next, configure Copr and tito to use Copr using Mirek Suchý's blog post and my rel-eng directory. The key part of the rel-eng configuration for using the git-annex support in tito 0.5.0 is setting the builder in tito.props:

builder = tito.builder.GitAnnexBuilder

Adding new packages is now a matter of doing:

  1. mkdir rubygem-foo && cd rubygem-foo
  2. vim rubygem-foo.spec
  3. git annex addurl --file=foo-1.2.3.gem or copy the file into place and run git annex add foo-1.2.3.gem
  4. git commit -am "Add foo 1.2.3"
  5. tito tag --keep-version
  6. tito release copr-domcleal

git-annex will store the file in its local storage under .git/annex, replacing the file in git with a symlink based on the checksum of the contents. When you push the repo to a remote without git-annex support (like GitHub), then the binaries won't be transferred, keeping the size small. Other users can fetch binaries by adding a shared remote (e.g. a WebDAV or rsync share or web remotes using

When tito is building SRPMs, it will "unlock" the files, fetching them from available remotes, build the RPM and then re-lock the checkout.

So the combination of git (for tracking sources and specs), git-annex (for keeping sources outside git), tito (for tagging builds) and Copr (for building and publishing) makes it easy to build and release your own RPMs, while allowing you to make the source code and build process accessible and transparent.

Automatic DNS updates from libvirt guests

Recently I've been relying a lot on Vagrant and vagrant-libvirt in particular for spinning up a variety of OSes for testing. One irritating habit I've developed is to check the IP of the VM each time it comes up if I need to view a website/app hosted on the VM itself and paste this into my browser, rather than using a DNS name.

Since libvirt runs dnsmasq by default for DHCP and DNS services inside NAT virtual networks, getting DNS working from the hypervisor (my desktop) is very easy. First ensure that the libvirt network has the domain name correctly configured - this was my hostname initially, but I changed it to

$ sudo virsh net-edit default
  <forward mode='nat'>
      <port start='1024' end='65535'/>
  <bridge name='virbr0' stp='on' delay='0' />
  <mac address='52:54:00:03:62:29'/>
  <domain name=''/>
  <ip address='' netmask=''>
      <range start='' end='' />

Change the <domain> tag's name attribute, then virsh net-destroy default and virsh net-start default to restart and apply this change.

Next I updated the images and VMs I use in libvirt to send hostnames when making DHCP requests, ensuring that dnsmasq would associate the hostname with the DHCP lease. On Red Hat variants, set in /etc/sysconfig/network-scripts/ifcfg-eth0 and run service network restart.

Provided the domain in the DHCP_HOSTNAME is the same as the domain set in libvirt's network (which is used to configure dnsmasq), the hostname will get updated and is shown in /var/lib/libvirt/dnsmasq/default.leases with the domain removed:

1387893215 52:54:00:15:d1:73 foreman *

If hostnames aren't shown, check syslog for messages such as this to indicate a domain name mismatch:

Dec 24 12:32:46 cobalt dnsmasq-dhcp[8571]: Ignoring domain for DHCP host name foreman

Lastly, I configure NetworkManager on my desktop to redirect queries for to the dnsmasq instance serving the libvirt network. First I switch NetworkManager to run a dnsmasq instance of its own instead of pointing resolv.conf to external resolvers:

$ sudoedit /etc/NetworkManager/NetworkManager.conf

Then add a dnsmasq config snippet to send queries for through to the dnsmasq instance bound to the gateway IP address of the libvirt network:

$ sudoedit /etc/NetworkManager/dnsmasq.d/libvirt_dnsmasq.conf

And now, from my desktop I'm able to bring up a VM and instantly query or use the hostname to access it.

Extending Foreman quickly with hook scripts

Foreman integrates with a bunch of systems such as DNS (BIND, AD), DHCP (ISC) and Puppet when systems get created and destroyed. It does most of this via its smart proxy architecture, where an agent (the proxy) is deployed on your other servers which gets called from your central Foreman host during "orchestration".

Most production setups are a bit more complex though with other systems we don't integrate with in Foreman, so the plugin support in version 1.1 is there to extend Foreman's capabilities. Since this means learning some Ruby and Rails, I've released foreman_hooks so admins can instead write simple shell scripts to hook straight into Foreman's orchestration feature.

I'm going to walk through a example hook to run a PuppetDB command to clean up stored data (facts, resources) when hosts are deleted in Foreman.

Foreman 1.2+ installed from RPMs: please use the ruby193-rubygem-foreman_hooks package from the plugins repo.

Other users: first we'll install the plugin by adding this config file:

# echo "gem 'foreman_hooks'" > ~foreman/bundler.d/foreman_hooks.rb
And then run this to install and restart Foreman:
# cd ~foreman && sudo -u foreman bundle update foreman_hooks
Fetching source index for
Installing foreman_hooks (0.3.1)
# touch ~foreman/tmp/restart.txt

Next, we'll create a hook script that runs when a managed host is destroyed in Foreman:

# mkdir -p ~foreman/config/hooks/host/destroy
# cat << EOF > ~foreman/config/hooks/host/destroy/70_deactivate_puppetdb

# We can only deactivate data in PuppetDB, not add it
[ "x$1" = xdestroy ] || exit 0

# Remove data in PuppetDB, supplying the FQDN
# assumes sudo rules set up as this runs as 'foreman':
#   foreman ALL = NOPASSWD : /usr/bin/puppet node deactivate *
#   Defaults:foreman !requiretty
sudo puppet node deactivate $2

# chmod +x ~foreman/config/hooks/host/destroy/70_deactivate_puppetdb

(on Foreman 1.2, change "host" in the path to "host/managed")

There are a few things here to note. The path we're creating under config/hooks/ refers to the Foreman object and event (see the README). For hosts we can use "create", "update" and "destroy" to extend the orchestration process and there similar events for every other object in Foreman. The 70 prefix influences the ordering with other tasks, see grep -r priority ~foreman/app/models/orchestration for more.

The script gets two arguments, the first is the event name ("destroy" etc) and very importantly for orchestration events, this can change. If an orchestration task fails, the process will get rolled back so the script will then be called with the opposite event to the one it was first called with. For example, a hook in the create/ directory will first be called with "create", then a later task may fail and it will be called again with "destroy" to revert the change. Orchestration of DNS records etc in Foreman works in the same way. Since this example is only able to remove data and not add it, the first line checks the argument and exits if it isn't asked to destroy. Other scripts should take note of value of this argument.

The second argument is the string representation of the object, i.e. the FQDN for host objects. On stdin, we receive a full JSON representation which gives access to other attributes. There are helpers in to access this data, see examples/ to get a copy.

Lastly in this example, we run the PuppetDB command to remove the data. The exit code of orchestration hooks is important here. If the exit code is non-zero, this will be treated as a failure so Foreman will cancel the operation and roll back other tasks that were already completed.

Now when the host gets deleted from either the Foreman UI or the API, the host gets deactivated in PuppetDB:

# puppet node status
Deactivated at 2013-04-07T13:39:59.574Z
Last catalog: 2013-04-07T11:56:40.551Z
Last facts: 2013-04-07T13:39:30.114Z
There's a decent amount of logging, so you can grep for the word "hook" in the log file to find it. You can increase the verbosity with config.log_level in ~foreman/config/environments/production.rb.
# grep -i hook /var/log/foreman/production.log
Queuing hook 70_deactivate_puppetdb for Host#destroy at priority 70
Running hook: /usr/share/foreman/config/hooks/host/destroy/70_deactivate_puppetdb destroy

One area I intend on improving is being able to interact with Foreman from within the hook script. At the moment, I'd suggest using the foremanBuddy command line tool from within your script - though you shouldn't edit the object from within the hook, unless you're using the after_create/after_save events.

I hope this sets off your imagination to consider which systems you could now integrate Foreman into, without needing to start with a full plugin. Maybe we could begin collecting the most useful hooks in the community-templates repo or on the wiki.

(PuppetDB users should actually use Daniel's dedicated puppetdb_foreman plugin. I hope development of dedicated plugins down the line will happen as a natural extension.)

Introducing tests for Augeas resources in Puppet

Augeas resources in Puppet have always been a bit of a black box, as they use somewhat esoteric commands (based on augtool, but with a different parser), are often not idempotent without some work, and are difficult to test.

Last year I wrote about a small test framework I'd written for augeasproviders, a set of new types and providers. These tools allowed me to check the providers were making the change correctly and were idempotent.

Before his talk on testing Puppet modules at Puppet Camp Ghent, vStone asked me about testing Augeas resources in the same way, so I decided to refactor this code into a new extension for Tim Sharpe's rspec-puppet tool, so it could be used against resources in manifests.

rspec-puppet-augeas is a small gem you can install to run Augeas resources inside your rspec-puppet tests. It runs the resource against as many test fixture(s) as you like, then runs it again to make sure it's idempotent. It can provide debug output from the provider when it goes wrong, and adds many file inspection tools to check it went right.

The README has info on setting up (it's quick) and I've published an example module showing how it should look. There are lots of examples showing off each feature inside the project tests themselves.

Feedback gratefully received. File issues, PRs or contact me on IRC.

Testing techniques for Puppet providers using Augeas

Over the past few months I've continued working on augeasproviders, a Puppet module that implements a few providers which use the Augeas configuration API to modify files. One of the interesting challenges has been testing providers, which I'm going to explain in this entry.

The challenge appears quite simple: the provider will alter the contents of a config file by adding, removing or editing a line and the spec (that is, a set of RSpec examples using the puppetlabs_spec_helper) has to test this was successful. Unlike core Puppet providers which may alter package databases and entire system state, augeasproviders operates only on individual files and so it can be run safely within a test harness.

The main flow of testing is shown above and follows ths sequence:

  1. A temporary copy is made of a fixture file (sample input file) which here is a typical /etc/hosts file. The resource is created on the fly from the type, referencing the temporary copy made of the fixture.
  2. The resource is added to a Puppet catalog instance and the catalog, containing a single resource is applied using Puppet. This tests the provider in a near "real life" setting, as all normal provider functionality will be used.
  3. Puppet will call methods to create, destroy or change individual parts of an existing resource via getter/setter methods. In augeasproviders, these use the Augeas API to open the target file, make the change or retrieve the value and close the file.
  4. The catalog run will complete and the temporary fixture will have been updated with any changes made. If there have been any resource failures or log messages at warning or above, the spec fails.
  5. A second catalog run is now performed in order to catch idempotency problems, which shouldn't produce any further changes. If a getter and setter aren't consistent, then Puppet will attempt to set a property on every run since the two implementations differ. This useful test discovered a few issues in augeasproviders.

Now the updated fixture is complete, the contents need to be tested. Originally tests were using the Augeas API again to load the fixture and then check certain parts of the tree. Since the file has been persisted to disk using a lens that is tested and typechecked upstream, this is pretty safe and allows us to skip over implementation details such as whitespace formatting within the output.

Unfortunately it's a very verbose way of testing. Improving on this was to use the augparse utility shipped with Augeas, normally used for testing lenses transform between a sample file and a given tree. The spec was changed to generate a module for augparse that took the fixture and an "expected tree" in the spec and compared the two via the lens.

An additional feature of this was to filter the full fixture into just the line that was expected to change, meaning only part of the entire file would need to be tested against a tree. (Clearly a bad bug could cause other parts to change, but it's a trade-off to maintain the tree text for the entire file.)

The above techniques used to test the augeasproviders code could easily be adapted into other provider tests, whether or not the provider itself uses the Augeas API.

Update: the new idempotency code and all the fixes it's helped with have now been released as version 0.3.0.

Older entries are hidden, but you can browse them with the date categories at the top-right of this page.

Created by Chronicle v3.5