The random rantings of a concerned programmer.

Why Nagios Sucks

June 07th, 2011 | Category: Random

fml, dependency hell

The first thing you’ll notice when deploying Nagios is that it’s a bloody mess of languages. While I don’t normally mind when some software does this, when I have to install and configure a long list of software packages that all need their own dependencies configured separately, I start to groan.

Back in the day (was it the Nagios 2.x era?) Nagios was simply a cluster of C, some embedded Perl, and a bit of C++ in the event broker subsystem. That was kind of a mess since the CGI scripts (also written in C) had to be served in a special way. But it was one special way.

It seems like the new distribution of Nagios now ships with PHP wrappers around the CGI interfaces. It may not sound like much more, but sweet pineapples, really? At this point you’re running some kind of webserver (nginx) that’s configured to serve static files, serve out requests to the FCGI-wrapped (with yet another piece of software) Nagios CGI scripts, and serve out requests to a PHP daemon for trivial nonsense. For a single bloody monitoring tool you’ve now got 4 heavy daemons running (Nagios, Nagios CGI Scripts, Nginx, PHP). This is too much for me.

clusterfuck configuration DSL

Once you get past that part (and I’m skipping configuration of the Nagios daemon itself — not covered in the object files you need to write), you have to configure all your services and hosts and contacts and stuff. The configuration language isn’t absolutely terrible (but perhaps I’ve just gotten used to it over the years?) but it is exceptionally verbose. Arguably, it supports a lot of functionality, but rarely in a non-enterprise setup are you going to need to specify that a service alert should be escalated to a different group of people on holidays.

The sample configuration files are over 1000 lines long. A non-trivial deployment can expect to around 2000-10000 lines of their configuration DSL, depending on how much you can consolidate common functionality in templates.

the Nagios plugin API is a shitty cludge

A “Nagios plugin” is basically an external executable on the machine you’re probing. The input/output format is quite simple — your executable takes any command-line arguments you want (you have to configure them in the Nagios configuration files anyway), and returns output via stdout and the POSIX return value. It’s not too shabby, and it’s kind of nice to use whatever language you want to write plugins.

The cludge, however, is that they’re returning three distinct strings via stdout. I’ll let the documentation explain:

At a minimum, plugins should return at least one of text output. Beginning with Nagios 3, plugins can optionally return multiple lines of output. Plugins may also return optional performance data that can be processed by external applications. The basic format for plugin output is shown below:


Getting your output data into that format isn’t too bad, but…

oh, you wanted to USE that data?

Once you’ve got your pretty install working with a bunch of blinking green lights, you might get the bright idea to use some of the data to make graphs and such. There are plenty of “scripts” to do this already, but they’re all clunky as shit. Why?

Nagios is designed to throw data away.

It doesn’t store data persistently. The only state it stores is the current state of all services (and even that gets thrown away when you kill the Nagios daemon).

To ameliorate this, Nagios has a component called the event broker. I haven’t mucked with it in a couple of years, but IIRC it’s a tacked-on piece written in C++ (whereas the rest of Nagios is in C). The idea is that it provides a hook for an external service to listen and collect information and package it off somewhere. So you’ve got tools like rrdtool (and I remember using one that dumped to a PostgreSQL database too, but can’t find it) that grab and warehouse the log data.

The problem is that you’ve got all this log data which is just a bunch of unstructured strings. To add insult to injury, when I was hacking on it a couple years ago, the event broker didn’t bother broking the long text line (I think it still did the perfdata though). Hacking that fix in was a nightmare, mostly because the event broker code is a clusterfuck of indirection. It necessarily has to be, because it’s loading in arbitrary shared object files which contain the plugin code.

sound like a piece of shit yet?

No? Here’s my current use-case: I have some hosts connected via low-reliability SSH tunnels, and want a graph of how often they’re online. To use Nagios, I need to

  • Configure 5 daemons (Nginx, PHP, Nagios, Nagios-CGI, an event broker plugin, and probably an RDBMS for the plugin).
  • Write a custom check plugin that serializes and deserializes input through an insane format.
  • Write a web interface which gets the data out of the RDBMS and generates a fancy graph.

…at some point, it becomes more work than just hacking a brand new piece of software.

Arguably, my use-case is a pretty atypical one for Nagios. Maybe I should be writing a custom piece of software.

But fuck it I’m writing one anyway so I might as well bitch about stuff.