Jun 7

Why Nagios Sucks

Category: Random

fml, dependency hell

The first thing you’ll notice when deploying Nagios is that it’s a bloody mess of languages. While I don’t normally mind when some software does this, when I have to install and configure a long list of software packages that all need their own dependencies configured separately, I start to groan.

Back in the day (was it the Nagios 2.x era?) Nagios was simply a cluster of C, some embedded Perl, and a bit of C++ in the event broker subsystem. That was kind of a mess since the CGI scripts (also written in C) had to be served in a special way. But it was one special way.

It seems like the new distribution of Nagios now ships with PHP wrappers around the CGI interfaces. It may not sound like much more, but sweet pineapples, really? At this point you’re running some kind of webserver (nginx) that’s configured to serve static files, serve out requests to the FCGI-wrapped (with yet another piece of software) Nagios CGI scripts, and serve out requests to a PHP daemon for trivial nonsense. For a single bloody monitoring tool you’ve now got 4 heavy daemons running (Nagios, Nagios CGI Scripts, Nginx, PHP). This is too much for me.

clusterfuck configuration DSL

Once you get past that part (and I’m skipping configuration of the Nagios daemon itself — not covered in the object files you need to write), you have to configure all your services and hosts and contacts and stuff. The configuration language isn’t absolutely terrible (but perhaps I’ve just gotten used to it over the years?) but it is exceptionally verbose. Arguably, it supports a lot of functionality, but rarely in a non-enterprise setup are you going to need to specify that a service alert should be escalated to a different group of people on holidays.

The sample configuration files are over 1000 lines long. A non-trivial deployment can expect to around 2000-10000 lines of their configuration DSL, depending on how much you can consolidate common functionality in templates.

the Nagios plugin API is a shitty cludge

A “Nagios plugin” is basically an external executable on the machine you’re probing. The input/output format is quite simple — your executable takes any command-line arguments you want (you have to configure them in the Nagios configuration files anyway), and returns output via stdout and the POSIX return value. It’s not too shabby, and it’s kind of nice to use whatever language you want to write plugins.

The cludge, however, is that they’re returning three distinct strings via stdout. I’ll let the documentation explain:

At a minimum, plugins should return at least one of text output. Beginning with Nagios 3, plugins can optionally return multiple lines of output. Plugins may also return optional performance data that can be processed by external applications. The basic format for plugin output is shown below:

TEXT OUTPUT | OPTIONAL PERFDATA
LONG TEXT LINE 1
LONG TEXT LINE 2
...
LONG TEXT LINE N  | PERFDATA LINE 2
PERFDATA LINE 3
...
PERFDATA LINE N

Getting your output data into that format isn’t too bad, but…

oh, you wanted to USE that data?

Once you’ve got your pretty install working with a bunch of blinking green lights, you might get the bright idea to use some of the data to make graphs and such. There are plenty of “scripts” to do this already, but they’re all clunky as shit. Why?

Nagios is designed to throw data away.

It doesn’t store data persistently. The only state it stores is the current state of all services (and even that gets thrown away when you kill the Nagios daemon).

To ameliorate this, Nagios has a component called the event broker. I haven’t mucked with it in a couple of years, but IIRC it’s a tacked-on piece written in C++ (whereas the rest of Nagios is in C). The idea is that it provides a hook for an external service to listen and collect information and package it off somewhere. So you’ve got tools like rrdtool (and I remember using one that dumped to a PostgreSQL database too, but can’t find it) that grab and warehouse the log data.

The problem is that you’ve got all this log data which is just a bunch of unstructured strings. To add insult to injury, when I was hacking on it a couple years ago, the event broker didn’t bother broking the long text line (I think it still did the perfdata though). Hacking that fix in was a nightmare, mostly because the event broker code is a clusterfuck of indirection. It necessarily has to be, because it’s loading in arbitrary shared object files which contain the plugin code.

sound like a piece of shit yet?

No? Here’s my current use-case: I have some hosts connected via low-reliability SSH tunnels, and want a graph of how often they’re online. To use Nagios, I need to

Configure 5 daemons (Nginx, PHP, Nagios, Nagios-CGI, an event broker plugin, and probably an RDBMS for the plugin).
Write a custom check plugin that serializes and deserializes input through an insane format.
Write a web interface which gets the data out of the RDBMS and generates a fancy graph.

…at some point, it becomes more work than just hacking a brand new piece of software.

Arguably, my use-case is a pretty atypical one for Nagios. Maybe I should be writing a custom piece of software.

But fuck it I’m writing one anyway so I might as well bitch about stuff.

Tagged with: anyone else wwdc today, bitching about stuff, nagios, stacks deeper than you're anus
16 comments

16 Comments so far

Anonymous June 7th, 2011 8:32 am

This seems like it could be done pretty easily with python and some plotting lib? Or am I missing something?
Taro June 7th, 2011 9:34 am
Yep, doing it with Python.

But really, Nagios is just the following things:
- A scheduler
- A collection of scripts the scheduler runs
- A configuration language for telling it which scripts to run
- An event system to do things when scripts return certain values
- A flashy interface with blinking lights
Note that “warehouse the collected data points” is not on the list.

So I wrote a short Python script scheduler which (in 6 hours will) work with the existing Nagios plugins (because there’s a lot of good plugins which work really well). Just need to write the flashy wank web interface.
Anonymous June 8th, 2011 1:28 am

Ah, cool. Will you release the code somewhere?
Anonymous June 8th, 2011 4:13 am

You used “warehouse” as a verb twice. Maybe it’s time to switch jobs for a bit.
Taro June 8th, 2011 9:11 am

> code release

probably, but not until I spend some time polishing it up a lot.

> “warehouse”

ihbt
Rocketman June 4th, 2012 8:07 pm

If I may suggest Shinken. It is a nagios full rewrite in 100% Python. A distributed architecture, a sensible scheduler, Livestatus compatibility, a python frontend or Livestatus based frontend.

It supports the configuration. It has a nice broker module that sends data to Graphite.

it has integrated high performance poller modules, notably a new SNMP one soon to be release to do away with the myriad of little SNMP scripts(finally).

Everything is python, daemons, graphing (Graphite), frontends (Multisite, Shinken WebUI) Nagvis is PHP though.

It fixes soo much of what is seriously wrong with Nagios in a tenth of the code! Ha!

Code you can actually read and understand.

Cheers. Try it using OMD distribution package or use the new installer script. See if you like it.
Taro June 7th, 2012 9:53 am

OMD packages don’t support FreeBSD, IIRC. I’m just going to do a source install in a jail. Thanks for the heads up, from a cursory glance, it doesn’t look half bad!
Anonymous July 11th, 2012 9:59 am

What about collectd (http://collectd.org/)? simple, expandable, escalable, setup in few minutes, small footprint and fast as in hell.

The worst part is the interface but has several options. It relies on RRD so cacti or similar can be adapted.
vij_pun December 11th, 2012 3:22 am

Ohh looks like someone is too angry with Nagios or should I say Jealous ?

Nagios is FREE… Easy to use, everything is configurable.

What else you want to do ?

Built another tool which does the same thing ?
& pretty much good support is available.
Taro December 11th, 2012 11:30 am

I never thought I’d see a Nagios apologist, lol.
Maro January 2nd, 2013 11:58 am

Right,
A big blob of mud.
The scattered documentation is out of date, configuration file locations and sections are a mess. Like a car, in which 100 mechanics did a fix.
Matt Stevens January 30th, 2013 11:29 am

Agree, 100%. In fact I only found this post because I Googled “Nagios Design” – hoping somebody had finally sorted it out. I’ve been using it for years though, and it’s never let me down. It just looks like crap. Maybe I should just get off my arse and make one instead of bitching…
NagiosLover February 7th, 2013 11:37 am

Nagios has no equal. you can do a survey/poll of network monitoring applications. see for yourself who wins.
Kiryl Hakhovich February 27th, 2013 10:41 pm

the only thing nagios SUCK at is a fucking CGI interface.

everything else about it – a solid piece of work that do one thing well – monitoring and alerting.

it never meant to be a statistical-monitoring-intelligence-analysis shitty ass software which everyone lately talking about every corner of IT you go.

While i DO like to see a project cacti+nagios together – but it is nearly impossible in the opensource world, thus we get separate tools that do only ONE job but do it well.

and yes nagios is for old school – kids who don’t like it – fuck go write your own shit based on tweeter framework :)) maybe you can get old schoolers to follow along.
ilSimo April 19th, 2013 10:42 am

Do you seriously believe that the only thing in nagios that sucks is the CGI interface? that can be easily replaced, after all.

with active checks performances are a big issue, and making it scale using monitoring “slaves” looks more like a hack than a feature.
As a consequence distributed monitoring is also a mess.

And, honestly, if it wasn’t for check_mk our system with about 4500 services would be an overkill to manage.
Taro April 20th, 2013 12:25 am

> Do you seriously believe that the only thing in nagios that sucks is the CGI interface?

Did you read the blog post?

* Dependency hell (how many runtimes does it depend on? fuck me)
* Shitty configuration DSL
* Plugin API could be better structured
* Event broker plugins are shite

I don’t know why so many people are qq’ing about a blog post I made years about. I don’t like nagios. Deal with it. Done with these comments.