At the company I work for, we have a NOC department that is manned 24/7/365. Our NOC monitors a lot of network equipment, and their system needs to be up and running all around the clock.
Our old platform where built on a commercial solution that couldn't keep up with our demands. We needed something that where more dynamic, scalable and could easy be extended with other tools.
Our base requirements are quite basic. We need a monitoring system that is able to process active and passive checks, send alerts over SMS or mail and write meta/performance data in a nice way. This could almost every monitoring system deliver, regardless if it's a commercial or open.
Our extended requirements of things like: HA (High Availability), modular, application clusters for load and HA. This can't be found in an out-of-box solution. Our old platform was an out-of-box solution in that context. It had HA, but it was not modular or scalable in the same way that our new platform is.
So what did we end up with? Icinga.
Well, Icinga is only one part of our stack. It acts as our core system that collects monitoring data and distribute that to other applications. Because of that, I'll focus on writing about Icinga.
So why did we choose Icinga?
Short answer: There is no such thing as a short answer to this question.
- Regarding performance, Icinga scales very well when comparing to other monitoring software. Apart from good written code, it's also because the system is modular.
- The configuration language is almost a script language for monitoring. Icinga referee to this as "Monitoring as Code". This is used a lot in our setup. For example, we are able to create hundreds of thousands service objects in Icinga by "looping" the config. In our setup, there is no Icinga service that is created in a static way. This also makes it easy for us to do performance testing. By looping the creation of x number of hosts and services.
- Their integration portfolio is big and well updated to the latest technologies. For example, we export data to InfluxDB which we have Grafana reading data from. We also export data to a MySQL database which is operated on a three node MySQL Galera cluster.
- Icinga has an official Puppet module which allow us to automate the deployment of components in Icinga. Unfortunately we deployed our installation of Icinga before the module where released. But we are working in full speed of implementing it.
- We are able to version handle the configuration and test it in the lab before pushing it to production. If all is well in the lab, our new commit of config will most likely be that in production.
- It has a configuration GUI which we actually don't use yet, because the lack of above function. (No configuration GUI could do that anyway so it's fine.)
- It's Open Source. This is a huge advantage for us, people tend to think of Open Source as complicated and hard to understand. But for us, it's piece of cake. (I will write a post that explains why Open Source is an advantage some day.)
- Icinga has an active and big community.
- Icinga clusters has a replay log that actually works. If the connection between the master zone and the satellite zone goes down, our satellites will operates on its own and alert us over SMS if an event triggers an alarm. When the connection is brought back, Icinga will replay the event log for our master zone. In our old platform this was not always the case.
- Because our platform is constructed by our self, we have a great understanding of how every component works.
We have been using Icinga for about one and a half year now. Running with a couple of pilot customers the first eight months. During that time we developed the platform and evaluated three time series databases (Graphite, OpenTSDB and InfluxDB), researched about how to use Icingas configuration language in a dynamic way and built surrounding systems like MySQL, Postfix, Icingaweb2, Nagvis and so on..
Every component in our platform is it own set of virtual machines, and in the current config our Icinga system can grow about twenty times the current number of hosts and services.