Mirabelle and Cabourotte for blackbox monitoring
In this article, I will explain how I monitor my personal infrastructure from the outside using two tools I wrote: Cabourotte for healthchecks and Mirabelle for stream processing and alerting.
Cabourotte for blackbox monitoring
It’s always useful to execute various kind of healthchecks on your infrastructure. As explained in the Cabourotte website, networking is complex today and having the ability to execute healthchecks from various sources (in order to detect small/transient network issues, timeouts… from parts of your infrastructure) is for me important.
People always use services like Pingdom for blackbox monitoring for their public endpoints, but for me running these kinds of healthchecks everywhere in the stack is really a good idea.
I was not satisfied with the existing tools to do that and so I wrote my own. I wanted for example:
-
A tool where everything can be controlled throught the API or with a configuration file, or both without conflicts between the two.
-
A lot of healthchecks options: TCP, DNS, HTTPs, TLS, certificates monitoring, with a lot of options like custom body/headers, mTLS, being able to configure which status code/response is expected, having the ability to configure expected failures for services which should be unreachable…
-
Hot reload.
-
Exporters to push healthchecks results to external systems.
-
Good metrics and logging: rate/duration for healthchecks, metrics on the tool itself (exporters, HTTP API, internal queues…). In Cabourotte, tons of metrics are exposed using the Prometheus format and logs are in JSON.
-
Lightweight and simple to deploy.
Cabourotte is for me the perfect tool for what I listed.
I have a small infrastructure running on Exoscale, on the managed Kubernetes product (you should really try it, it’s very good, and I’m not saying that because I helped built the product :D).
I first wanted to monitor all my external/public endpoints using Cabourotte (I plan to do internal monitoring of pods a bit later).
For example, here is a part of the Cabourotte configuration file in my case:
http:
host: "0.0.0.0"
port: 9013
http-checks:
- name: "https-mcorbin.fr"
description: "https healthcheck on mcorbin.fr"
valid-status:
- 200
target: "www.mcorbin.fr"
port: 443
protocol: "https"
path: "/pages/health/"
timeout: 5s
interval: 5s
labels:
site: mcorbin.fr
- name: "redirect-mcorbin.fr"
description: "redirect on mcorbin.fr"
valid-status:
- 308
target: "mcorbin.fr"
port: 80
protocol: "http"
path: "/pages/health"
timeout: 5s
interval: 5s
labels:
site: mcorbin.fr
exporters:
riemann:
- name: "mirabelle"
host: "89.145.167.133"
port: 5555
ttl: "120s"
As you can see, I have two HTTP healthchecks on mcorbin.fr
, one testing HTTPS connections and one testing a redirection. In reality, I have a lot more healthchecks (DNS ones, healthchecks targeting my subdomains…) but putting a giant YAML file in this article is not interesting.
You can check the Cabourotte configuration documentation to see the available options.
I also configure an exporter of type riemann
. Riemann is an amazing stream processing tool written primarly by Kyle Kingsbury. I contributed at some point to it, and decided this year (after thinking a lot about it the last few years) to write a new tool heavily inspired by it (and using the same protocol so all Riemann tooling and integrations should work on Mirabelle) named Mirabelle.
I wrote an article about why I wrote Mirabelle, which is available here. Since, a lot of things were added to it, and I spent a lot of time on the documentation.
So, let’s go back to Cabourotte. Once launched, I see logs like that:
{"level":"info","ts":1623064957.3846312,"caller":"exporter/root.go:126","msg":"Healthcheck successful","name":"https-mcorbin.fr","labels":{"site":"mcorbin.fr"},"healthcheck-timestamp":1623064957}
{"level":"info","ts":1623064962.3829973,"caller":"exporter/root.go:126","msg":"Healthcheck successful","name":"redirect-mcorbin.fr","labels":{"site":"mcorbin.fr"},"healthcheck-timestamp":1623064962}
So far so good. I could scrape the /metrics
endpoint with for example Prometheus to gather my healthchecks metrics (in order to detect performances degradations for example, or monitor the number of errors) but I will use Mirabelle instead.
The Riemann/Mirabelle exporter
Everytime an healthcheck is executed, Cabourotte will generate and push to my Mirabelle instance an event. The event looks like (in JSON, in reality protobuf is used for performances, like in Riemann):
{
"host": "cabourotte-6fc6967b4b-nh6f2",
"service": "cabourotte-healthcheck",
"state": "ok",
"description": "redirect on mcorbin.fr on mcorbin.fr:80: success",
"metric": 0.00105877,
"tags": [
"cabourotte"
],
"time": 1623000193.0,
"ttl": 120.0,
"healthcheck": "redirect-mcorbin.fr",
"site": "mcorbin.fr"
}
If you compare that with the healthcheck configuration:
- name: "redirect-mcorbin.fr"
description: "redirect on mcorbin.fr"
valid-status:
- 308
target: "mcorbin.fr"
port: 80
protocol: "http"
path: "/pages/health"
timeout: 5s
interval: 5s
labels:
site: mcorbin.fr
Things should be clear. The metric
field is the duration of the healthcheck execution. Let’s now configure Mirabelle in order to alert on these events.
Mirabelle
I wrote a small Ansible Role to install Mirabelle, but I will describe in this article how to configure it step by step.
You can either use the prebuilt Java jar or the Docker image to launch it, as explained in the documentation.
The first thing to do is to write the Mirabelle configuration file (again, everything is explained in the documentation).
{:tcp {:host "0.0.0.0"
:port 5555}
:http {:host "0.0.0.0"
:port 5558}
:websocket {:host "0.0.0.0"
:port 5556}
:stream {:directories ["/etc/mirabelle/streams"]
:actions {}}
:io {:directories ["/etc/mirabelle/io"]
:custom {}}
:logging {:level "info"
:console {:encoder :json}}}
Some important things are the :stream
configuration which references a directory where the streams will be compiled, and the io
directory where I/O (stateful components in Mirabelle) will be defined.
I/O
Let’s first put a file name io.edn
in /etc/mirabelle/io
and define a new Pagerduty I/O named :pagerduty-client
:
{:pagerduty-client {:config {:service-key "your-pagerduty-integration-key"
:source-key :host
:summary-keys [:host :service :state]
:dedup-keys [:host :service]
:http-options {:socket-timeout 5000
:connection-timeout 5000}}
:type :pagerduty}}
The Pagerduty I/O options are described in the documentation.
Now, let’s write our streams.
Streams
Now, we will create a new file named stream.clj
in the directory you want (it will compiled later). What we first want to do is:
-
Filter expired events (to avoid handling old events, which should in theory not happen with Cabourotte).
-
Filter all events related to Cabourotte healthchecks.
-
Send an alert to Pagerduty if an healthcheck is not successful, and resolve the alert when it’s back.
Here is the Mirabelle configuration to do this:
(streams
(stream
{:name :cabourotte :default true}
(not-expired
(where [:= :service "cabourotte-healthcheck"]
(by [:host :healthcheck]
(changed :state "ok"
(sformat "cabourotte-healthcheck-alert-%s" :service [:healthcheck]
(index [:host :service])
(info)
(tap :cabourotte-pagerduty)
(push-io! :pagerduty-client))))))))
We first use not-expired
to filter expired events, and then keep only events with :service
equal to cabourotte-healthcheck
.
We then want to treat each healthcheck independently. We indeed want to alert and resolve alerts based on states transitions for each healthcheck and don’t want conflicts between healthchecks or Cabourotte instances (you can have multiple Cabourotte instances pushing events from multiple hosts).
We do that in Mirabelle by using the by
action, here with by [:host :healthcheck]
. Everything after by
will be splitted for each combination of :host
and :healthcheck
.
Then, we do (changed :state "ok"
. This action will let events pass only on states transitions, the default value being "ok"
. It will prevent Mirabelle to spam the Pagerduty API for each event.
Finally, we use sformat
to rebuild the :service
value. Events will be updated and will have as :service
the value cabourotte-healthcheck-alert-<healthcheck-name>
.
Indeed, we configured our Pagerduty client with :host
and :service
as :dedup-key
(to detect which event belongs to which alert), so the service value should be unique for each healthcheck.
We then do several things:
-
(index [:host :service])
: Index (in memory) the event. -
(info)
: Log the event. -
(tap :cabourotte-pagerduty)
: it will be used in tests (more about this later). -
(push-io! :pagerduty-client)
: push the event to Pagerduty.
And we’re done !
You now need to compile this file into an EDN representation. Everything is explained in the documentation (it’s only a java -jar mirabelle.jar compile <src-directory> <dst-directory>
). The result should be put in /etc/mirabelle/streams
which is the directory referenced in your configuration file.
Writing tests
A lot of monitoring tools do not allow you to write unit tests on their configurations. It’s a big issue because you are unable to verify the configuration because of that.
Mirabelle has a simple but powerful test system built in. Here is a test I wrote for ths previous configuration:
{:cabourotte {:input [{:metric 0.1
:time 0
:state "ok"
:service "cabourotte-healthcheck"
:healthcheck "dns-mcorbin.fr"
:host "host1"}
{:metric 1
:time 5
:state "critical"
:service "cabourotte-healthcheck"
:healthcheck "dns-mcorbin.fr"
:host "host1"}
{:metric 0.5
:time 10
:state "critical"
:service "cabourotte-healthcheck"
:healthcheck "dns-mcorbin.fr"
:host "host1"}]
:tap-results {:cabourotte-pagerduty
[{:metric 1
:time 5
:state "critical"
:service "cabourotte-healthcheck-alert-dns-mcorbin.fr"
:healthcheck "dns-mcorbin.fr"
:host "host1"}]}}}
Events in :input
will be injected in Mirabelle and we can then verify what was recorded in the tap
actions (in our case, the :cabourotte-pagerduty
one).
As you can see, things work as expected: only states transitions are forwarded to Pagerduty, and my :service
key was correctly updated.
Does it really works ?
I killed my blog containers to test the setup. I Immediately receive the alerts in Pagerduty:
Once the website is back, the alert is automatically resolved:
I could to a lot more, like using the Mirabelle stable
action to avoid flapping states for example. But for my simple use case this configuration is enough.
Can we do more ?
Yes ! Mirabelle is powerful and extensible (you can write your own actions). As I said before, the events :metric
field contains the healthchecks duration. Why not compute the percentiles on the fly for each healthcheck ?
(by [:healthcheck]
(moving-event-window 10
(coll-percentiles [0.5 0.75 0.98 0.99]
(with {:service "cabourotte-healthcheck-percentile" :host nil}
(publish! :healthchecks-percentiles)))))
If you add this code snippet after (not expired
in your original configuration, percentiles will be computed for each healthcheck independently (note that :host
is not used in by
here, we want to compute the percentiles on events for a given healthcheck fromm all Cabourotte instances together).
The events generated for each quantile will be updated with a new service name and then passed to publish!
, which allows use to subscribe to it using a websocket client (users can also subscribe to the index).
For example, I could use the Python script referenced in the Mirabelle documentation to subscribe to the percentiles events for the healthcheck "https-mcorbin.fr"
:
./websocket.py --channel healthchecks-percentiles --query '[:= :healthcheck "https-mcorbin.fr"]'
:
> Received at Mon Jun 7 12:07:52 2021
{
"host": null,
"service": "cabourotte-healthcheck-percentile",
"state": "ok",
"description": "https healthcheck on mcorbin.fr on www.mcorbin.fr:443: success",
"metric": 0.004632761,
"tags": [
"cabourotte"
],
"time": 1623067652.0,
"ttl": 120.0,
"healthcheck": "https-mcorbin.fr",
"site": "mcorbin.fr",
"quantile": 0.99
}
> Received at Mon Jun 7 12:07:57 2021
{
"host": null,
"service": "cabourotte-healthcheck-percentile",
"state": "ok",
"description": "https healthcheck on mcorbin.fr on www.mcorbin.fr:443: success",
"metric": 0.002387217,
"tags": [
"cabourotte"
],
"time": 1623067632.0,
"ttl": 120.0,
"healthcheck": "https-mcorbin.fr",
"site": "mcorbin.fr",
"quantile": 0.5
}
Mirabelle, like Riemann, is very powerful. You have tons of built-in actions available, can have multiple streams running in parallel with different clocks (for real time computations and continuous queries…), an API to manage streams, an InfluxDB integration… The documentation explains all of that ;)
Conclusion
Writing monitoring tools is fun.
I believe Cabourotte is really a small but powerful tool, which does one thing (executing healthchecks) but do it really well and in a "modern" way (API, metrics…). Don’t hesitate to try it, alongside Prometheus for example.
On the other hand, Mirabelle is also powerful. If you like Riemann, Clojure, and stream processing, or need an "event router", you should also try it.
Add a comment
If you have a bug/issue with the commenting system, please send me an email (my email is in the "About" section).