21 mai 2017

A tour of Riemann : check disk, by, throttle, email

How tu use the (by) stream in Riemann ?

The problem

I now want to monitor disk usage. If a filesystem is 80 % full, fire an email. But i don’t want to be spammed , so i want at most 2 mails every hours for each distinct full filesystem.

I will receive events in Riemann like this one :

{:host "debian-mathieu.corbin"
 :service "df-root/percent_bytes-used"
 :state nil
 :description nil
 :metric 73.04872131347656
 :tags []
 :time 1495380355
 :ttl 20.0}

Here, the root fs is 73 % full for host debian-mathieu.corbin.

Email

You can send email using Riemann. Let’s define a stream to send email.

Create a file mycorp/output/email.clj :

(ns mycorp.output.email
  "send email"
  (:require [riemann.config :refer :all]
            [riemann.streams :refer :all]
            [riemann.test :refer :all]
            ;; we should import riemann.email
            [riemann.email :refer :all]
            [clojure.tools.logging :refer :all]))

;; this stream can be used to send email
(def email (mailer {:from "me@mcorbin.fr"
                    :host "mail.foo.com"
                    :user "foo"
                    :password "bar"}))

Here, we use def to define a new stream named email. We will use it to send emails.

Throttle

We will use throttle to limit the number of email. Take a look at the Riemann howto for more informations about throttle

Solution

Tests

First, let’s create a file mycorp/system/disk.clj and write the tests for our use case:

(ns mycorp.system.disk
  "Check disk"
  (:require [riemann.config :refer :all]
            [riemann.streams :refer :all]
            [riemann.test :refer :all]
            [mycorp.output.email :as email]
            [clojure.tools.logging :refer :all]))

(def disk-stream)

(tests
 (deftest disk-stream-test
   ;; i inject test events only in disk-stream
   (let [result (inject! [mycorp.system.disk/disk-stream]
                         [;; ok
                          {:host "debian-mathieu.corbin"
                           :service "df-root/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 73
                           :tags []
                           :time 1
                           :ttl 20.0}
                          ;; random event
                          {:host "debian-mathieu.corbin"
                           :service "random_service"
                           :state nil
                           :description nil
                           :metric 100
                           :tags []
                           :time 1
                           :ttl 20.0}
                          ;; debian-mathieu.corbin/root full
                          {:host "debian-mathieu.corbin"
                           :service "df-root/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 90
                           :tags []
                           :time 3
                           :ttl 20.0}
                          ;; debian-mathieu.corbin/var-log full
                          {:host "debian-mathieu.corbin"
                           :service "df-var-log/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 90
                           :tags []
                           :time 4
                           :ttl 20.0}
                          ;; debian-mathieu.corbin/root full
                          {:host "debian-mathieu.corbin"
                           :service "df-root/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 90
                           :tags []
                           :time 4
                           :ttl 20.0}
                          ;; debian-mathieu.corbin/root full
                          {:host "debian-mathieu.corbin"
                           :service "df-root/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 91
                           :tags []
                           :time 4
                           :ttl 20.0}
                          ;; guixsd-mathieu.corbin/root full
                          {:host "guixsd-mathieu.corbin"
                           :service "df-root/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 90
                           :tags []
                           :time 4
                           :ttl 20.0}
                          ;; debian-mathieu.corbin/root full
                          {:host "debian-mathieu.corbin"
                           :service "df-root/percent_bytes-used"
                           :state nil
                           :description nil
                           :metric 93
                           :tags []
                           :time 3605
                           :ttl 20.0}])]
     ;; :disk-stream-tap-1 should contains all events indicating a full fs
     (is (= (:disk-stream-tap-1 result)
            [{:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 3
              :ttl 20.0}
             {:host "debian-mathieu.corbin"
              :service "df-var-log/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 4
              :ttl 20.0}
             {:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 4
              :ttl 20.0}
             {:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 91
              :tags []
              :time 4
              :ttl 20.0}
             {:host "guixsd-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 4
              :ttl 20.0}
             {:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 93
              :tags []
              :time 3605
              :ttl 20.0}]))
     ;; :disk-stream-tap-2 should contains all events passed to the email stream.
     ;; for each host/service, we want maximum 2 mails every 3600 seconds
     (is (= (:disk-stream-tap-2 result)
            [ ;; first debian-mathieu/root
             {:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 3
              :ttl 20.0}
             ;; first debian-mathieu/var-log
             {:host "debian-mathieu.corbin"
              :service "df-var-log/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 4
              :ttl 20.0}
             ;; second debian-mathieu/root
             {:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 4
              :ttl 20.0}
             ;; first debian-mathieu/guixsd
             {:host "guixsd-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 90
              :tags []
              :time 4
              :ttl 20.0}
             ;; next window (time = 3605), first debian-mathieu/root
             {:host "debian-mathieu.corbin"
              :service "df-root/percent_bytes-used"
              :state nil
              :description nil
              :metric 93
              :tags []
              :time 3605
              :ttl 20.0}])))))

In this test suite, we have 2 :tap.

The first one, :disk-stream-tap-1, will contain all events representing a fs > to 80 %. The second, :disk-stream-tap-2, all events actually send by email. The distinction is important.

Remember, we only want 2 email per hour for each distinct full filesystem to not be spammed.

Look at the :disk-stream-tap-2 tests. We injected 3 events commented debian-mathieu.corbin/root full, but in :disk-stream-tap-2 we only had 3, because of throttle (2 in the first 3600 seconds, 1 after).

Don’t forget to add in riemann.config the new files :

(include "mycorp/output/email.clj")
(include "mycorp/system/ram.clj")

First (incorrect) solution

We saw in a previous article how to perform a simple check. Why not reuse it with throttle and email ?

(def disk-stream
  "Check if disk if > to 80 %, email if it is. Send only 2 email for each alert type."
  ;; #"percent_bytes-used$" is a regex, we only want events where :service match the regex
  (where (and (service #"percent_bytes-used$")
              ;; Test if disk is 80 % full
              (> (:metric event) 80))
    (tap :disk-stream-tap-1)
    ;; 2 events max every 3600 secondes using throttle
    (throttle 2 3600
      (tap :disk-stream-tap-2)
      ;; send email using the email stream defined in mycorp.output.email
      (io (email/email "foo@mcorbin.fr")))))

Launch riemann test riemann.config. It fails in the second test (:disk-stream-tap-2). Why ? because in this solution, we only send 2 email regardless the host/service fields. If we have 10 alerts for 10 differents filesystem, with this solution we will send only 2 emails for the 2 first alerts.

We want to have independant throttle for each host/filesystem. And for this, we will use the by stream.

Final solution

We just need to add (by):

(def disk-stream
  "Check if disk if > to 80 %, email if it is. Send only 2 email for each alert type."
  ;; #"percent_bytes-used$" is a regex, we only want events where :service match the regex
  (where (and (service #"percent_bytes-used$")
              ;; Test if disk is 80 % full
              (> (:metric event) 80))
    (tap :disk-stream-tap-1)
    ;; use (by) to have independant streams for each host/service couple
    (by [:host :service]
      ;; 2 events max every 3600 secondes using throttle
      (throttle 2 3600
        (tap :disk-stream-tap-2)
        ;; send email using the email stream defined in mycorp.output.email
        (io (email/email "foo@mcorbin.fr"))))))

riemann test riemann.config is now passing !

Conclusion

You now know how to send email, and how to use by and throttle.

Code here.

Tags: clojure riemann tour-of-riemann english
Top of page