The Monitor Daemon

So, you've got an Arborist manager running, and that's great. If you're astute, you may have noticed that it doesn't actually do very much by itself. In order to get things interesting, lets throw the Monitor daemon into the mix.

It knows how to speak the Manager's API, and it serves the following roles:

Keeps a schedule for when to execute tasks against the Manager's nodes
When the time comes, it asks the Manager for lists of relevant nodes and any appropriate attributes
Perform the monitoring heavy lifting
Set any gathered data on the checked nodes

Once new attribute data is set on a node, the Manager starts firing events all over the place -- we'll talk more about that later.

For now, just like before, we'll create a new directory to store the configuration files, and use Arborist's built-in file loader source. Assuming you're still in /usr/local/arborist...

mkdir monitors
touch monitors/example.rb

Everything is launched in the same fashion as the Manager -- I won't bother going over all of that again, because it's late, I'm getting tired, and really, you can just go back a page and re-read that part if you need to. Laziness is a virtue. Recall that you start Arborist components like arborist start <component> <source>.

arborist -c config.yml -l info start monitors monitors
[2016-08-30 00:03:15.641867 7541/main]  info {} -- Loading config from # with defaults for sections: [:logging, :arborist].
[2016-08-30 00:03:15.700270 7541/main]  info {Arborist::Monitor} -- Loading monitor file monitors/example.rb...
[2016-08-30 00:03:15.700512 7541/main]  info {Arborist::Client:0x1ea66f8} -- Connecting to the tree socket "ipc:///tmp/arborist_tree.sock"
[2016-08-30 00:03:15.700577 7541/main]  info {} -- Using ZeroMQ 4.0.3/CZMQ 2.0.1

Monitors 101

Lets write our first monitor. There are a couple of questions you should ask yourself before starting.

What node types does this monitor apply to?
What constitutes an up state vs a down state?

The Manager asks for nodes that match certain criteria. You can exclude nodes with the same mechanism, use tags to organize, or search on any other attributes you please.

Lets start with host nodes, and lets say that if they are reachable via an ICMP ping, they are considered operational. I'll also demonstrate a few different ways to achieve the same result -- there are lots-o-ways to do things, each with advantages and disadvantages. Arborist just offers up the options, it's up to you to decide which way to go with them.

Let's start with the Simplest Thing That Could Possibly Work™.

Arborist::Monitor.new do
    description 'ping check'
    key :ping
    every 5
    match type: 'host'
    use :addresses

    exec do |nodes|
        results = {}

        nodes.each_pair do |node, attributes|
            res = `ping -c 1 -W 1 #{attributes['addresses'].first}`
            if $?.success?
                results[ node ] = {}
            else
                results[ node ] = { error: 'Unable to reach host' }
            end
        end

        results
    end
end

As you might rightfully assume, this not what I'd recommend in production. Ping takes different flags for timeouts on different operating systems, for one thing. However, it illustrates some important points - lets go over them.

Line 2: A descriptive label that shows up in logs.
Line 3: This key is the name of a group the monitor belongs to, so errors can be attributed back to it.
Line 4: This is the frequency to run the monitor, in seconds.
Line 5: The criteria for nodes to match. This can be any attribute on the node -- multiple attributes are ANDed together.
Line 6: Request these node attributes from the Manager when performing the monitoring check. Addresses are the host's IPs, returned as an array. You'll probably be using that one a lot.
Line 8: exec takes a few different forms, depending on what you're after. If called with a ruby block, you can make quick monitors inline -- the block is passed the matching nodes as a hash, keyed by node identifier. The block should return the same structure, but with modified attributes.
Line 14: Successful checks simply don't have an error key set. Any other attributes of your choosing may be set on the node, and they'll be carried across to all other Arborist components.
Line 16: Unsuccessful checks also may set whatever they please, but they should have an error key, with descriptive text as the value.

3rd Party Modules

Arborist has a number of built in monitors for various things -- anything that requires an external dependency is broken out into a separate gem. Lets improve on the above monitor by installing FPing, and using it instead. Use whatever method you want to get the fping binary installed, then install the arborist-fping gem.

gem install arborist-fping

This gem has a small ruby module that knows how to parse fping output. It does so, and sets the round-trip-time (RTT) of each pinged host as node attributes. With it, you can quickly ping strobe thousands of hosts very efficiently.

Making some other additions and alterations to our monitor, it now looks like this.

using Arborist::TimeRefinements

require 'arborist/monitor/fping'

Arborist::Monitor 'ping check', :ping do
    every 15.seconds
    match type: 'host'
    exec 'fping', '-e', '-t', '150'
    exec_callbacks( Arborist::Monitor::FPing )
end

If you want to make your own reusable monitoring modules, you can read the source of gems like arborist-fping, and the section of the Cookbook that goes into more detail. Here are the other changes to our monitor:

Line 1: Include a helper mixin, that adds time methods to integers. It makes the DSL more readable when you can express intervals as 1.minute or 3.hours.
Line 3: Require the ruby module for the fping parser.
Line 5: The description and key can be arguments to the constructor for convenience.
Line 8: A different form of exec, that calls an external program. There are additional helpers to manage sending node data to external scripts either via arguments or via stdin, and subsequently parsing the results. Using external programs instead of doing everything inline provides better parallelization, and lets you use languages outside of ruby.
Line 9: The exec_callbacks is a shortcut to a module that can parse the results of exec. Without it, you'd instead use inline blocks exec_input, exec_arguments, and handle_results. See the Reference for detailed information on these calls.

We also removed the use :addresses declaration -- 3rd party modules can automatically request the data they require from the node. Handy.

Even More Modules!

We also want to perform simple TCP port checks and disk availability checks via the arborist-snmp gem, so lets add those in also.

using Arborist::TimeRefinements

require 'arborist/monitor/socket'
require 'arborist/monitor/fping'
require 'arborist/monitor/snmp'

Arborist::Monitor 'ping check', :ping do
    every 15.seconds
    match type: 'host'
    exec 'fping', '-e', '-t', '150'
    exec_callbacks( Arborist::Monitor::FPing )
end

Arborist::Monitor 'port checks for tcp services', :tcp do
    every 15.seconds
    match type: 'service', protocol: 'tcp'
    exec( Arborist::Monitor::Socket::TCP )
end

Arborist::Monitor 'disk usage checks', :disk do
    every 1.minute
    match type: 'resource', category: 'disk'
    exec( Arborist::Monitor::SNMP::Disk )
end

Testing a Monitor

When working on a specific monitor, it can be a drag to continually start and stop the Monitor daemon, wait for the timers, and debug while iterating. Fortunately, the Arborist command line has a way to test quickly when trying out new things -- a one shot mode.

Here's an example monitor that randomly causes ups and downs on host and service nodes each time it is fired, normally once a minute. We'll just throw this in our temp directory at /tmp/chaos.rb.

Arborist::Monitor 'chaos monkey!' do
    description 'An example inline monitor.'
    key :eek_eek
    every 60
    match type: [ 'host', 'service' ]

    exec do |nodes|
        results = {}
        nodes.each_pair do |node, attributes|
            val = rand( 2 )
            if val.zero?
                results[ node ] = { roll: "I rolled a #{val}." }
            else
                results[ node ] = { error: "I rolled a #{val}.  OF DOOM." }
            end
        end
        results
    end
end

The output of this can be tested against real nodes in the Manager with the run_once command. The timing of the monitor is completely ignored, and the node state is left unchanged. It only shows you what it would have done.

arborist run_once /tmp/chaos.rb
An example inline monitor.
vm01
{:error=>"I rolled a 1.  OF DOOM."}

vm01-memcached
{:roll=>"I rolled a 0."}

vmhost01
{:roll=>"I rolled a 0."}

vmhost01-ssh
{:roll=>"I rolled a 0."}

web01
{:roll=>"I rolled a 0."}

web01-http
{:error=>"I rolled a 1.  OF DOOM."}

web01-ssh
{:error=>"I rolled a 1.  OF DOOM."}

Multiple Monitor daemons

If you have many monitors doing a lot of heavy lifting, you can optionally scale them out across any number of machines. How their workload is divided up is up to you. The only configuration change needed is to the tree_api_url key -- instead of the local socket default, provide it with a listening IP and port:

---
arborist:
  tree_api_url: tcp://10.3.0.75:5011

The Manager will use this to bind, and the Monitors will use it to connect. More details on horizontally scaling things out can be found in the Cookbook.

Watching the Monitors

Once the monitor daemons are running, the quickest way to query the Manager for its status is the client, described on the Manager page. Again, see the Reference for a deeper dive into the specifics. Here's an example to see currently up nodes:

search( status: 'up' ).keys
=> ["_", "vmhost01", "vmhost01-ssh", "vmhost01-disk", "vm01", "vm01-memcached", "web01", "web01-ssh", "web01-http"]

You can also use the watch Arborist command, to see the fire hose stream of events that are happening within the Manager.

arborist -c config.yml watch
Subscription "76b6ed6c-fa11-41b2-9c71-4a63e2c8f59c"
Subscribing to manager heartbeat events.
Watching for events on manager at ipc:///tmp/arborist_events.sock
[2016-10-08 14:59:17 PDT] vmhost01 updated: host is up
[2016-10-08 14:59:17 PDT] vm01 updated: host is up
[2016-10-08 14:59:17 PDT] vm01 delta, changes: rtt: 0.53 -> 0.57
[2016-10-08 14:59:17 PDT] web01 updated: host is up
[2016-10-08 14:59:17 PDT] web01 delta, changes: rtt: 0.29 -> 0.22
[2016-10-08 14:59:17 PDT] vmhost01-ssh updated: service is up
[2016-10-08 14:59:17 PDT] vmhost01-ssh delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 5.8226e-05 -> 4.4903e-05
[2016-10-08 14:59:17 PDT] web01-ssh updated: service is up
[2016-10-08 14:59:17 PDT] web01-ssh delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 5.8226e-05 -> 4.4903e-05
[2016-10-08 14:59:17 PDT] web01-http updated: service is up
[2016-10-08 14:59:17 PDT] web01-http delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 0.000192603 -> 0.000169996
[2016-10-08 14:59:17 PDT] vm01-memcached updated: service is up
[2016-10-08 14:59:17 PDT] vm01-memcached delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 0.000192603 -> 0.000408556
[2016-10-08 14:59:17 PDT] vmhost01-disk updated: resource is up

Okay, so where are we now? We have a running Manager, and a Monitor daemon that is checking the state of reality and reporting it back to the Manager. That sounds great for the world of the robots, but where does this leave us humans? The above methods are nice for testing and debugging, but for practical use, you'll want automatic actions when events happen. That's where Observers come in. Lets move on!