The Monitor Daemon
So, you've got an Arborist manager running, and that's great. If you're astute, you may have noticed that it doesn't actually do very much by itself. In order to get things interesting, lets throw the Monitor daemon into the mix.
It knows how to speak the Manager's API, and it serves the following roles:
- Keeps a schedule for when to execute tasks against the Manager's nodes
- When the time comes, it asks the Manager for lists of relevant nodes and any appropriate attributes
- Perform the monitoring
heavy lifting
- Set any gathered data on the checked nodes
Once new attribute data is set on a node, the Manager starts firing events all over the place -- we'll talk more about that later.
For now, just like before, we'll create a new directory to store the
configuration files, and use Arborist's built-in file loader source.
Assuming you're still in /usr/local/arborist
...
mkdir monitors
touch monitors/example.rb
Everything is launched in the same fashion as the Manager -- I won't
bother going over all of that again, because it's late, I'm getting
tired, and really, you can just go back a page and re-read that part if
you need to. Laziness is a virtue. Recall that you start Arborist
components like arborist start <component> <source>
.
arborist -c config.yml -l info start monitors monitors
[2016-08-30 00:03:15.641867 7541/main] info {} -- Loading config from # with defaults for sections: [:logging, :arborist].
[2016-08-30 00:03:15.700270 7541/main] info {Arborist::Monitor} -- Loading monitor file monitors/example.rb...
[2016-08-30 00:03:15.700512 7541/main] info {Arborist::Client:0x1ea66f8} -- Connecting to the tree socket "ipc:///tmp/arborist_tree.sock"
[2016-08-30 00:03:15.700577 7541/main] info {} -- Using ZeroMQ 4.0.3/CZMQ 2.0.1
Monitors 101
Lets write our first monitor. There are a couple of questions you should ask yourself before starting.
- What node types does this monitor apply to?
- What constitutes an up state vs a down state?
The Manager asks for nodes that match
certain criteria. You can
exclude
nodes with the same mechanism, use tags
to organize, or
search on any other attributes you please.
Lets start with host nodes, and lets say that if they are reachable via an ICMP ping, they are considered operational. I'll also demonstrate a few different ways to achieve the same result -- there are lots-o-ways to do things, each with advantages and disadvantages. Arborist just offers up the options, it's up to you to decide which way to go with them.
Let's start with the Simplest Thing That Could Possibly Workâ˘.
Arborist::Monitor.new do
description 'ping check'
key :ping
every 5
match type: 'host'
use :addresses
exec do |nodes|
results = {}
nodes.each_pair do |node, attributes|
res = `ping -c 1 -W 1 #{attributes['addresses'].first}`
if $?.success?
results[ node ] = {}
else
results[ node ] = { error: 'Unable to reach host' }
end
end
results
end
end
As you might rightfully assume, this not what I'd recommend in production. Ping takes different flags for timeouts on different operating systems, for one thing. However, it illustrates some important points - lets go over them.
- Line 2: A descriptive label that shows up in logs.
- Line 3: This key is the name of a group the monitor belongs to, so errors can be attributed back to it.
- Line 4: This is the frequency to run the monitor, in seconds.
- Line 5: The criteria for nodes to match. This can be any attribute on the node -- multiple attributes are
AND
ed together. - Line 6: Request these node attributes from the Manager when performing the monitoring check.
Addresses
are the host's IPs, returned as an array. You'll probably be using that one a lot. - Line 8:
exec
takes a few different forms, depending on what you're after. If called with a ruby block, you can make quick monitors inline -- the block is passed the matching nodes as a hash, keyed by node identifier. The block should return the same structure, but with modified attributes. - Line 14: Successful checks simply don't have an
error
key set. Any other attributes of your choosing may be set on the node, and they'll be carried across to all other Arborist components. - Line 16: Unsuccessful checks also may set whatever they please, but they should have an
error
key, with descriptive text as the value.
3rd Party Modules
Arborist has a number of built in monitors for various things
-- anything that requires an external dependency is broken out into
a separate gem. Lets improve on the above monitor by installing
FPing, and using it instead. Use whatever
method you want to get the fping binary installed, then install the
arborist-fping
gem.
gem install arborist-fping
This gem has a small ruby module that knows how to parse fping output. It does so, and sets the round-trip-time (RTT) of each pinged host as node attributes. With it, you can quickly ping strobe thousands of hosts very efficiently.
Making some other additions and alterations to our monitor, it now looks like this.
using Arborist::TimeRefinements
require 'arborist/monitor/fping'
Arborist::Monitor 'ping check', :ping do
every 15.seconds
match type: 'host'
exec 'fping', '-e', '-t', '150'
exec_callbacks( Arborist::Monitor::FPing )
end
If you want to make your own reusable monitoring modules, you can
read the source of gems like arborist-fping
, and the section of the
Cookbook that goes into more detail. Here are
the other changes to our monitor:
- Line 1: Include a helper mixin, that adds time methods to integers. It makes the DSL more readable when you can express intervals as
1.minute
or3.hours
. - Line 3: Require the ruby module for the fping parser.
- Line 5: The
description
andkey
can be arguments to the constructor for convenience. - Line 8: A different form of
exec
, that calls an external program. There are additional helpers to manage sending node data to external scripts either via arguments or via stdin, and subsequently parsing the results. Using external programs instead of doing everything inline provides better parallelization, and lets you use languages outside of ruby. - Line 9: The
exec_callbacks
is a shortcut to a module that can parse the results ofexec
. Without it, you'd instead use inline blocksexec_input
,exec_arguments
, andhandle_results
. See the Reference for detailed information on these calls.
We also removed the use :addresses declaration -- 3rd party modules can automatically request the data they require from the node. Handy.
Even More Modules!
We also want to perform simple TCP port checks and disk availability
checks via the arborist-snmp
gem, so lets add those in also.
using Arborist::TimeRefinements
require 'arborist/monitor/socket'
require 'arborist/monitor/fping'
require 'arborist/monitor/snmp'
Arborist::Monitor 'ping check', :ping do
every 15.seconds
match type: 'host'
exec 'fping', '-e', '-t', '150'
exec_callbacks( Arborist::Monitor::FPing )
end
Arborist::Monitor 'port checks for tcp services', :tcp do
every 15.seconds
match type: 'service', protocol: 'tcp'
exec( Arborist::Monitor::Socket::TCP )
end
Arborist::Monitor 'disk usage checks', :disk do
every 1.minute
match type: 'resource', category: 'disk'
exec( Arborist::Monitor::SNMP::Disk )
end
Testing a Monitor
When working on a specific monitor, it can be a drag to continually
start and stop the Monitor daemon, wait for the timers, and debug
while iterating. Fortunately, the Arborist command line has a way to
test quickly when trying out new things -- a one shot
mode.
Here's an example monitor that randomly causes ups and downs on host
and service
nodes each time it is fired, normally once a minute.
We'll just throw this in our temp directory at /tmp/chaos.rb
.
Arborist::Monitor 'chaos monkey!' do
description 'An example inline monitor.'
key :eek_eek
every 60
match type: [ 'host', 'service' ]
exec do |nodes|
results = {}
nodes.each_pair do |node, attributes|
val = rand( 2 )
if val.zero?
results[ node ] = { roll: "I rolled a #{val}." }
else
results[ node ] = { error: "I rolled a #{val}. OF DOOM." }
end
end
results
end
end
The output of this can be tested against real nodes in the Manager
with the run_once
command. The timing of the monitor is completely
ignored, and the node state is left unchanged. It only shows you what
it would have done.
arborist run_once /tmp/chaos.rb
An example inline monitor.
vm01
{:error=>"I rolled a 1. OF DOOM."}
vm01-memcached
{:roll=>"I rolled a 0."}
vmhost01
{:roll=>"I rolled a 0."}
vmhost01-ssh
{:roll=>"I rolled a 0."}
web01
{:roll=>"I rolled a 0."}
web01-http
{:error=>"I rolled a 1. OF DOOM."}
web01-ssh
{:error=>"I rolled a 1. OF DOOM."}
Multiple Monitor daemons
If you have many monitors doing a lot of heavy lifting, you can
optionally scale them out across any number of machines. How their
workload is divided up is up to you. The only configuration change
needed is to the tree_api_url
key -- instead of the local socket
default, provide it with a listening IP and port:
---
arborist:
tree_api_url: tcp://10.3.0.75:5011
The Manager will use this to bind, and the Monitors will use it to connect. More details on horizontally scaling things out can be found in the Cookbook.
Watching the Monitors
Once the monitor daemons are running, the quickest way to query
the Manager for its status is the client
, described on the
Manager page. Again, see the
Reference for a deeper dive into the specifics. Here's an
example to see currently up
nodes:
search( status: 'up' ).keys
=> ["_", "vmhost01", "vmhost01-ssh", "vmhost01-disk", "vm01", "vm01-memcached", "web01", "web01-ssh", "web01-http"]
You can also use the watch
Arborist command, to see the fire hose
stream of events that are happening within the Manager.
arborist -c config.yml watch
Subscription "76b6ed6c-fa11-41b2-9c71-4a63e2c8f59c"
Subscribing to manager heartbeat events.
Watching for events on manager at ipc:///tmp/arborist_events.sock
[2016-10-08 14:59:17 PDT] vmhost01 updated: host is up
[2016-10-08 14:59:17 PDT] vm01 updated: host is up
[2016-10-08 14:59:17 PDT] vm01 delta, changes: rtt: 0.53 -> 0.57
[2016-10-08 14:59:17 PDT] web01 updated: host is up
[2016-10-08 14:59:17 PDT] web01 delta, changes: rtt: 0.29 -> 0.22
[2016-10-08 14:59:17 PDT] vmhost01-ssh updated: service is up
[2016-10-08 14:59:17 PDT] vmhost01-ssh delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 5.8226e-05 -> 4.4903e-05
[2016-10-08 14:59:17 PDT] web01-ssh updated: service is up
[2016-10-08 14:59:17 PDT] web01-ssh delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 5.8226e-05 -> 4.4903e-05
[2016-10-08 14:59:17 PDT] web01-http updated: service is up
[2016-10-08 14:59:17 PDT] web01-http delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 0.000192603 -> 0.000169996
[2016-10-08 14:59:17 PDT] vm01-memcached updated: service is up
[2016-10-08 14:59:17 PDT] vm01-memcached delta, changes: time: 2016-10-08T14:59:12-07:00 -> 2016-10-08T14:59:17-07:00, duration: 0.000192603 -> 0.000408556
[2016-10-08 14:59:17 PDT] vmhost01-disk updated: resource is up
Okay, so where are we now? We have a running Manager, and a Monitor daemon that is checking the state of reality and reporting it back to the Manager. That sounds great for the world of the robots, but where does this leave us humans? The above methods are nice for testing and debugging, but for practical use, you'll want automatic actions when events happen. That's where Observers come in. Lets move on!