The Observer Daemon
If you're still with me here, it's because you have the other components up and running, and you want to be automatically notified on the goings on of the Arborist node tree. This is where the Observer daemon comes into the mix -- it's job is to bundle up your selections and frequency timers, and subscribe to events as they happen at the Manager. When a matching event takes place, the Manager publishes it to all interested observing parties.
It has a few specific roles:
- Take search criteria and create a Manager subscription
- Handle event counts, rollups, time windowing -- signal-to-noise tweaks
- Do something with the event (email, sms, relay to whatever the fashionable internet service is right now)
Lets set up a playground file for examples again -- this all should be
familiar by now! Assuming you're still in /usr/local/arborist
...
mkdir observers
touch observers/example.rb
Then start it up with arborist start <component> <source>
.
arborist -c config.yml -l info start observers observers
[2016-10-06 17:05:15.563959 12721/main] info {} -- Loading config from # with defaults for sections: [:logging, :arborist].
[2016-10-06 17:05:15.621350 12721/main] info {Arborist::Observer} -- Loading observer file observers/example.rb...
[2016-10-06 17:05:15.621572 12721/main] info {Arborist::Client:0x2fc2d88} -- Connecting to the event socket "ipc:///tmp/arborist_events.sock"
[2016-10-06 17:05:15.621622 12721/main] info {} -- Using ZeroMQ 4.0.3/CZMQ 2.0.1
Like the other daemons, restart the Observer as we make changes to see them take effect -- and we'll start out with demonstrating a simplistic case.
Observers 101
The Arborist Manager has events bubbling around all over the place.
When nodes have their attributes updated and their states change,
they broadcast to their children, propagate to their parents, and
publish to their subscribers. Because Arborist stores state as a
tree, if you wanted to listen to all events, you can simply subscribe
to the root node, and catch everything that is propagating upwards.
As it so happens, subscribing to the root node is the default behavior.
It's as if someone thought about the normal, optimal case. Weird, huh?
Here's how to subscribe to any node that transitions to a down
state.
Arborist::Observer "Notify on downed nodes" do
subscribe to: 'node.down'
action do |node, event|
# do something amazing
end
end
And that's it. The node.down
event is only emitted when a node's
state changes, not every time it is checked. The action block is
executed only once per state transition to down
.
I should pause for a moment to discuss some Arborist internals. This won't be a departure from Observers, I promise. In fact, having a little bit of knowledge bomb dropped in your ear will probably help you figure out how to properly observe what you're after. Gather round everyone, lets go over...
Node Events
Yes, node events! What fun!
When a node's state changes, various events are generated within the Manager and ferried about. Here's a list of those events -- note that the primary events match the possible states of any one particular node. This is not a coincidence. When a node transitions from state X to state N, a single N event is propagated upwards through the node tree, ripe for capture by an eager Observer.
- node.acked (a downed node that has been acknowledged by a human,
down
-->acked
) - node.disabled (a node NOT down that has been acknowledged -- a
pre-emptive acknowledgement
.up
-->disabled
) - node.down (a node that failed its Monitor test,
up
-->down
) - node.quieted (a node who's parent or dependency went
down
) - node.unknown (a node that hasn't had a Monitor check yet, or needs retesting after dependencies/parents came back up)
- node.up (a node that was tested and checks out as good)
- node.warn (a node who's monitor indicates a potential upcoming
down
state)
Those events are issued once and only once per state transition.
In addition to the state transition events, there are two other informational events that are fired on any and every change to a node -- not just state changes.
- node.delta (this event captures a diff of the node before and after an attribute change)
- node.update (this event simply broadcasts the node as-is, after an update - even if no change is made)
Using these informational events, you can trigger actions on all events (not just transitions), or extremely specific changes to any arbitrary attribute -- this is great for relaying metrics data to external time-series databases, if you're building an Arborist user interface, or just otherwise want to drink from the FIRE HOSE.
Status Transitions
Status transitions are internally managed via a state machine. Here are
the possible status transitions -- useful to have if you want to observe
a specific transition from one state to another, using the node.delta
event.
Observer Pragmas
Okay, sidebar over. Now armed with that info, we can continue. An
Observer has three basic pragmas. subscribe
catches the event,
action
and summarize
actually do something with it.
subscribe
Subscriptions require the event that you're interested in seeing. If
you want to match on more than one event, add multiple subscribe
lines. They'll all be tested as events flow in, and any one of them
that matches is sufficient to trigger an action. A subscription accepts
the following arguments:
- to: Required. The event type to match. See the list above.
- on: Optional. The identifier of what node to subscribe to. With this, your observer can hook into a specific subset of the tree, omitting everything else. By default, it will subscribe to the root node, so it will see everything.
- where: Optional. Attribute selection criteria for a matched event.
- exclude: Optional. Negative criteria matches, accepting the same syntax as
where
.
# Subscribe to events where nodes transition into a down state
subscribe to 'node.down'
# Subscribe to every event where a node is detected as down
subscribe to: 'node.update', where: { status: 'down' }
# Subscribe to events that are generated by humans
subscribe to: 'node.ack' # acknowledging a downed node
subscribe to: 'node.disabled' # pre-emptive acknowledgement (maintenance windows)
subscribe to: 'node.delta', where: { delta: {status: [ 'disabled', 'unknown' ]}} # re-enabling a disabled node
action
An action
block is executed on an incoming event You may have any
number of action
blocks within an Observer, each with their own
settings.
- after: Execute the block when the number of matching events reaches this number. Defaults to 1.
- during: Optional. A time period description - this action is only executed if the current time falls within it. Time period is in Schedulability syntax.
- within: Optional. Time in seconds. Used alongside the
after
option, you can easily set up actions that only trigger after X eventswithin
N seconds.
When executed, the block is sent the node that generated the event, along with a hash of events, keyed on the time each event was generated.
# Perform the action as soon as an event comes in
action { ... }
# Perform the action if 3 events happen within a minute
action( after: 3, within: 60 ) { ... }
# Only trigger the action during core work hours
action( during: 'wd {Mon-Fri} hr {9am-4pm}' ) { ... }
If what goes inside of an action block has seemed nebulous so far... well, you're right. That's because you can do anything that is possible within the bounds of ruby, which practically means anything at all. Send email, sms, pushover, influxdb, splunk, ascii log files, relay to AMQP brokers, break out shared behavior to modules... really, the sky is the limit. Hack something together and use your imagination.
summarize
A summarize
block is executed not as an event comes in, but rather
when its options are satisfied. You may have any number of summarize
blocks within an Observer, each with their own settings.
- count: Execute the block when the number of matching events reaches this number.
- during: Optional. A time period description - this action is only executed if the current time falls within it. Time period is in Schedulability syntax.
- every: Execute the block once this amount of time has passed, in seconds.
Either the count
or the every
option is required. If you use both,
whichever options is satisfied first wins.
When executed, the block is sent a hash of events, keyed on the time the each event was generated.
# Trigger the block after 20 events
summarize( count: 20 ) { ... }
# Trigger the block once an hour, but only after work hours
summarize( every: 3600, during: 'hr {6pm-7am}' ) { ... }
Putting it All Together
Whoof, finally! This example clearly is large, and has plenty of opportunities to reduce code duplication, use variables, etc. Leaving it expanded out, so the core ideas of Observers aren't abstracted away. Here we go!
require 'mail'
using Arborist::TimeRefinements
NOT_WORK_HOURS = 'wd {Mon-Fri} hr {5pm-8am}, wd { Sat-Sun }'
Arborist::Observer "Notify on down" do
subscribe to: 'node.down'
action do |event|
data = event[ 'data' ]
mailer = Mail::Message.new
mailer.delivery_method :sendmail
mailer.from = Mail::Address.new( 'alerting@example.com' )
mailer.to = Mail::Address.new( 'recipients@example.com' )
mailer.subject = "Node down: %s" % [ event['identifier'] ]
mailer.body = "%s (%s) is down: %s" % [
event['identifier'],
data['addresses'].first,
data['error']
]
mailer.deliver
end
action( during: NOT_WORK_HOURS ) do |event|
send_sms_to_night_crew( event )
end
summarize( every: 1.hour ) do |events|
mailer = Mail::Message.new
mailer.delivery_method :sendmail
mailer.from = Mail::Address.new( 'alerting@example.com' )
mailer.to = Mail::Address.new( 'recipients@example.com' )
mailer.subject = "Down events for the past hour"
body = ""
events.sort_by{|k, v| k }.each do |time, event|
body << " - [%s] %s\n" % [
time, event['identifier']
]
end
mailer.body = body
mailer.deliver
end
end
Arborist::Observer "Notify on recovery" do
subscribe to: 'node.delta',
where: { delta: {status: ['down', 'up']} }
subscribe to: 'node.delta',
where: { delta: {status: ['acked', 'up']} }
action do |event|
mailer = Mail::Message.new
mailer.delivery_method :sendmail
mailer.from = Mail::Address.new( 'alerting@example.com' )
mailer.to = Mail::Address.new( 'recipients@example.com' )
mailer.subject = "Node recovered: %s" % [ event['identifier'] ]
mailer.body = "%s has recovered" % [ event['identifier'] ]
mailer.deliver
end
end
Arborist::Observer "Notify on ack/disabled/enabled" do
subscribe to: 'node.acked'
subscribe to: 'node.disabled'
subscribe to: 'node.delta',
where: { delta: {status: ['disabled', 'unknown']} }
action do |event|
data = event[ 'data' ]
verb = data[ 'status' ]
verb = 're-enabled' if event['type'] == 'node.delta'
mailer = Mail::Message.new
mailer.delivery_method :sendmail
mailer.from = Mail::Address.new( 'alerting@example.com' )
mailer.to = Mail::Address.new( 'recipients@example.com' )
mailer.subject = "Node %s %s" % [ event['identifier'], verb ]
if verb == 'acked' or verb == 'disabled'
mailer.body = "%s says: \"%s\"" % [
data['ack']['sender'],
data['ack']['message']
]
end
mailer.deliver
end
end
Arborist::Observer "Relay every event elsewhere" do
subscribe to: 'node.update'
action do |event|
send_elsewhere( event['data'] )
end
end
First observer:
- Line 3: Include a helper mixin that adds time methods to integers. It makes the DSL more readable when you can express intervals as
1.minute
or3.hours
. - Line 5: Schedulability syntax for defining a reusable time period.
- Line 8: This observer will watch any state change that puts a node into the
down
status. - Line 10: First action -- send email to an interested party.
- Line 25: Second action: Also send an SMS message, but only if it's after regular working hours.
- Line 29: Third action: Send a brief summary message with all
down
events within an hour's timeframe. If there are no events in the timeframe, this block isn't executed.
Second observer:
- Line 48 and 50: Subscribe to specific transitions to the
up
status. Why not just subscribe tonode.up
? Because I specifically don't want to be alerted on transitions fromunknown
toup
.
Third observer:
- Line 68: Looking back at the state chart, when a disabled node is manually re-enabled, it flips back to
unknown
until it is retested. This subscription catches that. - Line 81: Disabling or acknowledging a node carries additional metadata -- who acked it, and why? Include that in the notification!
Fourth observer:
- Line 93: Catch every node update as it happens.
- Line 96: Hand wavey -- send to a websocket, send to AMQP, send to a time-series database... integrate, integrate integrate!
If you want to get more details within an action than what the
event provides, you can always get a reference to the Manager
itself, and perform any queries you like against it using a
Client object. One is always in scope via the client
variable:
Arborist::Observer "Report on VM status changes" do
subscribe to: 'node.down', where: { tags: 'vm' }
subscribe to: 'node.up', where: { tags: 'vm' }
action do |event|
vms = client.search tag: 'vm'
down, up = vms.partition{|id, vm| vm['status'] == 'down' }
puts "%d virtual machines are down, %d are operational!" % [
down.length, up.length
]
end
end
Multiple Observer daemons
Just like Monitors, you can scale out observing across any
number of machines. The only configuration change needed is to the
event_api_url
key -- instead of the local socket default, provide it
with a listening IP and port:
---
arborist:
event_api_url: tcp://10.3.0.75:5012