See www.zabbix.com for the official Zabbix site.

Docs/simplify ad hoc maintenance

From Zabbix.org
Jump to: navigation, search

Making maintenance easier for users

System administrators often like to shut down notifications for a while, either when they briefly want to restart something, or when there is a known monitoring artifact swamping them in messages. Sometimes these people only have Zabbix user privileges or hardly know Zabbix. Different mechanisms in Zabbix can be used to achieve this result. They differ in scope, complexity, spontaneity, visibility and the user permissions necessary to implement them. Some are very hackish, too. It is vital to pick a sane mechanism and integrate it with the UI in a simple and useful way.

Host or trigger-based?

What suits our need better? To stop alerting for the whole host or just a particular trigger? How can the both approaches have visible and reliable results? Apart from regular maintenance, there are other potential candidates:

Mechanisms

Trigger dependency or a particular trigger expression

A trigger depending on a low severity trigger like ...
{Backup:maintenance.time(0)} > 000100 & {Backup:maintenance.time(0)} < 000200
... can be created to mask events between 1 and 2 A.M. Such dependencies would have to be set up on the fly. They can be visible on the dashboard. If they are created on a central fake host (like here!), users can hardly be notified of the "maintenance" starting or ending, but it can be visible as an "issue", depending on the dashboard settings. Since this causes an actual event, you can also keep track of the maintenances under Monitoring/Events. Unfortunately there is no upstream API option that allows to retrieve _dependent_ triggers. See ZBXNEXT-2554 for a patch! Pretty much the same can be achieved with a trigger like ...
{My_host:actual_item.last(0)} > 10 & {My_host:actual_item_maintenance.last(0)} = 1
... , where actual_item_maintenance is a trapper item you send "1" to.

Ups

  • Fine-grained control
  • Can be manipulated by every admin (Maintenances do not appear if an Admin does not have permissions to every host it affects.)
  • Easy to keep track of in Monitoring/Events

Downs

  • Dependency or trigger expression must be constructed on the fly or in advance
  • Can take a while to take effect depending on the value for CacheUpdateFrequency on the Zabbix server, if constructed on the fly
  • Not considered in IT service SLA calculation
  • Not visible in the way hosts in maintenance usually are (or should be, take ZBXNEXT-669, for instance!)
  • The dependencies can be hard to oversee
  • Potentially a vast number of fake items and triggers

Disabling things (Actions, triggers, items)

If your permissions allow for it, you can just disable an entity before or after a problem occured.

Ups

  • Easy to do

Downs

  • High privileges necessary
  • Potentially unforseeable consequences
  • Probably doesn't do a thing at all
  • Easy to forget to put it back
  • Hard to keep track of (Hopefully auditing works for what you are trying, see ZBX-2815)
  • Can take a while to take effect depending on the value for CacheUpdateFrequency on the Zabbix server

Operation conditions

You can stop or avert notifications by defining operation conditions.

Ups

  • Multiple triggers can be acknowledged in one go in the upstream UI

Downs

  • Must be set up in advance
  • Purely reactive, due to the nature of acknowledging
  • Acknowedgements don't cut it for "Multiple events" triggers
  • Anybody can acknowledge an event and thus disturb things. It can't be taken back. I also dislike that you can't just comment on an event.

Possible implementations

Host-based maintenance made easier

The main problem with the current maintenance configuration workflow is, that you need to define a maintenance and a maintenance period. This can be confusing to new users and is an annoying overhead for what we want to do here. It can be bothering at times that you can not define a timestamp for the end of the period, but only define the start and duration. Math is hard!

One option would be a slimmed down version of Configuration/Maintenance. It would take you straight to a configuration form and this form is missing the Periods tab. When you create the maintenance, it implicitly creates a period that matches the start and end of the maintenance.

Some people may also want to allow regular users to set up maintenances for particular hosts. This can be worked around. It requires some more work to let people view, delete and change the maintenances they set.

  • ZBXNEXT-1548 asks for UI integration between hosts and maintenance setup

POC implementation

File:Zabbix-2.2.5-quick maintenance.patch

  • Resides in the Monitoring menu, so that it's even accessible for regular users; Disabled for guest users!
  • Works for regular users with write permissions on hosts (Remove restriction from API class -- can easily be changed back)
  • Integrates with the host's Java script menu and leads to the same page, with the host pre-populated
  • Creates a separate maintenance with a one-time period of the same size; this comes with the extra benefit of having a meaning in "approaching, active and expired" in the list of maintenances
  • Only hosts can be chosen; Host groups were removed to avoid possible negative consequences
  • A name must be specified and is appended the Zabbix user name and the end of the maintence for your mouse-over comfort later. This means, you can create multiple maintenances of the same name, as long as they don't end at the same time. This probably isn't an ideal solution for everyone.
  • You can choose between 1, 3 and 6 hours. This list can easily be extended.
  • The description is optional and is appended the time of creation

Quick maintenance1.png

Quick maintenance2.png

Trigger-based "maintenance" enhanced

Using this approach is rather difficult. It is tempting to use the JS host menu and just fire a script labeled "Doze 3h". This would use LLD, the API, zabbix_sender or whatever to exploit the first mechanism layed out above. However, only very few macros are expanded for scripts and there are no additional form elements that would allow you to specify a reason for the maintenance.