See www.zabbix.com for the official Zabbix site.

Docs/specs/ZBXNEXT-2430

From Zabbix.org
Jump to: navigation, search

Monitoring fast-growing log files

ZBXNEXT-2430

Status: Work in progress

Owner: Andris Mednis

Summary

Zabbix log[] and logrt[] items work well with slowly growing log files. Every line matching the pattern is sent to server.

This method is not suitable for fast growing log files. Sending a massive amount of log lines to Zabbix server overloads the server and a backend database.

New options should be added for log[] and logrt[] items for processing fast-growing log files without overload.

Also, new items for counting number of matching lines should be added - in some use-cases number of lines is preferred instead of lines themselves.

Approaches to handling fast-growing log files

When a log file grows faster than a monitoring system can process it there are two options:

  1. Continue processing at a rate the monitoring system can handle and accept the growing time lag in a hope that eventually the system will catch up with the most recent lines in the log.
    Advantage: no lines are ignored.
    Drawback: important lines may be analyzed after long time, alerts generated too late.
    This is the current mode of operation in log[] and logrt[] items.
    Overload protection is implemented as described in Current rate limit mechanism.
  2. Ignore some older lines to keep up with the most recent lines in the log.
    Advantage: immediate attention to most recent lines.
    Drawback: ignored lines may be important but corresponding alerts are not generated.
    In addition to overload protection an "ignore" mechanism is proposed as described in Proposed modification with "maxdelay" parameter.

Zabbix should be modified to allow both approaches.

Current rate limit mechanism

Currently log[] and logrt[] items use two rate limits on number of log file lines:

  1. To prevent overloading Zabbix server and network resources Zabbix agent does not send more than "maxlines" of a log file per second.
    Maximum number of lines allowed to send to server for one update interval:
    s_count = maxlines * update_interval
    where
    maxlines value: min = 1, default = 20, max = 1000
    If maxlines parameter is not specified the default value provided by "MaxLinesPerSecond" parameter in the agent configuration file is used.
  2. To prevent overloading monitored host CPU and I/O if log file grows too fast Zabbix agent does not process more than
    p_count = 4 * s_count
    lines per one update interval.

So, maximum configurable amount is 1000 lines/s sent to server and 4000 line/s analyzed by agent.
These limits will not be changed.

Proposed modification with "maxdelay" parameter

  1. If maxdelay > 0 is specified, then in each check collect additional data:
    • number_of_processed_bytes from log files,
    • t_proc, "wall-clock" time spent on processing.
      The t_proc should include all time between successive checks of the item.
      One part of t_proc - time spent by agent on reading the log files and analyzing their lines - will become known during the current check.
      The other part of t_proc - time spent on sending results and metadata to server and time used for checking other items - will become available only during the next successive check.
  2. At the beginning of the next check calculate:
    • processing speed (bytes per second) as
      v = number_of_processed_bytes / t_proc
    • current delay (seconds) as
      t_del = number_of_remaining_bytes / v
  3. If 0 < t_del <= maxdelay then delay is acceptable, proceed with analyzing log file from current position.
  4. If t_del > maxdelay then ignore lines by "jumping" over them bytes_to_jump ahead
    bytes_to_jump = number_of_remaining_bytes * (t_del - maxdelay) / t_del
    Most likely we will "land" somewhere in the middle of line. Search the end of line, do not analyze it.
    Note that we do not even read ignored lines into buffer but calculate approximate position to jump to in a file.
    The fact that skipping of log file lines took place should be logged in agent log file at LOG_LEVEL_WARNING level.

For items dealing with log rotation (e.g. logrt[]) the number_of_processed_bytes, number_of_remaining_bytes and t_proc should be calculated over all log files selected for current check. "Jumping" over ignored lines may result in "landing" into other log file.

Changes in item parameters

Add a new parameter "maxdelay" to log[] and logrt[] items:

log[/path/to/file/file_name,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<maxdelay>]
logrt[/path/to/file/regexp_describing_filename_pattern,<regexp>,<encoding>,<maxlines>,<mode>,<output>,<maxdelay>]

"maxdelay" parameter is the maximum acceptable delay in seconds.
Type: float.
Values:

  • 0 - (default) standard behaviour, never ignore records.
  • > 0 - older lines may be ignored if necessary for keeping up with most recent lines in a fast growing log file.

The new parameter is optional.

New functionality: counting matching lines in log files

In some cases user is not interested in sending every matching log file line to Zabbix but in a number of matching lines.

Add two new items log.count[] and logrt.count[] to implement "count" functionality for log files:

log.count[/path/to/file/file_name,<regexp>,<encoding>,<maxproclines>,<mode>,<maxdelay>]
logrt.count[/path/to/file/regexp_describing_filename_pattern,<regexp>,<encoding>,<maxproclines>,<mode>,<maxdelay>]

where maxproclines - max number of lines per second to be analyzed in agent.
maxproclines value: min = 1, default = 80, max = 4000.
If maxproclines parameter is not specified the default value is set to 4 * MaxLinesPerSecond (parameter in the agent configuration file).

Value of items log.count[] and logrt.count[] will be number of matching lines for the configured update interval of the item.

While log[] and logrt[] send data only when matching lines have been detected, the new items log.count[] and logrt.count[] will send data on every update interval, even if log file has not changed ("0" in this case).

Similarly to log[] and logrt[], the new items log.count[] and logrt.count[] will not advance current position in log file in case of error when sending result to server until communication is restored, unless maxdelay > 0 is specified. Note that this can affect log.count[] and logrt.count[] results: for example, one check finds 100 matching lines in a log file, but due to a communication problem this result cannot be sent to server. In the next check the agent counts the same 100 matching lines and also 70 new matching lines and communication is restored. The agent now sends count = 170 as if they were found in one check.
If maxdelay > 0 is specified and a "jump" over log file lines takes place and the result of log.count[] or logrt.count[] check cannot be sent to server, the position after "jump" is kept and the result is discarded.

Similarly to log[], the new item log.count[] will send NOTSUPPORTED if the log file does not exist or is not accessible.

Similarly to logrt[], the new item logrt.count[] will NOT send NOTSUPPORTED if log file(s) does not exist or is not accessible as it could be a result of rotation.

Front-end changes

Modify items log[] and logrt[] by adding a new parameter "maxdelay".

Add two new items log.count[] and logrt.count[] as described above.

API changes

Translation strings

Added strings:

  • Number of matching lines since the last check of the item. Returns integer
  • Number of matching lines since the last check of the item with log rotation support. Returns integer

Database changes

None planned.

Documentation

  • What's new section
  • Item types, Zabbix agent - describe new "maxdelay" parameter for log[] and logrt[] items, describe new items log.count[] and logrt.count[].
  • Log file monitoring - describe how "maxdelay" parameter and skipping of lines work.

ChangeLog