See www.zabbix.com for the official Zabbix site.
Improved handling of items with timeout errors
Owner: Andris Zeila
Zabbix treats item timeouts in the same way as network timeouts. If item execution time exceeds the timeout configured in server the host will be assumed to be unreachable. In this case the host is marked as unreachable for UnreachableDelay seconds and all checks during this period are moved to unreachable poller and scheduled to execute after the UnreachableDelay. But there is a high chance that when the delay passes the timed out item will be at the top of unreachable poller queue. This will lead to host being marked as unreachable again – rinse and repeat:
- Item execution times will be at the best erratic, in the worst case other items would not be checked at all and the host might end as unavailable.
- The log file will be spammed with warnings of host becoming unreachable/available.
- The problematic item is not marked in any way – so besides the logs users have no way of telling there is something wrong with it.
Ideally Zabbix should detect such items as soon as possible, mark them as not supported with appropriate error message and stop treating further timeouts of this item as network timeouts.
Add an unreachable flag to items. Set it to 1 after timeout error and reset it to 0 after successful execution.
The queue must give priority to reachable items if they are scheduled to be executed at the same time as unreachable items.
If an unreachable item failed with network/gateway error (basically it failed 2+ times), but host is not marked as unreachable, then this must be treated as item rather than network failure.
The unreachable items must be kept in unreachable poller with 'not supported' status and polled with 'Refresh unsupported items' interval configured in Zabbix frontend.
For passive agent items TIMEOUT_ERROR must be returned if the item failed with timeout when waiting for response from agent. NETWORK_ERROR must still be returned if the connection or data sending failed. Items returning TIMEOUT_ERROR must be marked as unreachable items and processed as such.
- What's new
- v1.1 renamed 'timeout' items to 'unreachable' items. Clarified the error code for passive agent items failing with read timeout.