Prometheus监控告警规则介绍

Prometheus 监控告警规则介绍

一：配置加载异常检测

Prometheus configuration reload failure - alert: PrometheusConfigurationReloadFailure expr: prometheus_config_last_reload_successful != 1 for: 0m labels: severity: warning annotations: summary: Prometheus configuration reload failure (instance {{ $labels.instance }}) description: "Prometheus configuration reload error\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

expr: prometheus_config_last_reload_successful != 1

解读：

prometheus_config_last_reload_successful 是 Prometheus 监控系统中的一个重要指标，用于指示 Prometheus 配置文件的最后一次重新加载是否成功

指标含义：

当值为 1 时，表示最后一次配置重新加载成功。当值为 0 时，表示最后一次配置重新加载失败。

二：重启次数

Prometheus too many restarts- alert: PrometheusTooManyRestarts expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2 for: 0m labels: severity: warning annotations: summary: Prometheus too many restarts (instance {{ $labels.instance }}) description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2

解读：

process_start_time_seconds:

这是一个 Prometheus 指标，表示进程的启动时间（以 Unix 时间戳的形式）

{job=~"prometheus|pushgateway|alertmanager"}:

这是一个标签匹配器，用于选择特定的作业。=~ 表示正则表达式匹配。它会匹配 job 标签为 "prometheus"、"pushgateway" 或 "alertmanager" 的时间序列

[15m]:

这定义了一个 15 分钟的时间范围

changes():

这是一个函数，用于计算给定时间范围内时间序列的值变化的次数。在这个上下文中，它计算 process_start_time_seconds 在过去 15 分钟内变化的次数

魔女团新闻

韵味老鸟