prometheus 监控项解读
一:监控模版配置功能
Prometheus template text expansion failures - alert: PrometheusTemplateTextExpansionFailures expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus template text expansion failures (instance {{ $labels.instance }}) description: "Prometheus encountered {{ $value }} template text expansion failures\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
解读:
这个 Prometheus 查询表达式用于检测过去 3 分钟内是否有模板文本扩展失败的情况发生
prometheus_template_text_expansion_failures_total:
这是一个 Prometheus 指标,用于计算模板文本扩展失败的总次数increase():
这是一个函数,用于计算给定时间范围内某个计数器的增量二:规则复杂度评估
Prometheus rule evaluation slow - alert: PrometheusRuleEvaluationSlow expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds for: 5m labels: severity: warning annotations: summary: Prometheus rule evaluation slow (instance {{ $labels.instance }}) description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
解读:
这个 Prometheus 查询表达式用于检测规则组的执行时间是否超过了其设定的执行间隔
prometheus_rule_group_last_duration_seconds:
这个指标表示规则组最近一次执行所花费的时间(以秒为单位)prometheus_rule_group_interval_seconds:
这个指标表示规则组的预定执行间隔(以秒为单位)