监控使用的运维记录
Q1:level=error component=notifier alertmanager=http://alertmanager:9093/api/v2/alerts count=1 msg="Error sending alert"
在prometheus容器中的报错
各容器确保在同一网络中
Q2:prometheus的监控项中文翻译
这里有prometheus的监控项中文翻译,还不错
https://n9e.github.io/docs/appendix/grafana-agent/integrations/mongodb-exporter-config/
Q3:perf 部署
#安装性能分析工具
apt update -y
apt-get install linux-tools-common linux-tools-generic linux-cloud-tools-generic linux-tools-`uname -r`
Q4:免密拷贝
免密拷贝
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
scp /root/.ssh/authorized_keys root@192.168.102.47:~/.ssh/authorized_keys
Q5:k8s告警指标
参考文档:https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapidown/
Q6:告警推送平台是PrometheusAlert
测试触发告警
# stress -c 4 提高CPU使用率
https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/system-func.md
Q7:告警通知分组
global: # 全局配置 resolve_timeout: 5m # 超时时间 默认5minhibit_rules: - source_match: ## 源报警规则 severity: 90 target_match: ## 抑制的报警规则 severity: 80 equal: [kafka,instance] ## 需要都有相同的标签及值,否则抑制不起作用route: receiver: webhook1 group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [demo, kafka, instance] # 对应prometheus规则文件中的team routes: - receiver: webhook2 # 对应下面 group_by: [nodeExt2] matchers: - team = kafka group_interval: 10s group_wait: 30s repeat_interval: 60m - receiver: webhook3 group_by: [nodeExt3] matchers: - team = instance group_interval: 10s group_wait: 30s repeat_interval: 60mreceivers:- name: webhook1 webhook_configs: # webhook告警配置 - url: http://172.16.1.165:29098/maintenanceApi/order/alarm- name: webhook2 webhook_configs: # webhook告警配置 - url: http://172.16.1.165:29098/maintenanceApi/order/alarm2- name: webhook3 webhook_configs: # webhook告警配置 - url: http://172.16.1.165:29098/maintenanceApi/order/alarm3