[#6] 提供在Prometheus上部署Alertmanager配置,提供配置文件片段;
This commit is contained in:
parent
41bd3ca1f6
commit
f47d6560f5
60
src/alert/alertmanager/config/rule_files/README.md
Normal file
60
src/alert/alertmanager/config/rule_files/README.md
Normal file
@ -0,0 +1,60 @@
|
|||||||
|
# 告警配置
|
||||||
|
|
||||||
|
> 参考:[自定义Prometheus告警规则](https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule)
|
||||||
|
|
||||||
|
在Prometheus中配置告警的有两个步骤:
|
||||||
|
|
||||||
|
1. 写告警规则文件(rules文件)
|
||||||
|
2. 在promethues.yml里加载规则,并配置Alertmanager
|
||||||
|
|
||||||
|
## 1. 编写告警规则文件
|
||||||
|
告警规则如下:
|
||||||
|
```yml
|
||||||
|
groups:
|
||||||
|
- name: example-rules
|
||||||
|
interval: 30s # 每30秒评估一次
|
||||||
|
rules:
|
||||||
|
- alert: InstanceDown
|
||||||
|
expr: up == 0
|
||||||
|
for: 1m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "实例 {{ $labels.instance }} 已宕机"
|
||||||
|
description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
|
||||||
|
|
||||||
|
- alert: HighCpuUsage
|
||||||
|
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "CPU 使用率过高"
|
||||||
|
description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
|
||||||
|
```
|
||||||
|
|
||||||
|
其中:
|
||||||
|
|
||||||
|
- `alert`:告警规则的名称。
|
||||||
|
- `expr`:基于PromQL表达式告警触发条件,用于计算是否有时间序列满足该条件。
|
||||||
|
- `for`:评估等待时间,可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
|
||||||
|
- `labels`:自定义标签,允许用户指定要附加到告警上的一组附加标签,可以在Alertmanager中做路由和分组。
|
||||||
|
- `annotations`:用于指定一组附加信息,比如用于描述告警详细信息的文字等,annotations的内容在告警产生时会一同作为参数发送到Alertmanager。可以提供告警摘要和详细信息。
|
||||||
|
|
||||||
|
## 2. promothues.yml里引用
|
||||||
|
在prometheus.yml中加上`rule_files`和`alerting`:
|
||||||
|
|
||||||
|
```yml
|
||||||
|
global:
|
||||||
|
[ evaluation_interval: <duration> | default = 1m ]
|
||||||
|
|
||||||
|
rule_files:
|
||||||
|
[ - <filepath_glob> ... ]
|
||||||
|
|
||||||
|
alerting:
|
||||||
|
alertmanagers:
|
||||||
|
- static_configs:
|
||||||
|
- targets:
|
||||||
|
- "localhost:9093" # Alertmanager 地址
|
||||||
|
|
||||||
|
```
|
37
src/alert/alertmanager/config/rule_files/example_rules.yml
Normal file
37
src/alert/alertmanager/config/rule_files/example_rules.yml
Normal file
@ -0,0 +1,37 @@
|
|||||||
|
groups:
|
||||||
|
- name: example-rules
|
||||||
|
interval: 30s # 每30秒评估一次
|
||||||
|
rules:
|
||||||
|
- alert: InstanceDown
|
||||||
|
expr: up == 0
|
||||||
|
for: 1m
|
||||||
|
labels:
|
||||||
|
severity: critical
|
||||||
|
annotations:
|
||||||
|
summary: "实例 {{ $labels.instance }} 已宕机"
|
||||||
|
description: "{{ $labels.instance }} 在 {{ $labels.job }} 中无响应超过 1 分钟。"
|
||||||
|
|
||||||
|
- alert: HighCpuUsage
|
||||||
|
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "CPU 使用率过高"
|
||||||
|
description: "实例 {{ $labels.instance }} CPU 使用率超过 80% 持续 5 分钟。"
|
||||||
|
- alert: HighMemoryUsage
|
||||||
|
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
|
||||||
|
for: 5m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "内存使用率过高"
|
||||||
|
description: "实例 {{ $labels.instance }} 内存使用率超过 80% 持续 5 分钟。"
|
||||||
|
- alert: DiskSpaceLow
|
||||||
|
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} - node_filesystem_free_bytes{fstype!~"tmpfs|overlay"}) / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100 > 90
|
||||||
|
for: 10m
|
||||||
|
labels:
|
||||||
|
severity: warning
|
||||||
|
annotations:
|
||||||
|
summary: "磁盘空间不足"
|
||||||
|
description: "实例 {{ $labels.instance }} 磁盘空间不足超过 90% 持续 10 分钟。"
|
Loading…
x
Reference in New Issue
Block a user