本文最后更新于0 天前,其中的信息可能已经过时,如有错误请发送邮件到big_fw@foxmail.com
Prometheus环境部署,PromQL及常见系统服务监控
一、核心概念
1. Prometheus架构组件
- Prometheus Server: 数据采集、存储、WebUI查询接口
- Grafana: 可视化Dashboard展示
- Pushgateway: 自定义监控(短期任务)
- Alertmanager: 告警管理(钉钉、企业微信、邮箱等)
- Exporters: 被监控端代理程序
2. Prometheus数据类型(面试重点)
| 类型 | 说明 | 使用场景 |
|---|---|---|
| Gauge | 当前值,所见即所得 | 内存使用量、磁盘容量、温度 |
| Counter | 单调递增计数器 | HTTP请求总数、错误总数 |
| Histogram | 直方图样本观测 | 响应时间分位数分析 |
| Summary | 分位值结果 | 预计算的分位数指标 |
3. 黑盒监控 vs 白盒监控
- 白盒监控: 内部指标,可预测问题(如内存使用率)
- 黑盒监控: 外部探测,问题已发生(如网站不可访问)
二、关键命令/配置
1. Prometheus Server部署
# 下载
wget https://github.com/prometheus/prometheus/releases/download/v3.10.0/prometheus-3.10.0.linux-amd64.tar.gz
# 解压运行
tar xf prometheus-3.10.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/prometheus-3.10.0.linux-amd64/
./prometheus
# WebUI访问
http://IP:9090/
2. Node Exporter部署(监控Linux主机)
# 下载部署
wget https://github.com/prometheus/node_exporter/releases/download/v1.10.2/node_exporter-1.10.2.linux-amd64.tar.gz
tar xf node_exporter-1.10.2.linux-amd64.tar.gz -C /usr/local/
./node_exporter
# 访问metrics
http://IP:9100/metrics
3. Prometheus配置文件
global:
scrape_interval: 3s # 数据采集间隔
scrape_configs:
- job_name: "node-exporter"
metrics_path: "/metrics"
scheme: "http"
static_configs:
- targets: ["10.0.0.41:9100","10.0.0.42:9100"]
4. 热加载配置
curl -X POST http://IP:9090/-/reload
5. Grafana安装
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.5.21_amd64.deb
dpkg -i grafana-enterprise_9.5.21_amd64.deb
systemctl enable --now grafana-server
# 访问 http://IP:3000,默认账号密码: admin/admin
三、PromQL常用查询(面试重点)
1. CPU使用率
(1 - sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) /
sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
2. 系统模式占用时间
(sum(increase(node_cpu_seconds_total{mode="system"}[1m])) by (instance) /
sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100
3. CPU核心数
count(node_cpu_seconds_total{mode="idle"}) by (instance)
4. 节点启动时间
(time() - node_boot_time_seconds) / 60
5. 精确匹配与正则匹配
# 精确匹配
node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1"}
# 正则匹配
node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1",mode=~"i.*"}
# 取反
node_cpu_seconds_total{instance="10.0.0.42:9100",cpu!="1"}
四、面试常考点
- Prometheus与Zabbix的区别?
- Prometheus更适合容器和微服务监控
- Prometheus采用Pull模式,Zabbix支持Push
- Prometheus有强大的PromQL查询语言
- 四种数据类型的区别和应用场景?
- Gauge: 当前状态值
- Counter: 累计值,需配合rate/increase使用
- Histogram: 服务端计算分位数
- Summary: 客户端预计算分位数
- 监控服务流程?
- 被监控端暴露metrics指标
- Prometheus配置target
- 热加载配置
- 验证WebUI
- Grafana导入模板
Prometheus监控常见中间件及Grafana自定义Dashboard
一、核心概念
1. Exporter工作原理
Exporter是被监控端的代理程序,负责:
- 暴露metrics接口
- 采集应用指标
- 提供标准Prometheus格式数据
2. Grafana核心组件
- Dashboard: 仪表板
- Panel: 面板(单个图表)
- Row: 行(可折叠的面板组)
- Variable: 变量(动态过滤)
- DataSource: 数据源
二、关键命令/配置
1. 监控MySQL
# 运行MySQL exporter mysqld_exporter --config.my-cnf=/root/.my.cnf # .my.cnf配置
.my.cnf配置
[client]
host = 10.0.0.41
port = 3306
user = liuxin
password = liuxin
# Prometheus配置
- job_name: "mysql-exporter"
static_configs:
- targets: ["10.0.0.42:9104"]
Grafana模板ID: 14057, 17320
2. 监控Redis
# 运行Redis exporter redis_exporter -redis.addr redis://10.0.0.41:6379 -web.listen-address :9121
# Prometheus配置
- job_name: "redis-exporter"
static_configs:
- targets: ["10.0.0.42:9121"]
Grafana模板ID: 11835, 14091
3. 监控MongoDB
# 运行MongoDB exporter mongodb_exporter --mongodb.uri=mongodb://10.0.0.43:27017 --collect-all
# Prometheus配置
- job_name: "mongodb-exporter"
static_configs:
- targets: ["10.0.0.42:9216"]
Grafana模板ID: 16504
4. 监控Nginx(需编译vts模块)
# 编译nginx添加vts模块
./configure --add-module=/root/nginx-module-vts
make && make install
# nginx.conf配置
http {
vhost_traffic_status_zone;
server {
location /status {
vhost_traffic_status_display;
}
}
}
# Prometheus配置
- job_name: "nginx-vts"
metrics_path: "/status/format/prometheus"
static_configs:
- targets: ["10.0.0.41:80"]
Grafana模板ID: 9785
5. 监控Docker(cAdvisor)
# 运行cAdvisor docker run -d \ --volume=/:/rootfs:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ -p 18080:8080 \ --name=cadvisor \ gcr.io/cadvisor/cadvisor-amd64:v0.52.1
Grafana模板ID: 10619
6. 监控ElasticSearch
# 运行ES exporter elasticsearch_exporter --es.uri="https://elastic:123456@10.0.0.91:9200"
Grafana模板ID: 14191
三、Grafana自定义Dashboard
1. 变量定义
# 变量类型: Query
# Query: label_values(instance)
# 过滤: {job="node-exporter"}
2. 引用变量
node_cpu_seconds_total{instance="$myhost"}
3. 插件管理
# 安装插件 grafana-cli plugins install natel-discrete-panel # 重启生效 systemctl restart grafana-server
4. Dashboard备份与恢复
- 备份: Dashboard设置 → Export → Save to file
- 恢复: Dashboards → Import → Upload JSON file
四、面试常考点
- Exporter的作用?
- 将应用指标转换为Prometheus格式
- 提供HTTP接口暴露metrics
- 如何选择Grafana模板?
- 根据Exporter类型选择
- 注意版本兼容性
- 可自定义修改PromQL
- 监控中间件的标准流程?
- 部署对应的Exporter
- 配置Prometheus target
- 导入Grafana模板
- 验证数据展示
Pushgateway自定义监控,自定义exporter,服务发现,联邦模式实战
一、核心概念
1. Pushgateway使用场景
- 临时任务和批处理任务监控
- 短期生命周期无法被定期抓取的任务
- 自定义业务指标推送
2. 服务发现类型
| 类型 | 说明 | 适用场景 |
|---|---|---|
| static_configs | 静态配置,需热加载 | 固定目标 |
| file_sd_configs | 基于文件动态发现 | 配置中心管理 |
| consul_sd_configs | 基于Consul服务发现 | 微服务架构 |
| kubernetes_sd_configs | K8s服务发现 | 容器环境 |
3. 联邦模式
- 多级Prometheus架构
- 上级从下级拉取数据
- 减轻单节点I/O压力
二、关键命令/配置
1. Pushgateway部署与使用
# 部署 wget https://github.com/prometheus/pushgateway/releases/download/v1.11.2/pushgateway-1.11.2.linux-amd64.tar.gz pushgateway --web.listen-address=:9091 # 推送单值 echo "student_online 35" | curl --data-binary @- http://10.0.0.42:9091/metrics/job/myjob/instance/10.0.0.43 # 推送多值 cat <<EOF | curl --data-binary @- http://IP:9091/metrics/job/myjob/instance/10.0.0.41 # TYPE disk_usage gauge disk_usage 92.56 # TYPE student_online gauge student_online 150 EOF # 删除数据 curl -X DELETE http://IP:9091/metrics/job/myjob curl -X PUT http://IP:9091/api/v1/admin/wipe # 删除所有
2. Prometheus监控Pushgateway
- job_name: "pushgateway"
honor_labels: true # 标签冲突时保留远程标签
static_configs:
- targets: ["10.0.0.42:9091"]
3. 自定义监控脚本示例(TCP状态监控)
#!/bin/bash
pushgateway_url="http://10.0.0.42:9091/metrics/job/tcp_status/instance/10.0.0.31"
state="SYN-SENT SYN-RECV FIN-WAIT-1 FIN-WAIT-2 TIME-WAIT CLOSE CLOSE-WAIT LAST-ACK LISTEN CLOSING ESTAB UNKNOWN"
echo """
# TYPE tcp_connections gauge
# HELP tcp_connections Number of TCP state connections.""" > /tmp/tcp.txt
for i in $state; do
count=`ss -tan | grep $i | wc -l`
echo tcp_connections{state=\"$i\"} $count >> /tmp/tcp.txt
done
cat /tmp/tcp.txt | curl --data-binary @- $pushgateway_url
4. Python自定义Exporter
from prometheus_client import start_http_server, Counter, Summary
from flask import Flask, jsonify
app = Flask(__name__)
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent')
COUNTER_TIME = Counter("request_count", "Total request count")
@app.route("/apps")
@REQUEST_TIME.time()
def requests_count():
COUNTER_TIME.inc()
return jsonify({"office": "https://www.oldboyedu.com"})
if __name__ == "__main__":
start_http_server(8000)
app.run(host='0.0.0.0', port=8001)
5. 基于文件的服务发现
# prometheus.yml
- job_name: "file-sd"
file_sd_configs:
- files:
- /tmp/targets.json
- /tmp/targets.yaml
# targets.json
[
{
"targets": ["10.0.0.41:9100"],
"labels": {"school": "oldboyedu", "class": "linux102"}
}
]
6. 基于Consul的服务发现
# prometheus.yml
- job_name: "consul-sd"
consul_sd_configs:
- server: 10.0.0.43:8500
relabel_configs:
- source_labels: [__meta_consul_service]
regex: consul
action: drop # 过滤consul自身服务
# 注册服务到Consul
curl -X PUT -d '{"id":"node42","name":"node-exporter","address":"10.0.0.42","port":9100}' \
http://10.0.0.43:8500/v1/agent/service/register
# 注销服务
curl -X PUT http://10.0.0.43:8500/v1/agent/service/deregister/node42
7. 联邦模式配置
# 上级Prometheus配置
- job_name: "federate-32"
metrics_path: "/federate"
honor_labels: true
params:
"match[]":
- '{job="prometheus"}'
- '{__name__=~"node.*"}'
static_configs:
- targets: ["10.0.0.32:9090"]
三、面试常考点
- Pushgateway的缺点?
- 重启后数据丢失
- 单点故障风险
- 不适合长期监控
- 静态配置 vs 动态发现?
- 静态: 需要热加载或重启
- 动态: 自动发现,无需重启
- honor_labels参数作用?
- true: 远程标签覆盖本地标签
- false: 冲突时添加”exported_”前缀
- 联邦模式的应用场景?
- 多数据中心监控
- 分层架构减少压力
- 数据聚合展示
远端存储,黑白名单,启用https,标签管理,黑盒监控,Grafana数据源配置
一、核心概念
1. 远端存储
- 解决本地存储容量限制
- 支持VictoriaMetrics、Thanos、Cortex等
- 配置
remote_write推送到远端
2. 黑白名单
| 策略 | Exporter端 | Server端 |
|---|---|---|
| 黑名单 | --no-collector.cpu | exclude[]: cpu |
| 白名单 | --collector.disable-defaults --collector.cpu | collect[]: cpu,uname |
3. 标签管理流程
服务发现 → 配置 → relabel_configs → 抓取数据 → metric_relabel_configs
4. 黑盒监控 vs 白盒监控
- 黑盒监控: 外部探测(HTTP/TCP/ICMP)
- 白盒监控: 内部指标(CPU/内存/磁盘)
二、关键命令/配置
1. VictoriaMetrics部署
# 下载 wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.93.16/victoria-metrics-linux-amd64-v1.93.16.tar.gz # 启动 victoria-metrics-prod \ -httpListenAddr=0.0.0.0:8428 \ -storageDataPath=/data/victoria-metrics \ -retentionPeriod=3
2. Prometheus远端存储配置
# prometheus.yml remote_write: - url: http://10.0.0.43:8428/api/v1/write
3. Prometheus启用HTTPS
# 生成CA证书 openssl genrsa -out ca.key 4096 openssl req -x509 -new -nodes -sha512 -days 3650 \ -subj "/C=CN/ST=Beijing/L=Beijing/O=example/CN=yinzhengjie.com" \ -key ca.key -out ca.crt # 生成服务端证书 openssl genrsa -out prometheus.key 4096 openssl req -new -subj "/C=CN/ST=Beijing/CN=prometheus.yinzhengjie.com" \ -key prometheus.key -out prometheus.csr openssl x509 -req -days 3650 \ -CA ca.crt -CAkey ca.key -CAcreateserial \ -in prometheus.csr -out prometheus.crt
# auth.yml tls_server_config: cert_file: prometheus.crt key_file: prometheus.key basic_auth_users: admin: $2b$10$... # bcrypt加密密码
# 启动参数添加 --web.config.file=auth.yml
4. 启用认证访问
# 生成密码
python3 -c 'import bcrypt; print(bcrypt.hashpw("password".encode(), bcrypt.gensalt()).decode())'
# 访问时需要认证
curl -k -u admin:password https://IP:9090/metrics
5. Node Exporter黑白名单
# 黑名单(排除CPU指标) ./node_exporter --no-collector.cpu # 白名单(只采集CPU和uname) ./node_exporter --collector.disable-defaults --collector.cpu --collector.uname
6. Prometheus Server端黑白名单
# 黑名单
- job_name: "blacklist"
params:
exclude[]: [cpu]
static_configs:
- targets: ["10.0.0.42:9100"]
# 白名单
- job_name: "whitelist"
params:
collect[]: [uname, diskstats]
static_configs:
- targets: ["10.0.0.41:9100"]
7. 标签管理 – relabel_configs
# 为target打标签
- job_name: "custom-labels"
static_configs:
- targets: ["10.0.0.41:9100"]
labels:
school: oldboyedu
class: linux102
# 替换标签
relabel_configs:
- source_labels: [__scheme__, __address__, __metrics_path__]
separator: ""
regex: "(http|https)(.*)"
target_label: endpoint
replacement: "${1}://${2}"
action: replace
8. Blackbox Exporter部署
# 部署 wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.28.0/blackbox_exporter-0.28.0.linux-amd64.tar.gz ./blackbox_exporter # 访问 http://IP:9115
9. HTTP网站监控
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://www.oldboyedu.com/
- http://10.0.0.31:9090
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.0.43:9115
Grafana模板ID: 7587, 13659
10. ICMP主机存活监控
- job_name: 'blackbox-icmp'
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets:
- 10.0.0.41
- 10.0.0.42
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.0.43:9115
11. TCP端口监控
- job_name: 'blackbox-tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- 10.0.0.41:80
- 10.0.0.42:22
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.0.0.43:9115
12. Grafana配置MySQL数据源
# /etc/grafana/grafana.ini
type = mysql
host = 10.0.0.41:3306
name = grafana
user = grafana
password = password
三、面试常考点
- 本地存储 vs 远端存储?
- 本地: 简单,受容量限制
- 远端: 扩展性强,支持长期存储
- relabel_configs vs metric_relabel_configs?
- relabel_configs: 抓取前处理target标签
- metric_relabel_configs: 抓取后处理metrics标签
- 黑白名单的应用场景?
- 黑名单: 排除敏感/无用指标
- 白名单: 只采集需要的指标
- 黑盒监控的探测方式?
- HTTP/HTTPS: 网站可用性
- TCP: 端口存活
- ICMP: 主机存活
- DNS: 域名解析
Alertmanager告警及etcd集群备份恢复和监控实战
一、核心概念
1. Alertmanager核心功能
- 去重: 合并相同告警
- 分组: 相似告警合并
- 路由: 根据标签分发
- 静默: 临时屏蔽告警
- 抑制: 条件触发屏蔽相关告警
2. etcd特性
- 分布式键值存储
- 使用Raft共识算法
- 端口: 2379(客户端)、2380(集群通信)
- Kubernetes后端存储
3. 告警状态
| 状态 | 说明 |
|---|---|
| Inactive | 未触发 |
| Pending | 已触发,未达到持续时间 |
| Firing | 已触发并发送告警 |
| Resolved | 已恢复 |
二、关键命令/配置
1. Alertmanager部署
# 部署 wget https://github.com/prometheus/alertmanager/releases/download/v0.31.1/alertmanager-0.31.1.linux-amd64.tar.gz ./alertmanager # 访问 http://IP:9093
2. 基础告警配置
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: 'alert@example.com'
smtp_smarthost: 'smtp.example.com:465'
smtp_auth_username: 'alert@example.com'
smtp_auth_password: 'password'
route:
group_by: ['alertname']
group_wait: 5s
group_interval: 10s
repeat_interval: 5m
receiver: 'team-email'
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
send_resolved: true
3. 子路由配置
route:
receiver: 'sre_system'
routes:
- receiver: 'sre_dba'
match_re:
job: mysql.*
continue: true # 继续匹配后续规则
- receiver: 'sre_k8s'
match_re:
job: k8s.*
continue: true
receivers:
- name: 'sre_dba'
email_configs:
- to: 'dba@example.com'
- name: 'sre_k8s'
email_configs:
- to: 'k8s@example.com'
- name: 'sre_system'
email_configs:
- to: 'system@example.com'
4. Prometheus告警规则
# rules.yml
groups:
- name: node-alerts
rules:
- alert: NodeDown
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 已停止"
description: "节点已停止超过1分钟"
# prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["10.0.0.43:9093"]
rule_files:
- "rules.yml"
5. 自定义邮件模板
<!-- /oldboyedu/softwares/alertmanager/tmpl/email.tmpl -->
{{ define "email.html" }}
<h2>告警通知</h2>
<table border="1">
<tr><th>告警名称</th><th>实例</th><th>阈值</th><th>时间</th></tr>
{{ range .Alerts }}
<tr>
<td>{{ .Labels.alertname }}</td>
<td>{{ .Labels.instance }}</td>
<td>{{ .Annotations.value }}</td>
<td>{{ .StartsAt }}</td>
</tr>
{{ end }}
</table>
{{ end }}
# alertmanager.yml引用模板
receivers:
- name: 'team-email'
email_configs:
- to: 'team@example.com'
html: '{{ template "email.html" . }}'
headers: { Subject: "[告警] 服务异常" }
templates:
- '/oldboyedu/softwares/alertmanager/tmpl/*.tmpl'
6. 钉钉告警
# 部署钉钉插件 wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
# config.yml
targets:
ops-team:
url: https://oapi.dingtalk.com/robot/send?access_token=TOKEN
secret: "SEC..." # 加签密钥
# 启动 ./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060"
# alertmanager.yml
receivers:
- name: 'dingtalk'
webhook_configs:
- url: 'http://10.0.0.42:8060/dingtalk/ops-team/send'
send_resolved: true
7. 告警静默
- 在Alertmanager WebUI中创建
- 设置匹配条件和持续时间
- 适用于计划维护期
8. 告警抑制
# alertmanager.yml
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['dc'] # 相同数据中心才抑制
9. etcd集群部署
# 下载 wget https://github.com/etcd-io/etcd/releases/download/v3.6.10/etcd-v3.6.10-linux-amd64.tar.gz # 配置文件 /oldboyedu/softwares/etcd/etcd.config.yml name: 'etcd-node1' data-dir: /var/lib/etcd listen-peer-urls: 'https://10.0.0.41:2380' listen-client-urls: 'https://10.0.0.41:2379,http://127.0.0.1:2379' initial-advertise-peer-urls: 'https://10.0.0.41:2380' advertise-client-urls: 'https://10.0.0.41:2379' initial-cluster: 'etcd-node1=https://10.0.0.41:2380,etcd-node2=https://10.0.0.42:2380,etcd-node3=https://10.0.0.43:2380' initial-cluster-token: 'etcd-cluster' initial-cluster-state: 'new' client-transport-security: cert-file: '/oldboyedu/certs/etcd/etcd.pem' key-file: '/oldboyedu/certs/etcd/etcd-key.pem' trusted-ca-file: '/oldboyedu/certs/etcd/ca.pem' # 启动 systemctl start etcd
10. etcd基础操作
# 写入数据 etcdctl put /key value # 读取数据 etcdctl get /key etcdctl get / --prefix --keys-only # 所有key # 删除数据 etcdctl del /key etcdctl del / --prefix # 删除所有 # 查看集群状态 etcdctl endpoint status --write-out=table
11. etcd备份恢复
# 备份 etcdctl snapshot save /backup/etcd-$(date +%F).db # 查看备份状态 etcdutl snapshot status /backup/etcd.db -w table # 恢复(需要停止etcd服务) etcdutl snapshot restore /backup/etcd.db --data-dir=/var/lib/etcd-restore # 修改配置文件指向新数据目录后重启
12. Prometheus监控etcd
# prometheus.yml
- job_name: "etcd-cluster"
scheme: https
tls_config:
ca_file: certs/etcd/ca.pem
cert_file: certs/etcd/etcd.pem
key_file: certs/etcd/etcd-key.pem
static_configs:
- targets:
- 10.0.0.41:2379
- 10.0.0.42:2379
- 10.0.0.43:2379
Grafana模板ID: 21473, 3070, 10323
三、面试常考点
- Alertmanager的作用?
- 告警去重、分组、路由
- 支持多渠道通知
- 静默和抑制功能
- 告警静默 vs 告警抑制?
- 静默: 手动设置,临时屏蔽
- 抑制: 自动设置,条件触发
- etcd为什么需要3个节点?
- Raft算法要求多数派投票
- 奇数节点保证高可用
- etcd备份恢复流程?
- 定期快照备份
- 停止服务
- 恢复数据目录
- 重启服务
- 如何实现告警分级?
- 定义不同的severity标签
- 配置多条路由规则
- 发送到不同的接收者
附录:常用Grafana模板ID汇总
| 监控对象 | 模板ID |
|---|---|
| Node Exporter | 1860 |
| MySQL | 14057, 17320 |
| Redis | 11835, 14091 |
| MongoDB | 16504 |
| Nginx (vts) | 9785, 2949 |
| Docker (cAdvisor) | 10619 |
| ElasticSearch | 14191 |
| Kafka | 21078, 7589 |
| Zookeeper | 10465 |
| Consul | 12049 |
| Blackbox Exporter | 7587, 13659 |
| etcd | 21473, 3070, 10323 |
| Windows | 23847, 14694 |
常见错误排查
| 错误信息 | 原因 | 解决方案 |
|---|---|---|
server returned HTTP status 400 | 协议错误 | 检查scheme配置 |
failed to verify certificate | 证书验证失败 | 添加insecure_skip_verify: true |
server returned HTTP status 401 | 认证失败 | 配置basic_auth |
550 Connection frequency limited | 邮件发送频率限制 | 更换邮箱或降低频率 |
整理说明: 本文档提取了5天课程的核心知识点,涵盖Prometheus生态完整链路,从部署、配置、监控、告警到高阶特性,适合面试复习和实战参考。

