本文最后更新于0 天前，其中的信息可能已经过时，如有错误请发送邮件到big_fw@foxmail.com

Prometheus环境部署，PromQL及常见系统服务监控

一、核心概念

1. Prometheus架构组件

Prometheus Server: 数据采集、存储、WebUI查询接口
Grafana: 可视化Dashboard展示
Pushgateway: 自定义监控（短期任务）
Alertmanager: 告警管理（钉钉、企业微信、邮箱等）
Exporters: 被监控端代理程序

2. Prometheus数据类型（面试重点）

类型	说明	使用场景
Gauge	当前值，所见即所得	内存使用量、磁盘容量、温度
Counter	单调递增计数器	HTTP请求总数、错误总数
Histogram	直方图样本观测	响应时间分位数分析
Summary	分位值结果	预计算的分位数指标

3. 黑盒监控 vs 白盒监控

白盒监控: 内部指标，可预测问题（如内存使用率）
黑盒监控: 外部探测，问题已发生（如网站不可访问）

二、关键命令/配置

1. Prometheus Server部署

# 下载
wget https://github.com/prometheus/prometheus/releases/download/v3.10.0/prometheus-3.10.0.linux-amd64.tar.gz

# 解压运行
tar xf prometheus-3.10.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/prometheus-3.10.0.linux-amd64/
./prometheus

# WebUI访问
http://IP:9090/

2. Node Exporter部署（监控Linux主机）

# 下载部署
wget https://github.com/prometheus/node_exporter/releases/download/v1.10.2/node_exporter-1.10.2.linux-amd64.tar.gz
tar xf node_exporter-1.10.2.linux-amd64.tar.gz -C /usr/local/
./node_exporter

# 访问metrics
http://IP:9100/metrics

3. Prometheus配置文件

global:
  scrape_interval: 3s  # 数据采集间隔

scrape_configs:
  - job_name: "node-exporter"
    metrics_path: "/metrics"
    scheme: "http"
    static_configs:
      - targets: ["10.0.0.41:9100","10.0.0.42:9100"]

4. 热加载配置

curl -X POST http://IP:9090/-/reload

5. Grafana安装

wget https://dl.grafana.com/enterprise/release/grafana-enterprise_9.5.21_amd64.deb
dpkg -i grafana-enterprise_9.5.21_amd64.deb
systemctl enable --now grafana-server
# 访问 http://IP:3000，默认账号密码: admin/admin

三、PromQL常用查询（面试重点）

1. CPU使用率

(1 - sum(increase(node_cpu_seconds_total{mode="idle"}[1m])) by (instance) / 
 sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100

2. 系统模式占用时间

(sum(increase(node_cpu_seconds_total{mode="system"}[1m])) by (instance) / 
 sum(increase(node_cpu_seconds_total[1m])) by (instance)) * 100

3. CPU核心数

count(node_cpu_seconds_total{mode="idle"}) by (instance)

4. 节点启动时间

(time() - node_boot_time_seconds) / 60

5. 精确匹配与正则匹配

# 精确匹配
node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1"}

# 正则匹配
node_cpu_seconds_total{instance="10.0.0.42:9100",cpu="1",mode=~"i.*"}

# 取反
node_cpu_seconds_total{instance="10.0.0.42:9100",cpu!="1"}

四、面试常考点

Prometheus与Zabbix的区别？
- Prometheus更适合容器和微服务监控
- Prometheus采用Pull模式，Zabbix支持Push
- Prometheus有强大的PromQL查询语言
四种数据类型的区别和应用场景？
- Gauge: 当前状态值
- Counter: 累计值，需配合rate/increase使用
- Histogram: 服务端计算分位数
- Summary: 客户端预计算分位数
监控服务流程？
- 被监控端暴露metrics指标
- Prometheus配置target
- 热加载配置
- 验证WebUI
- Grafana导入模板

Prometheus监控常见中间件及Grafana自定义Dashboard

一、核心概念

1. Exporter工作原理

Exporter是被监控端的代理程序，负责：

暴露metrics接口
采集应用指标
提供标准Prometheus格式数据

2. Grafana核心组件

Dashboard: 仪表板
Panel: 面板（单个图表）
Row: 行（可折叠的面板组）
Variable: 变量（动态过滤）
DataSource: 数据源

二、关键命令/配置

1. 监控MySQL

# 运行MySQL exporter
mysqld_exporter --config.my-cnf=/root/.my.cnf

# .my.cnf配置

.my.cnf配置

[client]
host = 10.0.0.41
port = 3306
user = liuxin
password = liuxin

# Prometheus配置
- job_name: "mysql-exporter"
  static_configs:
    - targets: ["10.0.0.42:9104"]

Grafana模板ID: 14057, 17320

2. 监控Redis

# 运行Redis exporter
redis_exporter -redis.addr redis://10.0.0.41:6379 -web.listen-address :9121

# Prometheus配置
- job_name: "redis-exporter"
  static_configs:
    - targets: ["10.0.0.42:9121"]

Grafana模板ID: 11835, 14091

3. 监控MongoDB

# 运行MongoDB exporter
mongodb_exporter --mongodb.uri=mongodb://10.0.0.43:27017 --collect-all

# Prometheus配置
- job_name: "mongodb-exporter"
  static_configs:
    - targets: ["10.0.0.42:9216"]

Grafana模板ID: 16504

4. 监控Nginx（需编译vts模块）

# 编译nginx添加vts模块
./configure --add-module=/root/nginx-module-vts
make && make install

# nginx.conf配置
http {
    vhost_traffic_status_zone;
    server {
        location /status {
            vhost_traffic_status_display;
        }
    }
}

# Prometheus配置
- job_name: "nginx-vts"
  metrics_path: "/status/format/prometheus"
  static_configs:
    - targets: ["10.0.0.41:80"]

Grafana模板ID: 9785

5. 监控Docker（cAdvisor）

# 运行cAdvisor
docker run -d \
  --volume=/:/rootfs:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  -p 18080:8080 \
  --name=cadvisor \
  gcr.io/cadvisor/cadvisor-amd64:v0.52.1

Grafana模板ID: 10619

6. 监控ElasticSearch

# 运行ES exporter
elasticsearch_exporter --es.uri="https://elastic:123456@10.0.0.91:9200"

Grafana模板ID: 14191

三、Grafana自定义Dashboard

1. 变量定义

# 变量类型: Query
# Query: label_values(instance)
# 过滤: {job="node-exporter"}

2. 引用变量

node_cpu_seconds_total{instance="$myhost"}

3. 插件管理

# 安装插件
grafana-cli plugins install natel-discrete-panel

# 重启生效
systemctl restart grafana-server

4. Dashboard备份与恢复

备份: Dashboard设置 → Export → Save to file
恢复: Dashboards → Import → Upload JSON file

四、面试常考点

Exporter的作用？
- 将应用指标转换为Prometheus格式
- 提供HTTP接口暴露metrics
如何选择Grafana模板？
- 根据Exporter类型选择
- 注意版本兼容性
- 可自定义修改PromQL
监控中间件的标准流程？
- 部署对应的Exporter
- 配置Prometheus target
- 导入Grafana模板
- 验证数据展示

Pushgateway自定义监控，自定义exporter，服务发现，联邦模式实战

一、核心概念

1. Pushgateway使用场景

临时任务和批处理任务监控
短期生命周期无法被定期抓取的任务
自定义业务指标推送

2. 服务发现类型

类型	说明	适用场景
static_configs	静态配置，需热加载	固定目标
file_sd_configs	基于文件动态发现	配置中心管理
consul_sd_configs	基于Consul服务发现	微服务架构
kubernetes_sd_configs	K8s服务发现	容器环境

3. 联邦模式

多级Prometheus架构
上级从下级拉取数据
减轻单节点I/O压力

二、关键命令/配置

1. Pushgateway部署与使用

# 部署
wget https://github.com/prometheus/pushgateway/releases/download/v1.11.2/pushgateway-1.11.2.linux-amd64.tar.gz
pushgateway --web.listen-address=:9091

# 推送单值
echo "student_online 35" | curl --data-binary @- http://10.0.0.42:9091/metrics/job/myjob/instance/10.0.0.43

# 推送多值
cat <<EOF | curl --data-binary @- http://IP:9091/metrics/job/myjob/instance/10.0.0.41
# TYPE disk_usage gauge
disk_usage 92.56
# TYPE student_online gauge
student_online 150
EOF

# 删除数据
curl -X DELETE http://IP:9091/metrics/job/myjob
curl -X PUT http://IP:9091/api/v1/admin/wipe  # 删除所有

2. Prometheus监控Pushgateway

- job_name: "pushgateway"
  honor_labels: true  # 标签冲突时保留远程标签
  static_configs:
    - targets: ["10.0.0.42:9091"]

3. 自定义监控脚本示例（TCP状态监控）

#!/bin/bash
pushgateway_url="http://10.0.0.42:9091/metrics/job/tcp_status/instance/10.0.0.31"
state="SYN-SENT SYN-RECV FIN-WAIT-1 FIN-WAIT-2 TIME-WAIT CLOSE CLOSE-WAIT LAST-ACK LISTEN CLOSING ESTAB UNKNOWN"

echo """
# TYPE tcp_connections gauge
# HELP tcp_connections Number of TCP state connections.""" > /tmp/tcp.txt

for i in $state; do
    count=`ss -tan | grep $i | wc -l`
    echo tcp_connections{state=\"$i\"} $count >> /tmp/tcp.txt
done

cat /tmp/tcp.txt | curl --data-binary @- $pushgateway_url

4. Python自定义Exporter

from prometheus_client import start_http_server, Counter, Summary
from flask import Flask, jsonify

app = Flask(__name__)
REQUEST_TIME = Summary('request_processing_seconds', 'Time spent')
COUNTER_TIME = Counter("request_count", "Total request count")

@app.route("/apps")
@REQUEST_TIME.time()
def requests_count():
    COUNTER_TIME.inc()
    return jsonify({"office": "https://www.oldboyedu.com"})

if __name__ == "__main__":
    start_http_server(8000)
    app.run(host='0.0.0.0', port=8001)

5. 基于文件的服务发现

# prometheus.yml
- job_name: "file-sd"
  file_sd_configs:
    - files:
        - /tmp/targets.json
        - /tmp/targets.yaml

# targets.json
[
  {
    "targets": ["10.0.0.41:9100"],
    "labels": {"school": "oldboyedu", "class": "linux102"}
  }
]

6. 基于Consul的服务发现

# prometheus.yml
- job_name: "consul-sd"
  consul_sd_configs:
    - server: 10.0.0.43:8500
  relabel_configs:
    - source_labels: [__meta_consul_service]
      regex: consul
      action: drop  # 过滤consul自身服务

# 注册服务到Consul
curl -X PUT -d '{"id":"node42","name":"node-exporter","address":"10.0.0.42","port":9100}' \
  http://10.0.0.43:8500/v1/agent/service/register

# 注销服务
curl -X PUT http://10.0.0.43:8500/v1/agent/service/deregister/node42

7. 联邦模式配置

# 上级Prometheus配置
- job_name: "federate-32"
  metrics_path: "/federate"
  honor_labels: true
  params:
    "match[]":
      - '{job="prometheus"}'
      - '{__name__=~"node.*"}'
  static_configs:
    - targets: ["10.0.0.32:9090"]

三、面试常考点

Pushgateway的缺点？
- 重启后数据丢失
- 单点故障风险
- 不适合长期监控
静态配置 vs 动态发现？
- 静态: 需要热加载或重启
- 动态: 自动发现，无需重启
honor_labels参数作用？
- true: 远程标签覆盖本地标签
- false: 冲突时添加”exported_”前缀
联邦模式的应用场景？
- 多数据中心监控
- 分层架构减少压力
- 数据聚合展示

远端存储，黑白名单，启用https，标签管理，黑盒监控，Grafana数据源配置

一、核心概念

1. 远端存储

解决本地存储容量限制
支持VictoriaMetrics、Thanos、Cortex等
配置remote_write推送到远端

2. 黑白名单

策略	Exporter端	Server端
黑名单	`--no-collector.cpu`	`exclude[]: cpu`
白名单	`--collector.disable-defaults --collector.cpu`	`collect[]: cpu,uname`

3. 标签管理流程

服务发现 → 配置 → relabel_configs → 抓取数据 → metric_relabel_configs

4. 黑盒监控 vs 白盒监控

黑盒监控: 外部探测（HTTP/TCP/ICMP）
白盒监控: 内部指标（CPU/内存/磁盘）

二、关键命令/配置

1. VictoriaMetrics部署

# 下载
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.93.16/victoria-metrics-linux-amd64-v1.93.16.tar.gz

# 启动
victoria-metrics-prod \
  -httpListenAddr=0.0.0.0:8428 \
  -storageDataPath=/data/victoria-metrics \
  -retentionPeriod=3

2. Prometheus远端存储配置

# prometheus.yml
remote_write:
  - url: http://10.0.0.43:8428/api/v1/write

3. Prometheus启用HTTPS

# 生成CA证书
openssl genrsa -out ca.key 4096
openssl req -x509 -new -nodes -sha512 -days 3650 \
  -subj "/C=CN/ST=Beijing/L=Beijing/O=example/CN=yinzhengjie.com" \
  -key ca.key -out ca.crt

# 生成服务端证书
openssl genrsa -out prometheus.key 4096
openssl req -new -subj "/C=CN/ST=Beijing/CN=prometheus.yinzhengjie.com" \
  -key prometheus.key -out prometheus.csr
openssl x509 -req -days 3650 \
  -CA ca.crt -CAkey ca.key -CAcreateserial \
  -in prometheus.csr -out prometheus.crt

# auth.yml
tls_server_config:
  cert_file: prometheus.crt
  key_file: prometheus.key
basic_auth_users:
  admin: $2b$10$...  # bcrypt加密密码

# 启动参数添加
--web.config.file=auth.yml

4. 启用认证访问

# 生成密码
python3 -c 'import bcrypt; print(bcrypt.hashpw("password".encode(), bcrypt.gensalt()).decode())'

# 访问时需要认证
curl -k -u admin:password https://IP:9090/metrics

5. Node Exporter黑白名单

# 黑名单（排除CPU指标）
./node_exporter --no-collector.cpu

# 白名单（只采集CPU和uname）
./node_exporter --collector.disable-defaults --collector.cpu --collector.uname

6. Prometheus Server端黑白名单

# 黑名单
- job_name: "blacklist"
  params:
    exclude[]: [cpu]
  static_configs:
    - targets: ["10.0.0.42:9100"]

# 白名单
- job_name: "whitelist"
  params:
    collect[]: [uname, diskstats]
  static_configs:
    - targets: ["10.0.0.41:9100"]

7. 标签管理 – relabel_configs

# 为target打标签
- job_name: "custom-labels"
  static_configs:
    - targets: ["10.0.0.41:9100"]
      labels:
        school: oldboyedu
        class: linux102

# 替换标签
relabel_configs:
  - source_labels: [__scheme__, __address__, __metrics_path__]
    separator: ""
    regex: "(http|https)(.*)"
    target_label: endpoint
    replacement: "${1}://${2}"
    action: replace

8. Blackbox Exporter部署

# 部署
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.28.0/blackbox_exporter-0.28.0.linux-amd64.tar.gz
./blackbox_exporter

# 访问 http://IP:9115

9. HTTP网站监控

- job_name: 'blackbox-http'
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
        - https://www.oldboyedu.com/
        - http://10.0.0.31:9090
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 10.0.0.43:9115

Grafana模板ID: 7587, 13659

10. ICMP主机存活监控

- job_name: 'blackbox-icmp'
  metrics_path: /probe
  params:
    module: [icmp]
  static_configs:
    - targets:
        - 10.0.0.41
        - 10.0.0.42
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 10.0.0.43:9115

11. TCP端口监控

- job_name: 'blackbox-tcp'
  metrics_path: /probe
  params:
    module: [tcp_connect]
  static_configs:
    - targets:
        - 10.0.0.41:80
        - 10.0.0.42:22
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 10.0.0.43:9115

12. Grafana配置MySQL数据源

# /etc/grafana/grafana.ini
type = mysql 
host = 10.0.0.41:3306 
name = grafana 
user = grafana 
password = password

三、面试常考点

本地存储 vs 远端存储？
- 本地: 简单，受容量限制
- 远端: 扩展性强，支持长期存储
relabel_configs vs metric_relabel_configs？
- relabel_configs: 抓取前处理target标签
- metric_relabel_configs: 抓取后处理metrics标签
黑白名单的应用场景？
- 黑名单: 排除敏感/无用指标
- 白名单: 只采集需要的指标
黑盒监控的探测方式？
- HTTP/HTTPS: 网站可用性
- TCP: 端口存活
- ICMP: 主机存活
- DNS: 域名解析

Alertmanager告警及etcd集群备份恢复和监控实战

一、核心概念

1. Alertmanager核心功能

去重: 合并相同告警
分组: 相似告警合并
路由: 根据标签分发
静默: 临时屏蔽告警
抑制: 条件触发屏蔽相关告警

2. etcd特性

分布式键值存储
使用Raft共识算法
端口: 2379(客户端)、2380(集群通信)
Kubernetes后端存储

3. 告警状态

状态	说明
Inactive	未触发
Pending	已触发，未达到持续时间
Firing	已触发并发送告警
Resolved	已恢复

二、关键命令/配置

1. Alertmanager部署

# 部署
wget https://github.com/prometheus/alertmanager/releases/download/v0.31.1/alertmanager-0.31.1.linux-amd64.tar.gz
./alertmanager

# 访问 http://IP:9093

2. 基础告警配置

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: 'alert@example.com'
  smtp_smarthost: 'smtp.example.com:465'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname']
  group_wait: 5s
  group_interval: 10s
  repeat_interval: 5m
  receiver: 'team-email'

receivers:
- name: 'team-email'
  email_configs:
  - to: 'team@example.com'
    send_resolved: true

3. 子路由配置

route:
  receiver: 'sre_system'
  routes:
    - receiver: 'sre_dba'
      match_re:
        job: mysql.*
      continue: true  # 继续匹配后续规则
    - receiver: 'sre_k8s'
      match_re:
        job: k8s.*
      continue: true

receivers:
- name: 'sre_dba'
  email_configs:
  - to: 'dba@example.com'
- name: 'sre_k8s'
  email_configs:
  - to: 'k8s@example.com'
- name: 'sre_system'
  email_configs:
  - to: 'system@example.com'

4. Prometheus告警规则

# rules.yml
groups:
- name: node-alerts
  rules:
  - alert: NodeDown
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "节点 {{ $labels.instance }} 已停止"
      description: "节点已停止超过1分钟"

# prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["10.0.0.43:9093"]

rule_files:
  - "rules.yml"

5. 自定义邮件模板

<!-- /oldboyedu/softwares/alertmanager/tmpl/email.tmpl -->
{{ define "email.html" }}
<h2>告警通知</h2>
<table border="1">
  <tr><th>告警名称</th><th>实例</th><th>阈值</th><th>时间</th></tr>
  {{ range .Alerts }}
  <tr>
    <td>{{ .Labels.alertname }}</td>
    <td>{{ .Labels.instance }}</td>
    <td>{{ .Annotations.value }}</td>
    <td>{{ .StartsAt }}</td>
  </tr>
  {{ end }}
</table>
{{ end }}

# alertmanager.yml引用模板
receivers:
- name: 'team-email'
  email_configs:
  - to: 'team@example.com'
    html: '{{ template "email.html" . }}'
    headers: { Subject: "[告警] 服务异常" }

templates:
  - '/oldboyedu/softwares/alertmanager/tmpl/*.tmpl'

6. 钉钉告警

# 部署钉钉插件
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.1.0/prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz

# config.yml
targets:
  ops-team:
    url: https://oapi.dingtalk.com/robot/send?access_token=TOKEN
    secret: "SEC..."  # 加签密钥

# 启动
./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060"

# alertmanager.yml
receivers:
- name: 'dingtalk'
  webhook_configs:
    - url: 'http://10.0.0.42:8060/dingtalk/ops-team/send'
      send_resolved: true

7. 告警静默

在Alertmanager WebUI中创建
设置匹配条件和持续时间
适用于计划维护期

8. 告警抑制

# alertmanager.yml
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['dc']  # 相同数据中心才抑制

9. etcd集群部署

# 下载
wget https://github.com/etcd-io/etcd/releases/download/v3.6.10/etcd-v3.6.10-linux-amd64.tar.gz

# 配置文件 /oldboyedu/softwares/etcd/etcd.config.yml
name: 'etcd-node1'
data-dir: /var/lib/etcd
listen-peer-urls: 'https://10.0.0.41:2380'
listen-client-urls: 'https://10.0.0.41:2379,http://127.0.0.1:2379'
initial-advertise-peer-urls: 'https://10.0.0.41:2380'
advertise-client-urls: 'https://10.0.0.41:2379'
initial-cluster: 'etcd-node1=https://10.0.0.41:2380,etcd-node2=https://10.0.0.42:2380,etcd-node3=https://10.0.0.43:2380'
initial-cluster-token: 'etcd-cluster'
initial-cluster-state: 'new'
client-transport-security:
  cert-file: '/oldboyedu/certs/etcd/etcd.pem'
  key-file: '/oldboyedu/certs/etcd/etcd-key.pem'
  trusted-ca-file: '/oldboyedu/certs/etcd/ca.pem'

# 启动
systemctl start etcd

10. etcd基础操作

# 写入数据
etcdctl put /key value

# 读取数据
etcdctl get /key
etcdctl get / --prefix --keys-only  # 所有key

# 删除数据
etcdctl del /key
etcdctl del / --prefix  # 删除所有

# 查看集群状态
etcdctl endpoint status --write-out=table

11. etcd备份恢复

# 备份
etcdctl snapshot save /backup/etcd-$(date +%F).db

# 查看备份状态
etcdutl snapshot status /backup/etcd.db -w table

# 恢复（需要停止etcd服务）
etcdutl snapshot restore /backup/etcd.db --data-dir=/var/lib/etcd-restore

# 修改配置文件指向新数据目录后重启

12. Prometheus监控etcd

# prometheus.yml
- job_name: "etcd-cluster"
  scheme: https
  tls_config:
    ca_file: certs/etcd/ca.pem
    cert_file: certs/etcd/etcd.pem
    key_file: certs/etcd/etcd-key.pem
  static_configs:
    - targets:
        - 10.0.0.41:2379
        - 10.0.0.42:2379
        - 10.0.0.43:2379

Grafana模板ID: 21473, 3070, 10323

三、面试常考点

Alertmanager的作用？
- 告警去重、分组、路由
- 支持多渠道通知
- 静默和抑制功能
告警静默 vs 告警抑制？
- 静默: 手动设置，临时屏蔽
- 抑制: 自动设置，条件触发
etcd为什么需要3个节点？
- Raft算法要求多数派投票
- 奇数节点保证高可用
etcd备份恢复流程？
- 定期快照备份
- 停止服务
- 恢复数据目录
- 重启服务
如何实现告警分级？
- 定义不同的severity标签
- 配置多条路由规则
- 发送到不同的接收者

附录：常用Grafana模板ID汇总

监控对象	模板ID
Node Exporter	1860
MySQL	14057, 17320
Redis	11835, 14091
MongoDB	16504
Nginx (vts)	9785, 2949
Docker (cAdvisor)	10619
ElasticSearch	14191
Kafka	21078, 7589
Zookeeper	10465
Consul	12049
Blackbox Exporter	7587, 13659
etcd	21473, 3070, 10323
Windows	23847, 14694

常见错误排查

错误信息	原因	解决方案
`server returned HTTP status 400`	协议错误	检查scheme配置
`failed to verify certificate`	证书验证失败	添加`insecure_skip_verify: true`
`server returned HTTP status 401`	认证失败	配置`basic_auth`
`550 Connection frequency limited`	邮件发送频率限制	更换邮箱或降低频率

整理说明: 本文档提取了5天课程的核心知识点，涵盖Prometheus生态完整链路，从部署、配置、监控、告警到高阶特性，适合面试复习和实战参考。