`Python`的`Web`服务的`监控`与`报警`：`Prometheus`和`Grafana`的`配置`与`实践`。 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

Python Web 服务监控与报警：Prometheus 和 Grafana 的配置与实践

大家好，今天我们来聊聊如何使用 Prometheus 和 Grafana 对 Python Web 服务进行监控和报警。在生产环境中，监控是保证服务稳定性的关键环节。Prometheus 负责收集和存储监控数据，而 Grafana 则负责数据的可视化和报警配置。

1. 监控指标的选择与暴露

首先，我们需要确定要监控哪些指标。对于 Python Web 服务，常见的指标包括：

请求量 (Request Count)：衡量服务的吞吐量。
请求延迟 (Request Latency)：衡量服务的响应速度。
错误率 (Error Rate)：衡量服务的稳定性。
CPU 使用率 (CPU Usage)：衡量服务的资源消耗情况。
内存使用率 (Memory Usage)：衡量服务的资源消耗情况。
数据库连接数 (Database Connection Count)：衡量数据库的负载情况。
自定义业务指标：根据业务需求，监控特定的指标。

接下来，我们需要将这些指标暴露给 Prometheus。有多种方式可以实现，这里我们选择使用 Prometheus 官方提供的 Python 客户端库 prometheus_client。

安装 prometheus_client:

pip install prometheus_client

示例代码 (Flask Web 服务):

from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time
import random

app = Flask(__name__)

# 定义指标
REQUEST_COUNT = Counter('http_requests_total', 'Total number of HTTP requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency in seconds', ['method', 'endpoint'])
ERROR_COUNT = Counter('http_errors_total', 'Total number of HTTP errors', ['method', 'endpoint', 'status_code'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage')
MEMORY_USAGE = Gauge('memory_usage_bytes', 'Memory usage in bytes')

# 模拟CPU和内存数据
def update_resource_metrics():
    CPU_USAGE.set(random.randint(0, 100))
    MEMORY_USAGE.set(random.randint(1000000, 2000000))

@app.route('/')
def index():
    start_time = time.time()
    method = request.method
    endpoint = request.path

    try:
        # 模拟业务逻辑
        time.sleep(random.uniform(0.01, 0.1))
        response = "Hello, World!"
        status_code = 200
    except Exception as e:
        response = str(e)
        status_code = 500
        ERROR_COUNT.labels(method=method, endpoint=endpoint, status_code=status_code).inc()

    latency = time.time() - start_time
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(latency)

    return response, status_code

@app.route('/metrics')
def metrics():
    update_resource_metrics() # 更新CPU和内存数据
    return generate_latest(REGISTRY), 200, {'Content-Type': 'text/plain; charset=utf-8'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

代码解释：

Counter: 用于记录请求总数和错误总数。
Histogram: 用于记录请求延迟的分布情况。
Gauge: 用于记录 CPU 和内存使用率等可以上下波动的指标。
REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc(): 递增指定标签的计数器。
REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(latency): 观察指定标签的延迟，并将其添加到直方图中。
generate_latest(REGISTRY): 生成 Prometheus 可以读取的指标数据。
/metrics endpoint: 暴露 Prometheus 可以抓取的指标数据。

重要提示: 生产环境中，CPU_USAGE 和 MEMORY_USAGE 的值应该从操作系统层面获取，例如使用 psutil 库。上面的代码只是模拟了数据生成。

2. Prometheus 的配置

接下来，我们需要配置 Prometheus 来抓取 Web 服务的指标数据。

下载 Prometheus:

从 Prometheus 官网下载对应平台的安装包：https://prometheus.io/download/

配置 Prometheus ( prometheus.yml ):

global:
  scrape_interval:     15s  # 每 15 秒抓取一次指标数据
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'python_web_service'  # Job 名称
    static_configs:
      - targets: ['localhost:5000']  # Web 服务的地址和端口，这里假设服务运行在本地的 5000 端口
    metrics_path: /metrics # 指标暴露的路径

配置解释:

scrape_interval: Prometheus 抓取指标数据的频率。
job_name: Job 的名称，用于区分不同的监控目标。
targets: 要抓取指标数据的目标地址和端口。
metrics_path: 目标暴露指标数据的路径，与 Python 代码中的 /metrics endpoint 对应。

启动 Prometheus:

./prometheus --config.file=prometheus.yml

启动后，可以通过访问 http://localhost:9090 来查看 Prometheus 的 Web 界面。在 "Graph" 页面输入 http_requests_total，可以看到 Web 服务的请求量数据。

3. Grafana 的配置

现在，我们需要配置 Grafana 来可视化 Prometheus 收集的指标数据。

下载 Grafana:

从 Grafana 官网下载对应平台的安装包：https://grafana.com/grafana/download

启动 Grafana:

启动方式取决于你选择的安装方式。如果是 Linux 系统，通常可以使用 systemctl start grafana-server。

配置 Grafana:

访问 Grafana Web 界面: 默认地址是 http://localhost:3000，默认用户名和密码是 admin/admin。
添加 Prometheus 数据源:
- 点击 "Configuration" -> "Data sources"。
- 点击 "Add data source"。
- 选择 "Prometheus"。
- 在 "URL" 字段中输入 Prometheus 的地址 (例如 http://localhost:9090)。
- 点击 "Save & Test"。
创建 Dashboard:
- 点击 "+" -> "Dashboard"。
- 点击 "Add new panel"。
- 在 "Query" 字段中输入 Prometheus 查询语句 (例如 sum(rate(http_requests_total[5m])) by (endpoint) 用于查询每分钟的请求速率)。
- 选择合适的图表类型 (例如 "Time series")。
- 配置图表的标题、坐标轴等。
- 点击 "Apply"。
- 重复以上步骤，添加其他监控指标的图表。
- 点击 "Save" 保存 Dashboard。

常用的 Prometheus 查询语句 (PromQL):

指标	PromQL 查询语句	说明
每分钟请求速率 (按 endpoint 分组)	`sum(rate(http_requests_total[1m])) by (endpoint)`	计算过去一分钟内每个 endpoint 的平均请求速率。
平均请求延迟 (按 endpoint 分组)	`histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))`	计算过去五分钟内每个 endpoint 的 95 分位数的请求延迟。
错误率 (按 endpoint 分组)	`sum(rate(http_errors_total[5m])) by (endpoint) / sum(rate(http_requests_total[5m])) by (endpoint)`	计算过去五分钟内每个 endpoint 的错误率。
CPU 使用率	`avg(cpu_usage_percent)`	计算 CPU 使用率的平均值。
内存使用量	`avg(memory_usage_bytes)`	计算内存使用量的平均值。
过去5分钟内,状态码为500的错误总数(按endpoint分组)	`sum(rate(http_errors_total{status_code="500"}[5m])) by (endpoint)`	计算过去5分钟内每个endpoint,状态码为500的错误总数

重要提示: PromQL 语法比较复杂，需要花时间学习和掌握。可以参考 Prometheus 官方文档：https://prometheus.io/docs/prometheus/latest/querying/basics/

4. Grafana 报警配置

Grafana 可以根据监控指标的值，触发报警。

配置报警:

在 Dashboard 中选择要报警的图表。
点击 "Edit" 编辑图表。
切换到 "Alert" 选项卡。
配置报警规则:
- "Evaluate every": 评估报警规则的频率。
- "For": 报警状态持续的时间。
- "Conditions": 报警触发的条件，例如 "WHEN avg() OF query(A, 5m) IS ABOVE 100"。
- "Notifications": 报警通知方式，例如 Email, Slack, Webhook 等。
配置通知渠道:
- 点击 "Configuration" -> "Notification channels"。
- 点击 "Add channel"。
- 选择通知方式，并填写相应的配置信息。

报警示例：

假设我们希望在过去 5 分钟内，某个 endpoint 的请求速率超过 100 时触发报警。

在 Grafana 中选择请求速率图表。
编辑图表，切换到 "Alert" 选项卡。
配置报警规则：
- "Evaluate every": "1m" (每分钟评估一次)。
- "For": "5m" (持续 5 分钟)。
- "Conditions": "WHEN avg() OF query(A, 5m) IS ABOVE 100"。其中 A 代表查询语句 sum(rate(http_requests_total[1m])) by (endpoint)。
- "Notifications": 选择配置好的通知渠道。

5. 扩展和最佳实践

使用 psutil 库获取更准确的系统资源信息: psutil 是一个跨平台的 Python 库，可以获取 CPU、内存、磁盘、网络等系统资源信息。
使用 Service Discovery: 当 Web 服务实例数量较多时，手动维护 Prometheus 的 targets 配置非常麻烦。可以使用 Prometheus 的 Service Discovery 功能，自动发现和监控服务实例。常用的 Service Discovery 方案包括 Consul, Kubernetes 等。
自定义指标: 根据业务需求，监控特定的指标，例如用户注册量、订单量等。
指标命名规范: 遵循 Prometheus 的指标命名规范，可以提高指标的可读性和可维护性。
报警抑制: 当出现大量报警时，可以使用报警抑制功能，避免报警风暴。
日志集成: 将 Web 服务的日志与 Prometheus 集成，可以更方便地进行故障排查。常用的日志集成方案包括 EFK (Elasticsearch, Fluentd, Kibana) 和 PLG (Promtail, Loki, Grafana)。

6. 示例：使用 `psutil` 获取 CPU 和内存信息

import psutil
from flask import Flask, request
from prometheus_client import Counter, Histogram, Gauge, generate_latest, REGISTRY
import time
import random

app = Flask(__name__)

# 定义指标
REQUEST_COUNT = Counter('http_requests_total', 'Total number of HTTP requests', ['method', 'endpoint'])
REQUEST_LATENCY = Histogram('http_request_duration_seconds', 'HTTP request latency in seconds', ['method', 'endpoint'])
ERROR_COUNT = Counter('http_errors_total', 'Total number of HTTP errors', ['method', 'endpoint', 'status_code'])
CPU_USAGE = Gauge('cpu_usage_percent', 'CPU usage percentage')
MEMORY_USAGE = Gauge('memory_usage_bytes', 'Memory usage in bytes')

def update_resource_metrics():
    CPU_USAGE.set(psutil.cpu_percent(interval=1))
    MEMORY_USAGE.set(psutil.virtual_memory().used)

@app.route('/')
def index():
    start_time = time.time()
    method = request.method
    endpoint = request.path

    try:
        # 模拟业务逻辑
        time.sleep(random.uniform(0.01, 0.1))
        response = "Hello, World!"
        status_code = 200
    except Exception as e:
        response = str(e)
        status_code = 500
        ERROR_COUNT.labels(method=method, endpoint=endpoint, status_code=status_code).inc()

    latency = time.time() - start_time
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(latency)

    return response, status_code

@app.route('/metrics')
def metrics():
    update_resource_metrics() # 更新CPU和内存数据
    return generate_latest(REGISTRY), 200, {'Content-Type': 'text/plain; charset=utf-8'}

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

在这个例子中，我们使用了 psutil.cpu_percent(interval=1) 来获取 CPU 使用率，使用了 psutil.virtual_memory().used 来获取已使用的内存量。

7. 指标暴露方式的选择

除了使用 /metrics endpoint 暴露指标外，还可以选择其他方式：

Pushgateway: 适用于短生命周期的任务，或者无法被 Prometheus 直接抓取的任务。
Textfile Collector: 适用于从文件中读取指标数据。

选择哪种方式取决于具体的应用场景。

8. 关于监控的一些思考

通过以上步骤，我们就可以搭建起一个基本的 Python Web 服务监控和报警系统。监控系统的搭建是一个持续迭代的过程，需要根据实际情况不断调整和优化。希望今天的分享对大家有所帮助。

关键步骤的回顾

本文介绍了如何使用 prometheus_client 暴露 Python Web 服务的监控指标，配置 Prometheus 抓取指标数据，以及配置 Grafana 可视化数据和设置报警规则。实践过程中，需要根据实际情况选择合适的指标和报警阈值。