什么是 ‘Control Groups’ (cgroups)：如何从内核层面限制一个进程的 CPU 配额与 IO 权重？ - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位技术同仁，下午好！

今天，我们将深入探讨 Linux 内核中一个极其强大且关键的资源管理机制——Control Groups (cgroups)。作为一名编程专家，我深知在现代复杂的系统环境中，如何高效、公平地分配和限制计算资源，是构建稳定、高性能应用服务的基石。无论是云原生、容器化（Docker、Kubernetes），还是简单的多进程服务器，对 CPU、内存、I/O 等资源的精细控制都至关重要。而 cgroups，正是 Linux 为我们提供的这把瑞士军刀。

本次讲座，我们将聚焦 cgroups 的核心概念，并着重探讨如何利用它从内核层面限制一个进程的 CPU 配额与 I/O 权重。我将力求逻辑严谨，辅以实际的代码示例，帮助大家将理论知识转化为实践能力。

开场白：资源管理的核心挑战

想象一下，在一个多任务操作系统中，多个进程争抢着有限的硬件资源。一个计算密集型的批处理任务可能独占所有 CPU 核心，导致交互式服务响应迟缓；一个磁盘 I/O 密集型的数据备份任务可能使得其他应用的文件访问变得异常缓慢。在没有有效资源隔离和管理机制的情况下，这种“噪音邻居”问题会严重影响系统的整体性能、稳定性和用户体验。

早期的 Linux 系统主要通过进程调度器（如 nice 值）来调整进程的优先级，但这更多是一种“建议性”的调度，无法提供硬性的资源上限或下限保证。随着虚拟化和容器技术的兴起，对资源隔离和 QoS（Quality of Service）的需求变得前所未有的迫切。正是为了解决这些挑战，cgroups 应运而生。

初识 Control Groups (cgroups)：Linux 内核的资源管家

Control Groups，简称 cgroups，是 Linux 内核提供的一种机制，用于将进程组织成具有层级结构的组，并对这些组进行资源限制、优先级设置、审计和控制。它允许管理员或自动化系统对特定的进程集合强制执行资源使用策略。

历史背景与设计哲学

cgroups 最初由 Google 的工程师在 2006 年提出并开发，当时被称为“进程容器”（process containers）。其核心思想是为一系列进程创建一个“容器”，然后对这个容器内的所有进程统一管理其资源。后来，为了避免与 LinuX Containers（LXC）等其他容器技术名称混淆，并更好地反映其“控制组”的本质，更名为 cgroups。它在 Linux 内核 2.6.24 版本中首次被合入主线。

cgroups 的设计哲学是：

分层组织：进程可以被组织成树状结构，子组可以继承父组的属性，也可以在此基础上进行更细粒度的控制。
控制器（Subsystem）：每种资源（CPU、内存、I/O 等）都有一个独立的控制器负责管理。这使得可以独立地控制不同类型的资源。
灵活配置：通过文件系统接口（sysfs），用户态程序可以方便地创建、修改和删除 cgroups，以及设置其资源参数。

核心概念：任务、cgroup、控制器、层级

要理解 cgroups，我们需要掌握几个核心概念：

任务 (Task)：在 cgroups 的语境中，一个任务通常指一个进程。每个进程都有一个唯一的 PID。
cgroup (Control Group)：这是一个逻辑上的集合，包含零个或多个任务。每个 cgroup 都有一个唯一的路径，例如 /sys/fs/cgroup/cpu/my_group。
控制器 (Controller / Subsystem)：控制器是内核中的一个模块，负责管理特定类型的资源。例如：
- cpu：控制 CPU 时间的分配。
- cpuacct：统计 CPU 使用情况。
- blkio：控制块设备 I/O 访问。
- memory：限制内存使用。
- pids：限制进程数。
- freezer：暂停或恢复 cgroup 中的进程。
- net_cls / net_prio：网络流量控制与优先级。
  等等。
层级 (Hierarchy)：cgroups 被组织成树状结构，称为层级。一个层级可以关联一个或多个控制器。在 cgroups v1 中，不同的控制器可以挂载到不同的层级；而在 cgroups v2 中，所有控制器都共享一个统一的层级。

cgroup 文件系统：用户态与内核态的桥梁

cgroups 通过一个特殊的虚拟文件系统（通常挂载在 /sys/fs/cgroup 或其子目录下）向用户态暴露其管理接口。用户通过对这些文件的读写操作来创建、配置和管理 cgroups。

例如，如果你想创建一个名为 my_cpu_group 的 CPU cgroup，你会在 /sys/fs/cgroup/cpu 目录下创建一个同名子目录：

sudo mkdir /sys/fs/cgroup/cpu/my_cpu_group

这个操作会自动在 my_cpu_group 目录下生成一系列控制文件，例如 cpu.shares、cpu.cfs_period_us 等。

cgroup 目录中的常见文件：

文件名	描述
`cgroup.procs`	包含属于该 cgroup 的进程 ID 列表（主线程 ID）。
`tasks`	包含属于该 cgroup 的所有任务 ID 列表（包括线程 ID）。
`cgroup.event_control`	用于事件通知。
`cgroup.subtree_control`	v2 中用于启用/禁用子 cgroup 的控制器。
`notify_on_release`	当 cgroup 变空时是否通知用户态（通过 `release_agent`）。
`release_agent`	当 `notify_on_release` 启用时，cgroup 变空后执行的脚本路径。
`cpu.*`	`cpu` 控制器相关的参数文件。
`blkio.*`	`blkio` 控制器相关的参数文件。
`memory.*`	`memory` 控制器相关的参数文件。
…	其他控制器相关的参数文件。

要将一个进程移动到一个 cgroup 中，只需将其 PID 写入该 cgroup 目录下的 cgroup.procs 或 tasks 文件。例如，将 PID 为 12345 的进程移动到 my_cpu_group：

sudo sh -c "echo 12345 > /sys/fs/cgroup/cpu/my_cpu_group/cgroup.procs"

一旦进程被移动，它就会受到该 cgroup 所设置的所有资源限制。

CPU 资源管理：精细控制进程的计算力

CPU 控制器（cpu controller）是 cgroups 中最常用和最重要的控制器之一。它允许我们对进程组的 CPU 使用进行精细控制，无论是通过相对权重还是绝对配额。

CPU 控制器 (cpu controller) 概述

cpu 控制器主要通过以下几种机制来管理 CPU 资源：

cpu.shares：基于权重的相对分配。
cpu.cfs_period_us / cpu.cfs_quota_us：基于完全公平调度器 (CFS) 的绝对配额。
cpu.rt_period_us / cpu.rt_runtime_us：基于实时调度 (RT) 的配额。

我们将逐一深入探讨。

CPU Shares (cpu.shares)：相对权重分配

cpu.shares 是一种相对权重机制。它不保证任何绝对的 CPU 时间，而是在 CPU 资源紧张时，按照各个 cgroup 之间 cpu.shares 值的比例来分配 CPU 时间。

原理与应用场景
- 原理：默认情况下，每个新建的 cgroup 的 cpu.shares 值为 1024。如果一个父 cgroup 有两个子 cgroup A 和 B，它们的 cpu.shares 分别是 1024 和 512，那么当 CPU 资源竞争激烈时，cgroup A 将获得大约两倍于 cgroup B 的 CPU 时间。如果只有 cgroup A 在运行，它将可以独占所有可用的 CPU 资源，因为没有其他竞争者。
- 应用场景：非常适合在多个非关键服务之间进行资源划分，例如，一个开发环境中的多个用户虚拟机，或者一个服务器上运行的多个后台批处理任务。它提供了一种软性的优先级划分，当系统空闲时，所有任务都能充分利用资源；当系统繁忙时，优先级高的任务能获得更多的资源。
命令行实践：创建 cgroup 并设置 shares

首先，确保 cpu cgroup 文件系统已挂载。通常它会自动挂载在 /sys/fs/cgroup/cpu。

# 1. 创建一个新的 cgroup
sudo mkdir /sys/fs/cgroup/cpu/my_low_priority_group
sudo mkdir /sys/fs/cgroup/cpu/my_high_priority_group

# 2. 设置 cgroup 的 cpu.shares 值
# 默认值是 1024。这里我们给低优先级组 512，高优先级组 2048
sudo sh -c "echo 512 > /sys/fs/cgroup/cpu/my_low_priority_group/cpu.shares"
sudo sh -c "echo 2048 > /sys/fs/cgroup/cpu/my_high_priority_group/cpu.shares"

# 3. 验证设置
cat /sys/fs/cgroup/cpu/my_low_priority_group/cpu.shares
cat /sys/fs/cgroup/cpu/my_high_priority_group/cpu.shares

echo "cgroups 'my_low_priority_group' and 'my_high_priority_group' created with shares 512 and 2048 respectively."

代码示例：C 语言模拟 CPU 密集型任务与 cgroup 绑定

我们将编写一个简单的 C 程序，它会不断地执行计算，模拟一个 CPU 密集型任务。然后，我们将运行多个实例，并将它们分别绑定到不同的 cgroup，观察它们的 CPU 使用情况。

cpu_burner.c：

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>

// A simple CPU-bound task that runs for a specified duration
int main(int argc, char *argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s <duration_seconds>n", argv[0]);
        return 1;
    }

    int duration = atoi(argv[1]);
    if (duration <= 0) {
        fprintf(stderr, "Duration must be a positive integer.n");
        return 1;
    }

    printf("PID %d: Starting CPU burn for %d seconds...n", getpid(), duration);

    time_t start_time = time(NULL);
    while (time(NULL) - start_time < duration) {
        // Perform some CPU-intensive operations (e.g., calculations)
        // This loop keeps the CPU busy without blocking on I/O
        volatile double x = 1.0;
        for (int i = 0; i < 1000000; ++i) {
            x = x * x + 0.0000000001;
        }
    }

    printf("PID %d: CPU burn finished.n", getpid());
    return 0;
}

编译 cpu_burner.c：

gcc cpu_burner.c -o cpu_burner -Wall

运行与观察脚本 run_cpu_shares_test.sh：

#!/bin/bash

# Ensure running as root for cgroup operations
if [[ $EUID -ne 0 ]]; then
   echo "This script must be run as root."
   exit 1
fi

CGROUP_BASE="/sys/fs/cgroup/cpu"
LOW_PRIO_GROUP="${CGROUP_BASE}/my_low_priority_group"
HIGH_PRIO_GROUP="${CGROUP_BASE}/my_high_priority_group"

# Cleanup previous cgroups if they exist
rmdir "$LOW_PRIO_GROUP" 2>/dev/null
rmdir "$HIGH_PRIO_GROUP" 2>/dev/null

# Create cgroups
mkdir "$LOW_PRIO_GROUP"
mkdir "$HIGH_PRIO_GROUP"

# Set shares
echo 512 > "$LOW_PRIO_GROUP/cpu.shares"
echo 2048 > "$HIGH_PRIO_GROUP/cpu.shares"

echo "cgroups setup complete."
echo "Low priority group shares: $(cat $LOW_PRIO_GROUP/cpu.shares)"
echo "High priority group shares: $(cat $HIGH_PRIO_GROUP/cpu.shares)"
echo ""

# Compile the CPU burner program if not already compiled
if [ ! -f ./cpu_burner ]; then
    echo "Compiling cpu_burner.c..."
    gcc cpu_burner.c -o cpu_burner -Wall
    if [ $? -ne 0 ]; then
        echo "Compilation failed. Exiting."
        exit 1
    fi
fi

DURATION=30 # seconds

echo "Starting CPU burners. Monitor CPU usage with 'top' or 'htop' in another terminal."
echo "Press Ctrl+C to stop the test."

# Run the CPU burners in the background and capture their PIDs
./cpu_burner $DURATION &
PID_DEFAULT=$!
echo "Default priority burner (PID: $PID_DEFAULT) started."

./cpu_burner $DURATION &
PID_LOW=$!
echo "Low priority burner (PID: $PID_LOW) started."

./cpu_burner $DURATION &
PID_HIGH=$!
echo "High priority burner (PID: $PID_HIGH) started."

# Move processes to their respective cgroups
echo "$PID_LOW" > "$LOW_PRIO_GROUP/cgroup.procs"
echo "$PID_HIGH" > "$HIGH_PRIO_GROUP/cgroup.procs"

echo "Processes moved to cgroups."
echo ""
echo "You can check their cgroups using: cat /proc/<PID>/cgroup"
echo "Example: cat /proc/$PID_LOW/cgroup"
echo ""

# Wait for processes to finish or for script to be interrupted
wait $PID_DEFAULT $PID_LOW $PID_HIGH

echo "All CPU burners finished. Cleaning up cgroups."

# Cleanup
# Move processes out of cgroups before removing them (optional, but good practice)
# In case processes are still running, move them to root cgroup
echo "$PID_LOW" > "$CGROUP_BASE/cgroup.procs" 2>/dev/null
echo "$PID_HIGH" > "$CGROUP_BASE/cgroup.procs" 2>/dev/null

# Remove cgroups
rmdir "$LOW_PRIO_GROUP"
rmdir "$HIGH_PRIO_GROUP"

echo "Cleanup complete."

运行步骤：

保存 cpu_burner.c 和 run_cpu_shares_test.sh 文件。
chmod +x run_cpu_shares_test.sh
在一个终端中运行 ./run_cpu_shares_test.sh。
在另一个终端中，运行 top 或 htop，或者 pidstat -u -p $PID_LOW,$PID_HIGH,$PID_DEFAULT 5 来观察 CPU 使用率。
你会发现 PID_HIGH 的 CPU 使用率明显高于 PID_LOW，而 PID_DEFAULT（在根 cgroup 中，shares 也是 1024）会介于两者之间，或者与 PID_HIGH 竞争。当 CPU 资源饱和时，CPU 使用率会大致按照 2048:1024:512 的比例分配（即 4:2:1）。

CFS Quota (cpu.cfs_period_us, cpu.cfs_quota_us)：绝对配额限制

与 cpu.shares 的相对分配不同，CFS 配额（Completely Fair Scheduler Quota）提供了一种硬性的 CPU 使用上限。它允许你为 cgroup 定义一个在给定时间周期内可以使用的 CPU 时间量。

原理与应用场景
- cpu.cfs_period_us：定义了一个调度周期，单位是微秒（microseconds）。默认值通常是 100000 us (100 ms)。
- cpu.cfs_quota_us：定义了在一个 cfs_period_us 内，该 cgroup 可以使用的 CPU 时间总量，单位也是微秒。
- 计算方法：一个 cgroup 可以使用的 CPU 核心数量 = cpu.cfs_quota_us / cpu.cfs_period_us。例如，如果 period 是 100000 us，quota 是 50000 us，那么该 cgroup 最多可以使用 0.5 个 CPU 核心的计算能力。如果 quota 是 200000 us，它可以使用 2 个 CPU 核心。
- 应用场景：为容器或虚拟机分配固定的 CPU 资源，防止单个应用占用过多 CPU 导致系统不稳定，实现更严格的资源隔离。
命令行实践：设置绝对 CPU 配额

# 1. 创建一个新的 cgroup
sudo mkdir /sys/fs/cgroup/cpu/my_limited_cpu_group

# 2. 设置 cfs_period_us 和 cfs_quota_us
# 默认的 period 是 100000 us (100ms)
# 设置 quota 为 50000 us，表示该 cgroup 最多可以使用 0.5 个 CPU 核心
sudo sh -c "echo 100000 > /sys/fs/cgroup/cpu/my_limited_cpu_group/cpu.cfs_period_us"
sudo sh -c "echo 50000 > /sys/fs/cgroup/cpu/my_limited_cpu_group/cpu.cfs_quota_us"

# 3. 验证设置
cat /sys/fs/cgroup/cpu/my_limited_cpu_group/cpu.cfs_period_us
cat /sys/fs/cgroup/cpu/my_limited_cpu_group/cpu.cfs_quota_us

echo "cgroup 'my_limited_cpu_group' created with 0.5 CPU core limit."

代码示例：Python 脚本自动化配置与测试

我们将使用 Python 脚本来创建 cgroup，设置 CFS 配额，然后运行 CPU 密集型任务，并观察其 CPU 使用率。

run_cpu_quota_test.py：

import os
import sys
import time
import subprocess
import signal

CGROUP_BASE = "/sys/fs/cgroup/cpu"
LIMITED_GROUP_NAME = "my_limited_cpu_group"
LIMITED_GROUP_PATH = os.path.join(CGROUP_BASE, LIMITED_GROUP_NAME)

def setup_cgroup(quota_us, period_us):
    """Creates a cgroup and sets CPU quota/period."""
    print(f"Setting up cgroup: {LIMITED_GROUP_PATH}")
    try:
        os.makedirs(LIMITED_GROUP_PATH, exist_ok=True)
    except OSError as e:
        print(f"Error creating cgroup directory: {e}. Are you root?")
        sys.exit(1)

    try:
        with open(os.path.join(LIMITED_GROUP_PATH, "cpu.cfs_period_us"), "w") as f:
            f.write(str(period_us))
        with open(os.path.join(LIMITED_GROUP_PATH, "cpu.cfs_quota_us"), "w") as f:
            f.write(str(quota_us))
        print(f"  cpu.cfs_period_us set to: {period_us}")
        print(f"  cpu.cfs_quota_us set to: {quota_us}")
    except IOError as e:
        print(f"Error writing to cgroup files: {e}. Check permissions and cgroup existence.")
        sys.exit(1)

def cleanup_cgroup():
    """Removes the created cgroup."""
    print(f"nCleaning up cgroup: {LIMITED_GROUP_PATH}")
    try:
        # Before removing, move any lingering processes to the root cgroup
        with open(os.path.join(LIMITED_GROUP_PATH, "cgroup.procs"), "r") as f:
            pids = [int(p) for p in f.read().splitlines() if p.strip()]
        if pids:
            print(f"  Moving processes {pids} out of cgroup.")
            with open(os.path.join(CGROUP_BASE, "cgroup.procs"), "w") as f:
                for pid in pids:
                    f.write(str(pid) + "n")

        os.rmdir(LIMITED_GROUP_PATH)
        print("  Cgroup removed successfully.")
    except OSError as e:
        print(f"Error cleaning up cgroup: {e}. It might be empty or permissions issue.")

def main():
    if os.geteuid() != 0:
        print("This script must be run as root.")
        sys.exit(1)

    # CPU Quota settings: 0.5 CPU core
    PERIOD_US = 100000  # 100ms
    QUOTA_US = 50000    # 50ms (0.5 CPU)
    TEST_DURATION = 30  # seconds

    setup_cgroup(QUOTA_US, PERIOD_US)

    # Compile the CPU burner program if not already compiled
    if not os.path.exists("./cpu_burner"):
        print("Compiling cpu_burner.c...")
        compile_cmd = ["gcc", "cpu_burner.c", "-o", "cpu_burner", "-Wall"]
        result = subprocess.run(compile_cmd, capture_output=True, text=True)
        if result.returncode != 0:
            print(f"Compilation failed:n{result.stderr}")
            cleanup_cgroup()
            sys.exit(1)
        print("Compilation successful.")

    print(f"nRunning CPU burner with a limit of {QUOTA_US / PERIOD_US} CPU core(s) for {TEST_DURATION} seconds.")
    print("Monitor CPU usage with 'top' or 'htop' in another terminal.")
    print("Look for the 'cpu_burner' process and its CPU% (should be ~50% if system has enough capacity).")

    # Start the CPU burner process
    burner_cmd = ["./cpu_burner", str(TEST_DURATION)]
    burner_process = subprocess.Popen(burner_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    burner_pid = burner_process.pid
    print(f"CPU burner (PID: {burner_pid}) started.")

    # Move the process to the limited cgroup
    try:
        with open(os.path.join(LIMITED_GROUP_PATH, "cgroup.procs"), "w") as f:
            f.write(str(burner_pid))
        print(f"Process {burner_pid} moved to {LIMITED_GROUP_PATH}")
    except IOError as e:
        print(f"Error moving process to cgroup: {e}. Killing burner and cleaning up.")
        burner_process.terminate()
        burner_process.wait()
        cleanup_cgroup()
        sys.exit(1)

    print(f"You can check its cgroup: cat /proc/{burner_pid}/cgroup")

    # Wait for the burner process to complete
    try:
        burner_process.wait(timeout=TEST_DURATION + 5) # Add a small buffer
    except subprocess.TimeoutExpired:
        print("CPU burner did not finish in time, terminating.")
        burner_process.terminate()
        burner_process.wait()

    stdout, stderr = burner_process.communicate()
    if stdout:
        print(f"Burner stdout:n{stdout}")
    if stderr:
        print(f"Burner stderr:n{stderr}")

    # Read CPU statistics from the cgroup
    try:
        print("nCPU usage statistics from cgroup:")
        with open(os.path.join(LIMITED_GROUP_PATH, "cpu.stat"), "r") as f:
            print(f.read())
    except IOError as e:
        print(f"Error reading cpu.stat: {e}")

    cleanup_cgroup()

if __name__ == "__main__":
    main()

运行步骤：

确保 cpu_burner.c 存在于同一目录下（可以从之前的示例复制）。
保存 run_cpu_quota_test.py。
chmod +x run_cpu_quota_test.py
在一个终端中运行 sudo ./run_cpu_quota_test.py。
在另一个终端中，运行 top 或 htop，你会看到 cpu_burner 进程的 CPU 使用率被严格限制在 50% 左右（或者 quota_us / period_us * 100%）。即使系统有多余的 CPU 资源，它也无法超过这个上限。
cpu.stat 文件会提供详细的统计信息，如 nr_periods（调度周期总数）、nr_throttled（被节流的周期数）、throttled_time（被节流的总时间）。这些数据可以帮助你理解配额的实际效果。

实时调度 (RT) 配额 (cpu.rt_period_us, cpu.rt_runtime_us)：为关键任务保驾护航

cpu 控制器也支持对实时进程（使用 SCHED_FIFO 或 SCHED_RR 调度策略的进程）进行配额限制。这对于需要严格实时响应的系统非常重要，但也需要谨慎配置，以防系统不稳定。

原理与注意事项
- cpu.rt_period_us：实时调度周期，单位微秒。默认值通常是 1000000 us (1秒)。
- cpu.rt_runtime_us：在一个 rt_period_us 内，该 cgroup 中的实时进程可以使用的最大 CPU 时间，单位微秒。
- 计算方法：实时进程的 CPU 核心配额 = cpu.rt_runtime_us / cpu.rt_period_us。
- 应用场景：硬实时系统，例如工业控制、航空航天等。
- 注意事项：配置不当可能导致普通进程饥饿，甚至系统完全无响应。设置 rt_runtime_us 必须小于 rt_period_us。配置实时 cgroups 通常需要 CAP_SYS_NICE 权限。
命令行实践：配置实时配额

# 1. 创建一个新的 cgroup
sudo mkdir /sys/fs/cgroup/cpu/my_rt_group

# 2. 设置 rt_period_us 和 rt_runtime_us
# 假设我们希望该 cgroup 中的实时进程在一个 1秒的周期内最多运行 100ms (0.1 CPU)
sudo sh -c "echo 1000000 > /sys/fs/cgroup/cpu/my_rt_group/cpu.rt_period_us"
sudo sh -c "echo 100000 > /sys/fs/cgroup/cpu/my_rt_group/cpu.rt_runtime_us"

# 3. 验证设置
cat /sys/fs/cgroup/cpu/my_rt_group/cpu.rt_period_us
cat /sys/fs/cgroup/cpu/my_rt_group/cpu.rt_runtime_us

echo "cgroup 'my_rt_group' created with 0.1 CPU core RT limit."

# 将一个实时进程（假设 PID 为 54321）移动到该 cgroup
# sudo sh -c "echo 54321 > /sys/fs/cgroup/cpu/my_rt_group/cgroup.procs"

要实际测试实时配额，你需要一个能够以 SCHED_FIFO 或 SCHED_RR 优先级运行的实时应用程序。这通常需要特定的编程和系统权限配置。

I/O 资源管理：磁盘吞吐与操作的平衡艺术

除了 CPU，磁盘 I/O 也是一个关键的共享资源，不加限制的 I/O 操作同样会导致系统性能急剧下降。blkio 控制器专门用于管理块设备的 I/O 资源。

Block I/O 控制器 (blkio controller) 概述

blkio 控制器允许我们：

设置 I/O 权重：在 I/O 繁忙时，按比例分配 I/O 带宽。
设置 I/O 节流：对 I/O 带宽（BPS – Bytes Per Second）或 IOPS（I/O Operations Per Second）进行硬性限制。

这些限制可以应用于整个 cgroup，也可以针对特定的块设备进行配置。

I/O 权重 (blkio.weight, blkio.weight_device)：相对优先级

blkio.weight 和 blkio.weight_device 提供了类似 cpu.shares 的相对 I/O 优先级分配机制。

原理与应用场景
- 原理：当多个 cgroup 同时对同一个块设备发起 I/O 请求时，blkio 控制器会根据它们的权重值来分配 I/O 带宽。它通常与 CFQ (Completely Fair Queueing) 或 BFQ (Budget Fair Queueing) I/O 调度器协同工作。
- blkio.weight：一个 cgroup 的全局 I/O 权重。默认值是 1000，范围通常是 100 到 1000。
- blkio.weight_device：允许你为 cgroup 内的特定设备设置权重。例如，8:0 500 表示为 /dev/sda (主设备号 8，次设备号 0) 设置权重 500。
- 应用场景：在数据库服务器上，为关键的事务处理进程组设置更高的 I/O 权重，而为数据分析或备份进程组设置较低的权重。
命令行实践：设置全局与设备特有权重

首先，确保 blkio cgroup 文件系统已挂载。通常在 /sys/fs/cgroup/blkio。

# 1. 创建一个新的 cgroup
sudo mkdir /sys/fs/cgroup/blkio/my_low_io_group
sudo mkdir /sys/fs/cgroup/blkio/my_high_io_group

# 2. 设置 cgroup 的 blkio.weight 值
# 默认值是 1000。这里我们给低优先级组 200，高优先级组 800
sudo sh -c "echo 200 > /sys/fs/cgroup/blkio/my_low_io_group/blkio.weight"
sudo sh -c "echo 800 > /sys/fs/cgroup/blkio/my_high_io_group/blkio.weight"

# 3. 验证设置
cat /sys/fs/cgroup/blkio/my_low_io_group/blkio.weight
cat /sys/fs/cgroup/blkio/my_high_io_group/blkio.weight

echo "cgroups 'my_low_io_group' and 'my_high_io_group' created with weights 200 and 800 respectively."

# 获取设备的 major:minor 号。例如，/dev/sda 通常是 8:0
# ls -l /dev/sda
DEVICE_MAJOR_MINOR=$(ls -l /dev/sda | awk '{print $5}' | sed 's/,//')
echo "Detected /dev/sda as device: $DEVICE_MAJOR_MINOR"

# 4. 设置设备特有的权重 (可选)
# 假设我们想为 high_io_group 在 /dev/sda 上设置更高的权重 900
sudo sh -c "echo "$DEVICE_MAJOR_MINOR 900" > /sys/fs/cgroup/blkio/my_high_io_group/blkio.weight_device"

# 5. 验证设备权重 (读取 blkio.weight_device 不会显示设置的值，只会显示默认值或没有设置的值)
# cat /sys/fs/cgroup/blkio/my_high_io_group/blkio.weight_device # 这可能不会如预期显示你设置的值

注意：blkio.weight_device 文件读取时可能不直接显示你写入的值，而是显示默认值或空。要验证是否生效，需要通过实际 I/O 负载测试。

代码示例：C 语言模拟 I/O 密集型任务与 cgroup 绑定

我们将使用 dd 命令来模拟 I/O 密集型任务，因为它是一个简单且广泛可用的工具。为了更好地控制，我们也可以编写一个简单的 C 程序进行文件写入。

io_burner.c：

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <errno.h>
#include <time.h>

#define BUFFER_SIZE (1024 * 1024) // 1MB buffer

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s <output_file> <duration_seconds>n", argv[0]);
        return 1;
    }

    const char *output_file = argv[1];
    int duration = atoi(argv[2]);
    if (duration <= 0) {
        fprintf(stderr, "Duration must be a positive integer.n");
        return 1;
    }

    int fd = open(output_file, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) {
        fprintf(stderr, "PID %d: Error opening file %s: %sn", getpid(), output_file, strerror(errno));
        return 1;
    }

    char *buffer = (char *)malloc(BUFFER_SIZE);
    if (!buffer) {
        fprintf(stderr, "PID %d: Error allocating buffer: %sn", getpid(), strerror(errno));
        close(fd);
        return 1;
    }
    memset(buffer, 'A', BUFFER_SIZE); // Fill buffer with some data

    printf("PID %d: Starting I/O burn to %s for %d seconds...n", getpid(), output_file, duration);

    time_t start_time = time(NULL);
    long long total_bytes_written = 0;

    while (time(NULL) - start_time < duration) {
        ssize_t bytes_written = write(fd, buffer, BUFFER_SIZE);
        if (bytes_written == -1) {
            fprintf(stderr, "PID %d: Error writing to file %s: %sn", getpid(), output_file, strerror(errno));
            break;
        }
        total_bytes_written += bytes_written;
    }

    printf("PID %d: I/O burn finished. Total bytes written: %lldn", getpid(), total_bytes_written);

    free(buffer);
    close(fd);
    return 0;
}

编译 io_burner.c：

gcc io_burner.c -o io_burner -Wall

运行与观察脚本 run_blkio_shares_test.sh：

#!/bin/bash

# Ensure running as root for cgroup operations
if [[ $EUID -ne 0 ]]; then
   echo "This script must be run as root."
   exit 1
fi

CGROUP_BASE="/sys/fs/cgroup/blkio"
LOW_PRIO_GROUP="${CGROUP_BASE}/my_low_io_group"
HIGH_PRIO_GROUP="${CGROUP_BASE}/my_high_io_group"

# Cleanup previous cgroups and files if they exist
rm -f ./low_prio_output.bin ./high_prio_output.bin
rmdir "$LOW_PRIO_GROUP" 2>/dev/null
rmdir "$HIGH_PRIO_GROUP" 2>/dev/null

# Create cgroups
mkdir "$LOW_PRIO_GROUP"
mkdir "$HIGH_PRIO_GROUP"

# Set weights
echo 200 > "$LOW_PRIO_GROUP/blkio.weight"
echo 800 > "$HIGH_PRIO_GROUP/blkio.weight"

echo "cgroups setup complete."
echo "Low priority group weight: $(cat $LOW_PRIO_GROUP/blkio.weight)"
echo "High priority group weight: $(cat $HIGH_PRIO_GROUP/blkio.weight)"
echo ""

# Compile the IO burner program if not already compiled
if [ ! -f ./io_burner ]; then
    echo "Compiling io_burner.c..."
    gcc io_burner.c -o io_burner -Wall
    if [ $? -ne 0 ]; then
        echo "Compilation failed. Exiting."
        exit 1
    fi
fi

DURATION=30 # seconds

echo "Starting IO burners. Monitor disk I/O with 'iostat -x 1' or 'pidstat -d -p <PIDs> 1' in another terminal."
echo "Press Ctrl+C to stop the test."

# Run the IO burners in the background and capture their PIDs
./io_burner ./low_prio_output.bin $DURATION &
PID_LOW=$!
echo "Low priority burner (PID: $PID_LOW) started, writing to low_prio_output.bin."

./io_burner ./high_prio_output.bin $DURATION &
PID_HIGH=$!
echo "High priority burner (PID: $PID_HIGH) started, writing to high_prio_output.bin."

# Move processes to their respective cgroups
echo "$PID_LOW" > "$LOW_PRIO_GROUP/cgroup.procs"
echo "$PID_HIGH" > "$HIGH_PRIO_GROUP/cgroup.procs"

echo "Processes moved to cgroups."
echo ""
echo "You can check their cgroups using: cat /proc/<PID>/cgroup"
echo "Example: cat /proc/$PID_LOW/cgroup"
echo ""

# Wait for processes to finish or for script to be interrupted
wait $PID_LOW $PID_HIGH

echo "All IO burners finished. Cleaning up cgroups and output files."

# Cleanup
# Move processes out of cgroups before removing them
echo "$PID_LOW" > "$CGROUP_BASE/cgroup.procs" 2>/dev/null
echo "$PID_HIGH" > "$CGROUP_BASE/cgroup.procs" 2>/dev/null

# Remove cgroups
rmdir "$LOW_PRIO_GROUP"
rmdir "$HIGH_PRIO_GROUP"

# Remove output files
rm -f ./low_prio_output.bin ./high_prio_output.bin

echo "Cleanup complete."

运行步骤：

保存 io_burner.c 和 run_blkio_shares_test.sh 文件。
chmod +x run_blkio_shares_test.sh
在一个终端中运行 ./run_blkio_shares_test.sh。
在另一个终端中，运行 iostat -x 1 或 pidstat -d -p $PID_LOW,$PID_HIGH 1 来观察磁盘 I/O 吞吐量。
你会发现 PID_HIGH 对应的进程在写入数据时的带宽/IOPS 会显著高于 PID_LOW 对应的进程，大致按照 800:200（即 4:1）的比例分配。

I/O 节流 (blkio.throttle.*)：绝对速率限制

blkio 控制器也提供了硬性的 I/O 节流功能，允许你设置绝对的读写带宽或 IOPS 上限。

读写带宽限制 (read_bps_device, write_bps_device)
- blkio.throttle.read_bps_device：限制从指定设备读取的带宽，单位是字节每秒 (Bytes Per Second)。
- blkio.throttle.write_bps_device：限制写入指定设备的带宽，单位是字节每秒。
- 格式：major:minor bandwidth_limit_bytes_per_second。例如，8:0 1048576 限制 /dev/sda 的带宽为 1MB/s。
读写 IOPS 限制 (read_iops_device, write_iops_device)
- blkio.throttle.read_iops_device：限制从指定设备读取的操作数，单位是操作每秒 (IOPS)。
- blkio.throttle.write_iops_device：限制写入指定设备的操作数，单位是操作每秒。
- 格式：major:minor iops_limit。例如，8:0 100 限制 /dev/sda 的 IOPS 为 100。
命令行实践：配置 I/O 节流

# 1. 创建一个新的 cgroup
sudo mkdir /sys/fs/cgroup/blkio/my_throttled_io_group

# 2. 获取设备的 major:minor 号。例如，/dev/sda 通常是 8:0
DEVICE_MAJOR_MINOR=$(ls -l /dev/sda | awk '{print $5}' | sed 's/,//')
echo "Detected /dev/sda as device: $DEVICE_MAJOR_MINOR"

# 3. 设置 I/O 节流限制
# 限制写入 /dev/sda 的带宽为 2MB/s (2 * 1024 * 1024 bytes/s)
sudo sh -c "echo "$DEVICE_MAJOR_MINOR 2097152" > /sys/fs/cgroup/blkio/my_throttled_io_group/blkio.throttle.write_bps_device"
# 限制读取 /dev/sda 的 IOPS 为 500
sudo sh -c "echo "$DEVICE_MAJOR_MINOR 500" > /sys/fs/cgroup/blkio/my_throttled_io_group/blkio.throttle.read_iops_device"

echo "cgroup 'my_throttled_io_group' created with write BPS limit 2MB/s and read IOPS limit 500 for device $DEVICE_MAJOR_MINOR."

# 4. 验证设置 (读取 throttle 文件会显示当前设置)
cat /sys/fs/cgroup/blkio/my_throttled_io_group/blkio.throttle.write_bps_device
cat /sys/fs/cgroup/blkio/my_throttled_io_group/blkio.throttle.read_iops_device

代码示例：Python 脚本自动化配置与测试 I/O 节流

我们将继续使用 io_burner.c，但通过 Python 脚本来设置 I/O 节流。

run_blkio_throttle_test.py：

import os
import sys
import time
import subprocess
import signal

CGROUP_BASE = "/sys/fs/cgroup/blkio"
THROTTLED_GROUP_NAME = "my_throttled_io_group"
THROTTLED_GROUP_PATH = os.path.join(CGROUP_BASE, THROTTLED_GROUP_NAME)

def get_device_major_minor(device_path):
    """Gets major:minor number for a given device path."""
    try:
        # Use stat to get device info, then os.major/os.minor
        stat_info = os.stat(device_path)
        if not os.path.ismount(device_path) and not os.path.exists(device_path):
             # If device_path is not a mount point and not exists, it might be a block device name like 'sda'
             # Try to find it in /dev
             full_device_path = f"/dev/{device_path.split('/')[-1]}"
             stat_info = os.stat(full_device_path)

        major = os.major(stat_info.st_rdev)
        minor = os.minor(stat_info.st_rdev)
        return f"{major}:{minor}"
    except (FileNotFoundError, OSError) as e:
        print(f"Error getting major:minor for {device_path}: {e}")
        sys.exit(1)

def setup_cgroup(device_major_minor, write_bps_limit, read_iops_limit):
    """Creates a cgroup and sets I/O throttle limits."""
    print(f"Setting up cgroup: {THROTTLED_GROUP_PATH}")
    try:
        os.makedirs(THROTTLED_GROUP_PATH, exist_ok=True)
    except OSError as e:
        print(f"Error creating cgroup directory: {e}. Are you root?")
        sys.exit(1)

    try:
        if write_bps_limit is not None:
            with open(os.path.join(THROTTLED_GROUP_PATH, "blkio.throttle.write_bps_device"), "w") as f:
                f.write(f"{device_major_minor} {write_bps_limit}")
            print(f"  Write BPS limit set to {write_bps_limit} for {device_major_minor}")

        if read_iops_limit is not None:
            with open(os.path.join(THROTTLED_GROUP_PATH, "blkio.throttle.read_iops_device"), "w") as f:
                f.write(f"{device_major_minor} {read_iops_limit}")
            print(f"  Read IOPS limit set to {read_iops_limit} for {device_major_minor}")

    except IOError as e:
        print(f"Error writing to cgroup files: {e}. Check permissions and cgroup existence.")
        sys.exit(1)

def cleanup_cgroup():
    """Removes the created cgroup."""
    print(f"nCleaning up cgroup: {THROTTLED_GROUP_PATH}")
    try:
        # Before removing, move any lingering processes to the root cgroup
        with open(os.path.join(THROTTLED_GROUP_PATH, "cgroup.procs"), "r") as f:
            pids = [int(p) for p in f.read().splitlines() if p.strip()]
        if pids:
            print(f"  Moving processes {pids} out of cgroup.")
            with open(os.path.join(CGROUP_BASE, "cgroup.procs"), "w") as f:
                for pid in pids:
                    f.write(str(pid) + "n")

        os.rmdir(THROTTLED_GROUP_PATH)
        print("  Cgroup removed successfully.")
    except OSError as e:
        print(f"Error cleaning up cgroup: {e}. It might be empty or permissions issue.")

def main():
    if os.geteuid() != 0:
        print("This script must be run as root.")
        sys.exit(1)

    # Assume target device is /dev/sda, modify if needed
    TARGET_DEVICE_PATH = "/dev/sda" 
    device_major_minor = get_device_major_minor(TARGET_DEVICE_PATH)

    # I/O Throttle settings
    WRITE_BPS_LIMIT = 2 * 1024 * 1024  # 2 MB/s
    READ_IOPS_LIMIT = None # Not setting read IOPS for this example
    TEST_DURATION = 30  # seconds

    # Cleanup previous output file
    output_file = "./throttled_output.bin"
    if os.path.exists(output_file):
        os.remove(output_file)

    setup_cgroup(device_major_minor, WRITE_BPS_LIMIT, READ_IOPS_LIMIT)

    # Compile the IO burner program if not already compiled
    if not os.path.exists("./io_burner"):
        print("Compiling io_burner.c...")
        compile_cmd = ["gcc", "io_burner.c", "-o", "io_burner", "-Wall"]
        result = subprocess.run(compile_cmd, capture_output=True, text=True)
        if result.returncode != 0:
            print(f"Compilation failed:n{result.stderr}")
            cleanup_cgroup()
            sys.exit(1)
        print("Compilation successful.")

    print(f"nRunning IO burner with write BPS limit of {WRITE_BPS_LIMIT / (1024*1024)} MB/s for {TEST_DURATION} seconds.")
    print("Monitor disk I/O with 'iostat -x 1' or 'pidstat -d -p <PID> 1' in another terminal.")

    # Start the IO burner process
    burner_cmd = ["./io_burner", output_file, str(TEST_DURATION)]
    burner_process = subprocess.Popen(burner_cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
    burner_pid = burner_process.pid
    print(f"IO burner (PID: {burner_pid}) started.")

    # Move the process to the throttled cgroup
    try:
        with open(os.path.join(THROTTLED_GROUP_PATH, "cgroup.procs"), "w") as f:
            f.write(str(burner_pid))
        print(f"Process {burner_pid} moved to {THROTTLED_GROUP_PATH}")
    except IOError as e:
        print(f"Error moving process to cgroup: {e}. Killing burner and cleaning up.")
        burner_process.terminate()
        burner_process.wait()
        cleanup_cgroup()
        sys.exit(1)

    print(f"You can check its cgroup: cat /proc/{burner_pid}/cgroup")

    # Wait for the burner process to complete
    try:
        burner_process.wait(timeout=TEST_DURATION + 5) # Add a small buffer
    except subprocess.TimeoutExpired:
        print("IO burner did not finish in time, terminating.")
        burner_process.terminate()
        burner_process.wait()

    stdout, stderr = burner_process.communicate()
    if stdout:
        print(f"Burner stdout:n{stdout}")
    if stderr:
        print(f"Burner stderr:n{stderr}")

    # Read I/O statistics from the cgroup
    try:
        print("nI/O usage statistics from cgroup (blkio.throttle.io_service_bytes):")
        with open(os.path.join(THROTTLED_GROUP_PATH, "blkio.throttle.io_service_bytes"), "r") as f:
            print(f.read())
        print("nI/O usage statistics from cgroup (blkio.throttle.io_serviced):")
        with open(os.path.join(THROTTLED_GROUP_PATH, "blkio.throttle.io_serviced"), "r") as f:
            print(f.read())
    except IOError as e:
        print(f"Error reading blkio.throttle.io_service_bytes or blkio.throttle.io_serviced: {e}")

    cleanup_cgroup()
    if os.path.exists(output_file):
        os.remove(output_file)

if __name__ == "__main__":
    main()

运行步骤：

确保 io_burner.c 存在于同一目录下。
保存 run_blkio_throttle_test.py。
chmod +x run_blkio_throttle_test.py
在一个终端中运行 sudo ./run_blkio_throttle_test.py。
在另一个终端中，运行 iostat -x 1 或 pidstat -d -p <PID_OF_BURNER> 1 来观察磁盘写入带宽。
你会发现 io_burner 进程的写入带宽被严格限制在 2MB/s 左右，即使磁盘硬件本身能够提供更高的吞吐量。
blkio.throttle.io_service_bytes 和 blkio.throttle.io_serviced 文件会提供针对每个设备的详细 I/O 统计信息，包括读写字节数和操作数，可以用于验证节流效果。

cgroups v1 与 v2：演进与选择

Linux cgroups 经历了两个主要版本：cgroups v1 和 cgroups v2。我们前面的例子主要基于 v1 的行为，因为它在许多现有系统中仍广泛使用，且控制器之间相对独立。

v1 的层级限制与控制器耦合

cgroups v1 的特点：

多层级：每个控制器（或一组控制器）可以拥有独立的层级。这意味着一个进程可以同时属于多个 cgroup，但每个 cgroup 位于不同的控制器层级中。例如，一个进程可能在 cpu 层级的 group_A 中，同时在 memory 层级的 group_B 中。
控制器独立挂载：每个控制器都可以单独挂载到 /sys/fs/cgroup/<controller_name>。
复杂性：这种多层级结构增加了管理的复杂性，尤其是在资源共享和继承方面。
僵尸进程问题：当一个 cgroup 中的最后一个进程退出时，cgroup 目录可能不会自动移除，需要手动清理。

v2 的统一层级与增强隔离

为了解决 v1 的复杂性和一些设计缺陷，cgroups v2 被引入。它旨在提供一个更统一、更健壮的资源管理模型。

cgroups v2 的特点：

统一层级：所有控制器都共享一个单一的、统一的层级。一个进程只能属于一个 cgroup。
根 cgroup：根 cgroup 是唯一的，所有其他 cgroup 都是它的子节点。
控制器启用/禁用：控制器不能直接挂载到子 cgroup，而是通过父 cgroup 的 cgroup.subtree_control 文件来启用或禁用子 cgroup 的控制器。
更清晰的资源模型：v2 强制更严格的资源分配和隔离模型，例如，一个 cgroup 必须是叶子节点（即没有子 cgroup 且没有子进程）才能附加控制器。
更好的委托机制：允许将整个子树的资源管理委托给非特权用户。
默认挂载点：通常挂载在 /sys/fs/cgroup。

如何判断当前系统版本

可以通过查看 /proc/mounts 或 mount 命令来判断系统正在使用哪个版本的 cgroups：

如果看到 /sys/fs/cgroup/cpu、/sys/fs/cgroup/blkio 等多个独立挂载点，则为 cgroups v1。
如果只看到一个 /sys/fs/cgroup 挂载点，且其 type 是 cgroup2，则为 cgroups v2。

mount | grep cgroup

例如，如果输出类似：

cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)

这表示是 v1。

如果输出类似：

cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)

这表示是 v2。

现代 Linux 发行版（如 Fedora 31+，Ubuntu 20.04+）和容器运行时（如 containerd、runc）正逐渐转向 cgroups v2。理解 v1 的概念对于理解 v2 仍然很有帮助，因为许多核心资源控制思想是一致的。

cgroups 在现代系统中的应用

cgroups 并非一个只供管理员手动操作的底层工具，它已深度集成到现代 Linux 系统和容器生态中。

systemd：统一的资源管理接口

systemd 作为现代 Linux 发行版的事实标准 init 系统和服务管理器，与 cgroups 紧密集成。systemd 会为每个服务（Service）、Scope（如用户登录会话）和 Slice（资源组，如 system.slice、user.slice）自动创建和管理 cgroups。

你可以通过 systemctl 命令方便地为服务设置资源限制，而 systemd 会负责将其转化为相应的 cgroup 配置。
例如，为一个名为 my_service.service 的服务设置 CPU 限制：

sudo systemctl set-property my_service.service CPUQuota=50%
sudo systemctl set-property my_service.service IOWeight=200

这会在后台为 my_service.service 创建相应的 cgroup，并写入 cpu.cfs_quota_us 和 blkio.weight 文件。

Docker 与 Kubernetes：容器化技术的基石

cgroups 是容器技术实现资源隔离的基石。

Docker：每个 Docker 容器都在自己的 cgroup 集合中运行。当你使用 docker run 命令指定 --cpus, --cpu-shares, --blkio-weight, --device-read-bps 等参数时，Docker Daemon 就会利用 cgroups 在内核中创建和配置相应的资源限制。这些限制确保了容器之间的资源公平分配和隔离。
Kubernetes：作为容器编排平台，Kubernetes 通过其 Pod 定义中的 resources.limits 和 resources.requests 字段来指定容器的资源需求和限制。例如，cpu: "500m"（500 millicores）会被翻译成 cpu.cfs_quota_us=50000 和 cpu.cfs_period_us=100000。Kubernetes 依赖底层的容器运行时（如 containerd 或 CRI-O），而这些运行时则最终通过 cgroups 来强制执行这些资源限制。

最佳实践与注意事项

有效利用 cgroups 需要一些最佳实践和对潜在问题的认识。

层级设计与命名规范：
- 保持层级结构清晰、逻辑化，通常与服务的组织结构相对应。
- 使用有意义的 cgroup 名称，便于识别和管理。
- 避免创建过多不必要的 cgroup，增加管理开销。
资源隔离的粒度：
- 根据实际需求选择合适的粒度。过细的粒度可能引入额外的开销，而过粗的粒度则可能导致隔离不足。
- 对于 CPU，cpu.shares 适合软性优先级，cpu.cfs_quota_us 适合硬性上限。
- 对于 I/O，blkio.weight 适合相对优先级，blkio.throttle.* 适合硬性带宽/IOPS 限制。
监控与调优：
- 定期监控 cgroup 的资源使用情况。cpu.stat、blkio.throttle.io_service_bytes 等文件提供了宝贵的统计数据。
- 利用 top、htop、pidstat、iostat 等系统工具结合进程的 cgroup 信息来分析性能瓶颈。
- 根据监控数据调整 cgroup 参数，以达到最佳性能和资源利用率。
安全性考量：
- 创建和修改 cgroups 通常需要 root 权限（CAP_SYS_ADMIN）。
- 在将 cgroup 管理权限委托给非特权用户时，务必谨慎，特别是 cgroups v1 的某些特性可能允许用户逃逸。cgroups v2 在这方面有更好的设计。
错误处理：
- 当进程达到内存限制时，可能会被 OOM (Out Of Memory) killer 杀死。
- 当达到 CPU 或 I/O 限制时，进程会被节流，导致性能下降，但不会直接崩溃。
- 了解这些行为有助于诊断和解决问题。

展望未来：持续演进的资源管理技术

cgroups 是 Linux 内核资源管理的核心，并且仍在不断演进。cgroups v2 正在成为新的标准，它提供了更统一、更易于管理和更强大的资源隔离能力。未来的内核版本可能会引入新的控制器，或者优化现有控制器的性能和功能。

作为编程专家，理解 cgroups 不仅仅是学习几个命令和文件接口，更是深入理解操作系统如何管理资源的窗口。这对于设计高性能、高可用、可伸缩的分布式系统至关重要。掌握 cgroups，就如同掌握了 Linux 系统资源分配的指挥棒，能够更好地优化应用性能，确保系统稳定运行。让我们共同期待并拥抱 Linux 资源管理技术的持续进步！

开场白：资源管理的核心挑战

初识 Control Groups (cgroups)：Linux 内核的资源管家

历史背景与设计哲学

核心概念：任务、cgroup、控制器、层级

cgroup 文件系统：用户态与内核态的桥梁

CPU 资源管理：精细控制进程的计算力

CPU 控制器 (cpu controller) 概述

CPU Shares (cpu.shares)：相对权重分配

CFS Quota (cpu.cfs_period_us, cpu.cfs_quota_us)：绝对配额限制

实时调度 (RT) 配额 (cpu.rt_period_us, cpu.rt_runtime_us)：为关键任务保驾护航

I/O 资源管理：磁盘吞吐与操作的平衡艺术

Block I/O 控制器 (blkio controller) 概述

I/O 权重 (blkio.weight, blkio.weight_device)：相对优先级

I/O 节流 (blkio.throttle.*)：绝对速率限制

cgroups v1 与 v2：演进与选择

v1 的层级限制与控制器耦合

v2 的统一层级与增强隔离

如何判断当前系统版本

cgroups 在现代系统中的应用

systemd：统一的资源管理接口

Docker 与 Kubernetes：容器化技术的基石

最佳实践与注意事项

展望未来：持续演进的资源管理技术

发表回复 取消回复

发表回复取消回复