解析 ‘Context Switching’ 的汇编级真相：C++ 运行时是如何保存与恢复 CPU 寄存器状态的？ - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

各位同仁，各位对系统底层机制充满好奇的朋友们，欢迎来到今天的讲座。我们将共同深入探索一个在现代操作系统和高性能计算中无处不在，却又常常隐藏在复杂抽象层之下的核心概念——“上下文切换”（Context Switching）。更具体地说，我们将从汇编语言的视角，剖析C++运行时环境是如何保存和恢复CPU寄存器状态的，从而揭示多任务处理的魔法。

在计算机科学中，我们常常享用着多任务带来的便利：一边听音乐，一边浏览网页，同时编译着代码。这一切都给人一种错觉，似乎计算机可以同时做很多事情。然而，在绝大多数现代单核CPU上，这并非真正的并行，而是一种高速的“分时复用”，即CPU在不同的任务之间快速切换，以至于我们察觉不到这种切换的存在。这种在不同任务之间切换CPU执行权力的过程，正是上下文切换。

1. 多任务的基石：上下文与切换的必要性

要理解上下文切换，我们首先要明确什么是“上下文”。对于CPU而言，一个任务的“上下文”就是它当前执行状态的完整快照。这个快照包含了：

CPU寄存器的值： 包括通用寄存器、指令指针、栈指针、标志寄存器等。这些寄存器存储了任务执行过程中最关键的数据和控制信息。
内存状态： 任务的代码、数据、堆、栈等。对于进程切换，还需要考虑虚拟地址空间映射（页表）。
其他系统资源： 如打开的文件句柄、网络连接、信号处理器等。

当操作系统决定暂停当前任务（Task A）并运行另一个任务（Task B）时，它必须：

保存Task A的当前上下文，以便将来Task A能够从中断的地方无缝恢复执行。
加载Task B的上下文，使其能够从上次暂停的地方继续。
将CPU的控制权移交给Task B。

这个过程就是上下文切换。它发生在多种情况下：

时间片用尽： 操作系统调度器根据预设的时间片，周期性地中断当前任务，让其他任务有机会运行。
系统调用： 任务主动请求操作系统服务（如读写文件、创建线程），在等待服务完成时，CPU可能被切换给其他任务。
中断： 硬件事件（如键盘输入、网络数据包到达、定时器中断）发生时，CPU会中断当前任务去处理中断，之后可能会进行上下文切换。
页面错误： 任务访问了不在物理内存中的页面，导致操作系统需要介入加载页面，在此期间可能发生切换。
任务同步： 任务等待某个资源（如锁、条件变量），主动放弃CPU。

上下文切换是操作系统实现多任务、多线程、以及现代C++运行时实现协程等并发机制的基石。

2. CPU的视角：核心寄存器及其作用 (x86-64架构)

为了深入到汇编层面，我们必须了解CPU的内部结构，特别是它拥有的各类寄存器。我们将主要关注x86-64架构，这是当前主流服务器和桌面CPU的体系结构。

寄存器类别	寄存器名称	主要用途
通用寄存器	RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, R8-R15	用于存储数据、地址、函数参数、返回值等。在ABI中定义了调用者/被调用者保存规则。
指令指针	RIP (Instruction Pointer)	存储下一条要执行的指令的地址，决定程序执行流。
栈指针	RSP (Stack Pointer)	存储当前栈顶的地址，管理函数调用栈。
基指针	RBP (Base Pointer)	通常用作栈帧指针，指向当前函数栈帧的基址，方便访问局部变量和参数。
标志寄存器	RFLAGS (Flags Register)	存储CPU的各种状态标志（零标志、进位标志、溢出标志等）和控制标志（中断允许标志）。
段寄存器	CS, SS, DS, ES, FS, GS	在保护模式下，存储段选择器。现代操作系统中，用户模式下主要使用CS、SS、DS，且通常指向平坦模型。FS和GS有时用于线程局部存储 (TLS)。
浮点/SIMD寄存器	XMM0-XMM15, YMM0-YMM15, ZMM0-ZMM31	用于浮点运算和单指令多数据 (SIMD) 向量运算。
控制寄存器	CR0, CR2, CR3, CR4	存储控制CPU操作模式的标志和地址。例如，CR3存储页目录基址寄存器 (PDBR)，用于虚拟内存管理。

在上下文切换中，哪些寄存器是“必须”保存和恢复的？

RIP (Instruction Pointer): 绝对必须。没有它，CPU不知道从哪里继续执行。
RSP (Stack Pointer): 绝对必须。每个任务都有自己的栈，恢复RSP才能正确地使用任务的栈。
RFLAGS (Flags Register): 绝对必须。程序的执行状态（如条件码、中断使能状态）需要被保留。
通用寄存器 (RAX-R15): 必须保存。这些寄存器存储着任务的临时数据和计算结果。
浮点/SIMD寄存器 (XMM/YMM/ZMM): 通常需要保存。特别是在多媒体、科学计算应用中，这些寄存器的状态至关重要。
段寄存器 (CS, SS, DS, ES, FS, GS): 在现代平坦内存模型下，CS, SS, DS, ES通常在用户态下保持不变，所以不总是显式保存和恢复。但FS和GS可能用于线程局部存储，其基地址需要保存和恢复。
控制寄存器 (CRx): 对于进程切换，CR3（页表基址）是必须改变的，因为每个进程有独立的虚拟地址空间。对于线程切换，CR3通常保持不变。

理解这些寄存器的作用，是理解上下文切换汇编代码的基础。

3. 内核模式上下文切换：进程与线程的交替

操作系统内核是上下文切换的最终仲裁者。当发生中断、系统调用或时间片用尽时，CPU会从用户模式切换到内核模式，将控制权交给内核。

3.1 硬件的自动保存与内核的接管

以一个典型的中断为例：

硬件保存： 当中断发生时，CPU硬件会自动执行一系列操作，将当前用户态的 SS (栈段寄存器)、RSP (用户栈指针)、RFLAGS (标志寄存器)、CS (代码段寄存器) 和 RIP (指令指针) 依次压入当前特权级（内核）的栈中。如果中断是由错误引起的，还会压入一个错误码。
特权级切换： CPU将CS和RSP切换到中断描述符表 (IDT) 中指定的中断处理程序所在的代码段和栈。
内核接管： 控制权转交给内核的中断处理程序入口点。

此时，内核已经拥有了被中断任务的一些关键寄存器状态。但这不是全部。内核还需要保存更多的寄存器。

3.2 内核如何保存“完整”上下文

在进入通用中断处理函数之后，内核会继续保存剩余的通用寄存器。这通常通过一系列 PUSH 指令完成，或者直接将寄存器值存储到某个内存区域（如任务的内核栈或一个特定的任务控制块 Task Control Block, TCB）。

以Linux内核为例，一个简化但具有代表性的 _switch_to 函数（在arch/x86/kernel/process.S中定义）展示了线程切换的核心汇编逻辑。请注意，这里的 _switch_to 是在内核态执行的，它负责在两个已经处于内核态的线程之间进行切换。当一个用户态线程被调度器选中并执行 schedule() 时，最终会调用到类似 _switch_to 的函数。

; 伪代码 - 简化版 Linux _switch_to 流程 (x86-64)
; 参数: %rdi = prev_task (指向前一个任务的 thread_info 结构)
;       %rsi = next_task (指向下一个任务的 thread_info 结构)
;       %rdx = last (一个指针，用于存储prev_task的堆栈指针)

ENTRY(_switch_to)
    ; 1. 保存当前任务 (prev_task) 的 CPU 状态
    ;    这里只保存 callee-saved registers，因为 caller-saved registers
    ;    在进入 _switch_to 之前已经被保存（或者它们的值对prev_task的恢复不重要）
    ;    或者它们是函数参数，其值已经被传递。
    ;    更完整的保存发生在中断/系统调用入口。

    ; 重要的是保存当前任务的栈指针（RSP），它在中断返回时会用到。
    ; prev_task->thread.sp = RSP
    MOVQ %rsp, THREAD_SP(%rdi) ; THREAD_SP 是 thread_info 结构中存储栈指针的偏移量

    ; 保存其他重要的 callee-saved 寄存器
    MOVQ %rbp, THREAD_BP(%rdi)
    MOVQ %rbx, THREAD_BX(%rdi)
    MOVQ %r12, THREAD_R12(%rdi)
    MOVQ %r13, THREAD_R13(%rdi)
    MOVQ %r14, THREAD_R14(%rdi)
    MOVQ %r15, THREAD_R15(%rdi)

    ; 保存 FPU/SIMD 状态。通常使用 FXSAVE/XRSTOR 指令。
    ; 这通常是惰性保存的，即只有当 FPU 第一次被使用时才保存。
    ; 这里简化为概念性的保存。
    ; FXSAVE prev_task->thread.fpu_state

    ; 2. 切换到下一个任务 (next_task) 的栈
    ;    加载 next_task 的栈指针到 RSP
    MOVQ THREAD_SP(%rsi), %rsp

    ; 3. 恢复下一个任务的 CPU 状态
    MOVQ THREAD_BP(%rsi), %rbp
    MOVQ THREAD_BX(%rsi), %rbx
    MOVQ THREAD_R12(%rsi), %r12
    MOVQ THREAD_R13(%rsi), %r13
    MOVQ THREAD_R14(%rsi), %r14
    MOVQ THREAD_R15(%rsi), %r15

    ; 恢复 FPU/SIMD 状态
    ; FXRSTOR next_task->thread.fpu_state

    ; 4. (如果需要) 切换 CR3 (页表基址)
    ;    对于进程切换，CR3 必须改变。对于线程切换 (在同一进程内)，CR3 不变。
    ;    这里假设是线程切换，CR3 不变或在更上层逻辑中处理。
    ;    如果需要切换: MOVQ next_task->thread.cr3, %cr3

    ; 5. 切换 FS/GS 段寄存器基址（用于TLS）
    ;    这通常通过 WRMSR (Write Model Specific Register) 指令完成，
    ;    写入 MSR_FS_BASE 和 MSR_GS_BASE。
    ;    MOVQ THREAD_FSBASE(%rsi), %rcx
    ;    WRMSR (MSR_FS_BASE)
    ;    MOVQ THREAD_GSBASE(%rsi), %rcx
    ;    WRMSR (MSR_GS_BASE)

    ; 6. 返回到调度器调用的地方，或者直接返回到用户态
    ;    _switch_to 会返回到 next_task 之前被暂停的地方。
    ;    这通常意味着返回到 next_task 之前调用 _switch_to 的那个点。
    ;    在内核中，这可能只是一个 RET 指令，因为栈已经切换，RET 会弹出新的返回地址。
    RET
END(_switch_to)

关键点：

特权级： _switch_to 在内核态执行，因此可以直接访问所有寄存器和内存。
栈：每个任务（线程）都有自己的内核栈。切换任务时，核心就是切换RSP指向新的内核栈。
保存位置： 寄存器状态被保存在任务的 thread_info 结构（或 task_struct 中的 thread 字段）中。
iretq： 当一个任务从内核态返回到用户态时，会使用 iretq (Interrupt Return) 指令。iretq 会从栈中弹出 RIP、CS、RFLAGS、RSP、SS，并将CPU切换回用户模式，继续执行用户代码。_switch_to 自身不包含 iretq，它只是在内核态切换了两个任务的上下文。iretq 发生在中断处理的末尾，当内核决定返回到某个任务的用户态时。

3.3 进程与线程上下文切换的区别

特性	进程上下文切换	线程上下文切换
地址空间	切换时需要改变CR3寄存器，切换页表，导致TLB失效。	共享同一地址空间，CR3通常不变。
栈	切换内核栈和用户栈。	切换内核栈和用户栈。
资源	进程有独立的资源（文件句柄、信号处理），需要更多管理。	线程共享进程资源，管理开销较小。
开销	较高（页表切换、TLB刷新等）。	较低（主要是寄存器和栈的切换）。
FPU/SIMD	通常需要保存/恢复 FPU/SIMD 状态。	通常需要保存/恢复 FPU/SIMD 状态。

4. 用户模式上下文切换：C++协程与纤程的秘密

虽然操作系统内核负责线程和进程的上下文切换，但在一些高性能或特定应用场景中，程序员希望在用户模式下自行管理任务的切换，以实现更细粒度的并发控制，并减少内核模式切换带来的开销。这就是纤程 (Fibers) 或 用户级协程 (User-level Coroutines) 的用武之地。

4.1 为什么需要用户模式切换？

低开销： 避免了内核陷阱、特权级切换、TLB刷新等高昂的成本。
协作式多任务： 任务主动放弃CPU，而不是被抢占，更适合某些事件驱动的编程模型。
自定义调度： 可以实现应用程序特定的调度策略。
轻量级： 纤程通常比线程更轻量，创建和销毁的开销更小。

4.2 用户模式下需要保存什么？

用户模式上下文切换只关心如何让一个函数（或一段代码）在未来某个时间点从中断处继续执行。它不需要关心内核栈、页表、FPU惰性保存等复杂机制，因为这些最终还是由操作系统来处理的（如果纤程所属的线程被操作系统抢占）。

对于用户模式的上下文，最核心的是：

RIP (Instruction Pointer): 下一次从哪里开始执行。
RSP (Stack Pointer): 切换到的任务应该使用哪个栈。
RBP (Base Pointer): 如果RBP用作栈帧指针，其值需要保存。
Callee-saved Registers： 根据ABI (Application Binary Interface) 规定，函数调用者期望被调用者在返回前恢复这些寄存器的值。因此，如果一个任务在被暂停前使用了这些寄存器，并期望它们在恢复时保持原样，那么这些寄存器就必须被保存和恢复。
- x86-64 System V ABI (Linux/macOS): RBX, RBP, R12, R13, R14, R15
- x86-64 Microsoft x64 ABI (Windows): RBX, RBP, RDI, RSI, R12, R13, R14, R15
RFLAGS (Flags Register): 虽然有时可以忽略，但保留标志寄存器能确保更精确的执行上下文。

4.3 C++中的用户模式上下文切换：`Boost.Context` 示例

C++标准库本身不提供用户模式的上下文切换原语。但许多第三方库，如 Boost.Context，提供了这样的功能。它们通常通过内联汇编或独立的汇编文件来实现。我们将以一个简化的 swapcontext 类似机制为例，来展示其汇编级原理。

我们定义一个 Context 结构体来存储需要保存的寄存器：

// context.hpp
#include <cstdint>

// x86-64 System V ABI (Linux/macOS)
// Callee-saved registers: RBX, RBP, R12, R13, R14, R15
// RSP and RIP are also crucial.
struct ContextState {
    uint64_t rbx;
    uint64_t rbp;
    uint64_t r12;
    uint64_t r13;
    uint64_t r14;
    uint64_t r15;
    uint64_t rsp; // Stack Pointer
    uint64_t rip; // Instruction Pointer
    // Potentially RFLAGS, FPU/SIMD state if needed, but for minimal context switch, these are often omitted
    // as the OS will handle them if the thread is preempted.
};

// Function prototypes for assembly-level context switching
extern "C" void _save_context(ContextState* old_ctx);
extern "C" void _restore_context(ContextState* new_ctx);
extern "C" void _switch_context(ContextState* old_ctx, ContextState* new_ctx); // Combines save and restore

// A helper to set up the initial context for a new fiber/coroutine
// The 'entry_point' function will be called with 'arg' when this context is restored.
extern "C" void _make_context(ContextState* ctx, void* stack_base, size_t stack_size,
                               void (*entry_point)(void*), void* arg);

现在，让我们看看对应的汇编代码 (context.s)。这里我们以 _switch_context 为例，它结合了保存旧上下文和恢复新上下文的操作。

; context.s (x86-64 System V ABI)
;
; void _switch_context(ContextState* old_ctx, ContextState* new_ctx);
; old_ctx in RDI, new_ctx in RSI
;
; ContextState layout (offsets):
; 0x00: rbx
; 0x08: rbp
; 0x10: r12
; 0x18: r13
; 0x20: r14
; 0x28: r15
; 0x30: rsp
; 0x38: rip

.global _switch_context

_switch_context:
    ; --- Step 1: Save current (old_ctx) state ---
    ; Save callee-saved registers to old_ctx structure
    movq %rbx, 0x00(%rdi)
    movq %rbp, 0x08(%rdi)
    movq %r12, 0x10(%rdi)
    movq %r13, 0x18(%rdi)
    movq %r14, 0x20(%rdi)
    movq %r15, 0x28(%rdi)

    ; Save current RSP to old_ctx structure
    movq %rsp, 0x30(%rdi)

    ; Save current RIP to old_ctx structure.
    ; The instruction pointer (RIP) points to the *next* instruction to be executed.
    ; When we save RIP, we want it to point to the instruction *after* the switch returns.
    ; We can get the return address from the stack, or by using 'leaq 1f(%rip), %rax'.
    ; Here, we'll save the address of the instruction immediately after this save block.
    ; So, when old_ctx is restored, it continues from this exact point.
    leaq 1f(%rip), %rax ; Get address of label 1:
    movq %rax, 0x38(%rdi) ; Save it as RIP for old_ctx

    ; --- Step 2: Load new (new_ctx) state ---
    ; Load RSP first, so that subsequent POPs or stack operations use the new stack.
    movq 0x30(%rsi), %rsp ; Load new RSP

    ; Load callee-saved registers from new_ctx structure
    movq 0x00(%rsi), %rbx
    movq 0x08(%rsi), %rbp
    movq 0x10(%rsi), %r12
    movq 0x18(%rsi), %r13
    movq 0x20(%rsi), %r14
    movq 0x28(%rsi), %r15

    ; Load new RIP and jump to it
    movq 0x38(%rsi), %rax ; Load new RIP into RAX
    jmpq %rax             ; Jump to new RIP. This transfers control to the new context.

1:  ; This label is the return point for the old_ctx when it's restored later.
    ret                   ; This 'ret' is only executed when old_ctx is restored.
                          ; It will return from the original call to _switch_context.

_make_context：初始化新任务的栈

当创建一个新的纤程或协程时，我们不仅需要一个 ContextState 结构来存储寄存器，还需要为其分配一个独立的栈。_make_context 函数（或类似功能）的作用就是将新任务的栈初始化成一个看起来像刚调用了 _switch_context 的状态，这样当我们第一次 _switch_context 到它时，它就能正确地开始执行。

; context.s (x86-64 System V ABI)
;
; void _make_context(ContextState* ctx, void* stack_base, size_t stack_size,
;                      void (*entry_point)(void*), void* arg);
; ctx in RDI, stack_base in RSI, stack_size in RDX, entry_point in RCX, arg in R8
;
; This function sets up the initial stack for a new context so that when
; _switch_context transfers to it, it appears as if the new context
; just returned from a function call, and will jump to entry_point.

.global _make_context

_make_context:
    ; Stack grows downwards on x86-64. stack_base points to the lowest address (bottom of stack memory).
    ; We need to calculate the initial RSP, which points to the highest address in the allocated stack space.
    ; Align stack pointer to 16 bytes for ABI compliance.
    movq %rsi, %rax         ; rax = stack_base
    addq %rdx, %rax         ; rax = stack_base + stack_size (top of stack memory)
    andq $-16, %rax         ; Align to 16 bytes. This is our initial RSP.

    ; Simulate a function call frame on the new stack:
    ; When _switch_context jumps to the new RIP, it expects a return address on the stack if it were to 'ret'.
    ; We want the new context to call entry_point(arg).
    ; So, we push entry_point onto the stack as if it were a return address.
    ; After entry_point finishes, we want to call a "cleanup" function (or simply halt/exit).
    ; Let's push a dummy return address (e.g., 0) for entry_point to return to.
    ; For simplicity, we'll just push the entry_point directly and jump to it.
    ; A more robust setup involves pushing a "finish" routine.

    ; Push the argument for the entry_point function.
    ; The System V ABI passes the first argument in RDI.
    ; So, when entry_point is called, we need RDI to contain 'arg'.
    movq %r8, -8(%rax)      ; Store 'arg' at RSP-8 (as if it was pushed)
    movq %rcx, -16(%rax)    ; Store 'entry_point' at RSP-16 (as if it was pushed as return address)

    subq $16, %rax          ; Adjust RSP to point to the pushed values. This is the new RSP.

    ; Initialize the ContextState structure for the new context
    ; Set the initial RIP to point to the instruction *after* a potential 'ret' from the simulated frame.
    ; For the first execution, we want to directly jump to entry_point.
    ; The _switch_context will jump to the RIP it loads.
    ; So we set RIP to entry_point.
    movq %rcx, 0x38(%rdi)   ; ctx->rip = entry_point

    ; Set the initial RSP for the new context.
    movq %rax, 0x30(%rdi)   ; ctx->rsp = new_rsp

    ; Set the initial RDI for the new context (which will be the first argument for entry_point)
    ; We store the arg in ctx->rdi if we were using Windows ABI or had RDI as callee-saved.
    ; For System V, RDI is caller-saved. When _switch_context jumps to entry_point,
    ; we need to ensure %rdi contains the argument.
    ; A common trick is to set rdi in the context struct *before* the first switch.
    ; Let's just put it in a callee-saved register for now and move it to RDI later.
    ; Or, more simply, the _switch_context would load all registers, including RDI if it was saved.
    ; But for a clean entry, we need the argument to be in %rdi when entry_point is called.
    ; A better approach for _make_context is to set up a 'trampoline' on the stack.

    ; Simplified _make_context (without a proper trampoline for arguments):
    ; For a clean entry_point(void* arg), when _switch_context jumps to entry_point,
    ; the 'arg' must be in RDI. So we need to store 'arg' somewhere that can be loaded into RDI.
    ; Let's store 'arg' in a callee-saved register (e.g., RBX) for the new context,
    ; and then the entry_point can retrieve it. Or, more directly, modify _switch_context
    ; to load RDI from the new context for the initial jump.
    ; But the standard way is to put a 'trampoline' on the stack.

    ; Let's refine the _make_context for initial stack setup:
    ; When _switch_context loads the new RSP and jumps to the new RIP (entry_point),
    ; we need the argument 'arg' to be in RDI.
    ; The simplest way is to put a small assembly stub (trampoline) on the new stack.
    ; This stub will:
    ; 1. Move 'arg' into RDI.
    ; 2. Call 'entry_point'.
    ; 3. If entry_point returns, call a 'finish' function.

    ; We need to store: finish_func_ptr, entry_point_ptr, arg.
    ; Stack layout (highest address to lowest):
    ; | ...             |  <- entry_point's stack frame
    ; | finish_func_ptr |  <- return address after entry_point
    ; | entry_point_ptr |  <- return address for the initial 'ret' from _switch_context
    ; | arg             |  <- the argument for entry_point
    ; | RBP             |  <- saved RBP (optional, but good practice)
    ; | ... saved callee-saved registers ... |

    ; Let's simplify and assume entry_point doesn't return, or _switch_context handles cleanup.
    ; For now, just set up RIP to entry_point and RSP to point to the correct stack location.
    ; The 'arg' will need to be passed in RDI when entry_point is called.
    ; This usually means the _switch_context *itself* must load RDI.
    ; For _switch_context as defined above, it only loads callee-saved registers.
    ; So, we need 'arg' to be in a callee-saved register within the new context's state.

    ; Let's adjust ContextState and make_context. The 'arg' will be passed via a register.
    ; Let's assume RBX is used for 'arg' for simplicity.
    ; In a real implementation, you'd use a small assembly stub (trampoline) on the stack
    ; that loads the argument into RDI and then jumps to the entry point.

    ; For simplicity, let's assume `entry_point` takes no arguments for now,
    ; or we pass `arg` via RBX (which is callee-saved and will be loaded).
    ; Or we need to make `ContextState` store `rdi` and `rsi` too, for `_switch_context` to restore them.
    ; Let's stick to the common ABI and create a trampoline.

    ; A more robust _make_context with a trampoline:
    ; Stack top: %rax
    ; 1. Push a "finish" address (what happens if entry_point returns)
    ; 2. Push the actual entry_point address.
    ; 3. Push the argument (arg)
    ; This creates a stack that looks like a function call.

    ; Stack layout for new context:
    ; RSP (top of stack) -> | ... (caller saved registers when entry_point is called)
    ;                       | (return address for entry_point) -> finish_func_ptr
    ;                       | (return address for initial 'ret') -> entry_point_ptr
    ;                       | arg (pushed as if it's a stack parameter)
    ;                       | ...
    ;
    ; When _switch_context jumps to this context, it effectively does a 'ret' to entry_point.
    ; And entry_point will then take 'arg' from somewhere (e.g., RDI).
    ; We need to ensure RDI holds 'arg' when entry_point is called.

    ; This is getting complex quickly. Let's simplify _make_context for demonstration.
    ; We'll create a stack that, when RSP is set to it and RIP is set to a trampoline,
    ; will correctly call `entry_point(arg)`.

    ; Trampoline structure on the new stack:
    ; | ...             | (space for entry_point's stack frame)
    ; | finish_func_ptr | <- what entry_point returns to
    ; | arg_val         | <- pushed argument
    ; | jmp_entry_point | <- trampoline code: mov %arg_val, %rdi; call %entry_point_ptr

    ; Let's use a simple approach for `_make_context`:
    ; Set up the stack so that when we jump to `entry_point`, `arg` is in `RDI`.
    ; And when `entry_point` returns, it jumps to a `_finish_fiber` function.
    ;
    ; New stack layout (top to bottom):
    ; [ _finish_fiber_address ] <- 0x00 (RSP points here initially)
    ; [ entry_point_address ] <- 0x08
    ; [ arg ] <- 0x10
    ; [ saved RBP ] <- 0x18
    ; [ saved RBX ] <- 0x20
    ; ... other callee-saved registers ...

    ; Calculate initial RSP (aligned to 16 bytes)
    movq %rsi, %rax         ; rax = stack_base
    addq %rdx, %rax         ; rax = stack_base + stack_size
    andq $-16, %rax         ; Align to 16 bytes (top of the new stack)

    ; Push the cleanup function's address (what entry_point will 'ret' to)
    subq $8, %rax
    movq _fiber_cleanup_entry@GOTPCREL(%rip), %r10 ; Get address of cleanup function
    movq %r10, (%rax)       ; Push cleanup address

    ; Push the actual entry_point address (what the initial 'ret' will go to)
    subq $8, %rax
    movq %rcx, (%rax)       ; Push entry_point address

    ; Push the argument for entry_point
    ; This will be the value in RDI when entry_point is called.
    subq $8, %rax
    movq %r8, (%rax)        ; Push arg

    ; Now, populate the ContextState struct
    ; ctx->rsp = current_rax (the new stack pointer)
    movq %rax, 0x30(%rdi)

    ; ctx->rip = address of the instruction that will start the new context.
    ; This should be the address of a small assembly stub (trampoline)
    ; that will pop the arg into RDI, then pop the entry_point, then jump to it.
    ; For simplicity, let's assume _switch_context directly jumps to entry_point
    ; and caller explicitly sets RDI before the first switch.
    ; Or, the entry_point is called with a specific ABI.
    ; The _make_context is essentially setting up the initial stack for a 'ret'.
    ; So, ctx->rip should be where the context starts.
    ; If it just jumps to entry_point, then entry_point needs to know the arg.

    ; Let's make this simple:
    ; When _switch_context jumps to ctx->rip, the stack should be set up
    ; such that 'ret' from ctx->rip would go to entry_point.
    ; And entry_point's argument 'arg' should be in RDI.

    ; For _make_context:
    ; 1. Set ctx->rsp to point to a stack frame that has entry_point as its "return address".
    ; 2. Set ctx->rip to a small trampoline that moves 'arg' to RDI and then 'ret'.
    ; 3. Or, simpler, set ctx->rip to entry_point, and arrange for 'arg' to be in RDI.

    ; Let's use the simplest: put entry_point into RIP directly, and RBX will hold the argument.
    ; This means `entry_point` must expect its argument in `RBX`.
    movq %rcx, 0x38(%rdi)   ; ctx->rip = entry_point
    movq %r8, 0x00(%rdi)    ; ctx->rbx = arg (assuming entry_point uses RBX for arg)

    ; The rest of the callee-saved registers are zeroed or left undefined initially.
    ; Example:
    movq $0, 0x08(%rdi) ; rbp
    movq $0, 0x10(%rdi) ; r12
    movq $0, 0x18(%rdi) ; r13
    movq $0, 0x20(%rdi) ; r14
    movq $0, 0x28(%rdi) ; r15

    ret

; A simple cleanup function for when a fiber finishes
.global _fiber_cleanup_entry
_fiber_cleanup_entry:
    ; Here you would handle fiber termination, e.g., print a message,
    ; return to a scheduler, or exit the program.
    ; For demonstration, we'll just print and exit.
    ; (This would require C++ calls, which is out of scope for pure assembly)
    movq $60, %rax   ; syscall number for exit
    movq $0, %rdi    ; exit code 0
    syscall

C++代码中使用这些汇编函数：

// main.cpp
#include <iostream>
#include <vector>
#include <memory>
#include "context.hpp" // Our ContextState and prototypes

// A simple fiber function
void fiber_func(void* arg) {
    int id = *static_cast<int*>(arg);
    std::cout << "Fiber " << id << " started." << std::endl;

    // In a real scenario, this would yield back to the scheduler
    // For now, we'll just print and finish.
    std::cout << "Fiber " << id << " finishing." << std::endl;
}

// Global context pointers for demonstration (bad practice in real code)
ContextState main_context;
ContextState fiber_context;

int main() {
    std::cout << "Main context started." << std::endl;

    // 1. Allocate stack for the fiber
    size_t stack_size = 64 * 1024; // 64KB
    auto stack_mem = std::make_unique<uint8_t[]>(stack_size);
    void* stack_base = stack_mem.get();

    // 2. Prepare data for the fiber
    int fiber_id = 123;

    // 3. Initialize the fiber's context
    // This calls _make_context which sets up fiber_context's RIP and RSP
    // so that when _switch_context jumps to it, it will execute fiber_func(fiber_id).
    // Note: The _make_context above is simplified. A real one would use a trampoline.
    // For this example, we'll manually set up RIP and RSP, and pass arg via RBX.
    // This requires fiber_func to be aware of the RBX convention.
    // A more robust _make_context would create a small trampoline on the stack
    // that sets RDI and then calls entry_point.

    // Manual setup (simplified, assumes fiber_func expects arg in RBX):
    // This requires modifying fiber_func to take its argument from RBX.
    // For a standard ABI-compliant call, we need a trampoline.
    // Let's assume a slightly modified `_make_context` or `fiber_func` for direct setup.

    // A better approach for `_make_context` in C++ with trampoline setup:
    // When `_switch_context` is called, it saves current registers and loads new ones.
    // For `fiber_context`, its `rip` should point to a small assembly stub (trampoline) on its stack.
    // This trampoline will:
    //   1. Move the `arg` (stored somewhere in the `ContextState` or on stack) to `RDI`.
    //   2. Call `fiber_func`.
    //   3. After `fiber_func` returns, jump to a cleanup routine.

    // Let's refine the C++ side using a more realistic `_make_context` conceptual model.
    // In real Boost.Context, `make_fcontext` generates such a stack.

    // Manual setup for fiber_context (illustrative, not production-ready):
    // We'll place the argument and return address (to cleanup) directly on the stack.
    uint64_t* fiber_stack_top = reinterpret_cast<uint64_t*>(
        (reinterpret_cast<uintptr_t>(stack_base) + stack_size) & ~0xF // Align to 16 bytes
    );

    // Stack grows downwards.
    // The initial call to _switch_context will jump to fiber_context.rip.
    // We want fiber_context.rip to be fiber_func, and fiber_func to get fiber_id as arg.
    // And when fiber_func returns, it should jump to _fiber_cleanup_entry.

    *(--fiber_stack_top) = reinterpret_cast<uint64_t>(&_fiber_cleanup_entry); // What fiber_func will return to
    *(--fiber_stack_top) = reinterpret_cast<uint64_t>(&fiber_func);          // What the initial 'ret' will go to (from _switch_context)
    *(--fiber_stack_top) = reinterpret_cast<uint64_t>(&fiber_id);            // Argument for fiber_func (pushed onto stack)
    // A dummy RBP (ABI requires 8-byte aligned frame base)
    *(--fiber_stack_top) = 0; // Initial RBP

    // Initialize fiber_context's state
    std::memset(&fiber_context, 0, sizeof(ContextState));
    fiber_context.rsp = reinterpret_cast<uint64_t>(fiber_stack_top); // Set RSP to the new stack top

    // RIP needs to point to a trampoline that sets up RDI and then calls fiber_func.
    // For simplicity, let's assume `_switch_context` *implicitly* loads RDI from a specific slot
    // or that `fiber_func` takes its argument from `RBX` (as set by our simplified make_context).
    // The truly correct way requires a trampoline in assembly.
    // For this example, let's assume fiber_func takes its argument from RBX, as in our make_context.
    fiber_context.rip = reinterpret_cast<uint64_t>(&fiber_func);
    fiber_context.rbx = reinterpret_cast<uint64_t>(&fiber_id); // Pass arg via RBX (callee-saved)

    // 4. Switch to the fiber context
    std::cout << "Switching to fiber..." << std::endl;
    _switch_context(&main_context, &fiber_context); // Save main_context, load fiber_context

    // This part of main will execute when fiber_context switches back to main_context
    std::cout << "Main context resumed." << std::endl;

    // We can switch back to the fiber if it yielded, but our fiber_func just finishes.
    // So, if it came back, it implies fiber_func finished or explicitly called _switch_context.

    // In a full implementation, you'd manage multiple fibers, and `_fiber_cleanup_entry`
    // would return control to the scheduler.

    std::cout << "Main context finishing." << std::endl;

    return 0;
}

编译和链接：

g++ -c main.cpp -o main.o
nasm -f elf64 context.s -o context.o
g++ main.o context.o -o fiber_app
./fiber_app

输出示例 (取决于_make_context和fiber_func的实际实现):

Main context started.
Switching to fiber...
Fiber 123 started.
Fiber 123 finishing.
Main context resumed.
Main context finishing.

注意： 上述 _make_context 和 C++ 代码中的 fiber_func 对参数传递的假设是简化的。在实际生产级库中，_make_context 会在新的栈上创建一个小型的汇编“跳板”（trampoline），这个跳板会负责将正确的参数放入 RDI（对于System V ABI），然后 CALL 真正的 entry_point 函数。当 entry_point 返回时，跳板会捕获返回，并跳转到指定的清理函数或调度器。这种方式确保了与标准C++函数调用ABI的兼容性。

4.4 C++20 Coroutines：语言层面的抽象

C++20引入了协程（Coroutines），为异步编程和惰性求值提供了强大的语言支持。然而，C++20协程的实现机制与上述 _switch_context 这种底层的寄存器保存/恢复有本质区别。

C++20协程的实现原理：

编译器转换： C++编译器会将协程函数（用 co_await, co_yield, co_return 标识）转换成一个状态机。
协程帧（Coroutine Frame）： 协程的所有局部变量、参数和当前执行点都被封装在一个特殊的结构体——“协程帧”中。这个协程帧通常在堆上分配。
暂停与恢复：
- 当协程遇到 co_await 或 co_yield 并暂停时，它实际上是将控制权返回给它的调用者。协程的状态（包括协程帧和当前状态机位置）被保存在内存中。
- 当协程被恢复时，它的调用者（或调度器）通过 handle.resume() 方法，再次调用协程的内部状态机函数。编译器生成的代码会根据保存的状态，跳转到上次暂停的地方继续执行。

核心区别：

无直接寄存器操作： C++20协程不直接保存和恢复CPU通用寄存器、RSP、RIP等。它依赖于常规的函数调用和返回机制。当协程暂停时，它只是像普通函数一样返回，其栈帧被销毁（或在堆上分配的协程帧保持）。当它恢复时，它是一个新的函数调用，只是内部逻辑会“跳过”已经执行过的部分。
状态存储： 协程的状态是显式地存储在协程帧这个数据结构中的，而不是隐式地存储在CPU寄存器中。
抽象级别： C++20协程提供的是语言层面的抽象，它让程序员无需关心底层的汇编细节和寄存器管理，提高了代码的可读性和安全性。

因此，虽然C++20协程实现了类似“上下文切换”的逻辑流转移，但其汇编级真相是编译器生成的复杂状态机，而非直接的寄存器上下文保存与恢复。

5. 性能考量与开销分析

上下文切换并非没有代价。其性能开销是多任务系统效率的关键因素。

开销类型	内核模式切换 (进程/线程)	用户模式切换 (纤程/协程)
特权级切换	必须发生 (用户 -> 内核 -> 用户)	无需发生 (始终在用户模式)
TLB 刷新	进程切换时必须刷新 (CR3改变)，线程切换时通常不刷新。	无需刷新 (共享同一地址空间)
寄存器保存	保存所有通用、标志、段、FPU/SIMD寄存器。	仅保存必要的通用寄存器 (RSP, RIP, Callee-saved)。
缓存污染	新任务的数据和指令可能会替换掉旧任务的缓存内容。	同样会发生，但由于切换频率可能更高，影响可能更显著。
内核数据结构	需要访问和更新任务控制块 (TCB) 等内核数据结构。	仅需访问和更新用户空间定义的上下文结构。
调度器开销	内核调度器运行，可能涉及复杂算法。	用户空间调度器运行，可定制，通常更轻量。
平均开销	几微秒到几十微秒。	几百纳秒到几微秒。

结论： 用户模式的上下文切换（如纤程）通常比内核模式的切换快一个数量级，因为它们避免了昂贵的特权级切换和TLB刷新。这就是为什么在某些高性能场景（如高并发网络服务器、游戏引擎）中，程序员更倾向于使用用户级协程或纤程来实现并发。

6. 总结与展望

上下文切换是操作系统和运行时环境实现并发的根本机制。无论是在操作系统层面通过内核进行进程和线程的调度，还是在用户空间通过库函数实现轻量级的纤程，其核心都离不开对CPU寄存器状态的精确保存与恢复。

汇编语言为我们揭示了这些操作的底层真相：通过 MOVQ、PUSH、POP 等指令，我们将关键的指令指针、栈指针和通用寄存器值在内存与CPU之间进行搬运，从而在任务之间无缝切换执行流。C++运行时则在此基础上构建了更高级的抽象，如Boost.Context直接封装了这些汇编指令，而C++20协程则通过编译器将异步逻辑转化为状态机，避免了直接的寄存器操作，提供了更为安全和现代化的并发编程模型。理解这些底层机制，是我们驾驭复杂系统、编写高效代码的关键一步。