C++实现程序崩溃转储（Core Dump）分析：利用GDB/LLDB进行事后调试 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

C++程序崩溃转储（Core Dump）分析：利用GDB/LLDB进行事后调试

大家好，今天我们来聊聊C++程序崩溃转储（Core Dump）分析，以及如何利用GDB/LLDB进行事后调试。相信大家在开发过程中都遇到过程序突然崩溃的情况，这时候如果能拿到崩溃时的现场信息，对于定位问题将非常有帮助。Core Dump 就是一种记录程序崩溃时内存状态的技术，通过它，我们可以使用调试器来分析崩溃原因。

什么是Core Dump？

Core Dump，也称为核心转储，是操作系统在程序异常终止时，将程序在内存中的状态（包括代码、数据、堆栈、寄存器等）保存到一个文件中。这个文件可以被调试器（如GDB/LLDB）加载，从而让我们能够像调试运行中的程序一样，分析崩溃时的上下文信息。

什么时候会产生Core Dump？

常见的导致程序崩溃并产生Core Dump的情况包括：

段错误 (Segmentation Fault): 访问了未分配或者没有权限访问的内存区域，比如空指针解引用、越界访问数组等。
非法指令 (Illegal Instruction): 程序执行了处理器无法识别的指令。
除零错误 (Division by Zero): 尝试除以零。
abort() 函数调用: 程序主动调用 abort() 函数终止自身。
未捕获的异常 (Uncaught Exception): 抛出了异常但没有被try-catch块捕获。
栈溢出 (Stack Overflow): 函数调用层级太深，导致栈空间耗尽。
资源耗尽 (Resource Exhaustion): 系统资源（如内存、文件句柄）耗尽。

Core Dump文件的内容

Core Dump 文件包含程序崩溃时的以下信息：

代码段（Text Segment）： 程序的可执行代码。
数据段（Data Segment）： 程序的全局变量和静态变量。
堆（Heap）： 程序动态分配的内存区域。
栈（Stack）： 程序函数调用的栈帧信息。
寄存器状态（Register State）： CPU 寄存器的值。
进程信息（Process Information）： 进程 ID、用户 ID 等。
其他信息： 信号量、文件描述符等。

启用 Core Dump

在 Linux 系统中，默认情况下，Core Dump 可能是禁用的，或者 Core Dump 文件的大小被限制为 0。我们需要手动启用 Core Dump，并设置 Core Dump 文件的大小限制。

查看 Core Dump 是否启用:

ulimit -c

如果输出为 0，表示 Core Dump 被禁用。

启用 Core Dump (临时生效):

ulimit -c unlimited  # 不限制 Core Dump 文件的大小

或者设置一个具体的限制：

ulimit -c 1024       # 限制 Core Dump 文件大小为 1024 KB

永久启用 Core Dump:

修改 /etc/security/limits.conf 文件。添加以下行：

* soft core unlimited
* hard core unlimited

重启系统或者重新登录后生效。

指定 Core Dump 文件名和存储路径:

修改 /proc/sys/kernel/core_pattern 文件。例如，将 Core Dump 文件存储到 /tmp/cores 目录下，并以 core.pid 命名：

echo "/tmp/cores/core.%p" > /proc/sys/kernel/core_pattern

%p 会被替换为进程 ID。

注意: 在生产环境中，需要谨慎设置 Core Dump 的权限和存储路径，防止敏感信息泄露。

使用 GDB 分析 Core Dump

GDB (GNU Debugger) 是一个强大的调试工具，可以用来分析 Core Dump 文件。

基本用法:

gdb <executable> <core_file>

<executable>: 程序的执行文件。必须是编译时带有调试信息的版本 (-g 选项)。
<core_file>: Core Dump 文件名。

常用 GDB 命令:

命令	描述
`bt`	(backtrace) 显示函数调用栈。
`frame <n>`	选择栈帧 (n 为栈帧编号)。
`info locals`	显示当前栈帧的局部变量。
`print <variable>`	打印变量的值。
`list`	显示当前代码行的上下文。
`up`	向上移动一个栈帧。
`down`	向下移动一个栈帧。
`quit`	退出 GDB。

示例代码:

#include <iostream>

void crash_function(int* ptr) {
  std::cout << "Dereferencing pointer: " << *ptr << std::endl; // 故意解引用空指针
}

int main() {
  int* null_ptr = nullptr;
  crash_function(null_ptr);
  return 0;
}

编译:

g++ -g crash.cpp -o crash

运行 (会崩溃):

./crash

如果 Core Dump 已经启用，会在当前目录下生成一个 core 文件 (或者根据 /proc/sys/kernel/core_pattern 的设置生成)。

使用 GDB 分析:

gdb crash core

GDB 输出 (示例):

GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from crash...
[New LWP 4567]
Core was generated by `./crash'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  crash_function (ptr=0x0) at crash.cpp:4
4    std::cout << "Dereferencing pointer: " << *ptr << std::endl;
(gdb) bt
#0  crash_function (ptr=0x0) at crash.cpp:4
#1  0x000055555555519a in main () at crash.cpp:9
(gdb) frame 0
#0  crash_function (ptr=0x0) at crash.cpp:4
4    std::cout << "Dereferencing pointer: " << *ptr << std::endl;
(gdb) print ptr
$1 = (int *) 0x0
(gdb)

从 GDB 的输出可以看出，程序在 crash_function 函数的第 4 行崩溃，原因是 ptr 为空指针。

使用 LLDB 分析 Core Dump

LLDB (Low Level Debugger) 是 macOS 和 iOS 上的默认调试器，也可以在 Linux 和 Windows 上使用。它的用法与 GDB 类似。

基本用法:

lldb <executable> -c <core_file>

<executable>: 程序的执行文件。必须是编译时带有调试信息的版本 (-g 选项)。
<core_file>: Core Dump 文件名。

常用 LLDB 命令:

命令	描述
`bt`	(backtrace) 显示函数调用栈。
`frame select <n>`	选择栈帧 (n 为栈帧编号)。
`frame variable`	显示当前栈帧的局部变量。
`print <variable>`	打印变量的值。
`list`	显示当前代码行的上下文。
`up`	向上移动一个栈帧。
`down`	向下移动一个栈帧。
`quit`	退出 LLDB。

使用 LLDB 分析 (以上面的 crash.cpp 为例):

lldb crash -c core

LLDB 输出 (示例):

(lldb) target create "crash"
Current executable set to 'crash' (x86_64).
(lldb) core file  core
(lldb) bt
* thread #1, stop reason = signal SIGSEGV (fault address: 0x0)
  * frame #0: 0x0000000000001164 crash`crash_function(ptr=0x0000000000000000) at crash.cpp:4
    frame #1: 0x00000000000011a5 crash`main at crash.cpp:9
    frame #2: 0x00007f99e93890b3 libc.so.6`__libc_start_main(main=(crash`main at crash.cpp:7), argc=1, argv=0x00007ffcfc215608, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffcfc2155f8) at __libc_start_main.c:308
    frame #3: 0x000000000000102e crash`_start at crt1.c:31
(lldb) frame select 0
frame #0: 0x0000000000001164 crash`crash_function(ptr=0x0000000000000000) at crash.cpp:4
(lldb) print ptr
(int *) ptr = 0x0000000000000000
(lldb)

LLDB 的输出结果与 GDB 类似，也清晰地指出了程序在 crash_function 函数中由于解引用空指针而崩溃。

分析复杂 Core Dump 的技巧

对于复杂的 Core Dump，可能需要更多的技巧来定位问题。

查看线程信息: 如果程序是多线程的，可以使用 info threads (GDB) 或 thread list (LLDB) 命令查看所有线程的状态，并选择特定的线程进行分析。
查看内存信息: 可以使用 x (GDB/LLDB) 命令查看指定内存地址的内容。例如，x/10x 0x7fffffffe400 会显示从地址 0x7fffffffe400 开始的 10 个十六进制数。还可以指定不同的格式，例如 x/s 0x400544 会将地址 0x400544 的内容解释为字符串。
设置断点: 虽然 Core Dump 是事后分析，但是仍然可以在 GDB/LLDB 中设置断点，然后使用 continue 命令来模拟程序执行，直到遇到断点。这可以帮助理解程序的执行流程。
使用脚本: GDB/LLDB 支持脚本，可以编写脚本来自动化分析 Core Dump 的过程。
结合日志: 程序日志可以提供额外的信息，帮助理解程序崩溃时的状态。

避免 Core Dump 的一些建议

虽然 Core Dump 可以帮助我们定位问题，但是预防胜于治疗。以下是一些避免 Core Dump 的建议：

代码审查: 进行代码审查，特别是关注指针操作、数组访问、内存分配和释放等容易出错的地方。
单元测试: 编写单元测试，覆盖各种边界情况和异常情况。
静态分析: 使用静态分析工具，如 Coverity、Cppcheck 等，来检查代码中的潜在问题。
内存检查工具: 使用内存检查工具，如 Valgrind，来检测内存泄漏、越界访问等问题。
防御性编程: 在代码中加入检查，例如在解引用指针之前检查指针是否为空。
异常处理: 合理使用异常处理机制，捕获并处理可能发生的异常。

Core Dump 分析示例：内存损坏

假设我们有以下代码：

#include <iostream>
#include <cstring>

int main() {
  char buffer[10];
  strcpy(buffer, "This is a very long string"); // 缓冲区溢出
  std::cout << buffer << std::endl;
  return 0;
}

这段代码存在缓冲区溢出，strcpy 会将 "This is a very long string" 复制到 buffer 中，但是 buffer 的大小只有 10 个字节，导致内存损坏。

编译:

g++ -g overflow.cpp -o overflow

运行:

./overflow

程序可能会崩溃，并生成 Core Dump 文件。

使用 GDB 分析:

gdb overflow core

GDB 输出 (示例):

GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from overflow...
[New LWP 1234]
Core was generated by `./overflow'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f1234567890 in strcpy () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f1234567890 in strcpy () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00005555555551a5 in main () at overflow.cpp:6
(gdb) frame 1
#1  0x00005555555551a5 in main () at overflow.cpp:6
6    strcpy(buffer, "This is a very long string");
(gdb) info locals
buffer = "This is a v"
(gdb)

从 GDB 的输出可以看出，程序在 strcpy 函数中崩溃，这是因为缓冲区溢出导致 strcpy 写入了无效的内存地址。 info locals 命令显示 buffer 已经被部分覆盖。

Core Dump 是一个重要的调试手段

总而言之，Core Dump 是一个非常重要的调试手段，可以帮助我们定位程序崩溃的原因。通过启用 Core Dump，并结合 GDB/LLDB 等调试工具，我们可以有效地分析崩溃时的状态，找到问题的根源，从而修复程序中的错误。记住，编译时一定要加上 -g 选项，这样才能包含调试信息，方便后续的分析。同时也需要重视代码质量，尽量避免程序崩溃的发生。

熟悉工具，养成好的编程习惯

掌握 Core Dump 分析技术，以及熟练使用 GDB/LLDB 等调试工具，可以显著提高开发效率和代码质量。同时，养成良好的编程习惯，如代码审查、单元测试、静态分析等，可以有效地减少程序崩溃的发生。

更多IT精英技术系列讲座，到智猿学院