C++ 并发调试：`Helgrind`, `Tsan` 结合 `rr` (record and replay) 调试

哈喽，各位好！

今天咱们来聊聊 C++ 并发调试这个让人头大的话题。并发编程就像在厨房里同时做几道菜，一不小心就会手忙脚乱，出现各种奇怪的 bug。这些 bug 往往难以复现，让人抓狂。别担心，今天我就给大家介绍一套组合拳，用 Helgrind, Tsan 加上 rr (record and replay) 来搞定这些并发难题。

一、并发编程的那些坑

首先，咱们得知道并发编程里都有哪些坑。常见的有：

数据竞争 (Data Race): 多个线程同时访问同一个共享变量，并且至少有一个线程在写。这会导致不可预测的结果。
死锁 (Deadlock): 多个线程互相等待对方释放资源，导致所有线程都无法继续执行。
活锁 (Livelock): 线程不断重试操作，但由于其他线程的干扰，始终无法成功。活锁和死锁类似，但线程没有被阻塞，而是不断忙碌地做无用功。
竞争条件 (Race Condition): 程序的行为取决于多个线程执行的相对顺序。即使没有显式的数据竞争，也可能因为线程执行顺序的不同而导致不同的结果。
原子性问题 (Atomicity Violation): 一系列操作应该作为一个原子操作执行，但实际上被其他线程中断了。

这些问题都很隐蔽，很难用传统的调试方法找到。接下来，咱们就看看如何用工具来解决它们。

二、Helgrind：死锁和数据竞争的克星

Helgrind 是 Valgrind 工具集中的一个工具，专门用于检测 C++ 程序中的死锁和数据竞争。它的原理是监视线程对共享内存的访问，并检查是否存在潜在的冲突。

1. Helgrind 的安装和使用

首先，你需要安装 Valgrind。在 Debian/Ubuntu 上，可以用以下命令安装：

sudo apt-get update
sudo apt-get install valgrind

安装完成后，就可以用 Helgrind 来运行你的程序了：

valgrind --tool=helgrind ./your_program

Helgrind 会输出详细的报告，告诉你哪里可能存在死锁或数据竞争。

2. Helgrind 的例子

咱们来看一个简单的例子，演示如何用 Helgrind 检测数据竞争：

#include <iostream>
#include <thread>

int counter = 0;

void increment() {
    for (int i = 0; i < 100000; ++i) {
        counter++; // 数据竞争
    }
}

int main() {
    std::thread t1(increment);
    std::thread t2(increment);

    t1.join();
    t2.join();

    std::cout << "Counter: " << counter << std::endl;
    return 0;
}

在这个例子中，两个线程同时增加 counter 变量，但没有使用任何同步机制，导致数据竞争。用 Helgrind 运行这个程序，会输出如下报告：

==25611== Possible data race during write of size 4 at 0x10c884 in thread #2
==25611==    at 0x109203: increment() (example.cpp:6)
==25611==    by 0x4E4277F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30)
==25611==    by 0x48409DA: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
==25611==  Address 0x10c884 is located in the .data section
==25611==
==25611== Previous write of size 4 at 0x10c884 in thread #1
==25611==    at 0x109203: increment() (example.cpp:6)
==25611==    by 0x4E4277F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30)
==25611==    by 0x48409DA: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2)
==25611==  Address 0x10c884 is located in the .data section

报告清楚地指出了数据竞争发生在 example.cpp 的第 6 行，也就是 counter++ 这一行。

3. 如何修复数据竞争？

要修复数据竞争，可以使用互斥锁 (mutex) 来保护共享变量：

#include <iostream>
#include <thread>
#include <mutex>

int counter = 0;
std::mutex counter_mutex;

void increment() {
    for (int i = 0; i < 100000; ++i) {
        std::lock_guard<std::mutex> lock(counter_mutex);
        counter++;
    }
}

int main() {
    std::thread t1(increment);
    std::thread t2(increment);

    t1.join();
    t2.join();

    std::cout << "Counter: " << counter << std::endl;
    return 0;
}

在这个修改后的版本中，我们使用 std::mutex 来保护 counter 变量，确保同一时间只有一个线程可以访问它。再次用 Helgrind 运行这个程序，就不会再报告数据竞争了。

三、Tsan：更全面的数据竞争检测

Tsan (ThreadSanitizer) 是一个用于检测 C++ 和 Go 程序中数据竞争的工具。它比 Helgrind 更快，更轻量级，并且可以检测更多类型的数据竞争。

1. Tsan 的安装和使用

Tsan 通常包含在编译器工具链中。在 GCC 和 Clang 中，可以通过 -fsanitize=thread 选项来启用它：

g++ -fsanitize=thread -g example.cpp -o example
./example

Tsan 会在程序运行时检测数据竞争，并在发现问题时输出报告。

2. Tsan 的例子

咱们用之前的例子，看看 Tsan 如何检测数据竞争：

#include <iostream>
#include <thread>

int counter = 0;

void increment() {
    for (int i = 0; i < 100000; ++i) {
        counter++; // 数据竞争
    }
}

int main() {
    std::thread t1(increment);
    std::thread t2(increment);

    t1.join();
    t2.join();

    std::cout << "Counter: " << counter << std::endl;
    return 0;
}

用 Tsan 编译和运行这个程序，会输出如下报告：

==================
WARNING: ThreadSanitizer: data race (detected at pc 0x000000400d33 bp 0x7b9000000950 sp 0x7b9000000948)
  Write of size 4 at 0x000000602004 by thread T2:
    #0 increment() example.cpp:6 (example+0x400d33)
    #1 std::execute_native_thread_routine(void*) /usr/include/c++/11/thread:240
    #2 start_thread /lib/x86_64-linux-gnu/libpthread.so.0 +0x8643
    #3 clone /lib/x86_64-linux-gnu/libc.so.6 +0x12141f

  Previous write of size 4 at 0x000000602004 by thread T1:
    #0 increment() example.cpp:6 (example+0x400d33)
    #1 std::execute_native_thread_routine(void*) /usr/include/c++/11/thread:240
    #2 start_thread /lib/x86_64-linux-gnu/libpthread.so.0 +0x8643
    #3 clone /lib/x86_64-linux-gnu/libc.so.6 +0x12141f

  Location is global 'counter' of size 4 at 0x000000602004 (example+0x000000002004)

  Thread T2 (tid=2, running) created at:
    #0 std::thread::_M_start_thread(_Unique_if_no_destroy<_Thread_impl<std::unique_ptr<std::packaged_task<void ()> >, void> >*) /usr/include/c++/11/thread:345
    #1 main example.cpp:14 (example+0x400e76)

  Thread T1 (tid=1, running) created at:
    #0 std::thread::_M_start_thread(_Unique_if_no_destroy<_Thread_impl<std::unique_ptr<std::packaged_task<void ()> >, void> >*) /usr/include/c++/11/thread:345
    #1 main example.cpp:13 (example+0x400e26)
==================

报告同样指出了数据竞争发生在 example.cpp 的第 6 行，并提供了更详细的线程调用栈信息。

3. Tsan 的优点

速度快： Tsan 比 Helgrind 快得多，对程序性能的影响更小。
更全面的检测： Tsan 可以检测更多类型的数据竞争，包括发生在不同编译单元之间的数据竞争。
更好的错误报告： Tsan 的错误报告更详细，包含线程调用栈信息，更容易定位问题。

四、rr：记录和重放并发程序的执行

rr (record and replay) 是一个强大的调试工具，可以记录程序的执行过程，并重放它。这对于调试并发程序来说非常有用，因为并发程序的行为往往是不确定的，难以复现。

1. rr 的安装和使用

首先，你需要安装 rr。在 Debian/Ubuntu 上，可以用以下命令安装：

sudo apt-get update
sudo apt-get install rr

安装完成后，就可以用 rr 来记录你的程序的执行过程了：

rr record ./your_program

rr 会创建一个目录，其中包含程序的执行记录。然后，你可以用 rr replay 命令来重放程序的执行过程，并用 GDB 调试它：

rr replay

2. rr 的例子

咱们来看一个例子，演示如何用 rr 调试并发程序：

#include <iostream>
#include <thread>
#include <vector>
#include <random>
#include <chrono>

std::vector<int> data;
bool ready = false;

void producer() {
    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<> distrib(1, 100);

    for (int i = 0; i < 10; ++i) {
        data.push_back(distrib(gen));
        std::this_thread::sleep_for(std::chrono::milliseconds(distrib(gen))); // 模拟一些工作
    }
    ready = true;
    std::cout << "Producer finished" << std::endl;
}

void consumer() {
    while (!ready) {
        std::this_thread::sleep_for(std::chrono::milliseconds(10)); // 等待生产者
    }

    std::cout << "Consumer started" << std::endl;
    for (int i = 0; i < data.size(); ++i) { // 潜在的越界访问
        std::cout << "Data[" << i << "] = " << data[i] << std::endl;
    }
}

int main() {
    std::thread t1(producer);
    std::thread t2(consumer);

    t1.join();
    t2.join();

    return 0;
}

在这个例子中，producer 线程向 data 向量中添加数据，然后设置 ready 标志。 consumer 线程等待 ready 标志，然后访问 data 向量。这里存在一个潜在的越界访问的风险，因为 consumer 线程可能在 producer 线程还没有完成所有数据添加时就开始访问 data 向量。

首先，用 rr 记录程序的执行过程：

rr record ./example

然后，用 rr replay 重放程序的执行过程，并用 GDB 调试它：

rr replay

在 GDB 中，你可以设置断点，单步执行，查看变量的值，等等。由于 rr 可以精确地重放程序的执行过程，因此你可以反复调试，直到找到问题所在。

在这个例子中，你可以在 consumer 线程中设置一个断点，检查 data.size() 的值，看看是否发生了越界访问。

3. rr 的优点

精确重放： rr 可以精确地重放程序的执行过程，包括线程的调度顺序，随机数生成，等等。
方便调试： rr 可以与 GDB 集成，让你像调试单线程程序一样调试并发程序。
时间旅行： rr 允许你在程序的执行过程中向前或向后移动，方便你观察程序的行为。

五、Helgrind, Tsan, rr 组合使用

这三个工具可以组合使用，发挥更大的威力。通常的流程是：

用 Helgrind 或 Tsan 检测数据竞争和死锁。
如果发现问题，用 rr 记录程序的执行过程。
用 rr replay 重放程序的执行过程，并用 GDB 调试它。

通过这种方式，你可以快速定位和解决并发程序中的 bug。

六、总结

工具	功能	优点	缺点
`Helgrind`	检测数据竞争和死锁	能够发现比较明显的数据竞争和死锁问题，易于上手。	速度较慢，误报率较高，可能漏报一些数据竞争。
`Tsan`	检测数据竞争	速度快，误报率低，能够发现更多类型的数据竞争。	需要编译器支持，对程序性能有一定影响。
`rr`	记录和重放程序的执行过程	能够精确重放程序的执行过程，方便调试并发程序，可以进行时间旅行。	需要一定的学习成本，记录过程会产生大量数据，可能会影响程序性能。

并发编程是一个复杂的话题，需要不断学习和实践。希望今天的分享能够帮助大家更好地调试 C++ 并发程序。记住，工具只是辅助，更重要的是理解并发编程的原理，养成良好的编程习惯。祝大家编程愉快！

发表回复 取消回复

发表回复取消回复