C++ CPU 亲和性设置：将线程绑定到特定 CPU 核提升性能

好的，各位观众，各位码农，欢迎来到今天的“线程绑定CPU，性能蹭蹭涨”的专场讲座！我是你们的老朋友，专门负责帮大家把代码搞得更快更强的性能优化专家。今天，咱们就来聊聊C++中如何设置CPU亲和性，让你的线程和CPU核之间“情投意合”，从而榨干硬件的最后一滴性能。

开场白：你真的了解你的CPU吗？

在开始之前，我想问大家一个问题：你真的了解你的CPU吗？别急着回答“当然了解，几核几线程，睿频多少”，我要问的是，你了解你的程序在CPU上是怎么跑的吗？

想象一下，你的程序就像一群嗷嗷待哺的小鸡，而CPU核就是那些辛勤的母鸡。默认情况下，这些小鸡会被随机分配到不同的母鸡那里。虽然最终都能吃饱，但效率嘛，就不好说了。

CPU亲和性，就是让我们能够指定哪些小鸡只能由哪些母鸡来喂养。这样一来，小鸡们就不用到处乱跑，母鸡也不用频繁切换，自然就省下了不少力气，性能也就提升了。

什么是CPU亲和性？

简单来说，CPU亲和性（CPU affinity）就是将一个进程或线程绑定到一个或多个特定的CPU核心上运行。这样可以减少线程在不同核心之间迁移的次数，提高缓存命中率，从而提升性能。

为什么要设置CPU亲和性？

减少上下文切换： 线程在不同CPU核心之间迁移会带来上下文切换的开销，这会浪费大量的CPU时间。通过绑定CPU亲和性，可以避免或减少这种切换。
提高缓存命中率： 当一个线程在同一个CPU核心上运行时，它可以重复利用该核心的缓存，从而减少对内存的访问，提高性能。
NUMA架构优化： 在NUMA（Non-Uniform Memory Access）架构的系统中，不同的CPU核心访问内存的延迟不同。通过将线程绑定到距离其所需内存最近的CPU核心上，可以降低内存访问延迟。
避免干扰： 在某些情况下，你可能希望将某些关键线程绑定到特定的CPU核心上，以避免其他线程的干扰，保证其性能。

C++中如何设置CPU亲和性？

在C++中，设置CPU亲和性主要依赖于操作系统提供的API。不同的操作系统有不同的API，下面分别介绍Linux和Windows下的设置方法。

1. Linux下的CPU亲和性设置

在Linux下，可以使用sched_setaffinity和sched_getaffinity函数来设置和获取线程的CPU亲和性。

sched_setaffinity函数：

#define _GNU_SOURCE // For CPU_SET, CPU_ZERO
#include <sched.h>
#include <pthread.h>
#include <iostream>
#include <unistd.h> // Required for sysconf

int set_cpu_affinity(pthread_t thread, int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);

    int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        std::cerr << "Failed to set CPU affinity: " << result << std::endl;
        return -1;
    }
    return 0;
}

pthread_t thread: 要设置亲和性的线程ID。
size_t cpusetsize: cpuset的大小，通常为sizeof(cpu_set_t)。
cpu_set_t *cpuset: 一个CPU集合，用于指定线程可以运行的CPU核心。

使用示例：

#include <iostream>
#include <thread>
#include <chrono>

void worker_thread(int id) {
    // Bind this thread to CPU core 'id'
    if (set_cpu_affinity(pthread_self(), id) == 0) {
        std::cout << "Thread " << id << " bound to CPU core " << id << std::endl;
    }

    // Simulate some work
    for (int i = 0; i < 1000000; ++i) {
        // Do something
    }
    std::cout << "Thread " << id << " finished." << std::endl;
}

int main() {
    int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
    std::cout << "Number of available cores: " << num_cores << std::endl;

    std::thread threads[num_cores];
    for (int i = 0; i < num_cores; ++i) {
        threads[i] = std::thread(worker_thread, i);
    }

    for (int i = 0; i < num_cores; ++i) {
        threads[i].join();
    }

    return 0;
}

sched_getaffinity函数：

#include <sched.h>
#include <pthread.h>
#include <iostream>

int get_cpu_affinity(pthread_t thread) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);

    int result = pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        std::cerr << "Failed to get CPU affinity: " << result << std::endl;
        return -1;
    }

    std::cout << "Thread affinity: ";
    for (int i = 0; i < CPU_SETSIZE; ++i) {
        if (CPU_ISSET(i, &cpuset)) {
            std::cout << i << " ";
        }
    }
    std::cout << std::endl;
    return 0;
}

pid_t pid: 进程ID，如果要获取当前进程的亲和性，可以传入0。
size_t cpusetsize: cpuset的大小，通常为sizeof(cpu_set_t)。
cpu_set_t *cpuset: 一个CPU集合，用于存储进程可以运行的CPU核心。

使用示例：

#include <iostream>
#include <thread>
#include <chrono>

void worker_thread(int id) {
    // Bind this thread to CPU core 'id'
    if (set_cpu_affinity(pthread_self(), id) == 0) {
        std::cout << "Thread " << id << " bound to CPU core " << id << std::endl;
    }

    // Get and print the CPU affinity
    get_cpu_affinity(pthread_self());

    // Simulate some work
    for (int i = 0; i < 1000000; ++i) {
        // Do something
    }
    std::cout << "Thread " << id << " finished." << std::endl;
}

int main() {
    int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
    std::cout << "Number of available cores: " << num_cores << std::endl;

    std::thread threads[num_cores];
    for (int i = 0; i < num_cores; ++i) {
        threads[i] = std::thread(worker_thread, i);
    }

    for (int i = 0; i < num_cores; ++i) {
        threads[i].join();
    }

    return 0;
}

2. Windows下的CPU亲和性设置

在Windows下，可以使用SetThreadAffinityMask和GetThreadAffinityMask函数来设置和获取线程的CPU亲和性。

SetThreadAffinityMask函数：

#include <windows.h>
#include <iostream>
#include <thread>

DWORD WINAPI worker_thread(LPVOID lpParam) {
    DWORD_PTR mask = (DWORD_PTR)lpParam;
    if (!SetThreadAffinityMask(GetCurrentThread(), mask)) {
        std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
    } else {
        std::cout << "Thread bound to CPU mask: " << mask << std::endl;
    }

    // Simulate some work
    for (int i = 0; i < 1000000; ++i) {
        // Do something
    }
    std::cout << "Thread finished." << std::endl;
    return 0;
}

HANDLE hThread: 要设置亲和性的线程句柄。
DWORD_PTR dwThreadAffinityMask: 一个位掩码，用于指定线程可以运行的CPU核心。例如，如果想让线程在CPU核心0和1上运行，可以将dwThreadAffinityMask设置为0x03（二进制为00000011）。

使用示例：

#include <iostream>
#include <thread>
#include <windows.h>

DWORD WINAPI worker_thread(LPVOID lpParam) {
    DWORD_PTR mask = (DWORD_PTR)lpParam;
    if (!SetThreadAffinityMask(GetCurrentThread(), mask)) {
        std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
    } else {
        std::cout << "Thread bound to CPU mask: " << mask << std::endl;
    }

    // Simulate some work
    for (int i = 0; i < 1000000; ++i) {
        // Do something
    }
    std::cout << "Thread finished." << std::endl;
    return 0;
}

int main() {
    // Get the number of processors
    SYSTEM_INFO sysinfo;
    GetSystemInfo(&sysinfo);
    int num_cores = sysinfo.dwNumberOfProcessors;

    std::cout << "Number of available cores: " << num_cores << std::endl;

    // Create a thread and bind it to core 0
    HANDLE thread = CreateThread(
        NULL,   // default security attributes
        0,      // use default stack size
        worker_thread, // thread function
        (LPVOID)1,    // argument to thread function (CPU mask)
        0,      // use default creation flags
        NULL);  // returns the thread identifier

    if (thread == NULL) {
        std::cerr << "Failed to create thread: " << GetLastError() << std::endl;
        return 1;
    }

    WaitForSingleObject(thread, INFINITE);
    CloseHandle(thread);

    return 0;
}

GetThreadAffinityMask函数：

#include <windows.h>
#include <iostream>
#include <thread>

DWORD WINAPI worker_thread(LPVOID lpParam) {
    DWORD_PTR processMask, systemMask;
    if (!GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask)) {
        std::cerr << "Failed to get process affinity mask: " << GetLastError() << std::endl;
    } else {
        std::cout << "Process Affinity Mask: " << processMask << std::endl;
        std::cout << "System Affinity Mask: " << systemMask << std::endl;
    }

    DWORD_PTR threadMask = SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)lpParam);
    if (!threadMask) {
        std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
    } else {
        std::cout << "Thread bound to CPU mask: " << (DWORD_PTR)lpParam << std::endl;
    }

    if (!GetThreadAffinityMask(GetCurrentThread(), &processMask, &systemMask)) {
        std::cerr << "Failed to get thread affinity mask: " << GetLastError() << std::endl;
    } else {
        std::cout << "Thread Affinity Mask: " << processMask << std::endl;
        std::cout << "System Affinity Mask: " << systemMask << std::endl;
    }

    // Simulate some work
    for (int i = 0; i < 1000000; ++i) {
        // Do something
    }
    std::cout << "Thread finished." << std::endl;
    return 0;
}

HANDLE hThread: 要获取亲和性的线程句柄。
PDWORD_PTR lpProcessAffinityMask: 一个指针，用于存储进程的亲和性掩码。
PDWORD_PTR lpSystemAffinityMask: 一个指针，用于存储系统的亲和性掩码。

使用示例：

#include <iostream>
#include <thread>
#include <windows.h>

DWORD WINAPI worker_thread(LPVOID lpParam) {
    DWORD_PTR processMask, systemMask;
    if (!GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask)) {
        std::cerr << "Failed to get process affinity mask: " << GetLastError() << std::endl;
    } else {
        std::cout << "Process Affinity Mask: " << processMask << std::endl;
        std::cout << "System Affinity Mask: " << systemMask << std::endl;
    }

    DWORD_PTR threadMask = SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)lpParam);
    if (!threadMask) {
        std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
    } else {
        std::cout << "Thread bound to CPU mask: " << (DWORD_PTR)lpParam << std::endl;
    }

    if (!GetThreadAffinityMask(GetCurrentThread(), &processMask, &systemMask)) {
        std::cerr << "Failed to get thread affinity mask: " << GetLastError() << std::endl;
    } else {
        std::cout << "Thread Affinity Mask: " << processMask << std::endl;
        std::cout << "System Affinity Mask: " << systemMask << std::endl;
    }

    // Simulate some work
    for (int i = 0; i < 1000000; ++i) {
        // Do something
    }
    std::cout << "Thread finished." << std::endl;
    return 0;
}

int main() {
    // Get the number of processors
    SYSTEM_INFO sysinfo;
    GetSystemInfo(&sysinfo);
    int num_cores = sysinfo.dwNumberOfProcessors;

    std::cout << "Number of available cores: " << num_cores << std::endl;

    // Create a thread and bind it to core 0
    HANDLE thread = CreateThread(
        NULL,   // default security attributes
        0,      // use default stack size
        worker_thread, // thread function
        (LPVOID)1,    // argument to thread function (CPU mask)
        0,      // use default creation flags
        NULL);  // returns the thread identifier

    if (thread == NULL) {
        std::cerr << "Failed to create thread: " << GetLastError() << std::endl;
        return 1;
    }

    WaitForSingleObject(thread, INFINITE);
    CloseHandle(thread);

    return 0;
}

一些注意事项和最佳实践

获取CPU核心数量： 在设置CPU亲和性之前，务必先获取系统的CPU核心数量，避免设置超出范围的亲和性。
NUMA架构： 在NUMA架构的系统中，需要考虑线程访问内存的局部性，尽量将线程绑定到距离其所需内存最近的CPU核心上。
超线程： 超线程技术允许一个物理核心模拟成两个逻辑核心。在设置CPU亲和性时，需要了解系统的超线程配置，避免将线程绑定到同一个物理核心的两个逻辑核心上，导致性能下降。
测试和验证： 设置CPU亲和性后，务必进行测试和验证，确保其能够带来预期的性能提升。可以使用性能分析工具来监控线程的CPU利用率和缓存命中率。
不要过度优化： CPU亲和性并不是万能的，过度优化可能会导致代码复杂性增加，维护成本上升。只有在性能瓶颈确实存在时，才考虑使用CPU亲和性。
避免硬编码： 不要在代码中硬编码CPU核心ID。应该根据系统的实际情况，动态地设置CPU亲和性。

CPU亲和性设置的时机

在以下场景中，设置CPU亲和性可能会带来显著的性能提升：

多线程计算密集型应用： 对于需要大量计算的多线程应用，例如图像处理、科学计算等，将不同的线程绑定到不同的CPU核心上，可以充分利用多核CPU的并行计算能力。
实时性要求高的应用： 对于实时性要求高的应用，例如音视频处理、游戏等，将关键线程绑定到特定的CPU核心上，可以避免其他线程的干扰，保证其性能。
NUMA架构的系统： 在NUMA架构的系统中，将线程绑定到距离其所需内存最近的CPU核心上，可以降低内存访问延迟，提高性能。

总结

CPU亲和性是一种强大的性能优化技术，可以帮助你充分利用多核CPU的并行计算能力，提高程序的性能。但是，它也不是万能的，需要根据实际情况进行测试和验证。希望今天的讲座能够帮助大家更好地理解和应用CPU亲和性，让你的代码跑得更快、更稳！

最后的彩蛋：一个更高级的例子

假设我们有一个图像处理应用，需要对图像进行分块处理。我们可以将图像分成多个块，然后创建多个线程，每个线程负责处理一个块。为了提高性能，我们可以将每个线程绑定到不同的CPU核心上。

#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <numeric>

#ifdef _WIN32
#include <windows.h>
#else
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <unistd.h>
#endif

// Function to set CPU affinity for a thread
#ifdef _WIN32
bool set_thread_affinity(HANDLE thread, DWORD_PTR mask) {
    if (!SetThreadAffinityMask(thread, mask)) {
        std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
        return false;
    }
    return true;
}
#else
bool set_thread_affinity(pthread_t thread, int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);

    int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        std::cerr << "Failed to set CPU affinity: " << result << std::endl;
        return false;
    }
    return true;
}
#endif

// Function to simulate image processing
void process_image_chunk(int chunk_id, int num_chunks, int image_width, int image_height, std::vector<int>& image_data) {
    // Determine the start and end rows for this chunk
    int start_row = (chunk_id * image_height) / num_chunks;
    int end_row = ((chunk_id + 1) * image_height) / num_chunks;

    // Simulate some processing
    for (int row = start_row; row < end_row; ++row) {
        for (int col = 0; col < image_width; ++col) {
            image_data[row * image_width + col] += chunk_id; // Just add the chunk_id to each pixel
        }
    }

    std::cout << "Chunk " << chunk_id << " processed." << std::endl;
}

int main() {
    // Image dimensions
    int image_width = 2048;
    int image_height = 2048;

    // Number of chunks and threads
    int num_chunks = std::thread::hardware_concurrency();
    std::cout << "Using " << num_chunks << " threads." << std::endl;

    // Image data
    std::vector<int> image_data(image_width * image_height, 0);

    // Start time
    auto start_time = std::chrono::high_resolution_clock::now();

    // Create and run threads
    std::vector<std::thread> threads;
    for (int i = 0; i < num_chunks; ++i) {
        threads.emplace_back([i, num_chunks, image_width, image_height, &image_data]() {
            // Set CPU affinity
#ifdef _WIN32
            HANDLE current_thread = GetCurrentThread();
            set_thread_affinity(current_thread, 1ULL << i); // Bind to core i
#else
            set_thread_affinity(pthread_self(), i); // Bind to core i
#endif

            // Process the image chunk
            process_image_chunk(i, num_chunks, image_width, image_height, image_data);
        });
    }

    // Join threads
    for (auto& thread : threads) {
        thread.join();
    }

    // End time
    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

    std::cout << "Image processing took " << duration.count() << " milliseconds." << std::endl;

    // Verify result (optional)
    long long expected_sum = 0;
    for(int i = 0; i < num_chunks; ++i) {
        expected_sum += i;
    }
    expected_sum *= (long long)image_width * image_height / num_chunks;

    long long actual_sum = std::accumulate(image_data.begin(), image_data.end(), 0LL);

    // Correct expected sum to account for integer division issues
    expected_sum = 0;
    for (int chunk_id = 0; chunk_id < num_chunks; ++chunk_id) {
      int start_row = (chunk_id * image_height) / num_chunks;
      int end_row = ((chunk_id + 1) * image_height) / num_chunks;
      expected_sum += (long long)chunk_id * image_width * (end_row - start_row);
    }
    if (actual_sum == expected_sum){
      std::cout << "Verification successful." << std::endl;
    } else {
      std::cout << "Verification failed. Expected sum: " << expected_sum << ", Actual sum: " << actual_sum << std::endl;
    }

    return 0;
}

这个例子展示了如何将多个线程绑定到不同的CPU核心上，并利用它们并行处理图像的不同部分。通过这种方式，我们可以充分利用多核CPU的并行计算能力，提高图像处理的速度。

感谢大家的观看，希望大家有所收获！下课！

发表回复 取消回复

发表回复取消回复