C++ CPU 亲和性设置:将线程绑定到特定 CPU 核提升性能

好的,各位观众,各位码农,欢迎来到今天的“线程绑定CPU,性能蹭蹭涨”的专场讲座!我是你们的老朋友,专门负责帮大家把代码搞得更快更强的性能优化专家。今天,咱们就来聊聊C++中如何设置CPU亲和性,让你的线程和CPU核之间“情投意合”,从而榨干硬件的最后一滴性能。

开场白:你真的了解你的CPU吗?

在开始之前,我想问大家一个问题:你真的了解你的CPU吗?别急着回答“当然了解,几核几线程,睿频多少”,我要问的是,你了解你的程序在CPU上是怎么跑的吗?

想象一下,你的程序就像一群嗷嗷待哺的小鸡,而CPU核就是那些辛勤的母鸡。默认情况下,这些小鸡会被随机分配到不同的母鸡那里。虽然最终都能吃饱,但效率嘛,就不好说了。

CPU亲和性,就是让我们能够指定哪些小鸡只能由哪些母鸡来喂养。这样一来,小鸡们就不用到处乱跑,母鸡也不用频繁切换,自然就省下了不少力气,性能也就提升了。

什么是CPU亲和性?

简单来说,CPU亲和性(CPU affinity)就是将一个进程或线程绑定到一个或多个特定的CPU核心上运行。这样可以减少线程在不同核心之间迁移的次数,提高缓存命中率,从而提升性能。

为什么要设置CPU亲和性?

  • 减少上下文切换: 线程在不同CPU核心之间迁移会带来上下文切换的开销,这会浪费大量的CPU时间。通过绑定CPU亲和性,可以避免或减少这种切换。
  • 提高缓存命中率: 当一个线程在同一个CPU核心上运行时,它可以重复利用该核心的缓存,从而减少对内存的访问,提高性能。
  • NUMA架构优化: 在NUMA(Non-Uniform Memory Access)架构的系统中,不同的CPU核心访问内存的延迟不同。通过将线程绑定到距离其所需内存最近的CPU核心上,可以降低内存访问延迟。
  • 避免干扰: 在某些情况下,你可能希望将某些关键线程绑定到特定的CPU核心上,以避免其他线程的干扰,保证其性能。

C++中如何设置CPU亲和性?

在C++中,设置CPU亲和性主要依赖于操作系统提供的API。不同的操作系统有不同的API,下面分别介绍Linux和Windows下的设置方法。

1. Linux下的CPU亲和性设置

在Linux下,可以使用sched_setaffinitysched_getaffinity函数来设置和获取线程的CPU亲和性。

  • sched_setaffinity函数:

    #define _GNU_SOURCE // For CPU_SET, CPU_ZERO
    #include <sched.h>
    #include <pthread.h>
    #include <iostream>
    #include <unistd.h> // Required for sysconf
    
    int set_cpu_affinity(pthread_t thread, int cpu_id) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(cpu_id, &cpuset);
    
        int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
        if (result != 0) {
            std::cerr << "Failed to set CPU affinity: " << result << std::endl;
            return -1;
        }
        return 0;
    }
    • pthread_t thread: 要设置亲和性的线程ID。
    • size_t cpusetsize: cpuset的大小,通常为sizeof(cpu_set_t)
    • cpu_set_t *cpuset: 一个CPU集合,用于指定线程可以运行的CPU核心。

    使用示例:

    #include <iostream>
    #include <thread>
    #include <chrono>
    
    void worker_thread(int id) {
        // Bind this thread to CPU core 'id'
        if (set_cpu_affinity(pthread_self(), id) == 0) {
            std::cout << "Thread " << id << " bound to CPU core " << id << std::endl;
        }
    
        // Simulate some work
        for (int i = 0; i < 1000000; ++i) {
            // Do something
        }
        std::cout << "Thread " << id << " finished." << std::endl;
    }
    
    int main() {
        int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
        std::cout << "Number of available cores: " << num_cores << std::endl;
    
        std::thread threads[num_cores];
        for (int i = 0; i < num_cores; ++i) {
            threads[i] = std::thread(worker_thread, i);
        }
    
        for (int i = 0; i < num_cores; ++i) {
            threads[i].join();
        }
    
        return 0;
    }
  • sched_getaffinity函数:

    #include <sched.h>
    #include <pthread.h>
    #include <iostream>
    
    int get_cpu_affinity(pthread_t thread) {
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
    
        int result = pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
        if (result != 0) {
            std::cerr << "Failed to get CPU affinity: " << result << std::endl;
            return -1;
        }
    
        std::cout << "Thread affinity: ";
        for (int i = 0; i < CPU_SETSIZE; ++i) {
            if (CPU_ISSET(i, &cpuset)) {
                std::cout << i << " ";
            }
        }
        std::cout << std::endl;
        return 0;
    }
    • pid_t pid: 进程ID,如果要获取当前进程的亲和性,可以传入0。
    • size_t cpusetsize: cpuset的大小,通常为sizeof(cpu_set_t)
    • cpu_set_t *cpuset: 一个CPU集合,用于存储进程可以运行的CPU核心。

    使用示例:

    #include <iostream>
    #include <thread>
    #include <chrono>
    
    void worker_thread(int id) {
        // Bind this thread to CPU core 'id'
        if (set_cpu_affinity(pthread_self(), id) == 0) {
            std::cout << "Thread " << id << " bound to CPU core " << id << std::endl;
        }
    
        // Get and print the CPU affinity
        get_cpu_affinity(pthread_self());
    
        // Simulate some work
        for (int i = 0; i < 1000000; ++i) {
            // Do something
        }
        std::cout << "Thread " << id << " finished." << std::endl;
    }
    
    int main() {
        int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
        std::cout << "Number of available cores: " << num_cores << std::endl;
    
        std::thread threads[num_cores];
        for (int i = 0; i < num_cores; ++i) {
            threads[i] = std::thread(worker_thread, i);
        }
    
        for (int i = 0; i < num_cores; ++i) {
            threads[i].join();
        }
    
        return 0;
    }

2. Windows下的CPU亲和性设置

在Windows下,可以使用SetThreadAffinityMaskGetThreadAffinityMask函数来设置和获取线程的CPU亲和性。

  • SetThreadAffinityMask函数:

    #include <windows.h>
    #include <iostream>
    #include <thread>
    
    DWORD WINAPI worker_thread(LPVOID lpParam) {
        DWORD_PTR mask = (DWORD_PTR)lpParam;
        if (!SetThreadAffinityMask(GetCurrentThread(), mask)) {
            std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
        } else {
            std::cout << "Thread bound to CPU mask: " << mask << std::endl;
        }
    
        // Simulate some work
        for (int i = 0; i < 1000000; ++i) {
            // Do something
        }
        std::cout << "Thread finished." << std::endl;
        return 0;
    }
    • HANDLE hThread: 要设置亲和性的线程句柄。
    • DWORD_PTR dwThreadAffinityMask: 一个位掩码,用于指定线程可以运行的CPU核心。例如,如果想让线程在CPU核心0和1上运行,可以将dwThreadAffinityMask设置为0x03(二进制为00000011)。

    使用示例:

    #include <iostream>
    #include <thread>
    #include <windows.h>
    
    DWORD WINAPI worker_thread(LPVOID lpParam) {
        DWORD_PTR mask = (DWORD_PTR)lpParam;
        if (!SetThreadAffinityMask(GetCurrentThread(), mask)) {
            std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
        } else {
            std::cout << "Thread bound to CPU mask: " << mask << std::endl;
        }
    
        // Simulate some work
        for (int i = 0; i < 1000000; ++i) {
            // Do something
        }
        std::cout << "Thread finished." << std::endl;
        return 0;
    }
    
    int main() {
        // Get the number of processors
        SYSTEM_INFO sysinfo;
        GetSystemInfo(&sysinfo);
        int num_cores = sysinfo.dwNumberOfProcessors;
    
        std::cout << "Number of available cores: " << num_cores << std::endl;
    
        // Create a thread and bind it to core 0
        HANDLE thread = CreateThread(
            NULL,   // default security attributes
            0,      // use default stack size
            worker_thread, // thread function
            (LPVOID)1,    // argument to thread function (CPU mask)
            0,      // use default creation flags
            NULL);  // returns the thread identifier
    
        if (thread == NULL) {
            std::cerr << "Failed to create thread: " << GetLastError() << std::endl;
            return 1;
        }
    
        WaitForSingleObject(thread, INFINITE);
        CloseHandle(thread);
    
        return 0;
    }
  • GetThreadAffinityMask函数:

    #include <windows.h>
    #include <iostream>
    #include <thread>
    
    DWORD WINAPI worker_thread(LPVOID lpParam) {
        DWORD_PTR processMask, systemMask;
        if (!GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask)) {
            std::cerr << "Failed to get process affinity mask: " << GetLastError() << std::endl;
        } else {
            std::cout << "Process Affinity Mask: " << processMask << std::endl;
            std::cout << "System Affinity Mask: " << systemMask << std::endl;
        }
    
        DWORD_PTR threadMask = SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)lpParam);
        if (!threadMask) {
            std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
        } else {
            std::cout << "Thread bound to CPU mask: " << (DWORD_PTR)lpParam << std::endl;
        }
    
        if (!GetThreadAffinityMask(GetCurrentThread(), &processMask, &systemMask)) {
            std::cerr << "Failed to get thread affinity mask: " << GetLastError() << std::endl;
        } else {
            std::cout << "Thread Affinity Mask: " << processMask << std::endl;
            std::cout << "System Affinity Mask: " << systemMask << std::endl;
        }
    
        // Simulate some work
        for (int i = 0; i < 1000000; ++i) {
            // Do something
        }
        std::cout << "Thread finished." << std::endl;
        return 0;
    }
    • HANDLE hThread: 要获取亲和性的线程句柄。
    • PDWORD_PTR lpProcessAffinityMask: 一个指针,用于存储进程的亲和性掩码。
    • PDWORD_PTR lpSystemAffinityMask: 一个指针,用于存储系统的亲和性掩码。

    使用示例:

    #include <iostream>
    #include <thread>
    #include <windows.h>
    
    DWORD WINAPI worker_thread(LPVOID lpParam) {
        DWORD_PTR processMask, systemMask;
        if (!GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask)) {
            std::cerr << "Failed to get process affinity mask: " << GetLastError() << std::endl;
        } else {
            std::cout << "Process Affinity Mask: " << processMask << std::endl;
            std::cout << "System Affinity Mask: " << systemMask << std::endl;
        }
    
        DWORD_PTR threadMask = SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)lpParam);
        if (!threadMask) {
            std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
        } else {
            std::cout << "Thread bound to CPU mask: " << (DWORD_PTR)lpParam << std::endl;
        }
    
        if (!GetThreadAffinityMask(GetCurrentThread(), &processMask, &systemMask)) {
            std::cerr << "Failed to get thread affinity mask: " << GetLastError() << std::endl;
        } else {
            std::cout << "Thread Affinity Mask: " << processMask << std::endl;
            std::cout << "System Affinity Mask: " << systemMask << std::endl;
        }
    
        // Simulate some work
        for (int i = 0; i < 1000000; ++i) {
            // Do something
        }
        std::cout << "Thread finished." << std::endl;
        return 0;
    }
    
    int main() {
        // Get the number of processors
        SYSTEM_INFO sysinfo;
        GetSystemInfo(&sysinfo);
        int num_cores = sysinfo.dwNumberOfProcessors;
    
        std::cout << "Number of available cores: " << num_cores << std::endl;
    
        // Create a thread and bind it to core 0
        HANDLE thread = CreateThread(
            NULL,   // default security attributes
            0,      // use default stack size
            worker_thread, // thread function
            (LPVOID)1,    // argument to thread function (CPU mask)
            0,      // use default creation flags
            NULL);  // returns the thread identifier
    
        if (thread == NULL) {
            std::cerr << "Failed to create thread: " << GetLastError() << std::endl;
            return 1;
        }
    
        WaitForSingleObject(thread, INFINITE);
        CloseHandle(thread);
    
        return 0;
    }

一些注意事项和最佳实践

  • 获取CPU核心数量: 在设置CPU亲和性之前,务必先获取系统的CPU核心数量,避免设置超出范围的亲和性。
  • NUMA架构: 在NUMA架构的系统中,需要考虑线程访问内存的局部性,尽量将线程绑定到距离其所需内存最近的CPU核心上。
  • 超线程: 超线程技术允许一个物理核心模拟成两个逻辑核心。在设置CPU亲和性时,需要了解系统的超线程配置,避免将线程绑定到同一个物理核心的两个逻辑核心上,导致性能下降。
  • 测试和验证: 设置CPU亲和性后,务必进行测试和验证,确保其能够带来预期的性能提升。可以使用性能分析工具来监控线程的CPU利用率和缓存命中率。
  • 不要过度优化: CPU亲和性并不是万能的,过度优化可能会导致代码复杂性增加,维护成本上升。只有在性能瓶颈确实存在时,才考虑使用CPU亲和性。
  • 避免硬编码: 不要在代码中硬编码CPU核心ID。应该根据系统的实际情况,动态地设置CPU亲和性。

CPU亲和性设置的时机

在以下场景中,设置CPU亲和性可能会带来显著的性能提升:

  • 多线程计算密集型应用: 对于需要大量计算的多线程应用,例如图像处理、科学计算等,将不同的线程绑定到不同的CPU核心上,可以充分利用多核CPU的并行计算能力。
  • 实时性要求高的应用: 对于实时性要求高的应用,例如音视频处理、游戏等,将关键线程绑定到特定的CPU核心上,可以避免其他线程的干扰,保证其性能。
  • NUMA架构的系统: 在NUMA架构的系统中,将线程绑定到距离其所需内存最近的CPU核心上,可以降低内存访问延迟,提高性能。

总结

CPU亲和性是一种强大的性能优化技术,可以帮助你充分利用多核CPU的并行计算能力,提高程序的性能。但是,它也不是万能的,需要根据实际情况进行测试和验证。希望今天的讲座能够帮助大家更好地理解和应用CPU亲和性,让你的代码跑得更快、更稳!

最后的彩蛋:一个更高级的例子

假设我们有一个图像处理应用,需要对图像进行分块处理。我们可以将图像分成多个块,然后创建多个线程,每个线程负责处理一个块。为了提高性能,我们可以将每个线程绑定到不同的CPU核心上。

#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <numeric>

#ifdef _WIN32
#include <windows.h>
#else
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <unistd.h>
#endif

// Function to set CPU affinity for a thread
#ifdef _WIN32
bool set_thread_affinity(HANDLE thread, DWORD_PTR mask) {
    if (!SetThreadAffinityMask(thread, mask)) {
        std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
        return false;
    }
    return true;
}
#else
bool set_thread_affinity(pthread_t thread, int cpu_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu_id, &cpuset);

    int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
    if (result != 0) {
        std::cerr << "Failed to set CPU affinity: " << result << std::endl;
        return false;
    }
    return true;
}
#endif

// Function to simulate image processing
void process_image_chunk(int chunk_id, int num_chunks, int image_width, int image_height, std::vector<int>& image_data) {
    // Determine the start and end rows for this chunk
    int start_row = (chunk_id * image_height) / num_chunks;
    int end_row = ((chunk_id + 1) * image_height) / num_chunks;

    // Simulate some processing
    for (int row = start_row; row < end_row; ++row) {
        for (int col = 0; col < image_width; ++col) {
            image_data[row * image_width + col] += chunk_id; // Just add the chunk_id to each pixel
        }
    }

    std::cout << "Chunk " << chunk_id << " processed." << std::endl;
}

int main() {
    // Image dimensions
    int image_width = 2048;
    int image_height = 2048;

    // Number of chunks and threads
    int num_chunks = std::thread::hardware_concurrency();
    std::cout << "Using " << num_chunks << " threads." << std::endl;

    // Image data
    std::vector<int> image_data(image_width * image_height, 0);

    // Start time
    auto start_time = std::chrono::high_resolution_clock::now();

    // Create and run threads
    std::vector<std::thread> threads;
    for (int i = 0; i < num_chunks; ++i) {
        threads.emplace_back([i, num_chunks, image_width, image_height, &image_data]() {
            // Set CPU affinity
#ifdef _WIN32
            HANDLE current_thread = GetCurrentThread();
            set_thread_affinity(current_thread, 1ULL << i); // Bind to core i
#else
            set_thread_affinity(pthread_self(), i); // Bind to core i
#endif

            // Process the image chunk
            process_image_chunk(i, num_chunks, image_width, image_height, image_data);
        });
    }

    // Join threads
    for (auto& thread : threads) {
        thread.join();
    }

    // End time
    auto end_time = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

    std::cout << "Image processing took " << duration.count() << " milliseconds." << std::endl;

    // Verify result (optional)
    long long expected_sum = 0;
    for(int i = 0; i < num_chunks; ++i) {
        expected_sum += i;
    }
    expected_sum *= (long long)image_width * image_height / num_chunks;

    long long actual_sum = std::accumulate(image_data.begin(), image_data.end(), 0LL);

    // Correct expected sum to account for integer division issues
    expected_sum = 0;
    for (int chunk_id = 0; chunk_id < num_chunks; ++chunk_id) {
      int start_row = (chunk_id * image_height) / num_chunks;
      int end_row = ((chunk_id + 1) * image_height) / num_chunks;
      expected_sum += (long long)chunk_id * image_width * (end_row - start_row);
    }
    if (actual_sum == expected_sum){
      std::cout << "Verification successful." << std::endl;
    } else {
      std::cout << "Verification failed. Expected sum: " << expected_sum << ", Actual sum: " << actual_sum << std::endl;
    }

    return 0;
}

这个例子展示了如何将多个线程绑定到不同的CPU核心上,并利用它们并行处理图像的不同部分。通过这种方式,我们可以充分利用多核CPU的并行计算能力,提高图像处理的速度。

感谢大家的观看,希望大家有所收获! 下课!

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注