好的,各位观众,各位码农,欢迎来到今天的“线程绑定CPU,性能蹭蹭涨”的专场讲座!我是你们的老朋友,专门负责帮大家把代码搞得更快更强的性能优化专家。今天,咱们就来聊聊C++中如何设置CPU亲和性,让你的线程和CPU核之间“情投意合”,从而榨干硬件的最后一滴性能。
开场白:你真的了解你的CPU吗?
在开始之前,我想问大家一个问题:你真的了解你的CPU吗?别急着回答“当然了解,几核几线程,睿频多少”,我要问的是,你了解你的程序在CPU上是怎么跑的吗?
想象一下,你的程序就像一群嗷嗷待哺的小鸡,而CPU核就是那些辛勤的母鸡。默认情况下,这些小鸡会被随机分配到不同的母鸡那里。虽然最终都能吃饱,但效率嘛,就不好说了。
CPU亲和性,就是让我们能够指定哪些小鸡只能由哪些母鸡来喂养。这样一来,小鸡们就不用到处乱跑,母鸡也不用频繁切换,自然就省下了不少力气,性能也就提升了。
什么是CPU亲和性?
简单来说,CPU亲和性(CPU affinity)就是将一个进程或线程绑定到一个或多个特定的CPU核心上运行。这样可以减少线程在不同核心之间迁移的次数,提高缓存命中率,从而提升性能。
为什么要设置CPU亲和性?
- 减少上下文切换: 线程在不同CPU核心之间迁移会带来上下文切换的开销,这会浪费大量的CPU时间。通过绑定CPU亲和性,可以避免或减少这种切换。
- 提高缓存命中率: 当一个线程在同一个CPU核心上运行时,它可以重复利用该核心的缓存,从而减少对内存的访问,提高性能。
- NUMA架构优化: 在NUMA(Non-Uniform Memory Access)架构的系统中,不同的CPU核心访问内存的延迟不同。通过将线程绑定到距离其所需内存最近的CPU核心上,可以降低内存访问延迟。
- 避免干扰: 在某些情况下,你可能希望将某些关键线程绑定到特定的CPU核心上,以避免其他线程的干扰,保证其性能。
C++中如何设置CPU亲和性?
在C++中,设置CPU亲和性主要依赖于操作系统提供的API。不同的操作系统有不同的API,下面分别介绍Linux和Windows下的设置方法。
1. Linux下的CPU亲和性设置
在Linux下,可以使用sched_setaffinity
和sched_getaffinity
函数来设置和获取线程的CPU亲和性。
-
sched_setaffinity
函数:#define _GNU_SOURCE // For CPU_SET, CPU_ZERO #include <sched.h> #include <pthread.h> #include <iostream> #include <unistd.h> // Required for sysconf int set_cpu_affinity(pthread_t thread, int cpu_id) { cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu_id, &cpuset); int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset); if (result != 0) { std::cerr << "Failed to set CPU affinity: " << result << std::endl; return -1; } return 0; }
pthread_t thread
: 要设置亲和性的线程ID。size_t cpusetsize
:cpuset
的大小,通常为sizeof(cpu_set_t)
。cpu_set_t *cpuset
: 一个CPU集合,用于指定线程可以运行的CPU核心。
使用示例:
#include <iostream> #include <thread> #include <chrono> void worker_thread(int id) { // Bind this thread to CPU core 'id' if (set_cpu_affinity(pthread_self(), id) == 0) { std::cout << "Thread " << id << " bound to CPU core " << id << std::endl; } // Simulate some work for (int i = 0; i < 1000000; ++i) { // Do something } std::cout << "Thread " << id << " finished." << std::endl; } int main() { int num_cores = sysconf(_SC_NPROCESSORS_ONLN); std::cout << "Number of available cores: " << num_cores << std::endl; std::thread threads[num_cores]; for (int i = 0; i < num_cores; ++i) { threads[i] = std::thread(worker_thread, i); } for (int i = 0; i < num_cores; ++i) { threads[i].join(); } return 0; }
-
sched_getaffinity
函数:#include <sched.h> #include <pthread.h> #include <iostream> int get_cpu_affinity(pthread_t thread) { cpu_set_t cpuset; CPU_ZERO(&cpuset); int result = pthread_getaffinity_np(thread, sizeof(cpu_set_t), &cpuset); if (result != 0) { std::cerr << "Failed to get CPU affinity: " << result << std::endl; return -1; } std::cout << "Thread affinity: "; for (int i = 0; i < CPU_SETSIZE; ++i) { if (CPU_ISSET(i, &cpuset)) { std::cout << i << " "; } } std::cout << std::endl; return 0; }
pid_t pid
: 进程ID,如果要获取当前进程的亲和性,可以传入0。size_t cpusetsize
:cpuset
的大小,通常为sizeof(cpu_set_t)
。cpu_set_t *cpuset
: 一个CPU集合,用于存储进程可以运行的CPU核心。
使用示例:
#include <iostream> #include <thread> #include <chrono> void worker_thread(int id) { // Bind this thread to CPU core 'id' if (set_cpu_affinity(pthread_self(), id) == 0) { std::cout << "Thread " << id << " bound to CPU core " << id << std::endl; } // Get and print the CPU affinity get_cpu_affinity(pthread_self()); // Simulate some work for (int i = 0; i < 1000000; ++i) { // Do something } std::cout << "Thread " << id << " finished." << std::endl; } int main() { int num_cores = sysconf(_SC_NPROCESSORS_ONLN); std::cout << "Number of available cores: " << num_cores << std::endl; std::thread threads[num_cores]; for (int i = 0; i < num_cores; ++i) { threads[i] = std::thread(worker_thread, i); } for (int i = 0; i < num_cores; ++i) { threads[i].join(); } return 0; }
2. Windows下的CPU亲和性设置
在Windows下,可以使用SetThreadAffinityMask
和GetThreadAffinityMask
函数来设置和获取线程的CPU亲和性。
-
SetThreadAffinityMask
函数:#include <windows.h> #include <iostream> #include <thread> DWORD WINAPI worker_thread(LPVOID lpParam) { DWORD_PTR mask = (DWORD_PTR)lpParam; if (!SetThreadAffinityMask(GetCurrentThread(), mask)) { std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl; } else { std::cout << "Thread bound to CPU mask: " << mask << std::endl; } // Simulate some work for (int i = 0; i < 1000000; ++i) { // Do something } std::cout << "Thread finished." << std::endl; return 0; }
HANDLE hThread
: 要设置亲和性的线程句柄。DWORD_PTR dwThreadAffinityMask
: 一个位掩码,用于指定线程可以运行的CPU核心。例如,如果想让线程在CPU核心0和1上运行,可以将dwThreadAffinityMask
设置为0x03
(二进制为00000011
)。
使用示例:
#include <iostream> #include <thread> #include <windows.h> DWORD WINAPI worker_thread(LPVOID lpParam) { DWORD_PTR mask = (DWORD_PTR)lpParam; if (!SetThreadAffinityMask(GetCurrentThread(), mask)) { std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl; } else { std::cout << "Thread bound to CPU mask: " << mask << std::endl; } // Simulate some work for (int i = 0; i < 1000000; ++i) { // Do something } std::cout << "Thread finished." << std::endl; return 0; } int main() { // Get the number of processors SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); int num_cores = sysinfo.dwNumberOfProcessors; std::cout << "Number of available cores: " << num_cores << std::endl; // Create a thread and bind it to core 0 HANDLE thread = CreateThread( NULL, // default security attributes 0, // use default stack size worker_thread, // thread function (LPVOID)1, // argument to thread function (CPU mask) 0, // use default creation flags NULL); // returns the thread identifier if (thread == NULL) { std::cerr << "Failed to create thread: " << GetLastError() << std::endl; return 1; } WaitForSingleObject(thread, INFINITE); CloseHandle(thread); return 0; }
-
GetThreadAffinityMask
函数:#include <windows.h> #include <iostream> #include <thread> DWORD WINAPI worker_thread(LPVOID lpParam) { DWORD_PTR processMask, systemMask; if (!GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask)) { std::cerr << "Failed to get process affinity mask: " << GetLastError() << std::endl; } else { std::cout << "Process Affinity Mask: " << processMask << std::endl; std::cout << "System Affinity Mask: " << systemMask << std::endl; } DWORD_PTR threadMask = SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)lpParam); if (!threadMask) { std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl; } else { std::cout << "Thread bound to CPU mask: " << (DWORD_PTR)lpParam << std::endl; } if (!GetThreadAffinityMask(GetCurrentThread(), &processMask, &systemMask)) { std::cerr << "Failed to get thread affinity mask: " << GetLastError() << std::endl; } else { std::cout << "Thread Affinity Mask: " << processMask << std::endl; std::cout << "System Affinity Mask: " << systemMask << std::endl; } // Simulate some work for (int i = 0; i < 1000000; ++i) { // Do something } std::cout << "Thread finished." << std::endl; return 0; }
HANDLE hThread
: 要获取亲和性的线程句柄。PDWORD_PTR lpProcessAffinityMask
: 一个指针,用于存储进程的亲和性掩码。PDWORD_PTR lpSystemAffinityMask
: 一个指针,用于存储系统的亲和性掩码。
使用示例:
#include <iostream> #include <thread> #include <windows.h> DWORD WINAPI worker_thread(LPVOID lpParam) { DWORD_PTR processMask, systemMask; if (!GetProcessAffinityMask(GetCurrentProcess(), &processMask, &systemMask)) { std::cerr << "Failed to get process affinity mask: " << GetLastError() << std::endl; } else { std::cout << "Process Affinity Mask: " << processMask << std::endl; std::cout << "System Affinity Mask: " << systemMask << std::endl; } DWORD_PTR threadMask = SetThreadAffinityMask(GetCurrentThread(), (DWORD_PTR)lpParam); if (!threadMask) { std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl; } else { std::cout << "Thread bound to CPU mask: " << (DWORD_PTR)lpParam << std::endl; } if (!GetThreadAffinityMask(GetCurrentThread(), &processMask, &systemMask)) { std::cerr << "Failed to get thread affinity mask: " << GetLastError() << std::endl; } else { std::cout << "Thread Affinity Mask: " << processMask << std::endl; std::cout << "System Affinity Mask: " << systemMask << std::endl; } // Simulate some work for (int i = 0; i < 1000000; ++i) { // Do something } std::cout << "Thread finished." << std::endl; return 0; } int main() { // Get the number of processors SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); int num_cores = sysinfo.dwNumberOfProcessors; std::cout << "Number of available cores: " << num_cores << std::endl; // Create a thread and bind it to core 0 HANDLE thread = CreateThread( NULL, // default security attributes 0, // use default stack size worker_thread, // thread function (LPVOID)1, // argument to thread function (CPU mask) 0, // use default creation flags NULL); // returns the thread identifier if (thread == NULL) { std::cerr << "Failed to create thread: " << GetLastError() << std::endl; return 1; } WaitForSingleObject(thread, INFINITE); CloseHandle(thread); return 0; }
一些注意事项和最佳实践
- 获取CPU核心数量: 在设置CPU亲和性之前,务必先获取系统的CPU核心数量,避免设置超出范围的亲和性。
- NUMA架构: 在NUMA架构的系统中,需要考虑线程访问内存的局部性,尽量将线程绑定到距离其所需内存最近的CPU核心上。
- 超线程: 超线程技术允许一个物理核心模拟成两个逻辑核心。在设置CPU亲和性时,需要了解系统的超线程配置,避免将线程绑定到同一个物理核心的两个逻辑核心上,导致性能下降。
- 测试和验证: 设置CPU亲和性后,务必进行测试和验证,确保其能够带来预期的性能提升。可以使用性能分析工具来监控线程的CPU利用率和缓存命中率。
- 不要过度优化: CPU亲和性并不是万能的,过度优化可能会导致代码复杂性增加,维护成本上升。只有在性能瓶颈确实存在时,才考虑使用CPU亲和性。
- 避免硬编码: 不要在代码中硬编码CPU核心ID。应该根据系统的实际情况,动态地设置CPU亲和性。
CPU亲和性设置的时机
在以下场景中,设置CPU亲和性可能会带来显著的性能提升:
- 多线程计算密集型应用: 对于需要大量计算的多线程应用,例如图像处理、科学计算等,将不同的线程绑定到不同的CPU核心上,可以充分利用多核CPU的并行计算能力。
- 实时性要求高的应用: 对于实时性要求高的应用,例如音视频处理、游戏等,将关键线程绑定到特定的CPU核心上,可以避免其他线程的干扰,保证其性能。
- NUMA架构的系统: 在NUMA架构的系统中,将线程绑定到距离其所需内存最近的CPU核心上,可以降低内存访问延迟,提高性能。
总结
CPU亲和性是一种强大的性能优化技术,可以帮助你充分利用多核CPU的并行计算能力,提高程序的性能。但是,它也不是万能的,需要根据实际情况进行测试和验证。希望今天的讲座能够帮助大家更好地理解和应用CPU亲和性,让你的代码跑得更快、更稳!
最后的彩蛋:一个更高级的例子
假设我们有一个图像处理应用,需要对图像进行分块处理。我们可以将图像分成多个块,然后创建多个线程,每个线程负责处理一个块。为了提高性能,我们可以将每个线程绑定到不同的CPU核心上。
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
#include <numeric>
#ifdef _WIN32
#include <windows.h>
#else
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <unistd.h>
#endif
// Function to set CPU affinity for a thread
#ifdef _WIN32
bool set_thread_affinity(HANDLE thread, DWORD_PTR mask) {
if (!SetThreadAffinityMask(thread, mask)) {
std::cerr << "Failed to set CPU affinity: " << GetLastError() << std::endl;
return false;
}
return true;
}
#else
bool set_thread_affinity(pthread_t thread, int cpu_id) {
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(cpu_id, &cpuset);
int result = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
if (result != 0) {
std::cerr << "Failed to set CPU affinity: " << result << std::endl;
return false;
}
return true;
}
#endif
// Function to simulate image processing
void process_image_chunk(int chunk_id, int num_chunks, int image_width, int image_height, std::vector<int>& image_data) {
// Determine the start and end rows for this chunk
int start_row = (chunk_id * image_height) / num_chunks;
int end_row = ((chunk_id + 1) * image_height) / num_chunks;
// Simulate some processing
for (int row = start_row; row < end_row; ++row) {
for (int col = 0; col < image_width; ++col) {
image_data[row * image_width + col] += chunk_id; // Just add the chunk_id to each pixel
}
}
std::cout << "Chunk " << chunk_id << " processed." << std::endl;
}
int main() {
// Image dimensions
int image_width = 2048;
int image_height = 2048;
// Number of chunks and threads
int num_chunks = std::thread::hardware_concurrency();
std::cout << "Using " << num_chunks << " threads." << std::endl;
// Image data
std::vector<int> image_data(image_width * image_height, 0);
// Start time
auto start_time = std::chrono::high_resolution_clock::now();
// Create and run threads
std::vector<std::thread> threads;
for (int i = 0; i < num_chunks; ++i) {
threads.emplace_back([i, num_chunks, image_width, image_height, &image_data]() {
// Set CPU affinity
#ifdef _WIN32
HANDLE current_thread = GetCurrentThread();
set_thread_affinity(current_thread, 1ULL << i); // Bind to core i
#else
set_thread_affinity(pthread_self(), i); // Bind to core i
#endif
// Process the image chunk
process_image_chunk(i, num_chunks, image_width, image_height, image_data);
});
}
// Join threads
for (auto& thread : threads) {
thread.join();
}
// End time
auto end_time = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
std::cout << "Image processing took " << duration.count() << " milliseconds." << std::endl;
// Verify result (optional)
long long expected_sum = 0;
for(int i = 0; i < num_chunks; ++i) {
expected_sum += i;
}
expected_sum *= (long long)image_width * image_height / num_chunks;
long long actual_sum = std::accumulate(image_data.begin(), image_data.end(), 0LL);
// Correct expected sum to account for integer division issues
expected_sum = 0;
for (int chunk_id = 0; chunk_id < num_chunks; ++chunk_id) {
int start_row = (chunk_id * image_height) / num_chunks;
int end_row = ((chunk_id + 1) * image_height) / num_chunks;
expected_sum += (long long)chunk_id * image_width * (end_row - start_row);
}
if (actual_sum == expected_sum){
std::cout << "Verification successful." << std::endl;
} else {
std::cout << "Verification failed. Expected sum: " << expected_sum << ", Actual sum: " << actual_sum << std::endl;
}
return 0;
}
这个例子展示了如何将多个线程绑定到不同的CPU核心上,并利用它们并行处理图像的不同部分。通过这种方式,我们可以充分利用多核CPU的并行计算能力,提高图像处理的速度。
感谢大家的观看,希望大家有所收获! 下课!