C++ `std::thread::hardware_concurrency()`：获取硬件线程数

好的，各位观众，欢迎来到今天的C++线程漫谈现场！今天我们要聊的是一个非常重要，但又经常被忽视的小家伙：std::thread::hardware_concurrency()。这货，就像你CPU的心跳探测器，能告诉你你的机器到底有多少个“小人”可以同时干活。

开场白：你的电脑里住了多少个“小人”？

想象一下，你家电脑是个繁忙的工厂，CPU就是这个工厂的大老板。老板手下有很多工人，每个工人都能同时处理一项任务。这个std::thread::hardware_concurrency()函数，就是帮你数数这个工厂里到底有多少个“工人”的。

为什么要数清楚呢？因为多线程编程的核心思想就是把一个大任务分解成很多小任务，然后分配给这些“工人”去并行执行。如果你不知道有多少“工人”，就盲目地分配任务，可能会导致“工人”们互相抢资源，反而降低效率，甚至让工厂陷入混乱。

std::thread::hardware_concurrency()：闪亮登场！

好了，废话不多说，让我们请出今天的主角：std::thread::hardware_concurrency()。

这个函数非常简单，它不需要任何参数，直接调用就能返回一个整数，表示你的硬件支持的并发线程数。这个数字通常等于你的CPU的物理核心数乘以每个核心的超线程数。

#include <iostream>
#include <thread>

int main() {
  unsigned int n = std::thread::hardware_concurrency();
  std::cout << "This machine supports " << n << " concurrent threads.n";
  return 0;
}

运行这段代码，你就能看到你的机器支持的并发线程数了。

注意事项：这货不是永远靠谱！

虽然std::thread::hardware_concurrency()看起来很方便，但它也不是万能的。它有一些需要注意的地方：

返回值可能为0： 如果系统无法确定硬件并发能力，它可能会返回0。这意味着你需要自己想办法去获取CPU核心数等信息，或者干脆保守一点，使用一个默认值。
只是一个建议： 返回值只是一个建议，表示你的硬件可能支持的并发线程数。实际的性能还受到很多因素的影响，比如内存带宽、缓存大小、IO速度等等。所以，不要完全依赖这个数字，还需要进行实际的性能测试。
虚拟机环境： 在虚拟机环境下，返回值可能反映的是虚拟机分配给你的虚拟CPU核心数，而不是宿主机的真实核心数。
动态调整： 在某些操作系统中，CPU核心数可能会动态调整，比如节能模式下，或者CPU过热时。所以，不要假设这个返回值永远不变。

代码示例：优雅地处理返回值

为了应对std::thread::hardware_concurrency()可能返回0的情况，我们可以写一个更健壮的函数：

#include <iostream>
#include <thread>
#include <algorithm> // std::max

unsigned int get_max_threads() {
  unsigned int n = std::thread::hardware_concurrency();
  //如果返回值为0，则使用默认值，比如2或者4
  return n == 0 ? 4 : n; // 或者使用std::max(n, 2)
}

int main() {
  unsigned int max_threads = get_max_threads();
  std::cout << "The maximum number of threads to use is: " << max_threads << std::endl;
  return 0;
}

在这个例子中，如果std::thread::hardware_concurrency()返回0，我们就使用一个默认值4。当然，你可以根据你的应用场景选择更合适的默认值。或者选择std::max(n, 2) 保障最小值是2

使用场景：让你的程序飞起来！

那么，我们应该在哪些场景下使用std::thread::hardware_concurrency()呢？

线程池大小： 在创建线程池时，可以使用std::thread::hardware_concurrency()来确定线程池的大小。一般来说，线程池的大小不应该超过硬件支持的并发线程数，否则可能会导致线程切换的开销过大，反而降低性能。
并行计算： 在进行大规模的并行计算时，可以使用std::thread::hardware_concurrency()来确定应该创建多少个线程来执行计算任务。
任务分解： 在将一个大任务分解成多个小任务时，可以使用std::thread::hardware_concurrency()来确定应该分解成多少个小任务。

代码示例：线程池的简单实现

下面是一个简单的线程池的实现，使用了std::thread::hardware_concurrency()来确定线程池的大小：

#include <iostream>
#include <thread>
#include <vector>
#include <queue>
#include <mutex>
#include <condition_variable>
#include <functional>

class ThreadPool {
public:
  ThreadPool(size_t num_threads) : stop_(false) {
    threads_.resize(num_threads);
    for (size_t i = 0; i < num_threads; ++i) {
      threads_[i] = std::thread([this]() {
        while (true) {
          std::function<void()> task;
          {
            std::unique_lock<std::mutex> lock(queue_mutex_);
            cv_.wait(lock, [this]() { return stop_ || !tasks_.empty(); });
            if (stop_ && tasks_.empty()) return;
            task = tasks_.front();
            tasks_.pop();
          }
          task();
        }
      });
    }
  }

  ~ThreadPool() {
    {
      std::unique_lock<std::mutex> lock(queue_mutex_);
      stop_ = true;
    }
    cv_.notify_all();
    for (std::thread &thread : threads_) {
      thread.join();
    }
  }

  template <typename F>
  void enqueue(F task) {
    {
      std::unique_lock<std::mutex> lock(queue_mutex_);
      tasks_.push(task);
    }
    cv_.notify_one();
  }

private:
  std::vector<std::thread> threads_;
  std::queue<std::function<void()>> tasks_;
  std::mutex queue_mutex_;
  std::condition_variable cv_;
  bool stop_;
};

int main() {
  unsigned int num_threads = std::thread::hardware_concurrency();
  if (num_threads == 0) {
    num_threads = 4; // 默认线程数
  }

  ThreadPool pool(num_threads);

  for (int i = 0; i < 10; ++i) {
    pool.enqueue([i]() {
      std::cout << "Task " << i << " is running on thread " << std::this_thread::get_id() << std::endl;
      std::this_thread::sleep_for(std::chrono::milliseconds(100));
    });
  }

  std::this_thread::sleep_for(std::chrono::seconds(1)); // 等待任务完成
  return 0;
}

在这个例子中，我们首先使用std::thread::hardware_concurrency()来获取硬件支持的并发线程数，然后创建一个线程池，线程池的大小就是这个数字。然后，我们向线程池中提交10个任务，每个任务都会打印一条消息，并休眠100毫秒。

性能测试：眼见为实！

光说不练假把式，让我们来做一个简单的性能测试，看看使用std::thread::hardware_concurrency()来确定线程数，是否真的能提高性能。

我们做一个简单的计算任务：计算一个很大的数组的平方和。

#include <iostream>
#include <vector>
#include <thread>
#include <numeric> // std::accumulate
#include <chrono>

// 计算数组的平方和
long long calculate_sum_of_squares(const std::vector<int>& data) {
  long long sum = 0;
  for (int x : data) {
    sum += (long long)x * x;
  }
  return sum;
}

// 并行计算数组的平方和
long long calculate_sum_of_squares_parallel(const std::vector<int>& data, int num_threads) {
  std::vector<std::thread> threads;
  std::vector<long long> partial_sums(num_threads, 0);
  int chunk_size = data.size() / num_threads;

  for (int i = 0; i < num_threads; ++i) {
    int start = i * chunk_size;
    int end = (i == num_threads - 1) ? data.size() : (i + 1) * chunk_size;

    threads.emplace_back([&data, start, end, &partial_sums, i]() {
      long long sum = 0;
      for (int j = start; j < end; ++j) {
        sum += (long long)data[j] * data[j];
      }
      partial_sums[i] = sum;
    });
  }

  for (auto& thread : threads) {
    thread.join();
  }

  return std::accumulate(partial_sums.begin(), partial_sums.end(), 0LL);
}

int main() {
  // 创建一个很大的数组
  int array_size = 10000000;
  std::vector<int> data(array_size);
  for (int i = 0; i < array_size; ++i) {
    data[i] = i + 1;
  }

  // 串行计算
  auto start = std::chrono::high_resolution_clock::now();
  long long serial_sum = calculate_sum_of_squares(data);
  auto end = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
  std::cout << "Serial calculation took " << duration.count() << " milliseconds" << std::endl;

  // 并行计算，使用硬件并发数
  int num_threads = std::thread::hardware_concurrency();
  if (num_threads == 0) {
    num_threads = 4; // 默认线程数
  }

  start = std::chrono::high_resolution_clock::now();
  long long parallel_sum = calculate_sum_of_squares_parallel(data, num_threads);
  end = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
  std::cout << "Parallel calculation with " << num_threads << " threads took " << duration.count() << " milliseconds" << std::endl;

  // 并行计算，使用超过硬件并发数的线程
  int excessive_threads = num_threads * 2; // 使用两倍的线程数
  start = std::chrono::high_resolution_clock::now();
  long long excessive_parallel_sum = calculate_sum_of_squares_parallel(data, excessive_threads);
  end = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
  std::cout << "Parallel calculation with " << excessive_threads << " threads took " << duration.count() << " milliseconds" << std::endl;

  return 0;
}

运行这段代码，你会发现：

并行计算通常比串行计算快。
使用std::thread::hardware_concurrency()来确定线程数，通常能获得最佳的性能。
使用超过硬件并发数的线程，可能会降低性能，因为线程切换的开销会抵消并行计算带来的收益。

表格总结：std::thread::hardware_concurrency()的优缺点

特性	优点	缺点
返回值	提供一个硬件支持的并发线程数的估计值。	可能返回0，表示无法确定硬件并发能力。
易用性	使用简单，无需任何参数。	返回值只是一个建议，实际性能还受到其他因素的影响。
线程池大小	可以用来确定线程池的大小，避免创建过多的线程。	在虚拟机环境下，可能反映的是虚拟CPU核心数，而不是宿主机的真实核心数。
并行计算	可以用来确定并行计算任务应该创建多少个线程。	CPU核心数可能会动态调整。
任务分解	可以用来确定应该将一个大任务分解成多少个小任务。
跨平台性	具有良好的跨平台性，可以在不同的操作系统上使用。

高级话题：超线程和NUMA架构

如果你对多线程编程有更深入的了解，你可能会听说过超线程和NUMA架构。

超线程（Hyper-Threading）： 超线程是一种允许单个物理CPU核心模拟成两个逻辑CPU核心的技术。这意味着操作系统会认为你的CPU有两个核心，即使它实际上只有一个物理核心。超线程可以提高CPU的利用率，但并不能真正地将性能翻倍。
NUMA（Non-Uniform Memory Access）： NUMA是一种内存架构，其中不同的CPU核心访问内存的速度不同。如果一个CPU核心访问离它比较近的内存，速度会比较快；如果访问离它比较远的内存，速度会比较慢。在NUMA架构下，你需要注意将线程分配到合适的CPU核心上，以避免跨NUMA节点的内存访问。

std::thread::hardware_concurrency()返回的值会考虑超线程，但不会考虑NUMA架构。如果你需要在NUMA架构下进行优化，你需要使用操作系统提供的API来获取更详细的CPU拓扑信息。

总结：std::thread::hardware_concurrency()，你的多线程编程好帮手！

总而言之，std::thread::hardware_concurrency()是一个非常实用的函数，可以帮助你更好地进行多线程编程。虽然它不是万能的，但只要你了解它的局限性，并结合实际的性能测试，就能充分利用它的优势，让你的程序飞起来！

今天的漫谈就到这里，感谢大家的收看！希望大家以后写代码的时候，不要忘记这个小家伙：std::thread::hardware_concurrency()。祝大家编程愉快，bug少少！

发表回复 取消回复

发表回复取消回复