XGBoost/LightGBM 调参与优化：超参数搜索与集成学习策略 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

好的，没问题！咱们今天就来聊聊 XGBoost 和 LightGBM 这哥俩，看看怎么把它们调教得更听话，性能更彪悍！

XGBoost/LightGBM 调参与优化：超参数搜索与集成学习策略

大家好！我是今天的讲师，一个在机器学习的坑里摸爬滚打了好几年的老兵。今天咱们不搞那些虚头巴脑的概念，直接上干货，聊聊 XGBoost 和 LightGBM 这两大利器。相信大家或多或少都用过它们，但用得好不好，那就是另一回事儿了。

一、为啥要调参？（不调参的后果很严重！）

想象一下，你买了一辆跑车，结果发现默认设置开起来像拖拉机，那不得好好调教一下？XGBoost 和 LightGBM 也一样，默认参数虽然能跑，但要榨干它们的性能，就得动动脑子，好好调参。

不调参的后果嘛，轻则模型效果平平，浪费了时间和计算资源；重则模型过拟合，在测试集上表现惨不忍睹，让你怀疑人生。所以，调参是通往机器学习大神之路的必经之路！

二、超参数是个啥？（别被名字吓着！）

超参数，说白了就是模型训练之前需要人为设定的参数。它们控制着模型的学习过程，直接影响模型的最终效果。常见的超参数包括：

学习率 (Learning Rate): 控制模型每次迭代更新的幅度。学习率太大容易震荡，太小收敛太慢。
树的深度 (Max Depth): 控制每棵树的最大深度。深度太深容易过拟合，太浅容易欠拟合。
叶子节点最小样本数 (Min Child Weight/Min Data in Leaf): 控制每个叶子节点包含的最小样本数。防止模型过度划分，减少过拟合。
正则化参数 (L1/L2 Regularization): 通过对模型复杂度进行惩罚，防止过拟合。
子采样比例 (Subsample/Bagging Fraction): 控制每次迭代时用于训练的样本比例。
特征采样比例 (Colsample bytree/Feature Fraction): 控制每次迭代时用于训练的特征比例。
树的个数 (N Estimators/Num Boost Round): 控制模型的迭代次数，也就是树的棵数。

三、超参数搜索方法（让机器替你干活！）

手动调参？那效率也太低了！咱们要学会偷懒，让机器替我们干活。常用的超参数搜索方法有：

网格搜索 (Grid Search):

顾名思义，就是把所有可能的参数组合都尝试一遍。虽然简单粗暴，但计算量巨大，参数多的时候直接爆炸。

from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.datasets import make_classification

# 创建一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 定义参数网格
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3]
}

# 创建 XGBoost 分类器
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# 创建网格搜索对象
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, scoring='accuracy', cv=3, verbose=2)

# 运行网格搜索
grid_search.fit(X, y)

# 打印最佳参数和最佳得分
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

优点: 简单易懂，可以穷尽所有可能的组合。
缺点: 计算量大，效率低，不适合参数多的情况。

随机搜索 (Random Search):

在参数空间中随机采样一定数量的参数组合。相比网格搜索，效率更高，因为不需要尝试所有组合。

from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBClassifier
from scipy.stats import randint, uniform

# 创建一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 定义参数分布
param_distributions = {
    'n_estimators': randint(100, 500),
    'max_depth': randint(3, 8),
    'learning_rate': uniform(0.01, 0.3)
}

# 创建 XGBoost 分类器
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# 创建随机搜索对象
random_search = RandomizedSearchCV(estimator=xgb, param_distributions=param_distributions,
                                   scoring='accuracy', cv=3, n_iter=10, verbose=2)

# 运行随机搜索
random_search.fit(X, y)

# 打印最佳参数和最佳得分
print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)

优点: 效率高，适合参数多的情况。
缺点: 不能保证找到最佳参数，结果具有一定的随机性。

贝叶斯优化 (Bayesian Optimization):

基于贝叶斯统计的思想，通过建立目标函数的概率模型，不断优化参数组合。相比前两种方法，更加智能，效率更高。

from bayes_opt import BayesianOptimization
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

# 创建一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 定义目标函数
def xgb_evaluate(n_estimators, max_depth, learning_rate):
    params = {
        'n_estimators': int(n_estimators),
        'max_depth': int(max_depth),
        'learning_rate': learning_rate,
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,  # 添加这一行来解决警告
        'random_state': 42
    }
    xgb = XGBClassifier(**params)
    cv_scores = cross_val_score(xgb, X, y, cv=3, scoring='accuracy')
    return cv_scores.mean()

# 定义参数范围
pbounds = {
    'n_estimators': (100, 500),
    'max_depth': (3, 8),
    'learning_rate': (0.01, 0.3)
}

# 创建贝叶斯优化对象
optimizer = BayesianOptimization(
    f=xgb_evaluate,
    pbounds=pbounds,
    random_state=42,
)

# 运行贝叶斯优化
optimizer.maximize(
    init_points=2,
    n_iter=3,
)

# 打印最佳参数和最佳得分
print(optimizer.max)

优点: 效率高，可以找到更好的参数组合。
缺点: 实现起来相对复杂。

安装bayes_opt:

pip install bayesian-optimization

Hyperopt:

Hyperopt 是一个 Python 库，专门用于优化机器学习模型的超参数。它使用概率模型来指导搜索，并提供多种搜索算法，例如 Tree-structured Parzen Estimator (TPE) 和随机搜索。Hyperopt 的优点是灵活、高效，并且可以处理各种类型的超参数空间。

from hyperopt import fmin, tpe, hp, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score
from xgboost import XGBClassifier
from sklearn.datasets import make_classification

# 创建一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 定义目标函数
def objective(space):
    params = {
        'n_estimators': int(space['n_estimators']),
        'max_depth': int(space['max_depth']),
        'learning_rate': space['learning_rate'],
        'objective': 'binary:logistic',
        'eval_metric': 'logloss',
        'use_label_encoder': False,  # 添加这一行来解决警告
        'random_state': 42
    }
    xgb = XGBClassifier(**params)
    cv_scores = cross_val_score(xgb, X, y, cv=3, scoring='accuracy')
    return {'loss': -cv_scores.mean(), 'status': STATUS_OK}  # Hyperopt 最小化 loss

# 定义参数空间
space = {
    'n_estimators': hp.quniform('n_estimators', 100, 500, 1),
    'max_depth': hp.quniform('max_depth', 3, 8, 1),
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.3)
}

# 创建 Trials 对象，用于存储搜索过程中的信息
trials = Trials()

# 运行 Hyperopt
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10,
            trials=trials)

# 打印最佳参数
print("Best parameters:", best)

# 打印 Trials 信息
# print(trials.trials)  # 可以查看每次评估的详细信息

安装Hyperopt:

pip install hyperopt

优点: 灵活，高效，可以处理各种类型的超参数空间。
缺点: 需要定义目标函数和参数空间，相对复杂。

表格总结：超参数搜索方法对比

方法	优点	缺点	适用场景
网格搜索	简单易懂，穷尽所有组合	计算量大，效率低，参数多时爆炸	参数少，数据量小的情况
随机搜索	效率高，适合参数多的情况	不能保证找到最佳参数，具有随机性	参数多，数据量大的情况
贝叶斯优化	效率高，可以找到更好的参数组合	实现相对复杂	对精度要求高，计算资源充足的情况
Hyperopt	灵活，高效，各种参数空间	需要定义目标函数和参数空间，相对复杂	需要更高级的参数优化，有一定编程基础时

四、集成学习策略（人多力量大！）

光调参还不够，咱们还可以使用集成学习策略，让多个模型协同工作，进一步提升性能。常用的集成学习策略有：

Bagging (Bootstrap Aggregating):

通过对原始数据集进行有放回的抽样，构建多个不同的训练集，然后训练多个模型，最后对多个模型的预测结果进行平均或投票。随机森林 (Random Forest) 就是 Bagging 的一个典型例子。
Boosting:

通过迭代的方式，每次训练一个模型，并根据上一个模型的表现来调整样本的权重，使得下一个模型更加关注那些被错误分类的样本。XGBoost 和 LightGBM 都是 Boosting 算法的代表。
Stacking:

使用多个不同的模型进行预测，然后将这些模型的预测结果作为新的特征，再训练一个元模型 (Meta Model) 来进行最终的预测。

代码示例：Stacking 集成学习

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import numpy as np

# 创建一个示例数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 定义多个基模型
model1 = RandomForestClassifier(n_estimators=100, random_state=42)
model2 = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss', random_state=42)
model3 = LGBMClassifier(n_estimators=100, random_state=42)

# 训练基模型
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)

# 对测试集进行预测
pred1 = model1.predict_proba(X_test)[:, 1]
pred2 = model2.predict_proba(X_test)[:, 1]
pred3 = model3.predict_proba(X_test)[:, 1]

# 将基模型的预测结果作为新的特征
X_meta = np.column_stack([pred1, pred2, pred3])

# 定义元模型
meta_model = LogisticRegression()

# 训练元模型
meta_model.fit(X_meta, y_test)

# 对测试集进行最终预测
y_pred = meta_model.predict(X_meta)

# 评估模型效果
accuracy = accuracy_score(y_test, y_pred)
print("Stacking Accuracy:", accuracy)

五、XGBoost 和 LightGBM 的一些调参技巧

XGBoost:
- gamma: 减少树的叶子节点上进行进一步分割所需的最小损失下降。值越大，算法越保守。
- scale_pos_weight: 用于处理不平衡数据集。
LightGBM:
- num_leaves: 控制树的复杂性。通常设置为小于 2^max_depth 的值。
- min_data_in_leaf: 每个叶子节点所需的最小数据点数。
- boosting_type: 可以选择 gbdt, rf, dart, goss。goss 速度更快，但可能精度稍低。

六、实战案例（光说不练假把式！）

咱们以一个简单的分类任务为例，演示一下如何使用贝叶斯优化来调参 XGBoost。

(代码同上，贝叶斯优化的示例代码)

七、总结（温故而知新！）

今天咱们聊了 XGBoost 和 LightGBM 的调参和优化，主要包括以下几个方面：

调参的重要性：不调参的后果很严重！
超参数的概念：控制模型学习过程的参数。
超参数搜索方法：网格搜索、随机搜索、贝叶斯优化、Hyperopt。
集成学习策略：Bagging、Boosting、Stacking。
XGBoost 和 LightGBM 的一些调参技巧。

希望大家通过今天的学习，能够更好地掌握 XGBoost 和 LightGBM 这两大利器，在机器学习的道路上越走越远！

最后，记住一句真理：没有最好的模型，只有最适合的模型。多尝试，多思考，你也能成为调参高手！

感谢大家的聆听！如果有什么问题，欢迎随时提问！

发表回复 取消回复

发表回复取消回复