Python机器学习：Scikit-learn在模型选择、超参数调优和流水线（Pipeline）构建中的高级应用。 - 智猿学院-前后端，数据库，人工智能，云计算等领域前沿技术讲座

Python机器学习：Scikit-learn在模型选择、超参数调优和流水线（Pipeline）构建中的高级应用

大家好，今天我们来深入探讨Scikit-learn在机器学习模型选择、超参数调优以及Pipeline构建中的高级应用。Scikit-learn作为Python中最流行的机器学习库之一，提供了强大的工具和方法，帮助我们构建高效、可靠的机器学习模型。本次讲座将通过实例代码和详细解释，让你掌握这些高级技巧，提升你的模型开发能力。

1. 模型选择：评估与比较

在机器学习项目中，选择合适的模型至关重要。Scikit-learn提供了多种评估指标和交叉验证方法，帮助我们系统地比较不同模型的性能。

1.1 评估指标

评估指标用于衡量模型预测的准确性和泛化能力。根据任务类型（分类、回归），我们可以选择不同的指标。

分类指标：
- 准确率 (Accuracy)： 分类正确的样本比例。
- 精确率 (Precision)： 预测为正例的样本中，真正正例的比例。
- 召回率 (Recall)： 所有真正正例中，被正确预测为正例的比例。
- F1-score： 精确率和召回率的调和平均数。
- AUC (Area Under the ROC Curve)： ROC曲线下的面积，衡量模型区分正负例的能力。
回归指标：
- 均方误差 (Mean Squared Error, MSE)： 预测值与真实值差的平方的平均值。
- 均方根误差 (Root Mean Squared Error, RMSE)： MSE的平方根，更易于解释。
- 平均绝对误差 (Mean Absolute Error, MAE)： 预测值与真实值差的绝对值的平均值。
- R方 (R-squared)： 模型解释的方差比例，取值范围为0到1，值越大表示模型拟合效果越好。

代码示例：使用不同评估指标评估分类模型

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.datasets import make_classification

# 创建一个示例分类数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建一个逻辑回归模型
model = LogisticRegression(random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1] # 获取正例的概率

# 计算评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob)

# 打印评估结果
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")
print(f"AUC: {auc}")

1.2 交叉验证

交叉验证是一种评估模型泛化能力的有效方法。它将数据集划分为若干个子集（folds），轮流将其中一个子集作为验证集，其余子集作为训练集，多次训练和评估模型，最后取平均评估结果。

K折交叉验证 (K-Fold Cross-Validation)： 将数据集划分为K个子集，每次选择其中一个作为验证集，其余K-1个作为训练集。
分层K折交叉验证 (Stratified K-Fold Cross-Validation)： 在K折交叉验证的基础上，保证每个子集中各类别的比例与原始数据集一致，适用于不平衡数据集。
留一法交叉验证 (Leave-One-Out Cross-Validation)： 每次选择一个样本作为验证集，其余样本作为训练集，适用于小数据集。

代码示例：使用K折交叉验证评估模型

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import numpy as np

# 创建一个示例分类数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# 创建一个逻辑回归模型
model = LogisticRegression(random_state=42)

# 创建K折交叉验证对象
kf = KFold(n_splits=5, shuffle=True, random_state=42) # shuffle=True 每次划分前打乱数据

# 使用交叉验证评估模型
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy') # scoring指定评估指标

# 打印交叉验证结果
print(f"Cross-validation scores: {scores}")
print(f"Mean cross-validation score: {np.mean(scores)}")

1.3 模型比较

有了评估指标和交叉验证方法，我们就可以系统地比较不同模型的性能。可以使用表格或图表来展示比较结果，选择表现最好的模型。

代码示例：比较多个模型的性能

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import numpy as np
import pandas as pd

# 创建一个示例分类数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# 定义要比较的模型
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'SVM': SVC(random_state=42, probability=True), # probability=True for ROC AUC
    'Random Forest': RandomForestClassifier(random_state=42)
}

# 创建K折交叉验证对象
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# 存储交叉验证结果
results = {}

# 循环遍历每个模型
for name, model in models.items():
    # 使用交叉验证评估模型
    scores_accuracy = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
    scores_roc_auc = cross_val_score(model, X, y, cv=kf, scoring='roc_auc')

    # 存储结果
    results[name] = {
        'Accuracy': scores_accuracy.mean(),
        'ROC AUC': scores_roc_auc.mean()
    }

# 将结果转换为DataFrame
results_df = pd.DataFrame(results).T

# 打印结果
print(results_df)

以上代码会输出一个表格，展示每个模型在交叉验证中的平均准确率和ROC AUC值，方便我们比较不同模型的性能。

2. 超参数调优：寻找最佳配置

模型的性能很大程度上取决于超参数的选择。超参数是在训练模型之前设置的参数，例如学习率、正则化系数等。超参数调优的目标是找到一组最佳的超参数，使模型在验证集上表现最佳。

2.1 网格搜索 (Grid Search)

网格搜索是一种暴力搜索方法，它定义一个超参数的取值范围，然后穷举所有可能的超参数组合，训练和评估模型，选择表现最佳的组合。

代码示例：使用网格搜索调优SVM模型的超参数

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import make_classification

# 创建一个示例分类数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# 定义超参数网格
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.01, 0.1, 1, 'scale'],
    'kernel': ['rbf', 'linear']
}

# 创建SVM模型
model = SVC(random_state=42)

# 创建网格搜索对象
grid_search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy', verbose=2) # verbose控制输出信息的详细程度

# 运行网格搜索
grid_search.fit(X, y)

# 打印最佳超参数和最佳得分
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

# 获取最佳模型
best_model = grid_search.best_estimator_

2.2 随机搜索 (Randomized Search)

随机搜索是一种改进的搜索方法，它从超参数的取值范围内随机采样若干个超参数组合，训练和评估模型，选择表现最佳的组合。与网格搜索相比，随机搜索更高效，尤其是在超参数数量较多时。

代码示例：使用随机搜索调优随机森林模型的超参数

from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from scipy.stats import randint

# 创建一个示例分类数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

# 定义超参数分布
param_distributions = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(5, 15),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 5)
}

# 创建随机森林模型
model = RandomForestClassifier(random_state=42)

# 创建随机搜索对象
random_search = RandomizedSearchCV(model, param_distributions, cv=3, scoring='accuracy', n_iter=10, random_state=42, verbose=2) # n_iter指定采样的次数

# 运行随机搜索
random_search.fit(X, y)

# 打印最佳超参数和最佳得分
print(f"Best parameters: {random_search.best_params_}")
print(f"Best score: {random_search.best_score_}")

# 获取最佳模型
best_model = random_search.best_estimator_

2.3 Bayesian Optimization

贝叶斯优化是一种更高级的超参数调优方法，它使用贝叶斯模型来估计超参数组合的性能，并选择最有希望的超参数组合进行评估。与网格搜索和随机搜索相比，贝叶斯优化更智能，能够更快地找到最佳超参数。需要安装额外的库，例如 scikit-optimize 或者 optuna。

代码示例：使用Optuna进行超参数优化

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

# 创建一个示例分类数据集
X, y = make_classification(n_samples=100, n_features=10, random_state=42)

def objective(trial):
    # 定义超参数搜索空间
    n_estimators = trial.suggest_int('n_estimators', 50, 200)
    max_depth = trial.suggest_int('max_depth', 5, 15)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 5)

    # 创建随机森林模型
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        random_state=42
    )

    # 使用交叉验证评估模型
    scores = cross_val_score(model, X, y, cv=3, scoring='accuracy')

    # 返回平均交叉验证得分
    return scores.mean()

# 创建Optuna study对象
study = optuna.create_study(direction='maximize') # maximize表示最大化目标函数，minimize表示最小化

# 运行优化
study.optimize(objective, n_trials=10) # n_trials指定采样的次数

# 打印最佳超参数和最佳得分
print(f"Best parameters: {study.best_params}")
print(f"Best score: {study.best_value}")

# 获取最佳模型 (需要使用最佳参数重新训练模型)
best_model = RandomForestClassifier(**study.best_params, random_state=42)
best_model.fit(X, y)

3. 流水线 (Pipeline) 构建：简化模型开发流程

Pipeline可以将多个数据处理步骤和模型训练步骤串联起来，形成一个完整的机器学习流程。使用Pipeline可以简化模型开发流程，提高代码的可读性和可维护性，并避免数据泄露。

3.1 Pipeline的基本概念

Pipeline由一系列的步骤 (steps) 组成，每个步骤可以是数据预处理、特征工程或模型训练。Pipeline按照步骤的顺序执行，将数据依次传递给每个步骤，最终得到模型的预测结果。

3.2 Pipeline的优势

简化代码： 将多个步骤封装在一个对象中，减少代码量。
提高可读性： 明确地展示了数据处理和模型训练的流程。
避免数据泄露： 在交叉验证中使用Pipeline可以避免验证集的信息泄露到训练集中。
方便部署： 可以将整个Pipeline作为一个对象进行部署，简化部署流程。

代码示例：使用Pipeline构建一个完整的机器学习流程

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# 创建一个示例分类数据集
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# 将数据集划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 创建Pipeline对象
pipeline = Pipeline([
    ('scaler', StandardScaler()), # 数据标准化
    ('classifier', LogisticRegression(random_state=42)) # 逻辑回归模型
])

# 训练Pipeline
pipeline.fit(X_train, y_train)

# 在测试集上进行预测
y_pred = pipeline.predict(X_test)

# 评估模型
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

3.3 在Pipeline中使用超参数调优

可以将Pipeline与网格搜索或随机搜索结合使用，对整个Pipeline的超参数进行调优。

代码示例：在Pipeline中使用网格搜索调优超参数

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification

# 创建一个示例分类数据集
X, y = make_classification(n_samples=100, n_features=20, random_state=42)

# 创建Pipeline对象
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=42))
])

# 定义超参数网格
param_grid = {
    'classifier__C': [0.1, 1, 10, 100], # 注意超参数的命名方式：步骤名__超参数名
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['liblinear']
}

# 创建网格搜索对象
grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', verbose=2)

# 运行网格搜索
grid_search.fit(X, y)

# 打印最佳超参数和最佳得分
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_}")

# 获取最佳模型
best_model = grid_search.best_estimator_

4. 更复杂的Pipeline

Pipeline可以包含更复杂的步骤，例如特征选择、特征转换等。下面是一个包含特征选择的Pipeline示例。

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# 创建示例数据
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 创建Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('feature_selection', SelectKBest(score_func=f_classif, k=10)), # 选择前10个最重要的特征
    ('classifier', LogisticRegression(random_state=42))
])

# 训练和评估Pipeline
pipeline.fit(X_train, y_train)
accuracy = pipeline.score(X_test, y_test)
print(f"Accuracy: {accuracy}")

表格：Scikit-learn常用模型选择与评估工具

工具/方法	描述
`train_test_split`	将数据集分割成训练集和测试集。
`cross_val_score`	使用交叉验证评估模型性能。
`GridSearchCV`	网格搜索超参数调优。
`RandomizedSearchCV`	随机搜索超参数调优。
`Pipeline`	将多个数据处理步骤和模型训练步骤串联起来，形成一个完整的机器学习流程。
评估指标 (Accuracy, Precision, Recall, F1-score, ROC AUC, MSE, MAE, R-squared)	用于衡量模型在不同方面的性能。选择合适的指标取决于具体的任务和数据特点。

结语

今天的讲座涵盖了Scikit-learn在模型选择、超参数调优和Pipeline构建中的高级应用。通过学习这些技巧，你可以更有效地构建和优化机器学习模型，提升你的项目质量。希望这些知识能对你有所帮助，感谢大家的参与！

模型选择、参数调优和Pipeline，都是机器学习项目的重要环节

模型选择需要根据数据特点和业务需求，选择合适的评估指标和交叉验证方法，比较不同模型的性能。超参数调优则需要选择合适的搜索方法，找到最佳的超参数组合。Pipeline则可以简化模型开发流程，提高代码的可读性和可维护性。

Python机器学习：Scikit-learn在模型选择、超参数调优和流水线（Pipeline）构建中的高级应用

发表回复 取消回复

发表回复取消回复