6.3 构建并评价分类模型

6.3.1 使用sklearn估计器构建分类模型
6.3.2 评价分类模型

分类是指构造一个分类模型，输入样本的特征值，输出对应的类别，将每个样本映射到预先定义好的类别。分类模型建立在已有类标记的数据集上，属于有监督学习。在实际应用场景中，分类算法被用于行为分析，物品识别、图像检测等。

6.3.1 使用sklearn估计器构建分类模型

在数据分析领域，分类算法有很多，其原理千差万别，有基于样本距离的最近邻算法，有基于特征信息熵的决策树，有基于bagging的随机森林，有基于boosting的梯度提升分类树，但其实现的过程相差不大。过程如图所示:
在这里插入图片描述
sklearn中提供的分类算法非常多，分别存在于不同的模块中。常用的分类算法如下表所示：

1、使用sklearn估计器构建SVM模型

代码在下面

2、分类结果的混淆矩阵与准确率

代码

# 1、使用sklearn估计器构建SVM模型
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

cancer = load_breast_cancer()
X = cancer['data']
y = cancer['target']
# 划分数据集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# 数据标准化
stdScaler = StandardScaler().fit(X_train)
X_trainstd = stdScaler.transform(X_train)
X_teststd = stdScaler.transform(X_test)
# 建立SVM模型
svm = SVC().fit(X_trainstd, y_train)
print(svm)
# 预测测试集结果
y_pre = svm.predict(X_teststd)
print(y_pre[:20])  # 打印前20个

# 2、分类结果的混淆矩阵与准确率
true = np.sum(y_test == y_pre)  # 预测对的
sum = y_test.shape[0]  # 总数
print("预测对的:", true)
print("预测错的:", sum - true)
print("准确率:", true / sum)
# 3、画混淆矩阵
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt

ConfusionMatrixDisplay.from_predictions(y_test, y_pre,
display_labels=["T", "F"], cmap=plt.cm.Greens, colorbar=True)
##  cmap=plt.cm.Reds设置为红色 cmap=plt.cm.Greens绿色  
##  cmap=plt.cm.Blues蓝色 cmap=plt.cm.gray 灰色
plt.title("Confusion Matrix")
# from matplotlib import rcParams
# rcParams['font.sans-serif'] = 'SimHei'  # 设置中文显示
# plt.title("混淆矩阵")
plt.show()

在这里插入图片描述

6.3.2 评价分类模型

分类模型对测试集进行预测而得出的准确率并不能很好地反映模型的性能，为了有效判断一个预测模型的性能表现，需要结合真实值，计算出精确率、召回率、F1值和Cohen’s Kappa系数等指标来衡量。常规分类模型的评价指标如表所示。分类模型评价方法前4种都是分值越高越好，其使用方法基本相同。
sklearn的metrics模块还提供了一个能够输出分类模型评价报告的函数classfication_report。
在这里插入图片描述

# 使用sklearn估计器构建SVM模型
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, cohen_kappa_score
cancer = load_breast_cancer()
X = cancer['data']
y = cancer['target']
# 划分数据集测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)
# 数据标准化
stdScaler = StandardScaler().fit(X_train)
X_trainstd = stdScaler.transform(X_train)
X_teststd = stdScaler.transform(X_test)
# 建立SVM模型
svm = SVC().fit(X_trainstd, y_train)
# print(svm)
# 预测测试集结果
y_pre = svm.predict(X_teststd)
# print(y_pre[:20])  # 打印前20个

# 1、分类模型常用评价方法
print("准确率:", accuracy_score(y_test, y_pre))
print("精确率:", precision_score(y_test, y_pre))
print("召回率:", recall_score(y_test, y_pre))
print("F1:", f1_score(y_test, y_pre))
print("Cohen's Kappa系数:", cohen_kappa_score(y_test, y_pre))

# 2、分类模型评价报告
from sklearn.metrics import classification_report
print("分类报告:\n", classification_report(y_test, y_pre))

# 3、 绘制ROC曲线
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
## 求出ROC曲线的x轴和y轴
fpr, tpr, tthresholds = roc_curve(y_test, y_pre)
plt.figure(figsize=(10,6))
plt.xlim(0, 1)  # 设定x轴的范围
plt.ylim(0.0, 1.1)  # 设定y轴的范围
plt.xlabel('False')
plt.ylabel('True')
plt.plot(fpr, tpr, linewidth=2, color='red')
plt.show()

在这里插入图片描述