分类器置信度评估:从决策函数到概率输出的完整实现
不确定性量化的核心价值
在机器学习实践中,仅获取分类结果往往不够充分——理解模型对其预测的置信程度同样关键。这种不确定性信息在多个领域发挥重要作用:医疗诊断中辅助医生判断可疑病例、金融风控中识别异常交易、主动学习中选择最有价值的样本进行标注,以及提升模型可解释性。
决策函数与概率输出机制
分类器通常通过两种方式表达预测可信度:决策函数输出实数值,反映样本到分类边界的距离;概率输出将数值映射到0-1区间,更直观地表示类别归属可能性。对于线性模型,二者可通过sigmoid函数相互转换。
数学表达上,假设线性分类器输出为 \( f(x) = w^T x + b \),则决策函数直接取符号,而概率形式为 \( p(x) = \frac{1}{1+e^{-f(x)}} \)。
主流分类器实现方案
逻辑回归
该模型天然输出概率值,其决策边界由sigmoid函数定义:\( p(y=1|x) = \frac{1}{1+e^{-(w^T x+b)}} \)。
支持向量机
SVM原始输出为决策函数值 \( f(x)=w^T x+b \),可通过Platt校准转换为概率:\( p(y=1|x) = \frac{1}{1+e^{A f(x)+B}} \),其中参数A、B通过交叉验证拟合。
随机森林
集成模型通过投票机制获得概率:\( p(y=c|x)=\frac{1}{N}\sum_{i=1}^N \mathbb{I}(T_i(x)=c) \),即各决策树预测类别c的比例。
代码实现实战
数据准备
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
dataset = pd.read_csv("iris.csv")
features = dataset[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
labels = dataset['species']
X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
逻辑回归概率预测
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train_scaled, y_train)
proba_log = log_reg.predict_proba(X_test_scaled)
print("逻辑回归前5个样本概率:", proba_log[:5])
SVM概率校准
from sklearn.svm import SVC
from sklearn.calibration import CalibratedClassifierCV
svm_base = SVC(probability=True)
svm_calibrated = CalibratedClassifierCV(svm_base, cv=5)
svm_calibrated.fit(X_train_scaled, y_train)
proba_svm = svm_calibrated.predict_proba(X_test_scaled)
print("SVM前5个样本概率:", proba_svm[:5])
随机森林概率输出
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)
proba_rf = rf_model.predict_proba(X_test_scaled)
print("随机森林前5个样本概率:", proba_rf[:5])
不确定性计算函数
import numpy as np
def compute_uncertainty(proba_array):
return np.std(proba_array, axis=1)
uncertainty_scores = compute_uncertainty(proba_rf)
print("前5个样本不确定性:", uncertainty_scores[:5])
不确定性分布可视化
import matplotlib.pyplot as plt
plt.hist(uncertainty_scores, bins=20, edgecolor='black')
plt.xlabel("Uncertainty Score")
plt.ylabel("Frequency")
plt.title("Uncertainty Distribution of Random Forest")
plt.show()
高不确定性样本筛选
top_uncertain_idx = np.argsort(uncertainty_scores)[-5:]
high_uncertain_samples = X_test_scaled[top_uncertain_idx]
print("最不确定的5个样本特征:", high_uncertain_samples)
交叉验证概率评估
from sklearn.model_selection import cross_val_predict
cv_proba = cross_val_predict(rf_model, features, labels, method='predict_proba')
cv_uncertainty = compute_uncertainty(cv_proba)
print("交叉验证前5个不确定性值:", cv_uncertainty[:5])
主动学习样本选择
active_learn_idx = np.argsort(cv_uncertainty)[-5:]
active_samples = features.iloc[active_learn_idx]
print("需要标注的样本索引:", active_samples.index.tolist())
异常检测应用
anomaly_threshold = 0.2
anomaly_mask = uncertainty_scores > anomaly_threshold
anomalies = X_test_scaled[anomaly_mask]
print(f"检测到{len(anomalies)}个异常样本")
模型可解释性分析
import shap
shap_explainer = shap.TreeExplainer(rf_model)
shap_values = shap_explainer.shap_values(X_test_scaled)
shap.summary_plot(shap_values, X_test_scaled, plot_type="bar")
端到端工作流
# 完整流程合并
full_pipeline = [
("load", lambda: pd.read_csv("iris.csv")),
("split", lambda df: train_test_split(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], df['species'], test_size=0.2, random_state=42)),
("scale", lambda X: StandardScaler().fit_transform(X)),
("train", lambda X, y: RandomForestClassifier(n_estimators=100, random_state=42).fit(X, y)),
("predict", lambda model, X: model.predict_proba(X)),
("evaluate", lambda proba: (proba, np.std(proba, axis=1)))
]
data = full_pipeline[0][1]()
X_tr, X_te, y_tr, y_te = full_pipeline[1][1](data)
X_tr_s = full_pipeline[2][1](X_tr)
X_te_s = full_pipeline[2][1](X_te)
model = full_pipeline[3][1](X_tr_s, y_tr)
proba, unc = full_pipeline[4][1](model, X_te_s)
print("高不确定性样本特征:", X_te[np.argsort(unc)[-5:]])