逻辑回归二分类及多分类任务【pyspark】

本帖最后由 regan 于 2018-4-23 09:55 编辑

1.分类模型里最常用的就是Logistic regression逻辑回归模型了，逻辑回归它是一种广义的线性模型，但需要注意虽然名字里面有“回归”二字，但是它是一种分类模型。逻辑回归有多个变种，最常用的是用于二分类，当然对于多分类也是适用的。对于多分类实际上他会分成K-1个二分类任务。在pyspark.ml中逻辑回归模型有两种实现算法，分别是mini-batch gradient descent小批量的梯度下降算法，还有就是 L-BFGS 拟牛顿法。官方推荐使用 L-BFGS 拟牛顿法，原因是它具有更快的收敛速度。
通过损失函数可以求出权重对应的梯度，从而沿着梯度的负方向更新权重参数直至收敛。模型训练好只有，对于一个输入的特征向量X，需要用到下面的逻辑函数

其中z = W'X + b;如果f(z) > 0.5判断为正，f(z) < 0.5 判断为负。
Sigmoid函数及其导数

Logstic损失函数

取对数

求偏导及更新参数

接下来我们使用以下二分类逻辑回归算法。regParam用于指定正则化强度，elasticNetParam用于指定L1正则和L2正则影响的权重，通过maxIter指定算法迭代的次数为10次。
from pyspark.ml.classification import LogisticRegressiontraining = spark.read.format("libsvm").load("/datas/lib_svm.txt")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
print("Coefficients: " + str(lrModel.coefficients))
print("Intercept: " + str(lrModel.intercept))
mlr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8, family="multinomial")
mlrModel = mlr.fit(training)
print("Multinomial coefficients: " + str(mlrModel.coefficientMatrix))
print("Multinomial intercepts: " + str(mlrModel.interceptVector))

模型训练的好坏怎样来定义呢？在分类算法中可以使用精度和召回率来衡量算法的好坏程度，基于这两个指标衍生出了F1指标，它是精度和召回率两个指标的调和平均数。还有一种度量性能的指标是ROC曲线，ROC曲线的横坐标为false positive rate（FPR），纵坐标为true positive rate（TPR），越靠近左上角性能越好。

在LogisticRegression中提供了LogisticRegressionTrainingSummary 用于对算法执行过程中性能的度量。可以通过LogisticRegressionTrainingSummary对上上的fMeasureByThreshold获取由不同的阀值而计算出来的F1调和平均数
from pyspark.ml.classification import LogisticRegression
trainingSummary = lrModel.summary
objectiveHistory = trainingSummary.objectiveHistory
print("objectiveHistory:")
for objective in objectiveHistory:
print(objective)
trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))
fMeasure = trainingSummary.fMeasureByThreshold
maxFMeasure = fMeasure.groupBy().max('F-Measure').select('max(F-Measure)').head()
bestThreshold = fMeasure.where(fMeasure['F-Measure'] == maxFMeasure['max(F-Measure)']).select('threshold').head()['threshold']
lr.setThreshold(bestThreshold)

最后可以由fMeasureByThreshold选出当F-Measure最大时对应的threshold阀值。使用模型上的setThreshold方法把效果最好的阀值设置上去。

2.逻辑回归除了能对二分类进行分类之外，还可以用于多分类任务。输出的是多分类的概率，使用的是Softmax分类函数：

最终最小化负的对数似然加上有阿尔法调节的L1和L2权重参数正则化项，由L1和L2组成弹性的惩罚项，避免过拟合。

接下来我们看看多分类的例子：
from pyspark.ml.classification import LogisticRegression
training = spark.read.format("libsvm").load("/datas/multi_classification.txt")
lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(training)
print("Coefficients: \n" + str(lrModel.coefficientMatrix))
print("Intercept: " + str(lrModel.interceptVector))

trainingSummary = lrModel.summary
objectiveHistory = trainingSummary.objectiveHistoryprint("objectiveHistory:")
for objective in objectiveHistory:
print(objective)

print("False positive rate by label:")
for i, rate in enumerate(trainingSummary.falsePositiveRateByLabel):
print("label %d: %s" % (i, rate))

print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
print("label %d: %s" % (i, rate))

print("Precision by label:")
for i, prec in enumerate(trainingSummary.precisionByLabel):
print("label %d: %s" % (i, prec))

print("Recall by label:")
for i, rec in enumerate(trainingSummary.recallByLabel):
print("label %d: %s" % (i, rec))

print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
print("label %d: %s" % (i, f))

accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s" % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

补充内容 (2019-11-10 23:48):
微信搜索公众号“三角兽”，查看更多精彩。

图文精华

逻辑回归二分类及多分类任务【pyspark】

最佳新人

热心会员

推荐 /2