一直在研究Spark的分类算法,因为我们是做日志文本分类,在官网和各大网站一直没找到相应的Demo
[mw_shl_code=scala,true]val sparkConf = new SparkConf().setAppName("DecisionTree1").setMaster("local[2]")
val sc = new SparkContext(sparkConf)
var data1 = sc.textFile("/XXX/sample_libsvm_data.txt")
val hashingTF = new HashingTF()
val data = data1.map { line =>
val parts = line.split('\t')
LabeledPoint(parts(0).toDouble, hashingTF.transform(parts.tail))
}
val splits = data.randomSplit(Array(0.9, 0.1))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 5
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
println("--------------------train--------------------")
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
println("--------------------Test--------------------")
// Evaluate model on test instances and compute test error
val testStr = Array("l","o","k")
val prediction = model.predict(hashingTF.transform(testStr))
println("-----------------------------------------")
println(prediction)
println("-----------------------------------------")
[/mw_shl_code]
样例数据:
2 f g k m
3 o p s d
4 i l o v
4 i l o w
4 i l o f
4 i l o k
4 i l o n
4 i l o a
2 f g i m
2 f g o m
2 f g u m
2 f g w m
3 o k s d
3 o m s d
3 o s s d
3 o i s d
Classification算法只支持Double类型,其实我们的核心就是怎么把字符串转成Double型的向量,在Spark1.3.0版本中有 HashingTF 来做转化,就发现程序很简单了。
|