Spark MLlib算法之KMeans应用实例讲解【附代码下载】

本帖最后由 xuanxufeng 于 2017-4-4 15:03 编辑
问题导读

1.什么是KMeans？
2.KMeans算法可以用来做什么？
3.KMeans如何代码实现？

1.KMeans概念
   KMeans基于划分的聚类方法。给定数据样本集Sample和应该划分的类书K，对样本数据Sample进行聚类，最终形成K个聚类，其相似的度量是某条数据与中心点的“距离”（距离可分为绝对距离、欧氏距离、闵可夫斯基距离。这里说的距离是欧式距离，欧氏距离也称欧几里得距离，它是在m维空间中两个点之间的真实距离）。
2.KMeans算法实例操作
2.1 数据准备

从官网下载源码时在data文件夹下有mllib文件夹，里面有kmeans_data.txt，内容为：
0.0 0.0 0.0
0.1 0.1 0.1
0.2 0.2 0.2
9.0 9.0 9.0
9.1 9.1 9.1
9.2 9.2 9.2
2.2实现思路1.设置运行环境；
2.装载kmeans_data.txt数据集；
3.将数据集聚类（聚成2个类），进行20次迭代计算，形成数据模型；
4.在控制台打印数据模型的两个中心点；
5.使用误差平方和评估数据模型；
6.交叉评估1，只返回结果；
7.交叉评估2，返回数据集和结果。

3.用代码说话
1.设置运行环境
[mw_shl_code=bash,true]val conf = new SparkConf().setAppName("Kmeans").setMaster("local")
val sc = new SparkContext(conf)  [/mw_shl_code]

2.装载数据集
[mw_shl_code=bash,true]val data = sc.textFile("E:\\spark-2.1.0\\spark-2.1.0\\data\\mllib\\kmeans_data.txt", 1)

val parseData = data.map(s => Vectors.dense(s.split(" ").map(_.toDouble)))  [/mw_shl_code]

3.将数据集聚类，分成2个类，20次迭代，形成数据模型
[mw_shl_code=bash,true]val numClusters = 2
val numIterations = 20
val model = KMeans.train(parseData, numClusters, numIterations)  [/mw_shl_code]

   4.数据模型的中心点
[mw_shl_code=bash,true]println("Cluster centers:")
for (c <- model.clusterCenters) {
   println(" " + c.toString)
}  [/mw_shl_code]

5.使用误差平方和评估数据模型
[mw_shl_code=bash,true]val cost = model.computeCost(parseData)  [/mw_shl_code]

6.使用模型测试单点数据
[mw_shl_code=bash,true]
println("Vectors 0.2 0.2 0.2 is belongs to clusters:" + model.predict(Vectors.dense("0.2 0.2 0.2".split(" ").map(_.toDouble))))

println("Vectors 0.25 0.25 0.25 is belongs to clusters:" + model.predict(Vectors.dense("0.25 0.25 0.25".split(" ").map(_.toDouble))))

println("Vectors 8 8 8 is belongs to clusters:" + model.predict(Vectors.dense("8 8 8".split(" ").map(_.toDouble))))[/mw_shl_code]

   7.交叉评估1，只返回结果

[mw_shl_code=scala,true]val testdata = data.map(s => Vectors.dense(s.split(" ").map(_.toDouble)))
val result = model.predict(testdata)
result.saveAsTextFile("F:\\machine-learning\\result1")  [/mw_shl_code]

8.交叉评估2，返回数据集和结果

[mw_shl_code=scala,true]val result2 = data.map {
   line =>
      val linevector = Vectors.dense(line.split(" ").map(_.toDouble))
      val prediction = model.predict(linevector)
      line + " " + prediction
}.saveAsTextFile("F:\\machine-learning\\result2")  [/mw_shl_code]

4.结果    中心点结果：

单点测试数据结果：

交叉评估1，只返回结果：

交叉评估2，返回数据集和结果：

链接：http://pan.baidu.com/s/1pL8WzQN 密码：kw6g

来自：
csdn RiverCode