基于Mahout0.9+CDH5.2运行分布式ItemCF推荐算法

about云腾讯认证空间

问题导读
1.Mahout将来的前景是怎样的？
2.按原代码逻辑，实际上是在Java中做了命令行的工作，为何不直接通过命令行执行呢？

环境：

hadoop-2.5.0-cdh5.2.0

mahout-0.9-cdh5.2.0

引言

虽然Mahout已经宣布不再继续基于Mapreduce开发，迁移到Spark，但是实际面临的情况是公司集群没有足够的内存支持Spark这只把内存当饭吃的猛兽，再加上项目进度的压力以及开发人员的技能现状，所以不得不继续使用Mahout一段时间。
今天记录的是命令行运行ItemCF on Hadoop的过程。

历史

之前读过一些前辈们关于的Mahout ItemCF on Hadoop编程的相关文章，描述的都是如何基于Mahout编程实现ItemCF on Hadoop，由于没空亲自研究，所以一直遵循前辈们编程实现的做法，比如以下这段在各大博客都频繁出现的代码：

import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.mahout.cf.taste.hadoop.item.RecommenderJob;
public class ItemCFHadoop {
    public static void main(String[] args) throws Exception {
        JobConf conf = new JobConf(ItemCFHadoop.class);
        GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
        String[] remainingArgs = optionParser.getRemainingArgs();
        if (remainingArgs.length != 5) {
            System.out.println("args length: "+remainingArgs.length);
            System.err.println("Usage: hadoop jar <jarname> <package>.ItemCFHadoop <inputpath> <outputpath> <tmppath> <booleanData> <similarityClassname>");
            System.exit(2);
        }
        System.out.println("input : "+remainingArgs[0]);
        System.out.println("output : "+remainingArgs[1]);
        System.out.println("tempdir : "+remainingArgs[2]);
        System.out.println("booleanData : "+remainingArgs[3]);
        System.out.println("similarityClassname : "+remainingArgs[4]);
        
        StringBuilder sb = new StringBuilder();
        sb.append("--input ").append(remainingArgs[0]);
        sb.append(" --output ").append(remainingArgs[1]);
        sb.append(" --tempDir ").append(remainingArgs[2]);
        sb.append(" --booleanData ").append(remainingArgs[3]);
        sb.append(" --similarityClassname ").append(remainingArgs[4]);
        conf.setJobName("ItemCFHadoop");
        RecommenderJob job = new RecommenderJob();
        job.setConf(conf);
        job.run(sb.toString().split(" "));
    }
}
复制代码

以上代码是可执行的，只要在命令行中传入正确的参数就可以顺利完成ItemCF on Hadoop的任务。
但是，如果按这么个代码逻辑，实际上是在Java中做了命令行的工作，为何不直接通过命令行执行呢？

官网资料

前辈们为我指明了道路，ItemCF on Hadoop的任务是通过org.apache.mahout.cf.taste.hadoop.item.RecommenderJob类实现的。
官网（https://builds.apache.org/job/Mahout-Quality/javadoc/）中对于org.apache.mahout.cf.taste.hadoop.item.RecommenderJob类的说明如下：

Runs a completely distributed recommender job as a series of mapreduces.
Preferences in the input file should look like userID, itemID[, preferencevalue]
Preference value is optional to accommodate applications that have no notion of a preference value (that is, the user simply expresses a preference for an item, but no degree of preference).
The preference value is assumed to be parseable as a double. The user IDs and item IDs are parsed as longs.
Command line arguments specific to this class are:
--input(path): Directory containing one or more text files with the preference data
--output(path): output path where recommender output should go
--tempDir (path): Specifies a directory where the job may place temp files (default "temp")
--similarityClassname (classname): Name of vector similarity class to instantiate or a predefined similarity from VectorSimilarityMeasure
--usersFile (path): only compute recommendations for user IDs contained in this file (optional)
--itemsFile (path): only include item IDs from this file in the recommendations (optional)
--filterFile (path): file containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user (optional)
--numRecommendations (integer): Number of recommendations to compute per user (10)
--booleanData (boolean): Treat input data as having no pref values (false)
--maxPrefsPerUser (integer): Maximum number of preferences considered per user in final recommendation phase (10)
--maxSimilaritiesPerItem (integer): Maximum number of similarities considered per item (100)
--minPrefsPerUser (integer): ignore users with less preferences than this in the similarity computation (1)
--maxPrefsPerUserInItemSimilarity (integer): max number of preferences to consider per user in the item similarity computation phase, users with more preferences will be sampled down (1000)
--threshold (double): discard item pairs with a similarity value below this
复制代码

为了方便具备英语阅读能力的同学，上面保留了原文，下面是翻译：

运行一个完全分布式的推荐任务，通过一系列mapreduce任务实现。

输入文件中的偏好数据格式为：userID, itemID[, preferencevalue]。其中，preferencevalue并不是必须的。

userID和itemID将被解析为long类型，preferencevalue将被解析为double类型。

该类可以接收的命令行参数如下：

--input(path): 存储用户偏好数据的目录，该目录下可以包含一个或多个存储用户偏好数据的文本文件；
--output(path): 结算结果的输出目录
--tempDir (path): 存储临时文件的目录
--similarityClassname (classname): 向量相似度计算类，可选的相似度算法包括CityBlockSimilarity，CooccurrenceCountSimilarity，CosineSimilarity，CountbasedMeasure，EuclideanDistanceSimilarity，LoglikelihoodSimilarity，PearsonCorrelationSimilarity, TanimotoCoefficientSimilarity。注意参数中要带上包名。
--usersFile (path): 指定一个包含了一个或多个存储userID的文件路径，仅为该路径下所有文件包含的userID做推荐计算 (该选项可选)
--itemsFile (path): 指定一个包含了一个或多个存储itemID的文件路径，仅为该路径下所有文件包含的itemID做推荐计算 (该选项可选)
--filterFile (path): 指定一个路径，该路径下的文件包含了[userID,itemID]值对，userID和itemID用逗号分隔。计算结果将不会为user推荐[userID,itemID]值对中包含的item (该选项可选)
--numRecommendations (integer): 为每个用户推荐的item数量，默认为10
--booleanData (boolean): 如果输入数据不包含偏好数值，则将该参数设置为true，默认为false
--maxPrefsPerUser (integer): 在最后计算推荐结果的阶段，针对每一个user使用的偏好数据的最大数量，默认为10
--maxSimilaritiesPerItem (integer): 针对每个item的相似度最大值，默认为100
--minPrefsPerUser (integer): 在相似度计算中，忽略所有偏好数据量少于该值的用户，默认为1
--maxPrefsPerUserInItemSimilarity (integer): 在item相似度计算阶段，针对每个用户考虑的偏好数据最大数量，默认为1000
--threshold (double): 忽略相似度低于该阀值的item对

命令行执行

用于测试的用户偏好数据【userID, itemID, preferencevalue】：
1,101,2
1,102,5
1,103,1
2,101,1
2,102,3
2,103,2
2,104,6
3,101,1
3,104,1
3,105,1
3,107,2
4,101,2
4,103,2
4,104,5
4,106,3
5,101,3
5,102,5
5,103,6
5,104,8
5,105,1
5,106,1

相关基础环境配置完善后，在命令行执行如下命令即可进行ItemCF on Hadoop推荐计算：
hadoop jar $MAHOUT_HOME/mahout-core-0.9-cdh5.2.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob --input /UserPreference --output /CFOutput --tempDir /tmp --similarityClassname org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.LoglikelihoodSimilarity

注：这里只使用了最重要的参数，更多的参数使用调优需结合实际项目进行测试。

计算结果【userID [itemID1:score1,itemID2:score2......]】：
1 [104:3.4706533,106:1.7326527,105:1.5989419]
2 [106:3.8991857,105:3.691359]
3 [106:1.0,103:1.0,102:1.0]
4 [105:3.2909648,102:3.2909648]
5 [107:3.2898135]

图文精华

基于Mahout0.9+CDH5.2运行分布式ItemCF推荐算法

本帖被以下淘专辑推荐:

推荐 /2