日志

Hadoop上的中文分词与词频统计实践

已有 1115 次阅读2014-12-1 23:05

首先来推荐相关材料：http://xiaoxia.org/2011/12/18/map-reduce-program-of-rmm-word-count-on-hadoop/。小虾的这个统计武侠小说人名热度的段子很有意思，照虎画猫来实践一下。

与其不同的地方有：

　　0）其使用Hadoop Streaming，这里使用MapReduce框架。

　　1）不同的中文分词方法，这里使用IKAnalyzer，主页在http://code.google.com/p/ik-analyzer/。

　　2）这里的材料为《射雕英雄传》。哈哈，总要来一些改变。

0）使用WordCount源代码，修改其Map，在Map中使用IKAnalyzer的分词功能。

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.io.ByteArrayInputStream;

import org.wltea.analyzer.core.IKSegmenter;

import org.wltea.analyzer.core.Lexeme;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class ChineseWordCount {

public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {

byte[] bt = value.getBytes();

InputStream ip = new ByteArrayInputStream(bt);

Reader read = new InputStreamReader(ip);

IKSegmenter iks = new IKSegmenter(read,true);

Lexeme t;

while ((t = iks.next()) != null)

{

word.set(t.getLexemeText());

context.write(word, one);

}

public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,

Context context

) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

if (otherArgs.length != 2) {

System.err.println("Usage: wordcount <in> <out>");

System.exit(2);

}

Job job = new Job(conf, "word count");

job.setJarByClass(ChineseWordCount.class);

job.setMapperClass(TokenizerMapper.class);

job.setCombinerClass(IntSumReducer.class);

job.setReducerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

}

1）So，完成了，本地插件模拟环境OK。打包（带上分词包）扔到集群上。

hadoop fs -put chinese_in.txt chinese_in.txt

hadoop jar WordCount.jar chinese_in.txt out0

...mapping reducing...

hadoop fs -ls ./out0

hadoop fs -get part-r-00000 words.txt

2）数据后处理：

2.1）数据排序

head words.txt

tail words.txt

sort -k2 words.txt >0.txt

head 0.txt

tail 0.txt

sort -k2r words.txt>0.txt

head 0.txt

tail 0.txt

sort -k2rn words.txt>0.txt

head -n 50 0.txt

2.2）目标提取

awk '{if(length($1)>=2) print $0}' 0.txt >1.txt

2.3）结果呈现

head 1.txt -n 50 | sed = | sed 'N;s/\n//'

1郭靖 6427

2黄蓉 4621

3欧阳 1660

4甚么 1430

5说道 1287

6洪七公 1225

7笑道 1214

8自己 1193

9一个 1160

10师父 1080

11黄药师 1059

12心中 1046

13两人 1016

14武功 950

15咱们 925

16一声 912

17只见 827

18他们 782

19心想 780

20周伯通 771

21功夫 758

22不知 755

23欧阳克 752

24听得 741

25丘处机 732

26当下 668

27爹爹 664

28只是 657

29知道 654

30这时 639

31之中 621

32梅超风 586

33身子 552

34都是 540

35不是 534

36如此 531

37柯镇恶 528

38到了 523

39不敢 522

40裘千仞 521

41杨康 520

42你们 509

43这一 495

44却是 478

45众人 476

46二人 475

47铁木真 469

48怎么 464

49左手 452

50地下 448

在非人名词中有很多很有意思，如：5说道7笑道12心中17只见22不知30这时49左手。

路过

雷人

握手

鲜花

阿飞的个人空间 https://www.aboutyun.com/?3890 [收藏] [复制] [分享] [RSS]

日志

Hadoop上的中文分词与词频统计实践

全部作者的其他最新日志

评论 (0 个评论)

阿飞

推荐 /2