分享

Hadoop加速器GridGain

fc013 发表于 2015-11-15 20:23:17 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 1 12430



问题导读:

1.怎样配置Hadoop集群?
2.怎样配置GridGain加速器?
3.怎样利用MapReduce统计单词数?





像GridGain等内存网格产品(IMDG)不仅可以作为简单的缓存,加速Hadoop中MapReduce计算也是IMDG的一个亮点。这样内存计算领域又多了一种思路和选择,而不只是Spark独霸一方的局面。关于GridGain的功能介绍请参考《开源IMDG之GridGain》


1.安装Hadoop 2.7.1
很早之前写过一篇《Hadoop入门(一):Hadoop伪分布安装》,那时用的还是0.20的版本,转眼间都已经2.7.1了,Hadoop发展真是飞快!所以本文的前半部分重点看一下最新版2.7.1如何搭建伪分布式集群。

1.1 SSH无密码模式
为当前用户配置无密码的SSH登录,通过ssh localhost测试是否还需要输入密码。

  1. [root@vm Software]# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
  2. Generating public/private rsa key pair.
  3. Your identification has been saved in /root/.ssh/id_rsa.
  4. Your public key has been saved in /root/.ssh/id_rsa.pub.
  5. The key fingerprint is:
  6. 28:58:5c:c8:0a:b3:52:83:4f:c1:9a:71:65:12:61:b1 root@BC-VM-edce4ac67d304079868c0bb265337bd4
  7. The key's randomart image is:
  8. +--[ RSA 2048]----+
  9. | oBBo..          |
  10. |=.*=o.           |
  11. | %Eoo            |
  12. |= oo   .         |
  13. |. . . . S        |
  14. |     .           |
  15. |                 |
  16. |                 |
  17. |                 |
  18. +-----------------+
  19. [root@vm Software]# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  20. [root@vm Software]# ssh localhost
  21. Last login: Wed Sep  9 15:43:19 2015 from localhost
复制代码

1.2 环境变量
修改~/.bash_profile或/etc/profile,加入HADOOP_HOME环境变量。因为很多启动脚本都在sbin目录下,所以这里将sbin和bin目录都加到PATH环境变量中。
  1. export HADOOP_HOME=/home/hadoop-2.7.1
  2. export PATH=$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$PATH
复制代码

修改etc/hadoop/hadoop-env.sh。如果没有配置JAVA_HOME或想为Hadoop单独指定JDK的话就直接修改下面这一行:
  1. export JAVA_HOME={JAVA_HOME}
复制代码

Hadoop对Java的版本要求
“Hadoop requires Java 7 or a late version of Java 6. It is built and tested on both OpenJDK and Oracle (HotSpot)’s JDK/JRE”. 从官网描述能看出,用OpenJDK或Oracle的JDK或JRE运行Hadoop都是没有问题的,版本支持6的后几个版本以及7以上版本。但是从Hadoop 2.7版本开始,要求JDK必须是7以上版本了。

1.3 core-site.xml
修改etc/hadoop/core-site.xml:
  1. <configuration>
  2.     <property>
  3.         <name>hadoop.tmp.dir</name>
  4.         <value>/usr/opt/hadoop/tmp</value>
  5.     </property>
  6.     <property>
  7.         <name>fs.defaultFS</name>
  8.         <value>hdfs://localhost:9000</value>
  9.     </property>
  10. </configuration>
复制代码

1.4 hdfs-site.xml
修改etc/hadoop/hdfs-site.xml:
  1. <configuration>
  2.     <property>
  3.         <name>dfs.replication</name>
  4.         <value>1</value>
  5.     </property>
  6. </configuration>
复制代码

1.5 yarn-site.xml
修改etc/hadoop/yarn-site.xml:
  1. <configuration>
  2.     <property>
  3.         <name>mapreduce.framework.name</name>
  4.         <value>yarn</value>
  5.     </property>
  6.     <property>
  7.         <name>yarn.nodemanager.aux-services</name>
  8.         <value>mapreduce_shuffle</value>
  9.     </property>
  10. </configuration>
复制代码

至此,一个伪分布式的Hadoop集群就配置完毕了!


2.启动Hadoop集群

2.1 格式化NameNode
启动Hadoop之前,一定要先格式化Namenode:
  1. [root@vm hadoop-2.7.1]# hdfs namenode -format
  2. 15/09/09 13:03:08 INFO namenode.NameNode: STARTUP_MSG:
  3. /************************************************************
  4. STARTUP_MSG: Starting NameNode
  5. STARTUP_MSG:   host = BC-vm/192.168.1.111
  6. STARTUP_MSG:   args = [-format]
  7. STARTUP_MSG:   version = 2.7.1
  8. STARTUP_MSG:   classpath = /root/Software/hadoop-2.7.1/etc/hadoop:/root/Software/hadoop-2.7.1/share/hadoop/common/lib/commons-digester-1.8.jar:...
  9. STARTUP_MSG:   build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 15ecc87ccf4a0228f35af08fc56de536e6ce657a; compiled by 'jenkins' on 2015-06-29T06:04Z
  10. STARTUP_MSG:   java = 1.7.0_71
  11. ************************************************************/
  12. 15/09/09 13:03:08 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
  13. 15/09/09 13:03:08 INFO namenode.NameNode: createNameNode [-format]
  14. 15/09/09 13:03:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  15. Formatting using clusterid: CID-7fbd2609-fb3e-459d-bbcf-c24d32473ffb
  16.     ...
  17. 15/09/09 13:03:09 INFO util.ExitUtil: Exiting with status 0
  18. 15/09/09 13:03:09 INFO namenode.NameNode: SHUTDOWN_MSG:
  19. /************************************************************
  20. SHUTDOWN_MSG: Shutting down NameNode at BC-vm/192.168.1.111
  21. ************************************************************/
复制代码

2.2 启动HDFS
注意:sbin/start-all.sh中已经明确说明:“This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh”,所以不要通过它来启动Hadoop了。启动成功后,通过jps命名查看运行中的Java进程,应该有NameNode、SecondaryNameNode、DataNode三个。
  1. [root@vm hadoop-2.7.1]# start-dfs.sh
  2. Starting namenodes on [localhost]
  3. localhost: starting namenode, logging to /root/Software/hadoop-2.7.1/logs/hadoop-root-namenode-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  4. localhost: starting datanode, logging to /root/Software/hadoop-2.7.1/logs/hadoop-root-datanode-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  5. Starting secondary namenodes [0.0.0.0]
  6. 0.0.0.0: starting secondarynamenode, logging to /root/Software/hadoop-2.7.1/logs/hadoop-root-secondarynamenode-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  7. [root@BC-vm hadoop-2.7.1]# jps
  8. 20128 Jps
  9. 19825 DataNode
  10. 19688 NameNode
  11. 20007 SecondaryNameNode
复制代码

2.3 启动YARN
Hadoop 2中单独抽离出了资源管理器YARN (Yet Another Resource Negotiator),启动YARN后能看到又多了两个Java进程:NodeManager和ResourceManager。
  1. [root@vm hadoop-2.7.1]# start-yarn.sh
  2. starting yarn daemons
  3. starting resourcemanager, logging to /root/Software/hadoop-2.7.1/logs/yarn-root-resourcemanager-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  4. localhost: starting nodemanager, logging to /root/Software/hadoop-2.7.1/logs/yarn-root-nodemanager-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  5. [root@vm hadoop-2.7.1]# jps
  6. 20212 ResourceManager
  7. 19825 DataNode
  8. 20630 Jps
  9. 19688 NameNode
  10. 20007 SecondaryNameNode
  11. 20507 NodeManager
复制代码

详细日志都在HADOOP_HOME/logs下。

3.测试MapReduce
这里仍旧以经典的WordCount为例,简单测试一下Hadoop 2的性能。

3.1 上传数据文件
这里还是用big.txt作为测试文件。之前我曾在《Trie的应用及拼写检查器的优化》使用过这个文件,感兴趣的可以了解一下。此外要注意,输出文件的文件夹不用提前创建,否则Hadoop会报错,认为文件夹已经存在了。

  1. [root@vm hadoop-2.7.1]# wget http://www.norvig.com/big.txt
  2. [root@vm hadoop-2.7.1]# hadoop fs -mkdir -p /test/wordcount/input
  3. [root@vm hadoop-2.7.1]# hadoop fs -put big.txt /test/wordcount/input
  4. [root@vm hadoop-2.7.1]# hadoop fs -ls /test/wordcount/input
  5. Found 1 items
  6. -rw-r--r--   1 root supergroup        124 2015-09-09 14:21 /test/wordcount/input/big.txt
复制代码

3.2 执行WordCount任务
还是老地方,WordCount任务依旧在share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar中。big.txt只有6MB多,所以执行过程还是挺快的,启动花了大概7秒,计算花了15秒,总体大概花了22秒多。可以利用seq 150 | xargs -i cat big.txt >> bigbig.txt命令可以产生个1G左右的bigbig.txt作为测试文件,这次Hadoop花了214秒。
  1. [root@vm hadoop-2.7.1]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /test/wordcount/input /test/wordcount/output
  2. 15/09/09 15:23:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. 15/09/09 15:23:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
  4. 15/09/09 15:23:52 INFO input.FileInputFormat: Total input paths to process : 1
  5. 15/09/09 15:23:52 INFO mapreduce.JobSubmitter: number of splits:1
  6. 15/09/09 15:23:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441775536578_0003
  7. 15/09/09 15:23:52 INFO impl.YarnClientImpl: Submitted application application_1441775536578_0003
  8. 15/09/09 15:23:52 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1441775536578_0003/
  9. 15/09/09 15:23:52 INFO mapreduce.Job: Running job: job_1441775536578_0003
  10. 15/09/09 15:23:57 INFO mapreduce.Job: Job job_1441775536578_0003 running in uber mode : false
  11. 15/09/09 15:23:57 INFO mapreduce.Job:  map 0% reduce 0%
  12. 15/09/09 15:24:05 INFO mapreduce.Job:  map 100% reduce 0%
  13. 15/09/09 15:24:12 INFO mapreduce.Job:  map 100% reduce 100%
  14. 15/09/09 15:24:12 INFO mapreduce.Job: Job job_1441775536578_0003 completed successfully
  15. 15/09/09 15:24:12 INFO mapreduce.Job: Counters: 49
  16.     File System Counters
  17.         FILE: Number of bytes read=1251830
  18.         FILE: Number of bytes written=2734521
  19.         ...
复制代码

3.3 结果验证
下面查看一下运行结果,用sort和head命令查看Top 20的单词有哪些,果然都是些虚词:
  1. [root@vm hadoop-2.7.1]# hadoop fs -cat /test/wordcount/output/part-r-00000 | sort -rn -k 2 | head -n 20
  2. the 71744
  3. of  39169
  4. and 35968
  5. to  27895
  6. a   19811
  7. in  19515
  8. that    11216
  9. was 11129
  10. his 9561
  11. he  9362
  12. with    9358
  13. is  9247
  14. as  7333
  15. had 7275
  16. it  6545
  17. by  6384
  18. for 6358
  19. at  6237
  20. not 6201
  21. The 6149
复制代码

要想重复测试的话很简单,通过 hadoop fs -rm -r /test/wordcount/output 删掉输出文件夹,就可以重新跑一次WordCount任务!


4.使用GridGain加速器
经过了前面的各种铺垫,终于到了本篇的重点了。

4.1 安装GridGain
首先下载GridGain的Hadoop Acceleration版,这是个单独的分发版,与学习GridGain的网格特性时的fabric版不是一个。
GridGain对环境有一些要求:
  • Java 7及以上版本
  • 配置JAVA_HOME指向JDK或JRE
  • Hadoop 2.2及以上版本
  • 配置HADOOP_HOME
现在就可以执行bin/setup-hadoop.sh脚本替换Hadoop的配置文件了。
  1. [root@vm gridgain-community-hadoop-1.3.3]# bin/setup-hadoop.sh
  2.    __________  ________________
  3.   /  _/ ___/ |/ /  _/_  __/ __/
  4. _/ // (7 7    // /  / / / _/   
  5. /___/\___/_/|_/___/ /_/ /___/  
  6.                 for Apache Hadoop        
  7. ver. 1.3.3#20150803-sha1:7d747d2a
  8. 2015 Copyright(C) Apache Software Foundation
  9.   > IGNITE_HOME is set to '/root/Software/gridgain-community-hadoop-1.3.3'.
  10.   > HADOOP_HOME is set to '/root/Software/hadoop-2.7.1'.
  11.   > HADOOP_COMMON_HOME is not set, will use '/root/Software/hadoop-2.7.1/share/hadoop/common'.
  12. <  Ignite JAR files are not found in Hadoop 'lib' directory. Create appropriate symbolic links? (Y/N): Y
  13. >  Yes.
  14.   > Creating symbolic link '/root/Software/hadoop-2.7.1/share/hadoop/common/lib/ignite-shmem-1.0.0.jar'.
  15.   > Creating symbolic link '/root/Software/hadoop-2.7.1/share/hadoop/common/lib/ignite-core-1.3.3.jar'.
  16.   > Creating symbolic link '/root/Software/hadoop-2.7.1/share/hadoop/common/lib/ignite-hadoop-1.3.3.jar'.
  17. <  Replace 'core-site.xml' and 'mapred-site.xml' files with preconfigured templates (existing files will be backed up)? (Y/N): Y
  18. >  Yes.
  19.   > Replacing file '/root/Software/hadoop-2.7.1/etc/hadoop/core-site.xml'.
  20.   > Replacing file '/root/Software/hadoop-2.7.1/etc/hadoop/mapred-site.xml'.
  21.   > Apache Hadoop setup is complete.
复制代码

替换成功之后,先启动两个GridGain结点:
  1. [root@vm gridgain-community-hadoop-1.3.3]# nohup bin/ignite.sh &
  2. [root@vm gridgain-community-hadoop-1.3.3]# nohup bin/ignite.sh &
复制代码

启动Hadoop:
  1. [root@BC-VM-edce4ac67d304079868c0bb265337bd4 hadoop-2.7.1]# start-dfs.sh
  2. 15/09/09 17:11:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.
  4. Starting namenodes on []
  5. localhost: starting namenode, logging to /root/Software/hadoop-2.7.1/logs/hadoop-root-namenode-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  6. localhost: starting datanode, logging to /root/Software/hadoop-2.7.1/logs/hadoop-root-datanode-BC-VM-edce4ac67d304079868c0bb265337bd4.out
  7. Starting secondary namenodes [0.0.0.0]
  8. 0.0.0.0: starting secondarynamenode, logging to /root/Software/hadoop-2.7.1/logs/hadoop-root-secondarynamenode-BC-VM-edce4ac67d304079868c0bb265337bd4.out
复制代码

4.2 执行测试

现在测试一下GridGain加速器,还是以前的方法执行就可以了。在我的虚拟机中测试的效果不理想,对于一两个GB的数据,GridGain加速器不管是单结点还是双结点,都与Hadoop的测试结果差不多,有时还要慢一些。可能是环境或者代码实现的问题,也许要在更大的数据集上对比才会更明显。
  1. [root@BC-VM-edce4ac67d304079868c0bb265337bd4 hadoop-2.7.1]# hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /test/wordcount/input /test/wordcount/output
  2. 15/09/09 15:58:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  3. 15/09/09 15:58:58 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
  4. 15/09/09 15:58:59 INFO input.FileInputFormat: Total input paths to process : 1
  5. 15/09/09 15:58:59 INFO mapreduce.JobSubmitter: number of splits:9
  6. 15/09/09 15:59:00 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1441785297218_0002
  7. 15/09/09 15:59:00 INFO impl.YarnClientImpl: Submitted application application_1441785297218_0002
  8. 15/09/09 15:59:00 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1441785297218_0002/
  9. 15/09/09 15:59:00 INFO mapreduce.Job: Running job: job_1441785297218_0002
  10. 15/09/09 15:59:07 INFO mapreduce.Job: Job job_1441785297218_0002 running in uber mode : false
  11. 15/09/09 15:59:07 INFO mapreduce.Job:  map 0% reduce 0%
  12. 15/09/09 15:59:20 INFO mapreduce.Job:  map 2% reduce 0%
  13. 15/09/09 15:59:23 INFO mapreduce.Job:  map 3% reduce 0%
  14.     ...
  15. 15/09/09 16:01:24 INFO mapreduce.Job:  map 96% reduce 26%
  16. 15/09/09 16:01:26 INFO mapreduce.Job:  map 96% reduce 30%
  17. 15/09/09 16:01:28 INFO mapreduce.Job:  map 100% reduce 30%
  18. 15/09/09 16:01:29 INFO mapreduce.Job:  map 100% reduce 45%
  19. 15/09/09 16:01:31 INFO mapreduce.Job:  map 100% reduce 100%
  20. 15/09/09 16:01:31 INFO mapreduce.Job: Job job_1441785297218_0002 completed successfully
复制代码




已有(1)人评论

跳转到指定楼层
qazzxc5200 发表于 2015-11-16 10:09:19
謝樓主分享
有機會實測一下
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条