分享

大数据环境启用LZO压缩

ruozedashuju 发表于 2018-4-8 20:43:21 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 0 7315

微信搜索关注:若泽大数据,有精美课程和定时推送技术博客,qq交流群:707635769,一起交流,一起进步


CentOS7中安装LZO压缩程序

一、准备工作:
yum -y install  lzo-devel  zlib-devel  gcc autoconf automake libtool

二、安装LZO
1、解压编译,并安装
cd /opt/software
wget http://www.oberhumer.com/opensource/lzo/download/lzo-2.09.tar.gz
tar -zxvf lzo-2.09.tar.gz
cd lzo-2.09
./configure -enable-shared -prefix=/usr/local/hadoop/lzo/
make && make test && make install

2、复制文件
将/usr/local/hadoop/lzo/lib/* 复制到/usr/lib/和/usr/lib64/下
cp /usr/local/hadoop/lzo/lib/* /usr/lib/
cp /usr/local/hadoop/lzo/lib/* /usr/lib64/

3、修改配置环境变量(vi ~/.bash_profile),增加如下内容:
export PATH=/usr/local/hadoop/lzo/:$PATH

三、安装LZOP
1、下载并解压
cd /opt/software
wget http://www.lzop.org/download/lzop-1.04.tar.gz
tar -zxvf lzop-1.04.tar.gz

2、在编译前需要的环境变量(~/.bash_profile)中配置如下内容:
export C_INCLUDE_PATH=/usr/local/hadoop/lzo/include/
注:如不配置变量,在编译时会报:configure: error: LZO header files not found. Please check your installation or set the environment variable `CPPFLAGS'.

3、进入解压后目录,并编译安装
cd cd /opt/software/lzop-1.04
./configure -enable-shared -prefix=/usr/local/hadoop/lzop
make  && make install

4、将lzop复制到/usr/bin/
ln -s /usr/local/hadoop/lzop/bin/lzop /usr/bin/lzop

5、测试lzop
输入:lzop nohup.out
产生:lzo后缀的压缩文件: /home/hadoop/data/access_20131219.log.lzo即表示成功
注:在测试中可能遇到报错:lzop: error while loading shared libraries: liblzo2.so.2: cannot open shared object file: No such file or directory
    解决办法:增加环境变量(~/.bash_profile)export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64

四、安装Hadoop-LZO
注:编译时需要maven,自行配置好maven
1、下载介质:https://github.com/twitter/hadoop-lzo

2、解压并编译:
cd /opt/software/hadoop-lzo-release-0.4.19
mvn clean package -Dmaven.test.skip=true

3、编译完成执行如下命令:
tar -cBf --C target/native/Linux-amd64-64/lib . | tar -xBvf --C /app/hadoop-2.6.0-cdh5.7.0/lib/native
cp target/hadoop-lzo-0.4.19.jar /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/

如果为集群环境,则接下来就是将/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar以及/app/hadoop-2.6.0-cdh5.7.0/lib/native/同步到其它所有的hadoop节点。
注意,要保证目录/app/hadoop-2.6.0-cdh5.7.0/lib/native/下的jar包,运行hadoop的用户都有执行权限。

五、产生index文件
cd /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common
hadoop jar hadoop-lzo-0.4.19.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/page_views_parquet1/page_views_parquet.lzo
注:lzo文件必须在hdfs文件系统中。

至此完成CentOS7中安装LZO压缩程序


配置Hadoop中启用LZO压缩

配置Hadoop中启用LZO压缩,并完成测试。步骤如下:
一、配置hadoop的hadoop-evn.sh文件,增加如下内容:
  • export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib



二、配置core-site.xml文件,增加如下内容:
  • <!--支持的压缩列表-->
  • <property>
  •     <name>io.compression.codecs</name>
  •     <value>
  •       org.apache.hadoop.io.compress.GzipCodec,
  •       org.apache.hadoop.io.compress.DefaultCodec,
  •       org.apache.hadoop.io.compress.BZip2Codec,
  •       org.apache.hadoop.io.compress.SnappyCodec,
  •       com.hadoop.compression.lzo.LzoCodec,
  •       com.hadoop.compression.lzo.LzopCodec
  •     </value>
  • </property>
  • <!--支持LZO使用类-->
  • <property>
  •    <name>io.compression.codec.lzo.class</name>
  •    <value>com.hadoop.compression.lzo.LzopCodec</value>
  • </property>



二、配置mapred-site.xml文件,增加如下内容:
  • <!--启用map中间文件压缩-->
  • <property>
  •     <name>mapreduce.map.output.compress</name>
  •     <value>true</value>
  • </property>
  • <!--启用map中间压缩类-->
  • <property>
  •    <name>mapred.map.output.compression.codec</name>
  •    <value>com.hadoop.compression.lzo.LzopCodec</value>
  • </property>
  • <!--启用mapreduce文件压缩-->
  • <property>
  •     <name>mapreduce.output.fileoutputformat.compress</name>
  •     <value>true</value>
  • </property>
  • <!--启用mapreduce压缩类-->
  • <property>
  •    <name>mapreduce.output.fileoutputformat.compress.codec</name>
  •    <value>com.hadoop.compression.lzo.LzopCodec</value>
  • </property>
  • <!--配置Jar包-->
  • <property>
  •     <name>mapred.child.env</name>
  •     <value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>
  • </property>



三、使用hadoop自带wordcount程序测试
1、测试生成lzo文件
  • cd /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce
  • hadoop jar hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount /input/test1.txt /output/wc2


测试结果:
[hadoop@spark220 mapreduce]$ hdfs dfs -ls  /output/wc2
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2018-03-17 00:21 /output/wc2/_SUCCESS
-rw-r--r--   1 hadoop supergroup        113 2018-03-17 00:21 /output/wc2/part-r-00000.lzo

2、生成index文件:
  • cd /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common
  • hadoop jar hadoop-lzo-0.4.19.jar com.hadoop.compression.lzo.LzoIndexer /output/wc2/part-r-00000.lzo


日志:
18/03/17 00:23:05 INFO lzo.GPLNativeCodeLoader: Loaded native gpl libraryutput/wc2/part-r-00000.lzo
18/03/17 00:23:05 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev 049362b7cf53ff5f739d6b1532457f2c6cd495e8]
18/03/17 00:23:06 INFO lzo.LzoIndexer: [INDEX] LZO Indexing file /output/wc2/part-r-00000.lzo, size 0.00 GB...
18/03/17 00:23:07 INFO Configuration.deprecation: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
18/03/17 00:23:07 INFO lzo.LzoIndexer: Completed LZO Indexing in 0.80 seconds (0.00 MB/s).  Index size is 0.01 KB.

测试结果:
[hadoop@spark220 common]$ hdfs dfs -ls  /output/wc2
Found 3 items
-rw-r--r--   1 hadoop supergroup          0 2018-03-17 00:21 /output/wc2/_SUCCESS
-rw-r--r--   1 hadoop supergroup        113 2018-03-17 00:21 /output/wc2/part-r-00000.lzo
-rw-r--r--   1 hadoop supergroup          8 2018-03-17 00:23 /output/wc2/part-r-00000.lzo.index

至此完成配置与测试


spark中配置启用LZO压缩

Spark中配置启用LZO压缩,步骤如下:
一、spark-env.sh配置
  • export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
  • export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
  • export SPARK_CLASSPATH=$SPARK_CLASSPATH:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/*:/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/*



二、spark-defaults.conf配置
  • spark.driver.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar
  • spark.executor.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar


注:指向编译生成lzo的jar包

三、测试
1、读取Lzo文件
  • spark-shell --master local[2]
  • scala> import com.hadoop.compression.lzo.LzopCodec
  • scala> val page_views = sc.textFile("/user/hive/warehouse/page_views_lzo/page_views.dat.lzo")


2、写出lzo文件
  • spark-shell --master local[2]
  • scala> import com.hadoop.compression.lzo.LzopCodec
  • scala> val lzoTest = sc.parallelize(1 to 10)
  • scala> lzoTest.saveAsTextFile("/input/test_lzo", classOf[LzopCodec])
  • 结果:
  • [hadoop@spark220 common]$ hdfs dfs -ls /input/test_lzo
    Found 3 items
    -rw-r--r--   1 hadoop supergroup          0 2018-03-16 23:24 /input/test_lzo/_SUCCESS
    -rw-r--r--   1 hadoop supergroup         60 2018-03-16 23:24 /input/test_lzo/part-00000.lzo
    -rw-r--r--   1 hadoop supergroup         61 2018-03-16 23:24 /input/test_lzo/part-00001.lzo



至此配置与测试完成。

四、配置与测试中存问题
1、引用native,缺少LD_LIBRARY_PATH
   1.1、错误提示:
  • Caused by: java.lang.RuntimeException: native-lzo library not available
      at com.hadoop.compression.lzo.LzopCodec.getDecompressorType(LzopCodec.java:120)
      at org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
      at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:111)
      at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
      at org.apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:246)
      at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:245)
      at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:203)
      at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:94)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      at org.apache.spark.scheduler.Task.run(Task.scala:108)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)



   1.2、解决办法:在spark的conf中配置spark-evn.sh,增加以下内容:
  • export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
    export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/app/hadoop-2.6.0-cdh5.7.0/lib/native
    export SPARK_CLASSPATH=$SPARK_CLASSPATH:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/yarn/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/hdfs/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce/lib/*:/app/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/*:/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/jars/*


2、无法找到LzopCodec类
   2.1、错误提示:
  • Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzopCodec not found.
  •     at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:135)
  •     at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:175)
  •     at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
  • Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzopCodec not found
  •     at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
  •     at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:128)


   2.2、解决办法:在spark的conf中配置spark-defaults.conf,增加以下内容:
  • spark.driver.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar
  • spark.executor.extraClassPath /app/hadoop-2.6.0-cdh5.7.0/share/hadoop/common/hadoop-lzo-0.4.19.jar




没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条