分享

工作中总结的关于hive的优化方案

这些优化全是在公司中遇到瓶颈后的一些解决方案,精华部分,当然大数据时代任何问题的解决方法都不止一种,智者见智

slow相关参数能够设置map阶段执行完成在执行ruduce,默认值是0.05,设置成1后就会让map完全拿完数据了再执行reduce阶段,这样不会因为map端数据倾斜了执行map时也在执行reduce两个阶段的槽位都被占着,也就是内存利用率问题
set mapreduce.job.reduce.slowstart.completedmaps =1;


(数据应用)就像香肠,最好别看见它们是怎么做出来的。 ——Otto von Bismarck

作业优先级set mapred.job.priority=VERY_HIGH;

stage-label.sql中小文件太多,还比较小,几k大小的数据,最多一次一个job产生了9万多个map,100个reduce

第一条sql:
设置reduce数无效,合并map无效,压缩无效,最后设置bzip2压缩有点效果,因为bzip2支出文件切分,但是cpu会很吃力,毕竟一小时有1-2T的数据量
我决定废弃gzip和bzip2,使用DeflateCodec(只是在stage-label.sql中)
解决办法:
1.在这个脚本中有很多条sql,我采用优化sql,在where条件中不能加别名,有的sql为了给一个type字段过滤就在一个大表查询中加了嵌套,
废弃嵌套,把用函数处理后的字段放在where后不加别名
2.有的sql不产生reduce,在where后加distribute by rand(1234),强制产生reduce,在加上mapred.reduce.tasks参数
3.加上压缩属性:set hive.exec.compress.output=true;
set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.DeflateCodec;
set mapred.reduce.tasks=50;

65.5g跑出来到目标表,24.5g,DeflateCodec压缩后8.5g,压缩比大约35%

第二条sql:
以上一条sql目标表为源表向另一张表插入数据,文件不切分,即使是压缩文件,map执行前也没有解压文件,因为就产生了50个map,依然是没有reduce阶段。
产生了几十个小文件(kb大小),显然在一个脚本中设置全局reduce不合理
解决办法:
1.getEc这个udf函数中:如果传入的参数是null,返回空字符串,如果是'null',不处理,可以也返回空字符串(暂不优化),把where ec is null去掉,
然后把嵌套去掉,直接把udf处理后的字段放在where后过滤,然后把udf处理的字段 !='',改成传入的参数 != '',如此一来少调用了udf,少了where条件,
少了嵌套(少了一部大数据量的查询)
2.让其强制产生reduce,并再次设置reduce数量(用于覆盖第一次设置的reduce数量属性):set mapred.reduce.tasks=1;
有效果,确实会合并小文件,看来defalteCodec也是可以合并的,但是reduce阶段比较慢,猜测是反序列化后合并在做压缩导致的
又测试了下不压缩只覆盖reduce数量参数,reduce阶段确是会快很多,只需要反序列化即可:(还是没有改压缩方式,默认压缩,效率也还行)
set mapreduce.reduce.tasks=1;
distribute by rand(1);

第三条sql:



待优化:
use db;
insert into table tj_idfa
select '${date_desc}',count(distinct imei),
count(distinct idfa),count(distinct id_card),count(distinct tel_no),count(distinct user_name)
from (     
select imei,idfa,id_card,tel_no,user_name  
from z02_dw_tab_telphone_info_cde where pt_date='${date_desc}'
group by imei,imsi,idfa,id_card,tel_no,user_name) b;
优化1:
use db;
create table tmp_z02_dw_tab_telphone_info_cde
as
select imei,idfa,id_card,tel_no,user_name  
from z02_dw_tab_telphone_info_cde where pt_date='${date_desc}'
group by imei,imsi,idfa,id_card,tel_no,user_name;
insert into table tj_idfa
select '${date_desc}',
count(distinct case when imei is null and imei = 'null' and imei = 'NULL' then '999' else imei),
count(distinct case when idfa is null and idfa = 'null' and idfa = 'NULL' then '999' else idfa),
count(distinct case when id_card is null and id_card = 'null' and id_card = 'NULL' then '999' else id_card),
count(distinct case when tel_no is null and tel_no = 'null' and tel_no = 'NULL' then '999' else tel_no),
count(distinct case when user_name is null and user_name = 'null' and user_name = 'NULL' then '999' else user_name)
from tmp_z02_dw_tab_telphone_info_cde;
优化2:
use db;
create table tmp_z02_dw_tab_telphone_info_cde
as
select imei,idfa,id_card,tel_no,user_name  
from z02_dw_tab_telphone_info_cde where pt_date='${date_desc}'
group by imei,imsi,idfa,id_card,tel_no,user_name;

from tmp_z02_dw_tab_telphone_info_cde
insert into table tj_idfa
select '${date_desc}',1+count(distinct case when imei is null and imei = 'null' and imei = 'NULL' then '999' else imei)
'' as idfa,'' as id_card,'' as tel_no,'' as user_name where imei is not null
select '${date_desc}','' as imei,1+count(distinct case when idfa is null and idfa = 'null' and idfa = 'NULL' then '999' else idfa) ,
'' as id_card,'' as tel_no,'' as user_name where idfa is not null
select '${date_desc}','' as imei,'' as idfa,1+count(distinct case when id_card is null and id_card = 'null' and id_card = 'NULL' then '999' else id_card) ,
'' as tel_no,'' as user_name where id_card is not null
select '${date_desc}','' as imei,'' as idfa,'' as id_card,1+count(distinct case when tel_no is null and tel_no = 'null' and tel_no = 'NULL' then '999' else tel_no) ,
'' as user_name where tel_no is not null
select '${date_desc}','' as imei,'' as idfa,'' as id_card,'' as user_name ,1+count(distinct case when user_name is null and user_name = 'null' and user_name = 'NULL' then '999' else user_name)
where user_name is not null ;



换用tez执行引擎:
有以下不任性之处:不可跟linux本地磁盘交互,比如load data local inpath 用不了,add jar 本地jar包用不了,把jar上传到hdfs就ok了
                需要使用一张表时,如果某个分区下没数据,那么也用不了tez
一个好的优化方案就是:
                在load本地数据时,set hive.execution.engine=mr;或者用hdfs dfs -put   再alter table
                写个shell清楚分区下没数据的文件夹,或者直接没文件就删除hive表分区(一个好的思路就是用tez执行引擎,然后select count(*) from tabname ;tez会自动把哪些分区下没有数据的目录列出来,复制出来 -rmr就ok)
当然写个脚本自动化是更好了

好处就是在指定作业时,单session中只初始化一次

在cli窗口执行语句,时间长没结果会报错
Error: org.apache.thrift.transport.TTransportException: java.net.SocketException: Broken pipe (state=08S01,code=0)


磁盘存储解决方案:
set hive.exec.compress.output=true;
set hive.exec.compress.codec=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec;
insert overwrite table compress_test partition(d) select a,b,c,'20160517' from compress_test where d = 20160517;//因为原始数据是没有压缩的,cpu比较饱和,首选defalutcodec

set mapred.child.java.opts = -Xmx1024m;
set mapred.reduce.tasks=500;
set hive.optimize.skewjoin = true;
set hive.auto.convert.join = true;
set hive.map.aggr=true;
set hive.groupby.skewindata=true;

优化注意要点:索引,做表关联时,在主表源头下压数据,count(distinct)必须用时用子查询:select count(*) from (select column from tab group bu column),避免三层嵌套,若是必须用多层,使用临时表,大小表关联,使用mapjoin,分析数据,空值率,比如字段空值中只有'',就不用写太多添加过滤空值了,
注意避免空值关联,源表梳理,血缘关系,公用数据上层给出,或者一个脚本中有多张表公用源表,用mutil group by,参数的设置,排序避免使用order by,压缩的设置,线文件的控制数据倾斜,union all语法特别复杂时可以用两个脚本替代,先insert overwrite 再insert into。数据的分析很重要,比如知道url有很多空值null,需要统计uv,select count(distinct url)+1 from tab where url is not null
还有个优化是有必要提一下,解析udf时是有必要使用map,,然后类似where column != '' and column != 'null' and column != 'NULL' and column is not null;可以封装一个udf方法,checkNull,支持传多参数,若是其中有参数值等于上面四种则返回false


set hive.groupby.skewindata=true;
set hive.hadoop.supports.splittable.combineinputformat=true;--合并小文件
set hive.input.format=org.apache.Hadoop.hive.ql.io.CombineHiveInputFormat;
set hive.map.aggr=true;
set hive.merge.mapfiles = true;
set hive.merge.mapredfiles = true;
set mapred.max.split.size=256000000;
set hive.merge.smallfiles.avgsize=32000000;
set hive.merge.size.per.task=67108864;
set hive.exec.reducers.bytes.per.reducer=67108864;
set mapred.job.reuse.jvm.num.tasks=-1;
set hive.exec.compress.output=true;--默认是flase
set mapred.output.compress=true;--默认是flase
set mapred.output.compression.type=BLOCK;--默认是recode
set mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;
set io.compression.codecs=org.apache.hadoop.io.compress.BZip2Codec;;---DeflateCodec
set mapreduce.job.reduce.slowstart.completedmaps=1;--hive中map执行完再执行reduce,内存利用最大化
set hive.exec.parallel=true;是否可以并行执行  --默认是flase
set hive.exec.parallel.thread.number=8;默认是8

临时表最好也建成分区表,避免小时级作业没跑完下个小时起了job删了上个小时建的表,如果建成分区表,下次执行只要删除两个小时前的分区即可,或者建表表名中加上小时分区也行
最后删除中间表时删除两张表,当前小时和上一个小时的,以免上一个小时程序挂了但是中间表没有删








multiinsert
from dwd_ev_pub_ec_info_text
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'Ec_汽车' as data_type,'ec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='car'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'Ec_金融' as data_type,'ec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='fin'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'Ec_家电' as data_type,'ec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='hea'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'Ec_房产' as data_type,'ec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='house'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'Ec_手机' as data_type,'ec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='mobile';

from dwd_ev_pub_act_inter_text
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'notEc_汽车','notec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='car'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'notEc_金融','notec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='fin'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'notEc_家电','ec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='hea'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'notEc_房产','notec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='house'
insert into table z03_total_count partition(pt_date='${date_desc}')
select 'notEc_手机','notec' as ec_type,count(1) as totals_count
where  pt_hour rlike '${date_desc}*' and indu_cd='mobile';

本帖被以下淘专辑推荐:

已有(4)人评论

跳转到指定楼层
arsenduan 发表于 2017-2-13 15:05:50

第一条sql:
设置reduce数无效,合并map无效,压缩无效,最后设置bzip2压缩有点效果,因为bzip2支出文件切分,但是cpu会很吃

第二条sql:
以上一条sql目标表为源表向另一张表插入数据,文件不切分,即使是压缩文件,map执行前也没有解压文件,因为就产生

楼主似乎缺sql
第一条和第二条是什么语句

回复

使用道具 举报

zhuqitian 发表于 2017-2-13 15:29:48
arsenduan 发表于 2017-2-13 15:05
第一条sql:
设置reduce数无效,合并map无效,压缩无效,最后设置bzip2压缩有点效果,因为bzip2支出文件 ...

半年前记录的优化笔记,没做什么修改粘出来的,期间可能哪次被不小心剪切掉了
回复

使用道具 举报

Missway 发表于 2017-2-14 09:57:04
  果断收藏到我的云笔记中了
回复

使用道具 举报

叫我胖哥 发表于 2017-2-14 13:46:15
感谢楼主,mark标注。。。
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条