分享

Hive SQL 最新经典优化案例



问题导读

1.怎样优化亿级数据表的查询?
2.怎样优化复杂的SQL查询?
3.怎样优化使用函数的复杂SQL的查询?




HiveSQL经典优化案例一:
1.1 将要执行的查询(执行了 1个多小时才出结果):

  1. SELECT dt as DATA_DATE,STRATEGY,AB_GROUP,SOURCE,
  2.     count(distinct case when lower(event) not like '%push%' and event!='corner_mark_show' then udid else null end) as DAU,
  3.     count(case when event='client_show' then 1 else null end) as TOTAL_VSHOW,
  4.     count(distinct case when event='client_show' then vid else null end) as TOTAL_VIDEO_VSHOW,
  5.     count(case when event='video_play' then 1 else null end) as TOTAL_VV_VP,
  6.     count(distinct case when event='video_play' then udid else null end) as TOTAL_USERS_VP,
  7.     count(case when event='effective_play' then 1 else null end) as TOTAL_VV_EP,
  8.     count(distinct case when event='effective_play' then udid else null end) as TOTAL_USERS_EP,
  9.     sum(case when event='video_over' then duration else 0 end) as TOTAL_DURATION,
  10.     count(case when event='video_over' then 1 else null end) as TOTAL_VOVER,
  11.     sum(case when event='video_over' then play_cnts else 0 end) as TOTAL_VOVER_PCNTS,
  12.     count(case when event='push_video_clk' then 1 else null end) as TOTAL_PUSH_VC,
  13.     count(distinct case when event='app_start' and body_source = 'push' then udid else null end) as TOTAL_PUSH_START,
  14.     count(case when event='post_comment' then 1 else null end) as TOTAL_REPLY,
  15.     count(distinct case when event='post_comment' then udid else null end) as TOTAL_USERS_REPLY
  16.     FROM dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl
  17. group by dt,strategy,ab_group,source;
复制代码

1.2 查询语句涉及到的表有 7.7亿+ 数据。(查询如下)

  1. jdbc:hive2://ks-hdp-master-01.dns.rightpad (default)> select count(*) from dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码

640.png

1.3 优化思路:既然将要执行的查询是按照 dt, strategy, ab_group, source 这4个字段分组, 那么在建表的时候,就按这四个字段中的N个(1 或 2 或 3 或4)个字段组合分区,直接让 count(distinct xx) 之类的查询定位到“更少的数据子集”,其执行效率就应该更高了(不需要每个子任务均从 7.7亿+ 的数据中(去重)统计)。


1.4 先看每个字段将会有多少分区(因为 Hive 表分区也不宜过多,一般一个查询语句涉及到的 hive分区 应该控制在2K内)

  1. jdbc:hive2://ks-hdp-master-01.dns.rightpad (default)>
  2. select count(distinct dt) as dis_dt, count(distinct strategy) as dis_strategy,
  3. count(distinct ab_group) as dis_ab_group,
  4. count(distinct source) as dis_source
  5. from dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码

640 (50).png

  1. [hue@ks-hdp-client-v02 10:55:08 /usr/local/hue]$ python
  2. Python 2.7.12 (default, Dec  4 2017, 14:50:18)
  3. [GCC 5.4.0 20160609] on linux2
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. >>> 2*14*72
  6. 2016
  7. -- 2016 个分区还可以接受。
复制代码

1.5 根据原表,新建分区表,并将原表数据插入新表:

  1. show create table dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码
  1. jdbc:hive2://ks-hdp-master-01.dns.rightpad (default)> show create table dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl;
复制代码

640 (51).png

创建新表:按 dt,source,stragegy,ab_group 分区(注意先后顺序,一般习惯分区数越少的越靠前,根据1.5的查询可知:dt=1,source=2,strategy=14,ab_group=72)

  1. create external table `dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_lym`(
  2.   event string,
  3.   udid string,
  4.   vid string,
  5.   duration string,
  6.   body_source string,
  7.   play_cnts string
  8. )
  9. PARTITIONED BY (
  10.   dt string,
  11.   source string,
  12.   strategy string,
  13.   ab_group string
  14. );
复制代码

将原表数据插入新表:

  1. insert into `dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_lym` partition(dt,source,strategy,ab_group)
  2. select event,udid,vid,duration,body_source,play_cnts,dt,source,strategy,ab_group
  3. from `dwb_v8sp_tmp.base_report_bystrategy_byab_source_column_zkl`;
复制代码

核对两表的数据是否一致:

640 (52).png

1.6 基于新表执行查询(执行5分钟出结果):

640 (53).png

HiveSQL经典优化案例二:

问题描述:一个复杂的SQL,查询执行一段时间后报错:基本上是查不出来;

分析函数对于大表来说不是 hive的强项,这个时候我们将其分解成很多子集,并且合理利用 hive 分区表的优势,然后去 join 。

2.1 将要执行的查询

  1. create table bi_tmp.aloha_UserLoyalty_190301_190303 as
  2.     select aid, imei, idfa, udid, event, duration, dt, time_local, hour, source,
  3.         first_value(time_local) over(partition by udid, event order by time_local) as first_time,
  4.         last_value(time_local) over(partition by udid, event order by time_local) as last_time,
  5.         count(time_local) over(partition by udid, event, dt) as event_count_per_day,
  6.         sum(duration) over(partition by udid, event, dt) as event_duration_each_day
  7.     from dwb_v8sp.event_column_info_new_hour
  8.     where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment',
  9.         'comment_success')
  10.         and dt >= '2019-03-01' and dt <= '2019-03-03';
  11. select count(*)
  12. from dwb_v8sp.event_column_info_new_hour
  13. where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
  14. and dt >= '2019-03-01' and dt <= '2019-03-03';
复制代码

640 (54).png

  1. select count(distinct event) as dis_event
  2. from dwb_v8sp.event_column_info_new_hour
  3. where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
  4. and dt >= '2019-03-01' and dt <= '2019-03-03';
复制代码

分解成三个子集,并保存到三张表:  bi_tmp.zyt1, bi_tmp.zyt2, bi_tmp.zyt3

  1. -- drop table if exists bi_tmp.zyt1;
  2. create table bi_tmp.zyt1 partitioned by(event)
  3. as
  4. select udid,
  5.        min(time_local) as first_time,
  6.        max(time_local) as last_time,
  7.        event
  8. from dwb_v8sp.event_column_info_new_hour
  9. where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
  10. and dt >= '2019-03-01' and dt <= '2019-03-03'
  11. group by udid, event;
  12. -- drop table if exists bi_tmp.zyt2 purge;
  13. create table bi_tmp.zyt2 partitioned by(dt,event)
  14. as
  15. select udid,
  16.        count(time_local) as event_count_per_day,
  17.        sum(duration) as event_duration_each_day,
  18.        dt,
  19.        event
  20. from dwb_v8sp.event_column_info_new_hour
  21. where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment', 'comment_success')
  22. and dt >= '2019-03-01' and dt <= '2019-03-03'
  23. group by udid, dt, event;
  24. create table bi_tmp.zyt3 partitioned by(dt,event)
  25. as select aid, imei, idfa, udid, duration, time_local, hour, source, dt, event
  26. from dwb_v8sp.event_column_info_new_hour t3
  27.     where event in ('app_start', 'app_exit', 'effective_play', 'share_succ', 'like', 'unlike', 'like_comment', 'unlike_comment',
  28.         'comment_success')
  29.         and dt >= '2019-03-01' and dt <= '2019-03-03';
  30. -- 插入目标表:
  31. create table bi_tmp.aloha_UserLoyalty_190301_190303 as
  32.     select t3.aid, t3.imei, t3.idfa, t3.udid, t3.event, t3.duration, t3.dt, t3.time_local, t3.hour, t3.source,
  33.         t1.first_time,
  34.         t1.last_time,
  35.         t2.event_count_per_day,
  36.         t2.event_duration_each_day
  37.     from bi_tmp.zyt1 t1 join bi_tmp.zyt2 t2 on t1.event=t2.event and t1.udid=t2.udid
  38.     join bi_tmp.zyt3 t3 on t2.dt=t3.dt and t2.event= t3.event and t2.udid=t3.udid;
  39. -- 验证数据:(与上面的查询记录行数对的上)
复制代码

640 (55).png

HiveSQL经典优化案例三:

如下SQL,用到了 PERCENTILE_APPROX 函数,问题描述:如下SQL,用到了 PERCENTILE_APPROX 函数,个人初步分析认为:由于用到该函数的次数太多,导致性能严重下降。

我仔细查了一下该函数,发现:它是支持“数组传参”的,那么就不难找到优化该SQL的方法了。

640 (56).png

3.1 原SQL性能测试:

640 (57).png

3.2 优化后的SQL,性能测试:

640 (58).png

优化后的SQL,性能提升了4倍多。

640 (59).png




最新经典文章,欢迎关注公众号



---------------------

作者:大数据技术与架构
来源:weixin
原文:Hive SQL经典优化案例





已有(1)人评论

跳转到指定楼层
zxmit 发表于 2020-12-10 22:15:59
很强,学习了
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条