分享

Hive分析窗口函数(一) SUM,AVG,MIN,MAX

gefieder 2015-4-29 23:27:30 发表于 连载型 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 2 142089
问题导读:
1、Hive分析窗口函数SUM如何实现?
2、Hive分析窗口函数AVG脚本如何编写?
3、Hive分析窗口函数MIN、MAX脚本如何实现?




Hive中提供了越来越多的分析函数,用于完成负责的统计分析。抽时间将所有的分析窗口函数理一遍,将陆续发布。
今天先看几个基础的,SUM、AVG、MIN、MAX。
用于实现分组内所有和连续累积的统计。
Hive版本为 apache-hive-0.13.1数据准备
  1.     CREATE EXTERNAL TABLE lxw1234 (
  2.     cookieid string,
  3.     createtime string, --day
  4.     pv INT
  5.     ) ROW FORMAT DELIMITED
  6.     FIELDS TERMINATED BY ','
  7.     stored as textfile location '/tmp/lxw11/';
  8.      
  9.     DESC lxw1234;
  10.     cookieid STRING
  11.     createtime STRING
  12.     pv INT
  13.      
  14.     hive> select * from lxw1234;
  15.     OK
  16.     cookie1 2015-04-10 1
  17.     cookie1 2015-04-11 5
  18.     cookie1 2015-04-12 7
  19.     cookie1 2015-04-13 3
  20.     cookie1 2015-04-14 2
  21.     cookie1 2015-04-15 4
  22.     cookie1 2015-04-16 4
复制代码

SUM — 注意,结果和ORDER BY相关,默认为升序
  1.     SELECT cookieid,
  2.     createtime,
  3.     pv,
  4.     SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
  5.     SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
  6.     SUM(pv) OVER(PARTITION BY cookieid) AS pv3,        --分组内所有行
  7.     SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
  8.     SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
  9.     SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
  10.     FROM lxw1234;
  11.      
  12.     cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
  13.     -----------------------------------------------------------------------------
  14.     cookie1 2015-04-10 1 1 1 26 1 6 26
  15.     cookie1 2015-04-11 5 6 6 26 6 13 25
  16.     cookie1 2015-04-12 7 13 13 26 13 16 20
  17.     cookie1 2015-04-13 3 16 16 26 16 18 13
  18.     cookie1 2015-04-14 2 18 18 26 17 21 10
  19.     cookie1 2015-04-15 4 22 22 26 16 20 8
  20.     cookie1 2015-04-16 4 26 26 26 13 13 4
复制代码
pv1: 分组内从起点到当前行的pv累积,如,11号的pv1=10号的pv+11号的pv, 12号=10号+11号+12号
pv2: 同pv1
pv3: 分组内(cookie1)所有的pv累加
pv4: 分组内当前行+往前3行,如,11号=10号+11号, 12号=10号+11号+12号, 13号=10号+11号+12号+13号, 14号=11号+12号+13号+14号
pv5: 分组内当前行+往前3行+往后1行,如,14号=11号+12号+13号+14号+15号=5+7+3+2+4=21
pv6: 分组内当前行+往后所有行,如,13号=13号+14号+15号+16号=3+2+4+4=13,14号=14号+15号+16号=2+4+4=10

如果不指定ROWS BETWEEN,默认为从起点到当前行;
如果不指定ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
–其他AVG,MIN,MAX,和SUM用法一样。
  1.     --AVG
  2.     SELECT cookieid,
  3.     createtime,
  4.     pv,
  5.     AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
  6.     AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
  7.     AVG(pv) OVER(PARTITION BY cookieid) AS pv3,        --分组内所有行
  8.     AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
  9.     AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
  10.     AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
  11.     FROM lxw1234;
  12.     cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
  13.     -----------------------------------------------------------------------------
  14.     cookie1 2015-04-10 1 1.0 1.0 3.7142857142857144 1.0 3.0 3.7142857142857144
  15.     cookie1 2015-04-11 5 3.0 3.0 3.7142857142857144 3.0 4.333333333333333 4.166666666666667
  16.     cookie1 2015-04-12 7 4.333333333333333 4.333333333333333 3.7142857142857144 4.333333333333333 4.0 4.0
  17.     cookie1 2015-04-13 3 4.0 4.0 3.7142857142857144 4.0 3.6 3.25
  18.     cookie1 2015-04-14 2 3.6 3.6 3.7142857142857144 4.25 4.2 3.3333333333333335
  19.     cookie1 2015-04-15 4 3.6666666666666665 3.6666666666666665 3.7142857142857144 4.0 4.0 4.0
  20.     cookie1 2015-04-16 4 3.7142857142857144 3.7142857142857144 3.7142857142857144 3.25 3.25 4.0
复制代码
  1.     --MIN
  2.     SELECT cookieid,
  3.     createtime,
  4.     pv,
  5.     MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
  6.     MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
  7.     MIN(pv) OVER(PARTITION BY cookieid) AS pv3,        --分组内所有行
  8.     MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
  9.     MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
  10.     MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
  11.     FROM lxw1234;
  12.      
  13.     cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
  14.     -----------------------------------------------------------------------------
  15.     cookie1 2015-04-10 1 1 1 1 1 1 1
  16.     cookie1 2015-04-11 5 1 1 1 1 1 2
  17.     cookie1 2015-04-12 7 1 1 1 1 1 2
  18.     cookie1 2015-04-13 3 1 1 1 1 1 2
  19.     cookie1 2015-04-14 2 1 1 1 2 2 2
  20.     cookie1 2015-04-15 4 1 1 1 2 2 4
  21.     cookie1 2015-04-16 4 1 1 1 2 2 4
复制代码
  1.     ----MAX
  2.     SELECT cookieid,
  3.     createtime,
  4.     pv,
  5.     MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1, -- 默认为从起点到当前行
  6.     MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2, --从起点到当前行,结果同pv1
  7.     MAX(pv) OVER(PARTITION BY cookieid) AS pv3,        --分组内所有行
  8.     MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4, --当前行+往前3行
  9.     MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5, --当前行+往前3行+往后1行
  10.     MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6 ---当前行+往后所有行
  11.     FROM lxw1234;
  12.      
  13.     cookieid createtime pv pv1 pv2 pv3 pv4 pv5 pv6
  14.     -----------------------------------------------------------------------------
  15.     cookie1 2015-04-10 1 1 1 7 1 5 7
  16.     cookie1 2015-04-11 5 5 5 7 5 7 7
  17.     cookie1 2015-04-12 7 7 7 7 7 7 7
  18.     cookie1 2015-04-13 3 7 7 7 7 7 4
  19.     cookie1 2015-04-14 2 7 7 7 7 7 4
  20.     cookie1 2015-04-15 4 7 7 7 7 7 4
  21.     cookie1 2015-04-16 4 7 7 7 4 4 4
复制代码
其他函数的介绍将陆续整理发布。。



相关内容:


Hive分析窗口函数(一) SUM,AVG,MIN,MAX

Hive分析窗口函数(二) NTILE,ROW_NUMBER,RANK,DENSE_RANK


Hive分析窗口函数(四) LAG,LEAD,FIRST_VALUE,LAST_VALUE


Hive分析窗口函数(五) GROUPING SETS,GROUPING__ID,CUBE,ROLLUP











资料来源:http://029bigdata.com/?p=176

欢迎加入about云群371358502、39327136,云计算爱好者群,亦可关注about云腾讯认证空间||关注本站微信

已有(2)人评论

跳转到指定楼层
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条