大数据项目之电商数仓（总结）（十二）:用户行为数据仓库

问题导读：
1、如何统计最近7天未登录的用户？
2、如何使用周活明细表dws_uv_detail_wk输出数据？
3、如何统计最近连续三周活跃用户数？
4、如何使用日活明细表dws_uv_detail_day输出数据？

上一篇：大数据项目之电商数仓（总结）（十一）:用户行为数据仓库

第12章需求六：流失用户数
流失用户：最近7天未登录我们称之为流失用户

12.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据

12.2 ADS层

1）建表语句

hive (gmall)>
drop table if exists ads_wastage_count;
create external table ads_wastage_count(
    `dt` string COMMENT '统计日期',
    `wastage_count` bigint COMMENT '流失设备数'
)
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_wastage_count';
复制代码

2）导入2019-02-20数据

hive (gmall)>
insert into table ads_wastage_count
select
     '2019-02-20',
     count(*)
from 
(
    select mid_id
from dws_uv_detail_day
    group by mid_id
    having max(dt)<=date_add('2019-02-20',-7)
)t1;
复制代码

12.3 编写脚本
1）在hadoop102的/home/kgg/bin目录下创建脚本

[kgg@hadoop102 bin]$ vim ads_wastage_log.sh
    在脚本中编写如下内容
#!/bin/bash

if [ -n "$1" ];then
    do_date=$1
else
    do_date=`date -d "-1 day" +%F`
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo "-----------导入日期$do_date-----------"

sql="
insert into table "$APP".ads_wastage_count
select
     '$do_date',
     count(*)
from 
(
    select mid_id
    from "$APP".dws_uv_detail_day
    group by mid_id
    having max(dt)<=date_add('$do_date',-7)
)t1;"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 ads_wastage_log.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ ads_wastage_log.sh 2019-02-20
复制代码

4）查询结果

hive (gmall)> select * from ads_wastage_count;
复制代码

5）脚本执行时间
企业开发中一般在每日凌晨30分~1点

第13章需求七：最近连续三周活跃用户数
最近3周连续活跃的用户：通常是周一对前3周的数据做统计，该数据一周计算一次。

13.1 DWS层
使用周活明细表dws_uv_detail_wk作为DWS层数据

13.2 ADS层

1）建表语句

hive (gmall)>
drop table if exists ads_continuity_wk_count;
create external table ads_continuity_wk_count( 
    `dt` string COMMENT '统计日期,一般用结束周周日日期,如果每天计算一次,可用当天日期',
    `wk_dt` string COMMENT '持续时间',
    `continuity_count` bigint
) 
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_wk_count';
复制代码

2）导入2019-02-20所在周的数据

hive (gmall)>
insert into table ads_continuity_wk_count
select
     '2019-02-20',
     concat(date_add(next_day('2019-02-20','MO'),-7*3),'_',date_add(next_day('2019-02-20','MO'),-1)),
     count(*)
from 
(
    select mid_id
    from dws_uv_detail_wk
    where wk_dt>=concat(date_add(next_day('2019-02-20','MO'),-7*3),'_',date_add(next_day('2019-02-20','MO'),-7*2-1)) 
    and wk_dt<=concat(date_add(next_day('2019-02-20','MO'),-7),'_',date_add(next_day('2019-02-20','MO'),-1))
    group by mid_id
    having count(*)=3
)t1;
复制代码

3）查询

hive (gmall)> select * from ads_continuity_wk_count;
复制代码

13.3 编写脚本
1）在hadoop102的/home/kgg/bin目录下创建脚本

[kgg@hadoop102 bin]$ vim ads_continuity_wk_log.sh
    在脚本中编写如下内容
#!/bin/bash

if [ -n "$1" ];then
    do_date=$1
else
    do_date=`date -d "-1 day" +%F`
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo "-----------导入日期$do_date-----------"

sql="
insert into table "$APP".ads_continuity_wk_count
select 
     '$do_date',
     concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-1)),
     count(*)
from 
(
    select mid_id
    from "$APP".dws_uv_detail_wk
    where wk_dt>=concat(date_add(next_day('$do_date','MO'),-7*3),'_',date_add(next_day('$do_date','MO'),-7*2-1)) 
    and wk_dt<=concat(date_add(next_day('$do_date','MO'),-7),'_',date_add(next_day('$do_date','MO'),-1))
    group by mid_id
    having count(*)=3
)t1;"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 ads_continuity_wk_log.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ ads_continuity_wk_log.sh 2019-02-20
复制代码

4）查询结果

hive (gmall)> select * from ads_continuity_wk_count;
复制代码

5）脚本执行时间
企业开发中一般在每周一凌晨30分~1点

第14章需求八：最近七天内连续三天活跃用户数
说明：最近7天内连续3天活跃用户数

14.1 DWS层
使用日活明细表dws_uv_detail_day作为DWS层数据

14.2 ADS层

1）建表语句

hive (gmall)>
drop table if exists ads_continuity_uv_count;
create external table ads_continuity_uv_count( 
    `dt` string COMMENT '统计日期',
    `wk_dt` string COMMENT '最近7天日期',
    `continuity_count` bigint
) COMMENT '连续活跃设备数'
row format delimited fields terminated by '\t'
location '/warehouse/gmall/ads/ads_continuity_uv_count';
复制代码

2）写出导入数据的SQL语句

hive (gmall)>
insert into table ads_continuity_uv_count
select
    '2019-02-12',
    concat(date_add('2019-02-12',-6),'_','2019-02-12'),
    count(*)
from
(
    select mid_id
    from
    (
        select mid_id      
        from
        (
            select 
                mid_id,
                date_sub(dt,rank) date_dif
            from
            (
                select 
                    mid_id,
                    dt,
                    rank() over(partition by mid_id order by dt) rank
                from dws_uv_detail_day
                where dt>=date_add('2019-02-12',-6) and dt<='2019-02-12'
            )t1
        )t2 
        group by mid_id,date_dif
        having count(*)>=3
    )t3 
    group by mid_id
)t4;
复制代码

3）查询

hive (gmall)> select * from ads_continuity_uv_count;
复制代码

14.3 编写脚本
1）在hadoop102的/home/kgg/bin目录下创建脚本

[kgg@hadoop102 bin]$ vim ads_continuity_log.sh
    在脚本中编写如下内容
#!/bin/bash

if [ -n "$1" ];then
    do_date=$1
else
    do_date=`date -d "-1 day" +%F`
fi

hive=/opt/module/hive/bin/hive
APP=gmall

echo "-----------导入日期$do_date-----------"

sql="
insert into table "$APP".ads_continuity_uv_count
select 
     '$do_date',
     concat(date_add('$do_date',-6),'_','$do_date') dt,
     count(*) 
from 
(
    select mid_id
    from
    (
        select mid_id
        from 
        (
            select
                mid_id,
                date_sub(dt,rank) date_diff
            from 
            (
                select 
                    mid_id,
                    dt,
                    rank() over(partition by mid_id order by dt) rank
                from "$APP".dws_uv_detail_day
                where dt>=date_add('$do_date',-6) and dt<='$do_date'
            )t1
        )t2
        group by mid_id,date_diff
        having count(*)>=3
    )t3 
    group by mid_id
)t4;
"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 ads_continuity_log.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ ads_continuity_log.sh 2019-02-12
复制代码

4）查询结果

hive (gmall)> select * from ads_continuity_uv_count;
复制代码

5）脚本执行时间
企业开发中一般在每日凌晨30分~1点

最新经典文章，欢迎关注公众号

图文精华

大数据项目之电商数仓（总结）（十二）:用户行为数据仓库

最佳新人

热心会员

推荐 /2