大数据项目之电商数仓（总结）（六）：系统业务数据仓库

问题导读：
1、如何统计新付费用户数？
2、如何使用付费用户数以及新增用户表作为数据源？
3、如何统计每个用户最近一次购买时间？
4、如何设计用户行为宽表作为DWS层数据？

上一篇：大数据项目之电商数仓（总结）（五）：系统业务数据仓库

第8章需求五：新付费用户数

8.1 DWS层
8.1.1 建表语句

drop table if exists dws_pay_user_detail;
create external table dws_pay_user_detail(   
    `user_id` string comment '付费用户id',
    `name` string comment '付费用户姓名',
    `birthday` string COMMENT '',
    `gender` string COMMENT '',
    `email` string COMMENT '',
    `user_level` string COMMENT ''
) COMMENT '付费用户表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dws/dws_pay_user_detail/';
复制代码

8.1.2 导入数据

insert overwrite table dws_pay_user_detail partition(dt='2019-10-03')
select
   ua.user_id,
   ui.name,
   ui.birthday,
   ui.gender,
   ui.email,
   ui.user_level
from (
  select user_id from dws_user_action where dt='2019-10-03' 
) ua join( 
  select * from dwd_user_info where dt='2019-10-03'
) ui on ua.user_id=ui.id
left join dws_pay_user_detail ud on ua.user_id=ud.user_id
where ud.user_id is null
复制代码

8.1.3 导入数据脚本
1）在/home/kgg/bin目录下创建脚本dws_pay_user_detail.sh

[kgg@hadoop102 bin]$ vim dws_pay_user_detail.sh
在脚本中填写如下内容
#!/bin/bash
db=gmall
hive=/opt/module/hive-1.2.1/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

if [[ -n $1 ]]; then
    do_date=$1
else
    do_date=`date -d '-1 day' +%F`
fi

sql="
use gmall;
insert overwrite table dws_pay_user_detail partition(dt='$do_date')
select
   ua.user_id,
   ui.name,
   ui.birthday,
   ui.gender,
   ui.email,
   ui.user_level
from (
  select user_id from dws_user_action where dt='$do_date'
) ua join(
  select * from dwd_user_info where dt='$do_date'
) ui on ua.user_id=ui.id
left join dws_pay_user_detail ud on ua.user_id=ud.user_id
where ud.user_id is null;
"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 dws_pay_user_detail.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ dws_pay_user_detail.sh 2019-02-20
复制代码

4）查询结果

hive (gmall)> select * from dws_pay_user_detail;
复制代码

5）脚本执行时间

企业开发中一般在每天凌晨30分~1点
复制代码

8.2 ADS层
8.2.1 建表语句

drop table if exists ads_pay_user_count;
create external table ads_pay_user_count(   
    dt string COMMENT '统计日期',
    pay_count   bigint  COMMENT '付费用户数'
) COMMENT '付费用户表'
stored as parquet
location '/warehouse/gmall/dws/ads_pay_user_count/';
复制代码

8.2.2 导入数据

insert into table ads_pay_user_count
select
    '2019-02-10',
    count(*) pay_count
from
    dws_pay_user_detail
where
    dt='2019-02-10';
复制代码

8.2.3 导入数据脚本
1）在/home/kgg/bin目录下创建脚本ads_pay_user_count.sh

[kgg@hadoop102 bin]$ vim ads_pay_user_count.sh
在脚本中填写如下内容
#!/bin/bash

# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
   do_date=$1
else 
   do_date=`date -d "-1 day" +%F`
fi

echo "===日志日期为 $do_date==="
sql="
insert into table "$APP".ads_pay_user_count
select
    '$do_date',
    count(*) pay_count
from
    "$APP".dws_pay_user_detail
where
    dt='$do_date';
"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 ads_pay_user_count.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ ads_pay_user_count.sh 2019-02-20
复制代码

4）查询结果

hive (gmall)> select * from ads_pay_user_count;
复制代码

5）脚本执行时间
企业开发中一般在每天凌晨30分~1点

第9章需求六：付费用户率
9.1 DWS层
使用付费用户数以及新增用户表作为数据源

9.2 ADS层
9.2.1 建表语句

drop table if exists ads_pay_user_ratio;
create external table ads_pay_user_ratio (   
    dt string comment '统计日期',
    pay_count   bigint  comment '总付费用户数',
    user_count bigint comment '总用户数',
    pay_count_ratio decimal(10,2) COMMENT '付费用户比率'
) COMMENT '付费用户率表'
stored as parquet
location '/warehouse/gmall/dws/ads_pay_user_ratio';
复制代码

9.2.2 导入数据

insert into table ads_pay_user_ratio
select
    '2019-02-10' dt,
    pay_count,
    new_mid_count,
    pay_count/new_mid_count*100 pay_count_ratio
from
    (select
        '2019-02-10' dt,
        pay_count
    from
        ads_pay_user_count
    )pay_user
join
    (select
        '2019-02-10' dt, 
        sum(new_mid_count) new_mid_count
    from
        ads_new_mid_count
    where
        create_date<='2019-02-10'
    )user_total_count
on
    pay_user.dt=user_total_count.dt;
复制代码

9.2.3 导入数据脚本
1）在/home/kgg/bin目录下创建脚本ads_pay_user_ratio.sh

[kgg@hadoop102 bin]$ vim ads_pay_user_ratio.sh
在脚本中填写如下内容
#!/bin/bash

# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
   do_date=$1
else 
   do_date=`date -d "-1 day" +%F`
fi

echo "===日志日期为 $do_date==="
sql="
insert into table "$APP".ads_pay_user_ratio
select
    '$do_date' dt,
    pay_count,
    new_mid_count,
    pay_count/new_mid_count*100 pay_count_ratio
from
    (select
        '$do_date' dt,
        pay_count
    from
        "$APP".ads_pay_user_count
    )pay_user
join
    (select
        '$do_date' dt, 
        sum(new_mid_count) new_mid_count
    from
        "$APP".ads_new_mid_count
    where
        create_date<='$do_date'
    )user_total_count
on
    pay_user.dt=user_total_count.dt;
"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 ads_pay_user_ratio.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ ads_pay_user_ratio.sh 2019-02-20
复制代码

4）查询结果

hive (gmall)> select * from ads_pay_user_ratio;
复制代码

5）脚本执行时间

<div>企业开发中一般在每天凌晨30分~1点</div>
复制代码

第10章需求七：每个用户最近一次购买时间

10.1 DWS层
使用用户行为宽表作为DWS层数据

10.2 ADS层
10.2.1 建表语句

drop table if exists ads_user_last_pay;
create external table ads_user_last_pay(
    user_id   string  comment '用户id',
    pay_date string comment '最近一次购买时间'
) COMMENT '用户最近一次购买时间表'
stored as parquet
location '/warehouse/gmall/dws/ads_user_last_pay/';
复制代码

10.2.2 导入数据
初始化数据

insert into table ads_user_last_pay
select
    user_id,
    '2019-02-10'
from
    dws_user_action
where
    dt='2019-02-10'
    and
    payment_amount>0;
复制代码

导入其他日期数据

insert overwrite table ads_user_last_pay
select
    if(du.user_id is null, au.user_id, du.user_id),  
    if(du.user_id is null, au.pay_date,'2019-02-11')
from
    ads_user_last_pay  au
full join
    (select
        user_id
    from
        dws_user_action
    where
        dt='2019-02-11'
        and
        payment_amount>0) du
on
    au.user_id=du.user_id;
复制代码

10.2.3 导入数据脚本

1）在/home/kgg/bin目录下创建脚本ads_user_last_pay.sh
[kgg@hadoop102 bin]$ vim ads_user_last_pay.sh
在脚本中填写如下内容
#!/bin/bash

# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive
hadoop=/opt/module/hadoop-2.7.2/bin/hadoop

# 如果是输入的日期按照取输入日期；如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
   do_date=$1
else 
   do_date=`date -d "-1 day" +%F`
fi

echo "===日志日期为 $do_date==="
sql="
insert overwrite table "$APP".ads_user_last_pay
select
    if(du.user_id is null, au.user_id, du.user_id),  
    if(du.user_id is null, au.pay_date,'$do_date')
from
    "$APP".ads_user_last_pay  au
full join
    (select
        user_id
    from
        "$APP".dws_user_action
    where
        dt='$do_date'
        and
        payment_amount>0) du
on
    au.user_id=du.user_id;
"

$hive -e "$sql"
复制代码

2）增加脚本执行权限

[kgg@hadoop102 bin]$ chmod 777 ads_user_last_pay.sh
复制代码

3）脚本使用

[kgg@hadoop102 module]$ ads_user_last_pay.sh 2019-02-12
复制代码

4）查询结果

hive (gmall)> select * from ads_user_last_pay;
复制代码

5）脚本执行时间

企业开发中一般在每天凌晨30分~1点
复制代码

最新经典文章，欢迎关注公众号

图文精华

大数据项目之电商数仓（总结）（六）：系统业务数据仓库

最佳新人

热心会员

推荐 /2