分享

大数据项目之电商数仓(总结)(五):系统业务数据仓库

问题导读:
1、如何以月为单位统计,购买2次以上商品的用户?
2、如何设计用户购买商品明细表(宽表)?
3、如何统计各用户等级对应的复购率前十的商品排行?
4、如何统计每等级用户每产品的下单次数?


上一篇:大数据项目之电商数仓(总结)(四):系统业务数据仓库

第6章 需求三:品牌复购率
需求:以月为单位统计,购买2次以上商品的用户
6.1 复购率计算分析

2021-01-19_200021.jpg

6.2 DWS层
6.2.1 用户购买商品明细表(宽表)


2021-01-19_200052.jpg

  1. hive (gmall)>
  2. drop table if exists dws_sale_detail_daycount;
  3. create external table dws_sale_detail_daycount
  4. (   
  5.     user_id   string  comment '用户 id',
  6.     sku_id    string comment '商品 Id',
  7.     user_gender  string comment '用户性别',
  8.     user_age string  comment '用户年龄',
  9.     user_level string comment '用户等级',
  10.     order_price decimal(10,2) comment '商品价格',
  11.     sku_name string   comment '商品名称',
  12.     sku_tm_id string   comment '品牌id',
  13.     sku_category3_id string comment '商品三级品类id',
  14.     sku_category2_id string comment '商品二级品类id',
  15.     sku_category1_id string comment '商品一级品类id',
  16.     sku_category3_name string comment '商品三级品类名称',
  17.     sku_category2_name string comment '商品二级品类名称',
  18.     sku_category1_name string comment '商品一级品类名称',
  19.     spu_id  string comment '商品 spu',
  20.     sku_num  int comment '购买个数',
  21.     order_count string comment '当日下单单数',
  22.     order_amount string comment '当日下单金额'
  23. ) COMMENT '用户购买商品明细表'
  24. PARTITIONED BY (`dt` string)
  25. stored as parquet
  26. location '/warehouse/gmall/dws/dws_user_sale_detail_daycount/'
  27. tblproperties ("parquet.compression"="snappy");
复制代码

6.2.2 数据导入
  1. hive (gmall)>
  2. with
  3. tmp_detail as
  4. (
  5.     select
  6.         user_id,
  7.         sku_id,
  8.         sum(sku_num) sku_num,   
  9.         count(*) order_count,
  10.         sum(od.order_price*sku_num) order_amount
  11.     from dwd_order_detail od
  12.     where od.dt='2019-02-10'
  13.     group by user_id, sku_id
  14. )  
  15. insert overwrite table dws_sale_detail_daycount partition(dt='2019-02-10')
  16. select
  17.     tmp_detail.user_id,
  18.     tmp_detail.sku_id,
  19.     u.gender,
  20.     months_between('2019-02-10', u.birthday)/12  age,
  21.     u.user_level,
  22.     price,
  23.     sku_name,
  24.     tm_id,
  25.     category3_id,
  26.     category2_id,
  27.     category1_id,
  28.     category3_name,
  29.     category2_name,
  30.     category1_name,
  31.     spu_id,
  32.     tmp_detail.sku_num,
  33.     tmp_detail.order_count,
  34.     tmp_detail.order_amount
  35. from tmp_detail
  36. left join dwd_user_info u on tmp_detail.user_id =u.id and u.dt='2019-02-10'
  37. left join dwd_sku_info s on tmp_detail.sku_id =s.id and s.dt='2019-02-10';
复制代码


6.2.3 数据导入脚本
1)在/home/kgg/bin目录下创建脚本dws_sale.sh
  1. [kgg@hadoop102 bin]$ vim dws_sale.sh
  2.     在脚本中填写如下内容
  3. #!/bin/bash
  4. # 定义变量方便修改
  5. APP=gmall
  6. hive=/opt/module/hive/bin/hive
  7. # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
  8. if [ -n "$1" ] ;then
  9.     do_date=$1
  10. else
  11.     do_date=`date -d "-1 day" +%F`  
  12. fi
  13. sql="
  14. set hive.exec.dynamic.partition.mode=nonstrict;
  15. with
  16. tmp_detail as
  17. (
  18.     select
  19.         user_id,
  20.         sku_id,
  21.         sum(sku_num) sku_num,   
  22.         count(*) order_count,
  23.         sum(od.order_price*sku_num)  order_amount
  24.     from "$APP".dwd_order_detail od
  25.     where od.dt='$do_date'
  26.     group by user_id, sku_id
  27. )  
  28. insert overwrite table "$APP".dws_sale_detail_daycount partition(dt='$do_date')
  29. select
  30.     tmp_detail.user_id,
  31.     tmp_detail.sku_id,
  32.     u.gender,
  33.     months_between('$do_date', u.birthday)/12  age,
  34.     u.user_level,
  35.     price,
  36.     sku_name,
  37.     tm_id,
  38.     category3_id,
  39.     category2_id,
  40.     category1_id,
  41.     category3_name,
  42.     category2_name,
  43.     category1_name,
  44.     spu_id,
  45.     tmp_detail.sku_num,
  46.     tmp_detail.order_count,
  47.     tmp_detail.order_amount
  48. from tmp_detail
  49. left join "$APP".dwd_user_info u
  50. on tmp_detail.user_id=u.id and u.dt='$do_date'
  51. left join "$APP".dwd_sku_info s on tmp_detail.sku_id =s.id  and s.dt='$do_date';
  52. "
  53. $hive -e "$sql"
复制代码


2)增加脚本执行权限
  1. [kgg@hadoop102 bin]$ chmod 777 dws_sale.sh
复制代码

3)执行脚本导入数据
  1. [kgg@hadoop102 bin]$ dws_sale.sh 2019-02-11
复制代码

4)查看导入数据
  1. hive (gmall)>
  2. select * from dws_sale_detail_daycount where dt='2019-02-11' limit 2;
复制代码

6.3 ADS层品牌复购率


2021-01-19_200125.jpg

6.3.1 建表语句
  1. hive (gmall)>
  2. drop table ads_sale_tm_category1_stat_mn;
  3. create external table ads_sale_tm_category1_stat_mn
  4. (   
  5.     tm_id string comment '品牌id',
  6.     category1_id string comment '1级品类id ',
  7.     category1_name string comment '1级品类名称 ',
  8.     buycount   bigint comment  '购买人数',
  9.     buy_twice_last bigint  comment '两次以上购买人数',
  10.     buy_twice_last_ratio decimal(10,2)  comment  '单次复购率',
  11.     buy_3times_last   bigint comment   '三次以上购买人数',
  12.     buy_3times_last_ratio decimal(10,2)  comment  '多次复购率',
  13.     stat_mn string comment '统计月份',
  14.     stat_date string comment '统计日期'
  15. )   COMMENT '复购率统计'
  16. row format delimited fields terminated by '\t'
  17. location '/warehouse/gmall/ads/ads_sale_tm_category1_stat_mn/';
复制代码

6.3.2 数据导入
1)数据导入
  1. hive (gmall)>
  2. insert into table ads_sale_tm_category1_stat_mn
  3. select   
  4.     mn.sku_tm_id,
  5.     mn.sku_category1_id,
  6.     mn.sku_category1_name,
  7.     sum(if(mn.order_count>=1,1,0)) buycount,
  8.     sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
  9.     sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
  10.     sum(if(mn.order_count>=3,1,0))  buy3timeLast  ,
  11.     sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio ,
  12.     date_format('2019-02-10' ,'yyyy-MM') stat_mn,
  13.     '2019-02-10' stat_date
  14. from
  15. (
  16. select
  17.         user_id,
  18. sd.sku_tm_id,
  19.         sd.sku_category1_id,
  20.         sd.sku_category1_name,
  21.         sum(order_count) order_count
  22.     from dws_sale_detail_daycount sd
  23.     where date_format(dt,'yyyy-MM')=date_format('2019-02-10' ,'yyyy-MM')
  24.     group by user_id, sd.sku_tm_id, sd.sku_category1_id, sd.sku_category1_name
  25. ) mn
  26. group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name;
复制代码


2)查询导入数据
  1. hive (gmall)> select * from ads_sale_tm_category1_stat_mn;
复制代码

6.3.3 数据导入脚本
1)在/home/kgg/bin目录下创建脚本ads_sale.sh
  1. [kgg@hadoop102 bin]$ vim ads_sale.sh
  2.     在脚本中填写如下内容
  3. #!/bin/bash
  4. # 定义变量方便修改
  5. APP=gmall
  6. hive=/opt/module/hive/bin/hive
  7. # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
  8. if [ -n "$1" ] ;then
  9.     do_date=$1
  10. else
  11.     do_date=`date -d "-1 day" +%F`  
  12. fi
  13. sql="
  14. set hive.exec.dynamic.partition.mode=nonstrict;
  15. insert into table "$APP".ads_sale_tm_category1_stat_mn
  16. select   
  17.     mn.sku_tm_id,
  18.     mn.sku_category1_id,
  19.     mn.sku_category1_name,
  20.     sum(if(mn.order_count>=1,1,0)) buycount,
  21.     sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
  22.     sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
  23.     sum(if(mn.order_count>=3,1,0)) buy3timeLast,
  24.     sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio ,
  25.     date_format('$do_date' ,'yyyy-MM') stat_mn,
  26.     '$do_date' stat_date
  27. from
  28. (     
  29. select
  30.         user_id,
  31. od.sku_tm_id,
  32.         od.sku_category1_id,
  33.         od.sku_category1_name,  
  34.         sum(order_count) order_count
  35.     from "$APP".dws_sale_detail_daycount  od
  36.     where date_format(dt,'yyyy-MM')=date_format('$do_date' ,'yyyy-MM')
  37.     group by user_id, od.sku_tm_id, od.sku_category1_id, od.sku_category1_name
  38. ) mn
  39. group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name;
  40. "
  41. $hive -e "$sql"
复制代码

2)增加脚本执行权限
  1. [kgg@hadoop102 bin]$ chmod 777 ads_sale.sh
复制代码

3)执行脚本导入数据
  1. [kgg@hadoop102 bin]$ ads_sale.sh 2019-02-11
复制代码


4)查看导入数据
  1. hive (gmall)>
  2. select * from ads_sale_tm_category1_stat_mn limit 2;
复制代码


第7章 需求四:各用户等级对应的复购率前十的商品排行
7.1 DWS层
使用用户购买明细表宽表(dws_sale_detail_daycount)作为DWS数据
7.2 ADS层
7.2.1 建表语句
  1. drop  table ads_ul_rep_ratio;
  2. create  table ads_ul_rep_ratio(   
  3.     user_level string comment '用户等级' ,
  4.     sku_id string comment '商品id',
  5. buy_count bigint  comment '购买总人数',
  6. buy_twice_count bigint comment  '两次购买总数',
  7.     buy_twice_rate decimal(10,2)  comment  '二次复购率',
  8. rank string comment  '排名' ,
  9.     state_date string comment '统计日期'
  10. )   COMMENT '复购率统计'
  11. row format delimited  fields terminated by '\t'
  12. location '/warehouse/gmall/ads/ads_ul_rep_ratio/';
复制代码

7.2.2 导入数据
  1. with
  2. tmp_count as(
  3.   select -- 每个等级内每个用户对每个产品的下单次数   
  4. user_level,
  5. user_id,
  6.     sku_id,
  7.     sum(order_count) order_count
  8.   from dws_sale_detail_daycount
  9.   where dt<='2019-10-05'
  10.   group by user_level, user_id, sku_id
  11. )
  12. insert overwrite table ads_ul_rep_ratio
  13. select
  14.   *
  15. from(
  16.   select
  17.     user_level,
  18.     sku_id,
  19.     sum(if(order_count >=1, 1, 0)) buy_count,
  20.     sum(if(order_count >=2, 1, 0)) buy_twice_count,
  21.     sum(if(order_count >=2, 1, 0)) / sum(if(order_count >=1, 1, 0)) * 100  buy_twice_rate,
  22.     row_number() over(partition by user_level order by sum(if(order_count >=2, 1, 0)) / sum(if(order_count >=1, 1, 0)) desc) rn,
  23.     '2019-10-05'
  24.   from tmp_count
  25.   group by user_level, sku_id
  26. ) t1
  27. where rn<=10
复制代码

7.2.2 导入数据脚本
1)在/home/kgg/bin目录下创建脚本ads_ul_rep_ratio.sh
  1. [kgg@hadoop102 bin]$ vim ads_ul_rep_ratio.sh
  2. 在脚本中填写如下内容
  3. #!/bin/bash
  4. db=gmall
  5. hive=/opt/module/hive-1.2.1/bin/hive
  6. hadoop=/opt/module/hadoop-2.7.2/bin/hadoop
  7. if [[ -n $1 ]]; then
  8.     do_date=$1
  9. else
  10.     do_date=`date -d '-1 day' +%F`
  11. fi
  12. sql="
  13. use gmall;
  14. with
  15. tmp_count as(
  16.   select -- 每等级用户每产品的下单次数
  17.     user_level,
  18.     sku_id,
  19.     sum(order_count) order_count
  20.   from dws_sale_detail_daycount
  21.   where dt<='$do_date'
  22.   group by user_level, sku_id
  23. )
  24. insert overwrite table ads_ul_rep_ratio
  25. select
  26.   *
  27. from(
  28.   select
  29.     user_level,
  30.     sku_id,
  31.     sum(if(order_count >=1, 1, 0)) buy_count,
  32.     sum(if(order_count >=2, 1, 0)) buy_twice_count,
  33.     sum(if(order_count >=2, 1, 0)) / sum(if(order_count >=1, 1, 0)) * 100  buy_twice_rate,
  34.     row_number() over(partition by user_level order by sum(if(order_count >=2, 1, 0)) / sum(if(order_count >=1, 1, 0)) desc) rn,
  35.     '$do_date'
  36.   from tmp_count
  37.   group by user_level, sku_id
  38. ) t1
  39. where rn<=10
  40. "
  41. $hive -e "$sql"
复制代码

2)增加脚本执行权限
  1. [kgg@hadoop102 bin]$ chmod 777 ads_user_total_count.sh
复制代码

3)脚本使用
  1. [kgg@hadoop102 module]$ ads_user_total_count.sh 2019-02-20
复制代码

4)查询结果
  1. hive (gmall)> select * from ads_user_total_count;
复制代码

5)脚本执行时间
  1. 企业开发中一般在每天凌晨30分~1点
复制代码


最新经典文章,欢迎关注公众号



没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条