分享

大数据项目之电商数仓(总结)(九):用户行为数据仓库

问题导读:
1、如何设计当日(dau)、当周、当月活动的每个设备明细表?
2、如何根据日用户访问明细,获得周用户访问明细?
3、如何设计当日、当周、当月活跃设备数表?
4、如何编写ADS层加载数据脚本?


上一篇:大数据项目之电商数仓(总结)(八):用户行为数据仓库

第6章 需求一:用户活跃主题

6.1 DWS层
目标:统计当日(dau)、当周、当月活动的每个设备明细


6.1.1 每日活跃设备明细
2020-12-09_200533.jpg

1)建表语句
  1. hive (gmall)>
  2. drop table if exists dws_uv_detail_day;
  3. create external table dws_uv_detail_day
  4. (
  5.     `mid_id` string COMMENT '设备唯一标识',
  6.     `user_id` string COMMENT '用户标识',
  7.     `version_code` string COMMENT '程序版本号',
  8.     `version_name` string COMMENT '程序版本名',
  9.     `lang` string COMMENT '系统语言',
  10.     `source` string COMMENT '渠道号',
  11.     `os` string COMMENT '安卓系统版本',
  12.     `area` string COMMENT '区域',
  13.     `model` string COMMENT '手机型号',
  14.     `brand` string COMMENT '手机品牌',
  15.     `sdk_version` string COMMENT 'sdkVersion',
  16.     `gmail` string COMMENT 'gmail',
  17.     `height_width` string COMMENT '屏幕宽高',
  18.     `app_time` string COMMENT '客户端日志产生时的时间',
  19.     `network` string COMMENT '网络模式',
  20.     `lng` string COMMENT '经度',
  21.     `lat` string COMMENT '纬度'
  22. )
  23. partitioned by(dt string)
  24. stored as parquet
  25. location '/warehouse/gmall/dws/dws_uv_detail_day';
复制代码




2)数据导入
以用户单日访问为key进行聚合,如果某个用户在一天中使用了两种操作系统、两个系统版本、多个地区,登录不同账号,只取其中之一
  1. hive (gmall)>
  2. set hive.exec.dynamic.partition.mode=nonstrict;
  3. insert overwrite table dws_uv_detail_day
  4. partition(dt='2020-10-14')
  5. select  
  6.     mid_id,
  7.     concat_ws('|', collect_set(user_id)) user_id,
  8.     concat_ws('|', collect_set(version_code)) version_code,
  9.     concat_ws('|', collect_set(version_name)) version_name,
  10.     concat_ws('|', collect_set(lang))lang,
  11.     concat_ws('|', collect_set(source)) source,
  12.     concat_ws('|', collect_set(os)) os,
  13.     concat_ws('|', collect_set(area)) area,
  14.     concat_ws('|', collect_set(model)) model,
  15.     concat_ws('|', collect_set(brand)) brand,
  16.     concat_ws('|', collect_set(sdk_version)) sdk_version,
  17.     concat_ws('|', collect_set(gmail)) gmail,
  18.     concat_ws('|', collect_set(height_width)) height_width,
  19.     concat_ws('|', collect_set(app_time)) app_time,
  20.     concat_ws('|', collect_set(network)) network,
  21.     concat_ws('|', collect_set(lng)) lng,
  22.     concat_ws('|', collect_set(lat)) lat
  23. from dwd_start_log
  24. where dt='2020-10-14'
  25. group by mid_id;
复制代码


3)查询导入结果
  1. hive (gmall)> select * from dws_uv_detail_day limit 1;
  2. hive (gmall)> select count(*) from dws_uv_detail_day;
复制代码


4)思考:不同渠道来源的每日活跃数统计怎么计算?

6.1.2 每周活跃设备明细

2020-12-09_200614.jpg

根据日用户访问明细,获得周用户访问明细。
1)建表语句
  1. hive (gmall)>
  2. drop table if exists dws_uv_detail_wk;
  3. create external table dws_uv_detail_wk(
  4.     `mid_id` string COMMENT '设备唯一标识',
  5.     `user_id` string COMMENT '用户标识',
  6.     `version_code` string COMMENT '程序版本号',
  7.     `version_name` string COMMENT '程序版本名',
  8.     `lang` string COMMENT '系统语言',
  9.     `source` string COMMENT '渠道号',
  10.     `os` string COMMENT '安卓系统版本',
  11.     `area` string COMMENT '区域',
  12.     `model` string COMMENT '手机型号',
  13.     `brand` string COMMENT '手机品牌',
  14.     `sdk_version` string COMMENT 'sdkVersion',
  15.     `gmail` string COMMENT 'gmail',
  16.     `height_width` string COMMENT '屏幕宽高',
  17.     `app_time` string COMMENT '客户端日志产生时的时间',
  18.     `network` string COMMENT '网络模式',
  19.     `lng` string COMMENT '经度',
  20.     `lat` string COMMENT '纬度',
  21.     `monday_date` string COMMENT '周一日期',
  22.     `sunday_date` string COMMENT  '周日日期'
  23. ) COMMENT '活跃用户按周明细'
  24. PARTITIONED BY (`wk_dt` string)
  25. stored as parquet
  26. location '/warehouse/gmall/dws/dws_uv_detail_wk/';
复制代码




2)数据导入
  1. hive (gmall)>
  2. set hive.exec.dynamic.partition.mode=nonstrict;
  3. insert overwrite table dws_uv_detail_wk partition(wk_dt)
  4. select  
  5.     mid_id,
  6.     concat_ws('|', collect_set(user_id)) user_id,
  7.     concat_ws('|', collect_set(version_code)) version_code,
  8.     concat_ws('|', collect_set(version_name)) version_name,
  9.     concat_ws('|', collect_set(lang)) lang,
  10.     concat_ws('|', collect_set(source)) source,
  11.     concat_ws('|', collect_set(os)) os,
  12.     concat_ws('|', collect_set(area)) area,
  13.     concat_ws('|', collect_set(model)) model,
  14.     concat_ws('|', collect_set(brand)) brand,
  15.     concat_ws('|', collect_set(sdk_version)) sdk_version,
  16.     concat_ws('|', collect_set(gmail)) gmail,
  17.     concat_ws('|', collect_set(height_width)) height_width,
  18.     concat_ws('|', collect_set(app_time)) app_time,
  19.     concat_ws('|', collect_set(network)) network,
  20.     concat_ws('|', collect_set(lng)) lng,
  21.     concat_ws('|', collect_set(lat)) lat,
  22.     date_add(next_day('2020-10-14','MO'),-7),
  23.     date_add(next_day('2020-10-14','MO'),-1),
  24.     concat(date_add( next_day('2020-10-14','MO'),-7), '_' , date_add(next_day('2020-10-14','MO'),-1)
  25. )
  26. from dws_uv_detail_day
  27. where dt>=date_add(next_day('2020-10-14','MO'),-7) and dt<=date_add(next_day('2020-10-14','MO'),-1)
  28. group by mid_id;
复制代码




3)查询导入结果
  1. hive (gmall)> select * from dws_uv_detail_wk limit 1;
  2. hive (gmall)> select count(*) from dws_uv_detail_wk;
复制代码




6.1.3 每月活跃设备明细
2020-12-09_200651.jpg

1)建表语句
  1. hive (gmall)>
  2. drop table if exists dws_uv_detail_mn;
  3. create external table dws_uv_detail_mn(
  4.     `mid_id` string COMMENT '设备唯一标识',
  5.     `user_id` string COMMENT '用户标识',
  6.     `version_code` string COMMENT '程序版本号',
  7.     `version_name` string COMMENT '程序版本名',
  8.     `lang` string COMMENT '系统语言',
  9.     `source` string COMMENT '渠道号',
  10.     `os` string COMMENT '安卓系统版本',
  11.     `area` string COMMENT '区域',
  12.     `model` string COMMENT '手机型号',
  13.     `brand` string COMMENT '手机品牌',
  14.     `sdk_version` string COMMENT 'sdkVersion',
  15.     `gmail` string COMMENT 'gmail',
  16.     `height_width` string COMMENT '屏幕宽高',
  17.     `app_time` string COMMENT '客户端日志产生时的时间',
  18.     `network` string COMMENT '网络模式',
  19.     `lng` string COMMENT '经度',
  20.     `lat` string COMMENT '纬度'
  21. ) COMMENT '活跃用户按月明细'
  22. PARTITIONED BY (`mn` string)
  23. stored as parquet
  24. location '/warehouse/gmall/dws/dws_uv_detail_mn/';
复制代码




2)数据导入
  1. hive (gmall)>
  2. set hive.exec.dynamic.partition.mode=nonstrict;
  3. insert overwrite table dws_uv_detail_mn partition(mn)
  4. select  
  5.     mid_id,
  6.     concat_ws('|', collect_set(user_id)) user_id,
  7.     concat_ws('|', collect_set(version_code)) version_code,
  8.     concat_ws('|', collect_set(version_name)) version_name,
  9.     concat_ws('|', collect_set(lang)) lang,
  10.     concat_ws('|', collect_set(source)) source,
  11.     concat_ws('|', collect_set(os)) os,
  12.     concat_ws('|', collect_set(area)) area,
  13.     concat_ws('|', collect_set(model)) model,
  14.     concat_ws('|', collect_set(brand)) brand,
  15.     concat_ws('|', collect_set(sdk_version)) sdk_version,
  16.     concat_ws('|', collect_set(gmail)) gmail,
  17.     concat_ws('|', collect_set(height_width)) height_width,
  18.     concat_ws('|', collect_set(app_time)) app_time,
  19.     concat_ws('|', collect_set(network)) network,
  20.     concat_ws('|', collect_set(lng)) lng,
  21.     concat_ws('|', collect_set(lat)) lat,
  22.     date_format('2020-10-14','yyyy-MM')
  23. from dws_uv_detail_day
  24. where date_format(dt,'yyyy-MM') = date_format('2020-10-14','yyyy-MM')
  25. group by mid_id;
复制代码




3)查询导入结果
  1. hive (gmall)> select * from dws_uv_detail_mn limit 1;
  2. hive (gmall)> select count(*) from dws_uv_detail_mn ;
复制代码




6.1.4 DWS层加载数据脚本
  1. 1)在hadoop102的/home/kgg/bin目录下创建脚本
  2. [kgg@hadoop102 bin]$ vim dws_uv_log.sh
  3.     在脚本中编写如下内容
  4. #!/bin/bash
  5. # 定义变量方便修改
  6. APP=gmall
  7. hive=/opt/module/hive/bin/hive
  8. # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
  9. if [ -n "$1" ] ;then
  10.     do_date=$1
  11. else
  12.     do_date=`date -d "-1 day" +%F`  
  13. fi
  14. echo "===日志日期为 $do_date==="
  15. sql="
  16.   set hive.exec.dynamic.partition.mode=nonstrict;
  17.   insert overwrite table "$APP".dws_uv_detail_day partition(dt='$do_date')
  18.   select  
  19.     mid_id,
  20.     concat_ws('|', collect_set(user_id)) user_id,
  21.     concat_ws('|', collect_set(version_code)) version_code,
  22.     concat_ws('|', collect_set(version_name)) version_name,
  23.     concat_ws('|', collect_set(lang)) lang,
  24.     concat_ws('|', collect_set(source)) source,
  25.     concat_ws('|', collect_set(os)) os,
  26.     concat_ws('|', collect_set(area)) area,
  27.     concat_ws('|', collect_set(model)) model,
  28.     concat_ws('|', collect_set(brand)) brand,
  29.     concat_ws('|', collect_set(sdk_version)) sdk_version,
  30.     concat_ws('|', collect_set(gmail)) gmail,
  31.     concat_ws('|', collect_set(height_width)) height_width,
  32.     concat_ws('|', collect_set(app_time)) app_time,
  33.     concat_ws('|', collect_set(network)) network,
  34.     concat_ws('|', collect_set(lng)) lng,
  35.     concat_ws('|', collect_set(lat)) lat
  36.   from "$APP".dwd_start_log
  37.   where dt='$do_date'  
  38.   group by mid_id;
  39.   insert overwrite table "$APP".dws_uv_detail_wk partition(wk_dt)
  40.   select  
  41.     mid_id,
  42.     concat_ws('|', collect_set(user_id)) user_id,
  43.     concat_ws('|', collect_set(version_code)) version_code,
  44.     concat_ws('|', collect_set(version_name)) version_name,
  45.     concat_ws('|', collect_set(lang)) lang,
  46.     concat_ws('|', collect_set(source)) source,
  47.     concat_ws('|', collect_set(os)) os,
  48.     concat_ws('|', collect_set(area)) area,
  49.     concat_ws('|', collect_set(model)) model,
  50.     concat_ws('|', collect_set(brand)) brand,
  51.     concat_ws('|', collect_set(sdk_version)) sdk_version,
  52.     concat_ws('|', collect_set(gmail)) gmail,
  53.     concat_ws('|', collect_set(height_width)) height_width,
  54.     concat_ws('|', collect_set(app_time)) app_time,
  55.     concat_ws('|', collect_set(network)) network,
  56.     concat_ws('|', collect_set(lng)) lng,
  57.     concat_ws('|', collect_set(lat)) lat,
  58.     date_add(next_day('$do_date','MO'),-7),
  59.     date_add(next_day('$do_date','MO'),-1),
  60.     concat(date_add( next_day('$do_date','MO'),-7), '_' , date_add(next_day('$do_date','MO'),-1)
  61.   )
  62.   from "$APP".dws_uv_detail_day
  63.   where dt>=date_add(next_day('$do_date','MO'),-7) and dt<=date_add(next_day('$do_date','MO'),-1)
  64.   group by mid_id;
  65.   insert overwrite table "$APP".dws_uv_detail_mn partition(mn)
  66.   select
  67.     mid_id,
  68.     concat_ws('|', collect_set(user_id)) user_id,
  69.     concat_ws('|', collect_set(version_code)) version_code,
  70.     concat_ws('|', collect_set(version_name)) version_name,
  71.     concat_ws('|', collect_set(lang))lang,
  72.     concat_ws('|', collect_set(source)) source,
  73.     concat_ws('|', collect_set(os)) os,
  74.     concat_ws('|', collect_set(area)) area,
  75.     concat_ws('|', collect_set(model)) model,
  76.     concat_ws('|', collect_set(brand)) brand,
  77.     concat_ws('|', collect_set(sdk_version)) sdk_version,
  78.     concat_ws('|', collect_set(gmail)) gmail,
  79.     concat_ws('|', collect_set(height_width)) height_width,
  80.     concat_ws('|', collect_set(app_time)) app_time,
  81.     concat_ws('|', collect_set(network)) network,
  82.     concat_ws('|', collect_set(lng)) lng,
  83.     concat_ws('|', collect_set(lat)) lat,
  84.     date_format('$do_date','yyyy-MM')
  85.   from "$APP".dws_uv_detail_day
  86.   where date_format(dt,'yyyy-MM') = date_format('$do_date','yyyy-MM')   
  87.   group by mid_id;
  88. "
  89. $hive -e "$sql"
复制代码


2)增加脚本执行权限
  1. [kgg@hadoop102 bin]$ chmod 777 dws_uv_log.sh
复制代码

3)脚本使用
  1. [kgg@hadoop102 module]$ dws_uv_log.sh 2019-02-11
复制代码

4)查询结果
  1. hive (gmall)> select count(*) from dws_uv_detail_day where dt='2019-02-11';
  2. hive (gmall)> select count(*) from dws_uv_detail_wk;
  3. hive (gmall)> select count(*) from dws_uv_detail_mn ;
复制代码

5)脚本执行时间
企业开发中一般在每日凌晨30分~1点


6.2 ADS层
目标:当日、当周、当月活跃设备数

6.2.1 活跃设备数

2020-12-09_200751.jpg

  1. 1)建表语句
  2. hive (gmall)>
  3. drop table if exists ads_uv_count;
  4. create external table ads_uv_count(
  5.     `dt` string COMMENT '统计日期',
  6.     `day_count` bigint COMMENT '当日用户数量',
  7.     `wk_count`  bigint COMMENT '当周用户数量',
  8.     `mn_count`  bigint COMMENT '当月用户数量',
  9.     `is_weekend` string COMMENT 'Y,N是否是周末,用于得到本周最终结果',
  10.     `is_monthend` string COMMENT 'Y,N是否是月末,用于得到本月最终结果'
  11. ) COMMENT '活跃设备数'
  12. row format delimited fields terminated by '\t'
  13. location '/warehouse/gmall/ads/ads_uv_count/';
复制代码




2)导入数据
  1. hive (gmall)>
  2. insert into table ads_uv_count
  3. select  
  4.   '2020-10-14' dt,
  5.    daycount.ct,
  6.    wkcount.ct,
  7.    mncount.ct,
  8. if(date_add(next_day('2020-10-14','MO'),-1)='2020-10-14','Y','N') ,
  9.    if(last_day('2020-10-14')='2020-10-14','Y','N')
  10. from
  11. (
  12.    select  
  13.       '2020-10-14' dt,
  14.        count(*) ct
  15.    from dws_uv_detail_day
  16.    where dt='2020-10-14'  
  17. )daycount join
  18. (
  19.    select  
  20.      '2020-10-14' dt,
  21.      count (*) ct
  22.    from dws_uv_detail_wk
  23.    where wk_dt=concat(date_add(next_day('2020-10-14','MO'),-7),'_' ,date_add(next_day('2020-10-14','MO'),-1) )
  24. ) wkcount on daycount.dt=wkcount.dt
  25. join
  26. (
  27.    select  
  28.      '2020-10-14' dt,
  29.      count (*) ct
  30.    from dws_uv_detail_mn
  31.    where mn=date_format('2020-10-14','yyyy-MM')  
  32. )mncount on daycount.dt=mncount.dt;
复制代码

3)查询导入结果
  1. hive (gmall)> select * from ads_uv_count ;
复制代码


6.2.2 ADS层加载数据脚本
  1. 1)在hadoop102的/home/kgg/bin目录下创建脚本
  2. [kgg@hadoop102 bin]$ vim ads_uv_log.sh
  3.     在脚本中编写如下内容
  4. #!/bin/bash
  5. # 定义变量方便修改
  6. APP=gmall
  7. hive=/opt/module/hive/bin/hive
  8. # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
  9. if [ -n "$1" ] ;then
  10.     do_date=$1
  11. else
  12.     do_date=`date -d "-1 day" +%F`  
  13. fi
  14. echo "===日志日期为 $do_date==="
  15. sql="
  16.   set hive.exec.dynamic.partition.mode=nonstrict;
  17. insert into table "$APP".ads_uv_count
  18. select  
  19.   '$do_date' dt,
  20.    daycount.ct,
  21.    wkcount.ct,
  22.    mncount.ct,
  23.    if(date_add(next_day('$do_date','MO'),-1)='$do_date','Y','N') ,
  24.    if(last_day('$do_date')='$do_date','Y','N')
  25. from
  26. (
  27.    select  
  28.       '$do_date' dt,
  29.        count(*) ct
  30.    from "$APP".dws_uv_detail_day
  31.    where dt='$do_date'  
  32. )daycount   join
  33. (
  34.    select  
  35.      '$do_date' dt,
  36.      count (*) ct
  37.    from "$APP".dws_uv_detail_wk
  38.    where wk_dt=concat(date_add(next_day('$do_date','MO'),-7),'_' ,date_add(next_day('$do_date','MO'),-1) )
  39. )  wkcount  on daycount.dt=wkcount.dt
  40. join
  41. (
  42.    select  
  43.      '$do_date' dt,
  44.      count (*) ct
  45.    from "$APP".dws_uv_detail_mn
  46.    where mn=date_format('$do_date','yyyy-MM')  
  47. )mncount on daycount.dt=mncount.dt;
  48. "
  49. $hive -e "$sql"
复制代码

2)增加脚本执行权限
  1. [kgg@hadoop102 bin]$ chmod 777 ads_uv_log.sh
复制代码

3)脚本使用
  1. [kgg@hadoop102 module]$ ads_uv_log.sh 2019-02-11
复制代码

4)脚本执行时间
企业开发中一般在每日凌晨30分~1点

5)查询导入结果
  1. hive (gmall)> select * from ads_uv_count;
复制代码




最新经典文章,欢迎关注公众号


本帖被以下淘专辑推荐:

没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条