分享

HiveSQL电商分析案例30题

问题导读:
1、如何通过UDF自定义 MD5加密函数 ,对地址、邮箱等信息进行加密?
2、如何对表内容进行检查?
3、如何进行数据分析?

一、前言
Hive 学习过程中的一个练习项目,如果不妥的地方或者更好的建议,欢迎指出!我们主要进行一下一些练习:
  •     数据结构
  •     数据清洗
  •     基于Hive的数据分析

二、项目需求
首先和大家讲一下这个项目的需求:

「对某零售企业最近1年门店收集的数据进行数据分析」

  •     潜在客户画像
  •     用户消费统计
  •     门店的资源利用率
  •     消费的特征人群定位
  •     数据的可视化展现

三、数据结构
本次练习一共用到四张表,如下:
Customer表
2.png
Transaction表
3.png
Store表
4.png
Review表
5.png

四、项目实战

「Create HDFS Folder」
  1. hdfs dfs -mkdir -p /tmp/shopping/data/customer
  2. hdfs dfs -mkdir -p /tmp/shopping/data/transaction
  3. hdfs dfs -mkdir -p /tmp/shopping/data/store
  4. hdfs dfs -mkdir -p /tmp/shopping/data/review
复制代码
「Upload the file to HDFS」
  1. hdfs dfs -put /opt/soft/data/customer_details.csv /tmp/shopping/data/customer/
  2. hdfs dfs -put /opt/soft/data/transaction_details.csv /tmp/shopping/data/transaction/
  3. hdfs dfs -put /opt/soft/data/store_details.csv /tmp/shopping/data/store/
  4. hdfs dfs -put /opt/soft/data/store_review.csv /tmp/shopping/data/review/
复制代码
「Create database」
  1. drop database if exists shopping cascade
  2. create database shopping
复制代码
「Use database」
  1. use shopping
复制代码
「Create external table」

「创建四张对应的外部表,也就是本次项目中的近源表。」

  1. create external table if not exists ext_customer_details(
  2. customer_id string,
  3. first_name string,
  4. last_name string,
  5. email string,
  6. gender string,
  7. address string,
  8. country string,
  9. language string,
  10. job string,
  11. credit_type string,
  12. credit_no string
  13. )
  14. row format delimited fields terminated by ','
  15. location '/tmp/shopping/data/customer/'
  16. tblproperties('skip.header.line.count'='1')
  17. create external table if not exists ext_transaction_details(
  18. transaction_id string,
  19. customer_id string,
  20. store_id string,
  21. price double,
  22. product string,
  23. buydate string,
  24. buytime string
  25. )
  26. row format delimited fields terminated by ','
  27. location '/tmp/shopping/data/transaction'
  28. tblproperties('skip.header.line.count'='1')
  29. create external table if not exists ext_store_details(
  30. store_id string,
  31. store_name string,
  32. employee_number int
  33. )
  34. row format delimited fields terminated by ','
  35. location '/tmp/shopping/data/store/'
  36. tblproperties('skip.header.line.count'='1')
  37. create external table if not exists ext_store_review(
  38. transaction_id string,
  39. store_id string,
  40. review_score int
  41. )
  42. row format delimited fields terminated by ','
  43. location '/tmp/shopping/data/review'
  44. tblproperties('skip.header.line.count'='1')
复制代码
通过UDF自定义 MD5加密函数

「Create MD5 encryption function」

这里通过UDF自定义 MD5加密函数 ,对地址、邮箱等信息进行加密。
  1. -- md5 udf自定义加密函数
  2. --add jar /opt/soft/data/md5.jar
  3. --create temporary function md5 as 'com.shopping.services.Encryption'
  4. --select md5('abc')
  5. --drop temporary function encrymd5
复制代码
「Clean and Mask customer_details 创建明细表」
  1. create table if not exists customer_details
  2. as select customer_id,first_name,last_name,md5(email) email,gender,md5(address) address,country,job,credit_type,md5(credit_no)
  3. from ext_customer_details
复制代码
对表内容进行检查,为数据清洗做准备

「Check ext_transaction_details data」对transaction表的transaction_id进行检查,查看重复的、错误的、以及空值的数量。

这里从表中我们可以看到transaction_id存在100个重复的值。
  1. with
  2. t1 as (select 'countrow' as status,count(transaction_id) as val from ext_transaction_details),
  3. t2 as (select 'distinct' as status,(count(transaction_id)-count(distinct transaction_id)) as val from ext_transaction_details),
  4. t3 as (select 'nullrow' as status,count(transaction_id) as val from ext_transaction_details where transaction_id is null),
  5. t4 as (select 'errorexp' as status,count(regexp_extract(transaction_id,'^([0-9]{1,4})
  6. select * from t1 union all select * from t2 union all select * from t3 union all select * from t4
复制代码

6.png

「Clean transaction_details into partition table」
  1. create table if not exists transaction_details(
  2. transaction_id string,
  3. customer_id string,
  4. store_id string,
  5. price double,
  6. product string,
  7. buydate string,
  8. buytime string
  9. )
  10. partitioned by (partday string)
  11. row format delimited fields terminated by ','
  12. stored as rcfile
复制代码

「开启动态分区」
  1. set hive.exec.dynamic.partition=true
  2. set hive.exec.dynamic.partition.mode=nonstrict
复制代码
开启动态分区,通过窗口函数对数据进行清洗

「Clear data and import data into transaction_details」

  1. -- partday 分区 transaction_id 重复
  2. select if(t.ct=1,transaction_id,concat(t.transaction_id,'_',t.ct-1))
  3. transaction_id,customer_id,store_id,price,product,buydate,buytime,date_format(buydate,'yyyy-MM')
  4. as partday
  5. from (select *,row_number() over(partition by transaction_id) as ct
  6. from ext_transaction_details) t
复制代码
  1. insert into transaction_details partition(partday)
  2. select if(t.ct=1,transaction_id,concat(t.transaction_id,'_',t.ct-1)) transaction_id,customer_id,store_id,price,product,buydate,buytime,date_format(regexp_replace(buydate,'/','-'),'yyyy-MM')
  3. as partday from (select *,row_number() over(partition by transaction_id) as ct
  4. from ext_transaction_details) t
复制代码
  •     「row_number() over(partition by transaction_id)」 窗口函数 :从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列  这里我们对分组的transaction_id
  •     if(t.ct=1,transaction_id,concat(t.transaction_id,'_',t.ct-1)) 如果满足ct=1,就是transaction_id,否则进行字符串拼接生成新的id
7.png
「Clean store_review table」
  1. create table store_review
  2. as select transaction_id,store_id,nvl(review_score,ceil(rand()*5))
  3. as review_score from ext_store_review
复制代码

「NVL(E1, E2)的功能为:如果E1为NULL,则函数返回E2,否则返回E1本身。」
8.png
我们可以看到表中的数据存在空值,通过NVL函数对数据进行填充。
  1. show tables
复制代码
9.png
通过清洗后的近源表和明细表如上。

数据分析

Customer分析

  •     找出顾客最常用的信用卡

  1. select credit_type,count(credit_type) as peoplenum from customer_details
  2. group by credit_type order by peoplenum desc limit 1
复制代码
  •     找出客户资料中排名前五的职位名称
  1. select job,count(job) as jobnum from customer_details
  2. group by job
  3. order by jobnum desc
  4. limit 5
复制代码
  •     在美国女性最常用的信用卡
  1. select credit_type,count(credit_type) as femalenum from customer_details
  2. where gender='Female'
  3. group by credit_type
  4. order by femalenum desc
  5. limit 1
复制代码
  •     按性别和国家进行客户统计

  1. select count(*) as customernum,country,gender from customer_details
  2. group by country,gender
复制代码
Transaction分析

  •     计算每月总收入
  1. select partday,sum(price) as countMoney from transaction_details group by partday
复制代码
  •     计算每个季度的总收入「Create Quarter Macro 定义季度宏」,将时间按季度进行划分
  1. create temporary macro
  2. calQuarter(dt string)
  3. concat(year(regexp_replace(dt,'/','-')),'年第',ceil(month(regexp_replace(dt,'/','-'))/3),'季度')
复制代码
  1. select calQuarter(buydate) as quarter,sum(price) as sale
  2. from transaction_details group by calQuarter(buydate)
复制代码
    10.png
  • 按年计算总收入
  1. create temporary macro calYear(dt string) year(regexp_replace(dt,'/','-'))
  2. select calYear(buydate) as year,sum(price) as sale from transaction_details group by calYear(buydate)
复制代码
  •     按工作日计算总收入

  1. create temporary macro calWeek(dt string) concat('星期',dayofweek(regexp_replace(dt,'/','-'))-1)
复制代码
  1. select concat('星期',dayofweek(regexp_replace(buydate,'/','-'))-1) as week,sum(price) as sale
  2. from transaction_details group by dayofweek(regexp_replace(buydate,'/','-'))
复制代码
11.png
  •     按时间段计算总收入(需要清理数据)
  1. select concat(regexp_extract(buytime,'[0-9]{1,2}',0),'时') as time,sum(price) as sale from transaction_details group by regexp_extract(buytime,'[0-9]{1,2}',0)
复制代码
12.png
  •     按时间段计算平均消费「Time macro」

  1. create temporary macro calTime(time string) if(split(time,' ')[1]='PM',regexp_extract(time,'[0-9]{1,2}',0)+12,
  2. if(split(time,' ')[1]='AM',regexp_extract(time,'[0-9]{1,2}',0),split(time,':')[0]))
复制代码
  1. select calTime(buytime) as time,sum(price) as sale from transaction_details group by calTime(buytime)
复制代码

13.png

  1. --define time bucket
  2. --early morning: (5:00, 8:00]
  3. --morning: (8:00, 11:00]
  4. --noon: (11:00, 13:00]
  5. --afternoon: (13:00, 18:00]
  6. --evening: (18:00, 22:00]
  7. --night: (22:00, 5:00] --make it as else, since it is not liner increasing
  8. --We also format the time. 1st format time to 19:23 like, then compare, then convert minites to hours
  9. with
  10. t1 as
  11. (select calTime(buytime) as time,sum(price) as sale from transaction_details group by calTime(buytime) order by time),
  12. t2 as
  13. (select if(time>5 and time<=8,'early morning',if(time >8 and time<=11,'moring',if(time>11 and time <13,'noon',
  14. if(time>13 and time <=18,'afternoon',if(time >18 and time <=22,'evening','night'))))) as sumtime,sale
  15. from t1)
  16. select sumtime,sum(sale) from t2
  17. group by sumtime
复制代码
14.png
  •     按工作日计算平均消费
  1. select concat('星期',dayofweek(regexp_replace(buydate,'/','-'))-1)
  2. as week,avg(price) as sale from transaction_details
  3. where dayofweek(regexp_replace(buydate,'/','-'))-1 !=0 and dayofweek(regexp_replace(buydate,'/','-'))-1 !=6
  4. group by dayofweek(regexp_replace(buydate,'/','-'))
复制代码
15.png

  •     计算年、月、日的交易总数
  1. select buydate as month,count(*) as salenum from transaction_details group by buydate
复制代码
  •     找出交易量最大的10个客户
  1. select c.customer_id,c.first_name,c.last_name,count(c.customer_id) as custnum from customer_details c
  2. inner join transaction_details t
  3. on c.customer_id=t.customer_id
  4. group by c.customer_id,c.first_name,c.last_name
  5. order by custnum desc
  6. limit 10
复制代码
  •     找出消费最多的前10位顾客
  1. select c.customer_id,c.first_name,c.last_name,sum(price) as sumprice from customer_details c
  2. inner join transaction_details t
  3. on c.customer_id=t.customer_id
  4. group by c.customer_id,c.first_name,c.last_name
  5. order by sumprice desc
  6. limit 10
复制代码
  •     统计该期间交易数量最少的用户
  1. select c.customer_id,c.first_name,c.last_name,count(*) as custnum from customer_details c
  2. inner join transaction_details t
  3. on c.customer_id=t.customer_id
  4. group by c.customer_id,c.first_name,c.last_name
  5. order by custnum asc
  6. limit 1
复制代码
  •     计算每个季度的独立客户总数
  1. select calQuarter(buydate) as quarter,count(distinct customer_id) as uninum
  2. from transaction_details
  3. group by calQuarter(buydate)
复制代码
  •     计算每周的独立客户总数
  1. select calWeek(buydate) as quarter,count(distinct customer_id) as uninum
  2. from transaction_details
  3. group by calWeek(buydate)
复制代码
  •     计算整个活动客户平均花费的最大值
  1. select sum(price)/count(*) as sale
  2. from transaction_details
  3. group by customer_id
  4. order by sale desc
  5. limit 1
复制代码
  •     统计每月花费最多的客户
  1. with
  2. t1 as
  3. (select customer_id,partday,count(distinct buydate) as visit from transaction_details group by partday,customer_id),
  4. t2 as
  5. (select customer_id,partday,visit,row_number() over(partition by partday order by visit desc) as visitnum from t1)
  6. select * from t2 where visitnum=1
复制代码
  •     统计每月访问次数最多的客户
  1. with
  2. t1 as
  3. (select customer_id,partday,sum(price) as pay from transaction_details group by partday,customer_id),
  4. t2 as
  5. (select customer_id,partday,pay,row_number() over(partition by partday order by pay desc) as paynum from t1)
  6. select * from t2 where paynum=1
复制代码
  •     按总价找出最受欢迎的5种产品
  1. select product,sum(price) as sale from transaction_details
  2. group by product
  3. order by sale desc
  4. limit 5
复制代码
  •     根据购买频率找出最畅销的5种产品
  1. select product,count(*) as num from transaction_details
  2. group by product
  3. order by num desc
  4. limit 5
复制代码
  •     根据客户数量找出最受欢迎的5种产品
  1. select product,count(distinct customer_id) as num from transaction_details
  2. group by product
  3. order by num desc
  4. limit 5
复制代码
  •     验证前5个details
  1. select * from transaction_details where product in ('Goat - Whole Cut')
复制代码

Store分析

  •     按客流量找出最受欢迎的商店
  1. with
  2. t1 as (select store_id,count(*) as visit from transaction_details
  3. group by
  4. store_id order by visit desc limit 1)
  5. select s.store_name,t.visit
  6. from t1 t
  7. inner join
  8. ext_store_details s
  9. on t.store_id=s.store_id
复制代码
  •     根据顾客消费价格找出最受欢迎的商店
  1. with
  2. t1 as (select store_id,sum(price) as sale from transaction_details
  3. group by
  4. store_id order by sale desc limit 1)
  5. select s.store_name,t.sale
  6. from t1 t
  7. inner join
  8. ext_store_details s
  9. on t.store_id=s.store_id
复制代码
  •     根据顾客交易情况找出最受欢迎的商店
  1. with
  2. t1 as
  3. (select store_id,store_name from ext_store_details)
  4. select t.store_id,store_name,count(distinct t.customer_id) as num
  5. from transaction_details t
  6. inner join t1 s
  7. on s.store_id=t.store_id
  8. group by t.store_id,store_name
  9. order by num desc
  10. limit 1
复制代码
  •     根据商店和唯一的顾客id获取最受欢迎的产品
  1. with
  2. t1 as (select store_id,product,count(distinct customer_id) as num from transaction_details
  3. group by store_id,product order by num desc limit 1)
  4. select s.store_name,t.num,t.product
  5. from t1 t
  6. inner join
  7. ext_store_details s
  8. on t.store_id=s.store_id
复制代码
  •     获取每个商店的员工与顾客比
  1. with
  2. t1 as (select store_id,count(distinct customer_id) as num from transaction_details
  3. group by store_id )
  4. select s.store_name,employee_number/num as vs from t1 t
  5. inner join ext_store_details s
  6. on t.store_id=s.store_id
复制代码
  •     按年和月计算每家店的收入
  1. select store_id,partday,sum(price) from transaction_details group by store_id,partday
复制代码
  •     按店铺制作总收益饼图
  1. select store_id,sum(price) from transaction_details group by store_id
复制代码
  •     找出每个商店最繁忙的时间段
  1. with
  2. t1 as
  3. (select store_id,count(customer_id) as peoplenum from transaction_details group by store_id,concat(regexp_extract(buytime,'[0-9]{1,2}',0),'时')),
  4. t2 as
  5. (select store_id,peoplenum,row_number() over(partition by store_id order by peoplenum desc) as peo from t1 )
  6. select t.store_id,e.store_name,t.peoplenum from t2 t
  7. inner join ext_store_details e
  8. on e.store_id = t.store_id
  9. where peo =1
复制代码
  •     找出每家店的忠实顾客
  1. with
  2. t1 as
  3. (select customer_id,store_id,count(customer_id) as visit from transaction_details group by store_id,customer_id ),
  4. t2 as
  5. (select customer_id,store_id,visit,row_number() over(partition by store_id order by visit desc) as most from t1)
  6. select r.customer_id,concat(first_name,last_name) as customer_name,r.store_id,store_name,r.visit from t2 r
  7. inner join customer_details c
  8. on c.customer_id=r.customer_id
  9. inner join ext_store_details e
  10. on e.store_id=r.store_id
  11. where most=1
复制代码
  •     根据每位员工的最高收入找出明星商店
  1. with
  2. t1 as
  3. (select store_id,sum(price) as sumprice from transaction_details group by store_id)
  4. select t.store_id,s.store_name,sumprice/employee_number as avgprice  from t1 t
  5. inner join ext_store_details s
  6. on s.store_id=t.store_id
  7. order by avgprice desc
复制代码

Review分析

  •     在ext_store_review中找出存在冲突的交易映射关系
  1. select t.transaction_id,t.store_id from transaction_details t
  2. inner join ext_store_review e
  3. on e.transaction_id=t.transaction_id
  4. where e.store_id!=t.store_id
复制代码
  •     了解客户评价的覆盖率
  1. with
  2. trans as (select store_id,count(transaction_id) as countSale from transaction_details group by store_id),
  3. rev as (select store_id,count(distinct transaction_id) as review from store_review group by store_id)
  4. select s.store_name,(r.review*100/t.countSale) as cover from  trans t
  5. inner join rev r
  6. on t.store_id=r.store_id
  7. inner join ext_store_details s
  8. on t.store_id=s.store_id
复制代码
  •     根据评分了解客户的分布情况
  1. select store_id,review_score,count(review_score) as numview from ext_store_review  where review_score>0 group by review_score,store_id
复制代码
  •     根据交易了解客户的分布情况
  1. select store_id,count(transaction_id) as transactionnum from ext_store_review  group by store_id
复制代码
  •     客户给出的最佳评价是否总是同一家门店
  1. select store_id,customer_id,count(customer_id) as visit from transaction_details t
  2. join ext_store_review e
  3. on e.transaction_id = t.transaction_id
  4. where e.review_score=5
  5. group by t.store_id,t.customer_id
复制代码





获取更多资源:
领取100本书+1T资源
http://www.aboutyun.com/forum.php?mod=viewthread&tid=26480

大数据5个项目视频
http://www.aboutyun.com/forum.php?mod=viewthread&tid=25235

名企资源、名企面试题、最新BAT面试题、专题面试题等资源汇总
https://www.aboutyun.com/forum.php?mod=viewthread&tid=27732


名企资源、名企面试题、最新BAT面试题、专题面试题等资源汇总
https://www.aboutyun.com/forum.php?mod=viewthread&tid=27732



原文链接
https://mp.weixin.qq.com/s/PZcAtTgn0ByzhLrAhs2P0w

已有(2)人评论

跳转到指定楼层
wqkenqing 发表于 2020-11-2 08:57:23
非常感兴趣,多看看.
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条