日志

Hive分桶的作用

已有 1194 次阅读2017-4-20 17:29

分区的主要作用是可用允许我们只统计一部分内容，加快统计的速度。

什么是分桶

假如我们有个表t_buck。

create table t_buck(id string,name string)

clustered by (id) sort by(id) into 4 buckets;

指定了根据id分成4个桶。

只是说明了表会分桶，具体的分区需要在导入数据时产生。最好的导入数据方式是insert into table;

开始的时候我们的数据都是在一起的，按照上面的分桶结果，会在表目录下产生多个文件：/user/hive/warehouse/test_db/t_buk/

000001_0

000002_0

000003_0

000004_0每个文件中的内容是根据HASH散列后得到的结果。

实验

使用下面的代码创建表：

create table t_p(id string,name string)

row format delimited fields terminated by ',';

load data local inpath '/root/buck.data' overwrite into table t_p;

create table t_buck(id string,name string)

clustered by(id) sorted by(id)

into 4 buckets

row format delimited fields terminated by ',';

# 要开启模式开关

set hive.enforce.bucketing = true;

set mapreduce.job.reduces=4;

# 查询时cluster by指定的字段就是partition时分区的key

# 每个区中的数据根据id排序。

insert into table t_buck

select * from t_p cluster by(id);

来看一下sort by的结果。

set mapreduce.job.reduces=4; select * from t_p sort by id;

输出结果为：

+---------+-----------+--+

| t_p.id | t_p.name |

+---------+-----------+--+

| 12 | 12 |

| 13 | 13 |

| 4 | 4 |

| 8 | 8 |

| 14 | 14 |

| 2 | 2 |

| 6 | 6 |

| 1 | 1 |

| 10 | 10 |

| 11 | 11 |

| 3 | 3 |

| 5 | 5 |

| 7 | 7 |

| 9 | 9 |

+---------+-----------+--+

明显看出是每个Reduce中有序而不是全局有序。

set mapreduce.job.reduces=4;

select * from t_p sort by id;distribute by(id)指定分发字段，sort

by指定排序字段。

分桶的作用

观察下面的语句。

cluster by(id) = distribute by(id) sort by(id)

如果a表和b表已经是分桶表，而且分桶的字段是id字段，那么做这个操作的时候就不需要再进行全表笛卡尔积了。但是如果标注了分桶但是实际上数据并没有分桶，那么结果就会出问题。

路过

雷人

握手

鲜花

鸡蛋

收藏分享邀请举报

2017的个人空间 https://www.aboutyun.com/?58380 [收藏] [复制] [分享] [RSS]

日志

Hive分桶的作用

评论 (0 个评论)

2017

推荐 /2