日志

Hive创建索引

已有 421 次阅读2014-4-19 21:21

索引是标准的数据库技术，hive 0.7版本之后支持索引。Hive提供有限的索引功能，这不像传统的关系型数据库那样有“键(key)”的概念，用户可以在某些列上创建索引来加速某些操作，给一个表创建的索引数据被保存在另外的表中。 Hive的索引功能现在还相对较晚，提供的选项还较少。但是，索引被设计为可使用内置的可插拔的java代码来定制，用户可以扩展这个功能来满足自己的需求。当然不是说有的查询都会受惠于Hive索引。用户可以使用EXPLAIN语法来分析HiveQL语句是否可以使用索引来提升用户查询的性能。像RDBMS中的索引一样，需要评估索引创建的是否合理，毕竟，索引需要更多的磁盘空间，并且创建维护索引也会有一定的代价。用户必须要权衡从索引得到的好处和代价。

下面说说怎么创建索引：

1	hive> create table user( id int, name string)

2	> ROW FORMAT DELIMITED

3	> FIELDS TERMINATED BY '\t'

4	> STORED AS TEXTFILE;

2、导入数据：

1	hive> load data local inpath '/export1/tmp/wyp/row.txt'

2	> overwrite into table user;

3、创建索引之前测试

01	*hive> select from user where id =500000;**

02	Total MapReduce jobs = 1

03	Launching Job 1 out of 1

04	Number of reduce tasks is set to 0 since there's no reduce operator

05	Cannot run job locally: Input Size (= 356888890) is larger than

06	hive.exec.mode.local.auto.inputbytes.max (= 134217728)

07	Starting Job = job_1384246387966_0247, Tracking URL =

09	http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/

11	Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247

12	Hadoop job information for Stage-1: number of mappers:2; number of reducers:0

13	2013-11-13 15:09:53,336 Stage-1 map = 0%, reduce = 0%

14	2013-11-13 15:09:59,500 Stage-1 map=50%,reduce=0%, Cumulative CPU 2.0 sec

15	2013-11-13 15:10:00,531 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.63 sec

16	2013-11-13 15:10:01,560 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.63 sec

17	MapReduce Total cumulative CPU time: 5 seconds 630 msec

18	Ended Job = job_1384246387966_0247

19	MapReduce Jobs Launched:

20	Job 0: Map: 2 Cumulative CPU: 5.63 sec

21	HDFS Read: 361084006 HDFS Write: 357 SUCCESS

22	Total MapReduce CPU Time Spent: 5 seconds 630 msec

24	500000 wyp.

25	Time taken: 14.107 seconds, Fetched: 1 row(s)

一共用了14.107s

4、对user创建索引

01	hive> create index user_index on table user(id)

02	> as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

03	> with deferred rebuild

04	> IN TABLE user_index_table;

05	hive> alter index user_index on user rebuild;

06	*hive> select from user_index_table limit 5;**

07	0 hdfs://mycluster/user/hive/warehouse/table02/000000_0 [0]

08	1 hdfs://mycluster/user/hive/warehouse/table02/000000_0 [352]

09	2 hdfs://mycluster/user/hive/warehouse/table02/000000_0 [704]

10	3 hdfs://mycluster/user/hive/warehouse/table02/000000_0 [1056]

11	4 hdfs://mycluster/user/hive/warehouse/table02/000000_0 [1408]

12	Time taken: 0.244 seconds, Fetched: 5 row(s)

这样就对user表创建好了一个索引。

5、对创建索引后的user再进行测试

01	*hive> select from user where id =500000;**

02	Total MapReduce jobs = 1

03	Launching Job 1 out of 1

04	Number of reduce tasks is set to 0 since there's no reduce operator

05	Cannot run job locally: Input Size (= 356888890) is larger than

06	hive.exec.mode.local.auto.inputbytes.max (= 134217728)

07	Starting Job = job_1384246387966_0247, Tracking URL =

09	http://l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/

11	Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247

12	Hadoop job information for Stage-1: number of mappers:2; number of reducers:0

13	2013-11-13 15:23:12,336 Stage-1 map = 0%, reduce = 0%

14	2013-11-13 15:23:53,240 Stage-1 map=50%,reduce=0%, Cumulative CPU 2.0 sec

15	2013-11-13 15:24:00,253 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.27 sec

16	2013-11-13 15:24:01,650 Stage-1 map=100%,reduce=0%, Cumulative CPU 5.27 sec

17	MapReduce Total cumulative CPU time: 5 seconds 630 msec

18	Ended Job = job_1384246387966_0247

19	MapReduce Jobs Launched:

20	Job 0: Map: 2 Cumulative CPU: 5.63 sec

21	HDFS Read: 361084006 HDFS Write: 357 SUCCESS

22	Total MapReduce CPU Time Spent: 5 seconds 630 msec

24	500000 wyp.

25	Time taken: 13.042 seconds, Fetched: 1 row(s)

时间用了13.042s这和没有创建索引的效果差不多。

在Hive创建索引还存在bug：如果表格的模式信息来自SerDe，Hive将不能创建索引：

01	hive> CREATE INDEX employees_index

02	> ON TABLE employees (country)

03	> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

04	> WITH DEFERRED REBUILD

05	> IDXPROPERTIES ('creator' = 'me','created_at' = 'some_time')

06	> IN TABLE employees_index_table

07	> COMMENT 'Employees indexed by country and name.';

08	FAILED: Error in metadata: java.lang.RuntimeException: \

09	Check the index columns, they should appear in the table being indexed.

10	FAILED: Execution Error, return code 1 from \

11	org.apache.hadoop.hive.ql.exec.DDLTask

这个bug发生在Hive0.10.0、0.10.1、0.11.0，在Hive0.12.0已经修复了，详情请参见：https://issues.apache.org/jira/browse/HIVE-4251

尊重原创，转载请注明：转载自过往记忆（http://www.iteblog.com/）

路过

雷人

握手

鲜花

sunshine_junge的个人空间 https://www.aboutyun.com/?3779 [收藏] [复制] [分享] [RSS]

日志

Hive创建索引

全部作者的其他最新日志

评论 (0 个评论)

sunshine_junge

推荐 /2