日志

hive如何对复合数据类型创建索引

已有 1151 次阅读2014-7-15 23:47

问题导读：
1.符合数据类型如何创建索引？
2.hive能否对其中属性做索引？

本人在下面基础上做一个扩张：
hive复合数据类型 array、map、struct使用

我们知道hive可以创建索引，那么对于符合数据结构，里面包含比较多的属性，如下面：
学生表中info，包含name与age。

hive> create table student_test(id INT, info struct<name:STRING, age:INT>)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> COLLECTION ITEMS TERMINATED BY ':';
OK
Time taken: 0.446 seconds

假如我们执行如下查询操作：

select * from student_test where info.age=80 ;

我们就想提高效率，对age创建索引，行不行那？
答案是：不行的
那我们该如何做：
hive提供了对符合数据类型创建索引，那就是对整个数据类型创建索引，而不是针对单个属性。

下面是我们没有创建索引前，执行的操作。

1.创建索引前
我们看到用的时间为：91.85s

hive> select * from student_test where info.age=80 ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1405418310771_0001, Tracking URL = http://master:8088/proxy/application_1405418310771_0001/
Kill Command = /usr/hadoop/bin/hadoop job  -kill job_1405418310771_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-07-15 23:03:29,706 Stage-1 map = 0%,  reduce = 0%
2014-07-15 23:04:07,900 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.97 sec
2014-07-15 23:04:08,977 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1405418310771_0001
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.97 sec HDFS Read: 275 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 970 msec
OK
4 {"name":"li","age":80}
Time taken: 91.105 seconds, Fetched: 1 row(s)

2.创建索引

hive> create index student_test_index on table student_test(info)
> as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
> with deferred rebuild
> IN TABLE student_test_index_table;

hive> alter index student_test_index on student_test_index rebuild;

复制代码

3.对比查询

下面用时26.167，比刚开始的 91.105快了很多

hive> select * from student_test where info.age=80;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1405418310771_0003, Tracking URL = http://master:8088/proxy/application_1405418310771_0003/
Kill Command = /usr/hadoop/bin/hadoop job  -kill job_1405418310771_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-07-15 23:19:05,465 Stage-1 map = 0%,  reduce = 0%
2014-07-15 23:19:17,093 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.61 sec
2014-07-15 23:19:18,148 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.61 sec
MapReduce Total cumulative CPU time: 1 seconds 610 msec
Ended Job = job_1405418310771_0003
MapReduce Jobs Launched:
Job 0: Map: 1 Cumulative CPU: 1.61 sec HDFS Read: 275 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 610 msec
OK
4 {"name":"li","age":80}
Time taken: 26.167 seconds, Fetched: 1 row(s)