hive学习笔记

about云腾讯认证空间

本帖最后由 sunshine_junge 于 2014-9-19 14:25 编辑

问题导读：
1.如何配置hive的metastore？
2.hive内部表、外部表的区别？
3.如何操作hive的索引？

一、关于hive安装
            把hive-0.9.0.tar.gz复制到/usr/local
            解压hive-0.9.0.tar.gz与重命名

  #cd /usr/local
  #tar -zxvf hive-0.9.0.tar.gz
  #mv hive-0.9.0 hive
复制代码

修改/etc/profile文件。

#vi /etc/profile
复制代码

增加

export HIVE_HOME=/usr/local/hive
复制代码

修改

export PATH=$JAVA_HOME/bin:$PATH:$HADOOP_HOME/bin:$HIVE_HOME/bin
复制代码

保存退出

 #source /etc/profile
   cd  $HIVE_HOME
   cp  hive-env.sh.template  hive-env.sh
   cp  hive-default.xml.template  hive-site.xml
复制代码

1.修改hadoop的hadoop-env.sh(否则启动hive汇报找不到类的错误)

  export HADOOP_CLASSPATH=.:$CLASSPATH:$HADOOP_CLASSPATH:
  $HADOOP_HOME/bin
复制代码

2.修改$HIVE_HOME/bin的hive-config.sh，增加以下三行

export JAVA_HOME=/usr/local/jdk
export HIVE_HOME=/usr/local/hive
export HADOOP_HOME=/usr/local/hadoop
复制代码

二、查看hive
      1、hive 命令行模式，直接输入#/hive/bin/hive的执行程序，或者输入 #hive --service cli

      2、 hive web界面的 (端口号9999) 启动方式
            #hive --service hwi &
            用于通过浏览器来访问hive
            http://hadoop0:9999/hwi/

      3、 hive 远程服务 (端口号10000) 启动方式(java访问必须打开)
            #hive --service hiveserver &

三、关于metastore
      metastore是hive元数据的集中存放地。metastore默认使用内嵌的derby数据库作为存储引擎
      Derby引擎的缺点：一次只能打开一个会话
      使用Mysql作为外置存储引擎，多用户同时访问

      配置MySQL的metastore

      1.上传mysql-connector-java-5.1.10.jar到$HIVE_HOME/lib

      2.登录MYSQL，创建数据库hive

#mysql -uroot -padmin
  mysql>create database hive;
  mysql>grant all ON hive.* TO root@'%' IDENTIFIED BY 'admin';
  mysql>flush privileges;
  mysql>set global binlog_format='MIXED';
复制代码

3.修改$HIVE_HOME/conf/hive-site.xml

                 <property>
                        <name>javax.jdo.option.ConnectionURL</name>
                        <value>jdbc:mysql://hadoop001:3306/hive?createDatabaseIfNotExist=true</value>
                </property>
                <property>
                        <name>javax.jdo.option.ConnectionDriverName</name>
                        <value>com.mysql.jdbc.Driver</value>
                </property>
                <property>
                        <name>javax.jdo.option.ConnectionUserName</name>
                        <value>root</value>
                </property>
                <property>
                        <name>javax.jdo.option.ConnectionPassword</name>
                        <value>661366</value>
                </property>
复制代码

四、关于hive数据存储及操作
      Hive的数据存储基于Hadoop HDFS
      Hive没有专门的数据存储格式
      存储结构主要包括：数据库、文件、表、视图
      Hive默认可以直接加载文本文件（TextFile），还支持sequence file
      创建表时，指定Hive数据的列分隔符与行分隔符，Hive即可解析数据

      类似传统数据库的DataBase
      默认数据库"default"
      使用#hive命令后，不使用hive>use <数据库名>，系统默认的数据库。可以显式使用hive> use default;
      创建一个新库
      hive > create database test_dw;

      与数据库中的 Table 在概念上是类似
      每一个 Table 在 Hive 中都有一个相应的目录存储数据。例如，一个表 test，它在 HDFS 中的路径为：/ warehouse/test。 warehouse是在 hive-site.xml 中由 ${hive.metastore.warehouse.dir}
      指定的数据仓库的目录所有的 Table 数据（不包括 External Table）都保存在这个目录中。
      删除表时，元数据与数据都会被删除

 内部表和外部表的区别：
                内部表:创建过程和数据加载过程（这两个过程可以在同一个语句中完成），在加载数据的过程中，实际数据会被移动到数据仓库目录中；之后对数据对访问将会直接在数据仓库目录中完成。删除表时，表中的数据和元数据将会被同时删除
                外部表:只有一个过程，加载数据和创建表同时完成，并不会移动到数据仓库目录中，只是与外部数据建立一个链接。当删除一个外部表 时，仅删除该链接
复制代码

创建内部表

create table inner_table (key string);                                                       //如果不指定分隔符,默认分隔符为'\001'
create table inner_table (key string) row format delimited fields terminated by ',';           //指定分隔符
create table inner_table (key string) row format delimited fields terminated by ',' stored as SEQUENCEFILE; //用哪种方式存储数据，SEQUENCEFILE是hadoop自带的文件压缩格式
复制代码

查看表结构

 describe inner_table; 
或者 desc inner_table;
复制代码

加载数据

load data local inpath '/root/test.txt' into table inner_table;
load data local inpath '/root/test.txt' overwriter into table inner_table; //数据有误，重新加载数据
复制代码

查看数据

select * from inner_table
select count(*) from inner_table
复制代码

删除表

drop table inner_table
复制代码

重命名表

 alter table inner_table rename to new_table_name;
复制代码

修改字段

alter table inner_table change key key1;
复制代码

添加字段

alter table inner_table add columns(value string);
复制代码

创建分区表

create table partition_table(name string,salary float,gender string,level string) partitioned by(dt string,dep string) row format delimited fields terminated by ',' stored as textfile;
复制代码

查看表有哪些分区

SHOW PARTITIONS partition_table; 
复制代码

查看表结构

 describe partition_table; 或者 desc partition_table;
复制代码

加载数据到分区

load data local inpath '/hadoop/out/partition.txt' into table partition_table partition (dt='2014-04-01',dep='001');
复制代码

观察hdfs文件结构

${hive.metastore.warehouse.dir}/youDatabase.td/partition_table/...
复制代码

增加分区

alter table partition_table addpatition(dt='2014-04-01',dep='001') localtion '/out/20140901/001' partition(dt='2014-09-02',dep='002') localtion '/out/20140901/002';
复制代码

在hive查询中有两种情况是不执行mapreduce的。

select * from partition_table;                //第一种
select * from partition_table where dt='2014-09-01';                                                //第二种条件是partition时
select * from partition_table where dt='2014-09-01' and dep='001';                //第二种
复制代码

case when .. then.. else.. end例子:

select name,gender,case when salary<5000 then 'L1' when salary>=5000 and salary <=10000 then 'L2' when salary>10000 then 'L3' else 'L0' end as salary_  from partition_table;
复制代码

group by:

select gender,sum(salary) from partition_table group by gender;
                hive.map.aggr控制如何聚合，默认是false，如果设置为true，hive将会在map端做第一级的集合，这样效率会变快，但是需要更         多的内存，内存和数据成正比。
set hive.map.aggr=true;
select gender,sum(salary) from partition_table group by gender;
复制代码

having:

版本支持having：
                        select gender,sum(salary) from partition_table where dt='2014-04-01'  group by gender having sum(salary)>25000;
                版本不支持having：
                        select * from (select gender,sum(salary) as sum_salary from partition_table where dt='2014-04-01'  group by gender)  e where e.sum_salary>25000;
复制代码

   join:
            hive join只支持等值join！
            reduce join:

 select c.class,s.score from score_test s join class_test c on s.classno=c.no;
复制代码

最为普通的join策略，不受数据量的大小影响，也叫做reduce side join，最没效率的一种join方式，它由一个mapreduce job完成。

 原理：
               首先将大表和小表分别进行map操作，在map shuffle的阶段每个mao output key变成了                         table_name_tag_prefix+join_column_value,但是在进行partition的时候它仍然只使用join_column_value进行hash。
复制代码

map join 只有一个大表时：

select /*+MAPJOIN(c)*/ c.class,s.score from score_test s join class_test c on s.classno=c.no;   //小表放在MAPJOIN中
复制代码

map join的计算步骤分两步，将小表的数据变成hashtable广播到所有的map端，将大表的数据进行合理的切分，然后在map阶段的时候用大表的数据一行一行的去探测(probe),小表的hashtable,如果join key相等，就写入hdfs。map join之所以叫做map join是因为它所有的工作都在map端进行计算。

 select c.class,s.score from score_test s left semi join class_test c on s.classno=c.no;
复制代码

order by(全局排序,所有数据都传给一个reduce来处理,处理大数据集来说，这个过程会很慢):

set hive.mapred.mode=nostrict;(default value/默认值) //不严谨模式
set hive.mapred.mode=strict; //严谨模式下必须制定limit否则会报错
select * from score_test order by score desc limit 1;        
复制代码

sort by:

sort by 只会在每个reduce中进行排序，这样只保证每个reduce的输出数据时有序的(非全局排序)，这样可以提高后面全局排序的效率
                sort by 不受hive.mapred.mode是否为strict,nostrict的影响
                使用sort by可以指定执行的reduce个数set mapred.reduce.tasks=3.
复制代码

union all

select * from score_test union all select * from score_test;//报错
                select * from(select * from score_test Union all select * from score_test) e;
复制代码

五、关于hive索引
            特点：       1.索引key冗余存储，提供基于key的数据视图
                              2.存储设计以优化查询&检索性能
                              3.对于某些查询减少IO，从而提高性能
            注意：       1.index的partition默认和数据表一致
                              2.视图上不能创建index
                              3. index可以通过stored as配置存储格式

            创建表

  create table index_test(id int,name string) partitioned by (dt string) row format delimited fields terminated by ',' stored as textfile;
复制代码

创建临时数据表

create table index_tmp(id int,name string,dt string) row format delimited fields terminated by ',' stored as textfile;  
复制代码

加载数据：

load data local inpath 'hadoop/out/index.txt' into table index_tmp;
复制代码

partition模式为不严谨

set hive.exec.dynamic.partition.mode=nonstrict;
复制代码

打开动态partition

set hive.exec.dynamic.partition=true;
复制代码

将index_tmp表数据插入到index_test中

insert overwriter table index_test partition(dt) select id,name,dt from index_tmp;
复制代码

创建索引表

create index index_dom01 on table index_test(id) as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;
复制代码

rebuild一下

 alter index index_dom01  on index_test rebuild;
复制代码

查看索引

  show index on index_test; 
复制代码

删除索引

 drop index index_dom01 on index_test;  
复制代码

引用：http://www.51studyit.com/html/notes/20140916/1057.html

quenlang · 发表于 2014-9-27 15:32:31

学习了，支持一下

EASONLIU · 发表于 2014-12-17 08:58:01

逛逛~~~~~~

hery · 发表于 2015-3-4 23:34:35

学习了，支持一下

phonix512 · 发表于 2015-3-5 00:42:55

学习了谢谢楼主分享！

sunt99 · 发表于 2016-5-19 17:18:47

666666666

wangb · 发表于 2016-5-31 21:27:18

学习了，谢谢楼主

QIDOUDOU · 发表于 2016-6-14 21:09:10

学习了，支持一下

图文精华

hive学习笔记

已有(7)人评论

浏览过的版块

推荐 /2