hive：如何查看hive执行计划，调优hive执行

问题导读

1.如何预先查看hive执行情况？
2.如何查看hive有多少job?

Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整，以获得最少的MapReduce作业数。举一个例子（案例基于Hive 0.6.0）：

create table ljn1(

k1 bigint,

k2 String,

v1 int

);

create table ljn2(

k1 bigint,

v2 int

);

create table ljn3(

k1 bigint,

v3 int

);

create table ljn4(

k1 bigint,

v4 int

);

create table ljn5(

k1 bigint,

v5 int

);

create table ljn6(

k2 string,

v6 int

);

然后看一下下面这个SQL的执行计划:

explain

select a.v1

from

ljn1 a

left outer join ljn2 b on (a.k1 = b.k1)

left outer join ljn3 c on (a.k1 = c.k1)

left outer join ljn4 d on (a.k1 = d.k1)

left outer join ljn6 e on (a.k2 = e.k2)

left outer join ljn5 f on (a.k1 = f.k1);

STAGE DEPENDENCIES:

Stage-5 is a root stage

Stage-1 depends on stages: Stage-5

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-5

Map Reduce

Alias -> Map Operator Tree:

a

TableScan

alias: a

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 0

value expressions:

expr: k1

type: bigint

expr: k2

type: string

expr: v1

type: int

b

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col0} {VALUE._col1} {VALUE._col2}

1

handleSkewJoin: false

outputColumnNames: _col0, _col1, _col2

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col0

type: bigint

sort order: +

Map-reduce partition columns:

expr: _col0

type: bigint

tag: 0

value expressions:

expr: _col1

type: string

expr: _col2

type: int

c

TableScan

alias: c

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

d

TableScan

alias: d

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 2

f

TableScan

alias: f

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 3

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

Left Outer Join0 to 3

condition expressions:

0 {VALUE._col3} {VALUE._col4}

1

2

3

handleSkewJoin: false

outputColumnNames: _col3, _col4

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-2

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col3

type: string

sort order: +

Map-reduce partition columns:

expr: _col3

type: string

tag: 0

value expressions:

expr: _col4

type: int

e

TableScan

alias: e

Reduce Output Operator

key expressions:

expr: k2

type: string

sort order: +

Map-reduce partition columns:

expr: k2

type: string

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col10}

1

handleSkewJoin: false

outputColumnNames: _col10

Select Operator

expressions:

expr: _col10

type: int

outputColumnNames: _col0

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

常规来讲,这个SQL非常简单,a表是主表，与其他表左外关联用到了k1和k2两个关联键，使用两个MapReduce作业完全可以搞定。但是这个SQL的执行计划却给出了3个作业：（Stage-0用做数据的最终展示，该作业可以忽略不计）第1个作业（Stage-5）是a表与b表关联；第2个作业（Stage-1）是第1个作业的中间结果再与c、d、f三表关联；第3个作业（Stage-2）是第2个作业的中间结果再与e表关联。

有点搞不懂了吧，第1和第2个作业明明可以合并在一起来完成的呀！其实我也搞不懂，从执行计划中看不出原由。而且如果这个SQL去掉c或者e其中的一个关联表，第1和第2个作业就可以合并在一起！很奇妙，我没有深入探究，应该是Hive的规则优化器还不够完美。

总之，遇到这种多表关联的情况一定要记得看一下执行计划，看看Hive是不是生成了多余的作业。如果Hive真的犯傻生成了多余的作业，就要尝试改变一下SQL的写法。通常是将关联键相同的表放在一起，如果还不行就再引入子查询。例如上面这个例子改为如下SQL就可以只生成2个作业了：

explain

select t.v1

from

(

select a.k2,a.v1

from

ljn1 a

left outer join ljn2 b on (a.k1 = b.k1)

left outer join ljn3 c on (a.k1 = c.k1)

left outer join ljn4 d on (a.k1 = d.k1)

left outer join ljn5 f on (a.k1 = f.k1)

) t

left outer join ljn6 e on (t.k2 = e.k2)

;

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-2 depends on stages: Stage-1

Stage-0 is a root stage

STAGE PLANS:

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

t:a

TableScan

alias: a

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 0

value expressions:

expr: k2

type: string

expr: v1

type: int

t:b

TableScan

alias: b

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 1

t:c

TableScan

alias: c

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 2

t:d

TableScan

alias: d

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 3

t:f

TableScan

alias: f

Reduce Output Operator

key expressions:

expr: k1

type: bigint

sort order: +

Map-reduce partition columns:

expr: k1

type: bigint

tag: 4

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

Left Outer Join0 to 2

Left Outer Join0 to 3

Left Outer Join0 to 4

condition expressions:

0 {VALUE._col1} {VALUE._col2}

1

2

3

4

handleSkewJoin: false

outputColumnNames: _col1, _col2

Select Operator

expressions:

expr: _col1

type: string

expr: _col2

type: int

outputColumnNames: _col0, _col1

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.SequenceFileInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Stage: Stage-2

Map Reduce

Alias -> Map Operator Tree:

$INTNAME

Reduce Output Operator

key expressions:

expr: _col0

type: string

sort order: +

Map-reduce partition columns:

expr: _col0

type: string

tag: 0

value expressions:

expr: _col1

type: int

e

TableScan

alias: e

Reduce Output Operator

key expressions:

expr: k2

type: string

sort order: +

Map-reduce partition columns:

expr: k2

type: string

tag: 1

Reduce Operator Tree:

Join Operator

condition map:

Left Outer Join0 to 1

condition expressions:

0 {VALUE._col1}

1

handleSkewJoin: false

outputColumnNames: _col1

Select Operator

expressions:

expr: _col1

type: int

outputColumnNames: _col0

File Output Operator

compressed: true

GlobalTableId: 0

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Stage: Stage-0

Fetch Operator

limit: -1

Joker · 发表于 2015-10-29 09:35:36

这种执行计划是怎么做的？对Hive的理解自己写的还是有软件？谢谢

wubaozhou · 发表于 2015-1-1 16:38:20

pig2 · 发表于 2015-10-29 10:11:42

Joker 发表于 2015-10-29 09:35
这种执行计划是怎么做的？对Hive的理解自己写的还是有软件？谢谢

这是hive自带的

大数据爱我 · 发表于 2015-11-1 19:53:58

我觉得这种问题是因为数据量的大小导致的。

webank_hadoop · 发表于 2015-11-11 17:19:03

Joker 发表于 2015-10-29 09:35
这种执行计划是怎么做的？对Hive的理解自己写的还是有软件？谢谢

和关系型数据库差不多，
explain
select 1 from dual;

雷夫23 · 发表于 2017-6-29 16:59:55

真不错，学习学习

hasqjh7 · 发表于 2017-8-2 18:32:05

记得，多个Join的条件一样，就是一个Job，不一样就会产生多个是Job。

翱翔于天变 · 发表于 2018-1-30 16:05:07

不错，学习，学习！

图文精华

hive：如何查看hive执行计划，调优hive执行

已有(8)人评论

活跃会员

热心会员

优秀版主

论坛元老

最佳新人

推荐 /2