Hive – Distinct解析

问题导读

1.本文是如何解析sql的？

2.如何查看sql执行计划?

准备数据

语句
[mw_shl_code=bash,true]SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;[/mw_shl_code]

[mw_shl_code=bash,true]hive> SELECT * FROM logs;
OK
a 苹果 3
a 橙子 3
a 烧鸡 1
b 烧鸡 3

hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;[/mw_shl_code]

根据count分组，计算独立用户数。

计算过程

1. 第一步先在mapper计算部分值，会以count和uid作为key，如果是distinct并且之前已经出现过，则忽略这条计算。第一步是以组合为key，第二步是以count为key.
2. ReduceSink是在mapper.close()时才执行的，在GroupByOperator.close()时，把结果输出。注意这里虽然key是count和uid，但是在reduce时分区是按count来的！
3. 第一步的distinct计算的值没用，要留到reduce计算的才准确。这里只是减少了key组合相同的行。不过如果是普通的count，后面是会合并起来的。
4. distinct通过比较lastInvoke判断要不要+1（因为在reduce是排序过了的，所以判断distict的字段变了没有，如果没变，则不+1）

Operator

Explain

[mw_shl_code=bash,true]hive> explain select count, count(distinct uid) from logs group by count;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL count)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)))) (TOK_GROUPBY (TOK_TABLE_OR_COL count))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
Map Reduce
   Alias -> Map Operator Tree:
      logs
      TableScan //表扫描
         alias: logs
         Select Operator//列裁剪，取出uid，count字段就够了
            expressions:
                  expr: count
                  type: int
                  expr: uid
                  type: string
            outputColumnNames: count, uid
            Group By Operator //先来map聚集
            aggregations:
                  expr: count(DISTINCT uid) //聚集表达式
            bucketGroup: false
            keys:
                  expr: count
                  type: int
                  expr: uid
                  type: string
            mode: hash //hash方式
            outputColumnNames: _col0, _col1, _col2
            Reduce Output Operator
               key expressions: //输出的键
                     expr: _col0 //count
                     type: int
                     expr: _col1 //uid
                     type: string
               sort order: ++
               Map-reduce partition columns: //这里是按group by的字段分区的
                     expr: _col0 //这里表示count
                     type: int
               tag: -1
               value expressions:
                     expr: _col2
                     type: bigint
   Reduce Operator Tree:
      Group By Operator //第二次聚集
      aggregations:
            expr: count(DISTINCT KEY._col1:0._col0) //uid:count
      bucketGroup: false
      keys:
            expr: KEY._col0 //count
            type: int
      mode: mergepartial //合并
      outputColumnNames: _col0, _col1
      Select Operator //列裁剪
         expressions:
               expr: _col0
               type: int
               expr: _col1
               type: bigint
         outputColumnNames: _col0, _col1
         File Output Operator //输出结果到文件
            compressed: false
            GlobalTableId: 0
            table:
               input format: org.apache.hadoop.mapred.TextInputFormat
               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
Fetch Operator
   limit: -1[/mw_shl_code]

szcountryboy · 发表于 2015-9-22 11:38:17

a,3 的value是2还是1？

感觉图片上面计算过程里面的第一行，a 3 map后应该是2个吧？

图文精华

Hive – Distinct解析

已有(1)人评论

推荐 /2