【求助】关于HIVE数据倾斜的问题

大家好：

首先这是我的HQL语句：

 insert into table YHX_Report_SourceSplit select /*+mapjoin(sensor_shopinfo_localId) */ sensor_shopinfo_localId.localId,localId,'','','',from_unixtime(rawdata_sensor_a.rd_sen_ts,'yyyy-MM-dd HH:00:00'),from_unixtime(rawdata_sensor_a.rd_sen_ts,'yyyy-MM-dd'),from_unixtime(rawdata_sensor_a.rd_sen_ts),rawdata_sensor_a.rd_sen_ts,if(rawdata_sensor_a.rd_sen_rssi>=sensor_shopinfo_localId.maxRssi,'1','2'),'1',rawdata_sensor_a.rd_sen_devmac,rawdata_sensor_a.rd_sen_smac,rawdata_sensor_a.rd_sen_apmac,rawdata_sensor_a.rd_sen_rssi,'','','','', '' from sensor_shopinfo_localId left outer join rawdata_sensor_a on datediff(from_unixtime(rd_sen_ts),'2014-11-29')=0 and sensor_shopinfo_localId.sensorMac = rawdata_sensor_a.rd_sen_smac;
复制代码

我并没有做groupby和distinct，只是对于A表的数据做了拆分存储至B表中。

最终数据结果如下：
大数据入库test1.png

可以看出来，一共产生了13个REDUCE，但大家细看一下，大多的REDUCE都是没有什么数据的，这代表了出现了严重的数据倾斜……
我试着调整了set hive.groupby.skewindata=true ; 这个参数，但是这个参数貌似只是针对groupby的，所以失败了。

所以我想求助一下，我应该如何解决我这样的数据倾斜问题？

谢谢
hark

lixiaoliang7 · 发表于 2014-12-25 11:14:41

好象有点思路了，我使用explain看了HIVE的执行计划
看到了

 Reduce Output Operator
                key expressions: rd_sen_smac (type: string)
                sort order: +
                Map-reduce partition columns: rd_sen_smac (type: string)
                Statistics: Num rows: 1 Data size: 503 Basic stats: COMPLETE Column stats: NONE
                value expressions: rd_sen_smac (type: string), rd_sen_devmac (type: string), rd_sen_rssi (type: string), rd_sen_ts (type: int), rd_sen_apmac (type: string)
复制代码

貌似Reduce的key是使用两个表关联的字段 rd_sen_smac

然后我对于数据按又进行了rd_sen_smac又做了groupby，求每个key的数量，发现数量如下：

rd_sen_smac     _c1
0023cd02041d    38045088
0023cd020457    83382687
0023cd020540    65777040
0023cd02086d    449712
复制代码

我擦，那这个怎么搞啊… 只有四个KEY，但是又有13个reduce。。。

lixiaoliang7 · 发表于 2014-12-25 11:24:52

这是整个执行计划，大家帮忙看一下哇：

Explain
STAGE DEPENDENCIES:
  Stage-5 is a root stage , consists of Stage-1
  Stage-1
  Stage-0 depends on stages: Stage-1
  Stage-2 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-5
    Conditional Operator

  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: rawdata_sensor_a
            Statistics: Num rows: 3 Data size: 1510 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (datediff(from_unixtime(rd_sen_ts), '2014-11-29') = 0) (type: boolean)
              Statistics: Num rows: 1 Data size: 503 Basic stats: COMPLETE Column stats: NONE
              Reduce Output Operator
                key expressions: rd_sen_smac (type: string)
                sort order: +
                Map-reduce partition columns: rd_sen_smac (type: string)
                Statistics: Num rows: 1 Data size: 503 Basic stats: COMPLETE Column stats: NONE
                value expressions: rd_sen_smac (type: string), rd_sen_devmac (type: string), rd_sen_rssi (type: string), rd_sen_ts (type: int), rd_sen_apmac (type: string)
          TableScan
            alias: sensor_shopinfo_localid
            Statistics: Num rows: 8 Data size: 280 Basic stats: COMPLETE Column stats: NONE
            Reduce Output Operator
              key expressions: sensormac (type: string)
              sort order: +
              Map-reduce partition columns: sensormac (type: string)
              Statistics: Num rows: 8 Data size: 280 Basic stats: COMPLETE Column stats: NONE
              value expressions: localid (type: string), maxrssi (type: int)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col2} {VALUE._col3}
            1 {VALUE._col0} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5}
          outputColumnNames: _col2, _col3, _col6, _col8, _col9, _col10, _col11
          Statistics: Num rows: 8 Data size: 308 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: _col2 (type: string), _col2 (type: string), '' (type: string), '' (type: string), '' (type: string), CAST( from_unixtime(_col10, 'yyyy-MM-dd HH:00:00') AS TIMESTAMP) (type: timestamp), CAST( from_unixtime(_col10, 'yyyy-MM-dd') AS DATE) (type: date), CAST( from_unixtime(_col10) AS TIMESTAMP) (type: timestamp), _col10 (type: int), UDFToByte(if((_col9 >= _col3), '1', '2')) (type: tinyint), UDFToByte('1') (type: tinyint), _col8 (type: string), _col6 (type: string), _col11 (type: string), UDFToFloat(_col9) (type: float), UDFToFloat('') (type: float), UDFToFloat('') (type: float), UDFToInteger('') (type: int), UDFToDouble('') (type: double), UDFToDouble('') (type: double)
            outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19
            Statistics: Num rows: 8 Data size: 308 Basic stats: COMPLETE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 8 Data size: 308 Basic stats: COMPLETE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                  name: yhx_report_test.yhx_report_sourcesplit

  Stage: Stage-0
    Move Operator
      tables:
          replace: false
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: yhx_report_test.yhx_report_sourcesplit

  Stage: Stage-2
    Stats-Aggr Operator

Time taken: 0.266 seconds, Fetched: 69 row(s)
复制代码

lixiaoliang7 · 发表于 2014-12-25 11:26:51

然后左表的数据量只有几条，很小的。右表大概10个G左右数据

desehawk · 发表于 2014-12-25 12:46:22

楼主已经做了优化，并且从执行计划，只有一个job。

看到楼主的sql中有空值，可以做如下优化

select * from log a left outer join users b on case when a.user_id is null then concat(‘hive’,rand() ) else a.user_id end = b.user_id;

上面只是例子，楼主可以尝试，把空值替换为某个特定的值

lixiaoliang7 · 发表于 2014-12-25 14:25:00

本帖最后由 lixiaoliang7 于 2014-12-25 10:28 编辑

desehawk 发表于 2014-12-25 08:46
楼主已经做了优化，并且从执行计划，只有一个job。

看到楼主的sql中有空值，可以做如下优化

其实关键在于，有的reduce负担太大，有的reduce负担太小。这个问题不知道如何解决

rsgg03 · 发表于 2014-12-25 17:32:56

可能并非楼主所想，数据倾斜是最后map或则reduce会执行很慢。
如果有的输出文件为空，楼主，可以设置下reduce的个数。
或则最后合并下，在输出。

lixiaoliang7 · 发表于 2014-12-25 18:07:32

rsgg03 发表于 2014-12-25 13:32
可能并非楼主所想，数据倾斜是最后map或则reduce会执行很慢。
如果有的输出文件为空，楼主，可以设置下red ...

谢谢了。我现在先去处理别的问题。
完事后再来试一下这个事情。如有解决，我还会在该贴子中留言的。
多谢版主大人的热心回复

zzuyao · 发表于 2015-3-13 17:02:17

楼主，问题怎么样了

ainubis · 发表于 2015-3-28 15:56:26

飘过，学习学习！

图文精华

【求助】关于HIVE数据倾斜的问题

已有(9)人评论

最佳新人

活跃会员

热心会员

优秀版主

推荐 /2