分享

hadoop故障:BlockAlreadyExistsException的问题所在

在hive执行时留下的蛛丝马迹:

Failed with exception org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/tmp/hive-root/hive_2011-08-15_00-31-02_332_247809173824307798/-ext-10000/access_bucket-2011-08-14_00004
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1257)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor2037.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.CopyTask

在DN中发现:

2011-08-15 00:47:09,138 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_8964076545845199727_216399 received exception org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_8964076545845199727_216399 is valid, and cannot be written to.
2011-08-15 00:47:09,138 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.23:50010, storageID=DS-52195649-192.168.1.23-50010-1299427987620, infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_8964076545845199727_216399 is valid, and cannot be written to.
at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:983)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:98)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:259)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:662)
2011-08-15 00:47:15,366 WARN org.apache.hadoop.util.Shell: Could not get disk usage information
org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access `/data/hadoop/data/dfs.data.dir/tmp/blk_-1540848236479330018_216371.meta': No such file or directory
du: cannot access `/data/hadoop/data/dfs.data.dir/tmp/blk_-1540848236479330018': No such file or directory
at org.apache.hadoop.util.Shell.runCommand(Shell.java:195)
at org.apache.hadoop.util.Shell.run(Shell.java:134)
at org.apache.hadoop.fs.DU.access$200(DU.java:29)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:84)
at java.lang.Thread.run(Thread.java:662)

看着像是DN写入的时候遇到了服务不响应,发现DN上全都忘记设置


没找到任何评论,期待你打破沉寂

pig2 发表于 2014-1-7 17:22:42
本帖最后由 pig2 于 2014-1-7 17:28 编辑

可以参考下面blog:

集群崩溃,之前也发生过类似的事.
故障是datanode重启时间不长后就挂掉了,日志显示如下

  • 2013-04-04 00:37:22,842 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-145226050820856206_12015772 src=\'#\'" /192.168.1.51:9879 dest: /192.168.1.51:50010
  • at java.lang.Thread.run(Thread.java:662)
  • at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:107)
  • at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:299)
  • at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:99)
  • at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1402)
  • org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_2613577836754795446_11189945 is valid, and cannot be written to.
  • ----
  • ************************************************************/
  • SHUTDOWN_MSG: Shutting down DataNode at hadoop-node-51/192.168.1.51
  • /************************************************************
  • 2013-04-04 00:43:36,872 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:  
  • 2013-04-04 00:43:34,336 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
  • 2013-04-04 00:43:33,082 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src=\'#\'" /192.168.1.51:50010, dest: /192.168.1.46:17879, bytes: 6175632, op: HDFS_READ, cliID: DFSClient_1354167993, offset: 0, srvID: DS-1881855868-192.168.1.51-50010-1338289396565, blockid: blk_890732859468346910_12015888, duration: 21544233000
  • java.lang.OutOfMemoryError: GC overhead limit exceeded
  • 2013-04-04 00:43:29,975 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.51:50010, storageID=DS-1881855868-192.168.1.51-50010-1338289396565, infoPort=50075, ipcPort=50020):DataXceiver
上半部分的错误日志累计多了之后,datanode会OOM而shutdown。
一开始看了源码里面的DataXceiver.java,DataNode.java但是没找到有用的东西,然后日志里有一个FSDataset.java,和专门抛异常的BlockAlreadyExistsException.java找到了有用的东西。
FSDataset.java里面包含了这一段
  • if (isValidBlock(b)) {
  •       if (!isRecovery) {
  •         throw new BlockAlreadyExistsException("Block " + b + " is valid, and cannot be written to.");
  •       }
  •       // If the block was successfully finalized because all packets
  •       // were successfully processed at the Datanode but the ack for
  •       // some of the packets were not received by the client. The client  
  •       // re-opens the connection and retries sending those packets.
  •       // The other reason is that an "append" is occurring to this block.
  •       detachBlock(b, 1);
  •     }
而BlockAlreadyExistsException.java里面包含这一段,其实也就这一段
  • class BlockAlreadyExistsException extends IOException {
  •   private static final long serialVersionUID = 1L;
  •   public BlockAlreadyExistsException() {
  •     super();
  •   }
  •   public BlockAlreadyExistsException(String msg) {
  •     super(msg);
  •   }
  • }
应该原因是因为宕机,恢复后要保证3份复制,就把其他节点上的块写入到这个节点,但是这个节点上块已经存在了,要做recover或者delete,datanode会对已经存在的blk先判断权限。然后进到存储路径下查看,有些blk的权限是664,有些则是644,644则无法recover或者delete。于是把644的改为664,就不再报这个错误,或者把644的块写脚本删除也可以。至于怎么产生的644权限文件,还没弄明白。异常抛多了,GC不能快速回收内存,DN就OOM了。不过,正常的块文件和块meta的权限应该是664,而datanode自己哈希出来的subdir,权限应该都是775,subdir下面的块应该也是664。然后调大hadoop-env.sh里面的HADOOP_HEAPSIZE和HADOOP_CLIENT_OPTS的内存数值。


回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条