此 DataNode 未连接到其一个或多个 NameNode。-HDFS-About云-梭伦科技

grinsky 发表于 2019-12-12 21:19:06

此 DataNode 未连接到其一个或多个 NameNode。

最近hdfs集群出现异常，出现：此 DataNode 未连接到其一个或多个 NameNode。报警
查看日志，也有找到一些错误信息，但还是没有定位到具体问题原因。

该集群之前一直都正常运行，最近也并没有做什么改动，就是突然发生该情况，当时没发现设么，就重启datanode ok。
但该问题并没有解决，即后续每个2-3天会有1个datanode出现该问题（不是同一台datanode）。
查看了监控，发生问题的时间点cpu，内存及GC情况均未出现较大波动（发生问题的时间并非业务高峰时段）。

下面该时间段的一些WARN和ERR信息 ipaddr 是发生问题的datanode的ip：
namenode日志：

晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176622248_105822235
晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176622125_105822105
晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176622384_105822379
晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176621497_105821443
晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176622521_105822523
晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176621498_105821444
晚上6点31:01.610分WARNBlockManagerPendingReplicationMonitor timed out blk_1176621116_105821044

晚上6点31:27.390分WARNNetworkTopologyThe cluster does not contain node: /default/ipaddr:50010
晚上6点31:27.390分WARNNetworkTopologyThe cluster does not contain node: /default/ipaddr:50010
晚上6点31:27.390分WARNNetworkTopologyThe cluster does not contain node: /default/ipaddr:50010
晚上6点31:27.390分WARNNetworkTopologyThe cluster does not contain node: /default/ipaddr:50010
发生问题的datanode的日志：

晚上6点31:03.506分WARNUserGroupInformationPriviledgedActionException as:blk_1175113165_104282103 (auth:SIMPLE) cause:java.io.IOException: replica.getGenerationStamp() < block.getGenerationStamp(), block=BP-202159622-xx.xx.xx.xx-1529480710771:blk_1175113165_104282103, replica=ReplicaWaitingToBeRecovered, blk_1175113165_104281990, RWRgetNumBytes() = 35262477getBytesOnDisk()= 35262477getVisibleLength()= -1getVolume() = /onstardiskl/dfs/dn/currentgetBlockFile() = /onstardiskl/dfs/dn/current/BP-202159622-xx.xx.xx.xx-1529480710771/current/rbw/blk_1175113165
晚上6点31:03.506分INFOServerIPC Server handler 37 on 50020, call org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol.getReplicaVisibleLength from xx.xx.xx.xx:50356 Call#4713650 Retry#0
java.io.IOException: replica.getGenerationStamp() < block.getGenerationStamp(), block=BP-202159622-10.216.83.101-1529480710771:blk_1175113165_104282103, replica=ReplicaWaitingToBeRecovered, blk_1175113165_104281990, RWRgetNumBytes() = 35262477getBytesOnDisk()= 35262477getVisibleLength()= -1getVolume() = /onstardiskl/dfs/dn/currentgetBlockFile() = /onstardiskl/dfs/dn/current/BP-202159622-xx.xx.xx.xx-1529480710771/current/rbw/blk_1175113165
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaVisibleLength(FsDatasetImpl.java:2591) at org.apache.hadoop.hdfs.server.datanode.DataNode.getReplicaVisibleLength(DataNode.java:2756) at org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.getReplicaVisibleLength(ClientDatanodeProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17873) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)

晚上6点31:34.343分INFODataNodeSlow BlockReceiver write packet to mirror took 405ms (threshold=300ms)
晚上6点31:34.841分INFODataNodeSlow BlockReceiver write packet to mirror took 411ms (threshold=300ms)
晚上6点31:34.990分INFODataNodeSlow BlockReceiver write packet to mirror took 508ms (threshold=300ms)

晚上6点32:01.135分INFOclienttracesrc: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 231a57d3b76a29bd4df8aaa59539067e, slotIdx: 27, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true
晚上6点32:01.135分INFOclienttracesrc: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 8d170d6023e785411dbbe4bee2e64a18, slotIdx: 126, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true
晚上6点32:01.136分INFOclienttracesrc: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: c3d854d0358db8162ba15be80d2376d5, slotIdx: 79, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true
晚上6点32:01.136分INFOclienttracesrc: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 8d170d6023e785411dbbe4bee2e64a18, slotIdx: 77, srvID: 3a9de8a1-0c96-4ddc-ab10-bbe2dd0dbd13, success: true0

晚上6点40:26.442分INFODataNodeLikely the client has stopped reading, disconnecting it (ipaddr:50010:DataXceiver error processing READ_BLOCK operationsrc: /ipaddr:57414 dst: /ipaddr:50010); java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel

目前还不清楚是什么原因造成的这个问题，不知道有没有遇到过类似情况的小伙伴，希望大家帮帮忙看下，谢谢。

阿飞发表于 2019-12-13 07:34:07

楼主提供的信息比较全，这里给咱们分析下

首先知道PendingReplicationMonitor 用来干什么的？NameNode发现DataNode节点复制Block的副本失败是通过一种检查机制，也就是PendingReplicationMonitor

PendingReplicationMonitor线程会把copy失败的Blocks放到timedOutItems中。
通过这个错误
PendingReplicationMonitor timed out blk_1176622248_105822235
我们看到输出上面信息。我们再来看源码
/**
* Iterate through all items and detect timed-out items
* 通过所有项目迭代检测超时项目
*/
void pendingReplicationCheck() {

// 使用synchronized关键字对pendingReplications进行同步
synchronized (pendingReplications) {

// 获取集合pendingReplications的迭代器
   Iterator<Map.Entry<Block, PendingBlockInfo>> iter =
                              pendingReplications.entrySet().iterator();

   // 记录当前时间now
   long now = now();
   if(LOG.isDebugEnabled()) {
   LOG.debug("PendingReplicationMonitor checking Q");
   }

   // 遍历pendingReplications集合中的每个元素
   while (iter.hasNext()) {

   // 取出每个<Block, PendingBlockInfo>条目
   Map.Entry<Block, PendingBlockInfo> entry = iter.next();

   // 取出Block对应的PendingBlockInfo实例pendingBlock
   PendingBlockInfo pendingBlock = entry.getValue();

   // 判断pendingBlock自其生成时的timeStamp以来到现在，是否已超过timeout时间
   if (now > pendingBlock.getTimeStamp() + timeout) {

   // 超过的话，
   // 取出timeout实例block
      Block block = entry.getKey();

      // 使用synchronized关键字对timedOutItems进行同步
      synchronized (timedOutItems) {

         // 将block添加入复制请求超时的块列表timedOutItems
         timedOutItems.add(block);
      }
      LOG.warn("PendingReplicationMonitor timed out " + block);

      // 从迭代器中移除该条目
      iter.remove();
   }

PendingReplicationMonitor timed out blk_1176622248_105822235这条信息是代码       LOG.warn("PendingReplicationMonitor timed out " + block);
输出的，也就是说pendingReplications检测到了blk_1176622248_105822235副本缺失了。

也就是通过这个信息
PendingReplicationMonitor timed out blk_1176622248_105822235
我们知道了hdfs副本缺失导致产生这条报警。

————————————————
那么为何会丢失副本：
下面两条错误是重要信息
src: /ipaddr:57414 dst: /ipaddr:50010);
The cluster does not contain node: /default/ipaddr:50010

上面其实是代表节点出问题了，其实这个问题正说明了上面问题的，副本丢失。所以这是根本的错误，如果我们解决了上面两个问题，那么副本丢失就可能解决掉。那么该如何解决掉上面问题，这里推测可能原因如下，楼主也可以找更多信息，更准确定位错误：
1.网络原因，比如防火墙、端口、ip地址、hosts、hostname等等可能的网络问题，
2.进程出问题了，可能出现僵尸进程。
3.楼主说重启解决问题，可能就是上面进程的问题。还需要详细查看日志，追踪根本原因
4.另外可能是数据量过多等造成的。

总结
上面只是猜测，我们知道原因后，剩下的就是细心问题了。多看看日志。

grinsky 发表于 2019-12-19 21:01:40

阿飞发表于 2019-12-13 07:34
楼主提供的信息比较全，这里给咱们分析下

首先知道PendingReplicationMonitor 用来干什么的？NameNode发 ...

多谢帮助，目前该问题已经解决了。
问题原因：以前系统中虽然修改了用户的最大线程数，但没有修改系统的最大线程数，
导致现在线程数达到系统的最大限制后无法再创建新的线程，所以就出现上述问题。

解决方放：修改系统最大线程数即可。
/etc/sysctl.conf中配置kernel.pid_max

目前更改后已过去1周，未再次出现该问题。

页: [1]

About云-梭伦科技's Archiver

此 DataNode 未连接到其一个或多个 NameNode。