分享

yarn ha中zookeeper的作用

CR_Y 发表于 2018-1-11 14:39:40 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 5 11574
请问在yarn的ha中,resourcemanager向zookeeper写入了什么数据,我的rm和zk报错,找不到原因,一些相关的日志如下:
resourcemanager日志:
2018-01-07 16:05:11,244 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1
2018-01-07 16:05:11,245 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode...
2018-01-07 16:05:11,869 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.200.185.131/10.200.185.131:2181. Will not attempt to authenticate using SASL (unknown error)
2018-01-07 16:05:11,870 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.200.185.131/10.200.185.131:2181, initiating session
2018-01-07 16:05:11,872 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.200.185.131/10.200.185.131:2181, sessionid = 0x45a8e21311c01ed, negotiated timeout = 10000
2018-01-07 16:05:11,881 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x45a8e21311c01ed, likely server has closed socket, closing socket connection and attempting reconnect
2018-01-07 16:05:11,981 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.

2018-01-07 16:46:52,350 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 999
2018-01-07 16:46:53,069 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server 10.200.185.134/10.200.185.134:2181. Will not attempt to authenticate using SASL (unknown error)
2018-01-07 16:46:53,070 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to 10.200.185.134/10.200.185.134:2181, initiating session
2018-01-07 16:46:53,071 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server 10.200.185.134/10.200.185.134:2181, sessionid = 0x45a8e21311c01ed, negotiated timeout = 10000
2018-01-07 16:46:53,086 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x45a8e21311c01ed, likely server has closed socket, closing socket connection and attempting reconnect


zookeeper日志:
2018-01-07 16:05:11,145 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x25dd0e6d1f300ec, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:744)
2018-01-07 16:05:12,681 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x45a8e21311c01ed due to java.io.IOException: Len error 1530558

已有(5)人评论

跳转到指定楼层
sstutu 发表于 2018-1-11 14:50:51
两个问题
1.集群使用的是什么认证方式?SSH?还是SASL
2.集群的时间ntp同步方式是什么?
是全部同步网络服务器,还是master同步网络,其它同步master。集群时间不一致,会造成session通信失败。
回复

使用道具 举报

CR_Y 发表于 2018-1-11 14:57:53
sstutu 发表于 2018-1-11 14:50
两个问题
1.集群使用的是什么认证方式?SSH?还是SASL
2.集群的时间ntp同步方式是什么?

我们使用的是原生的hadoop集群,认证方式应该是ssh(这个具体我也不太确定),集群时间全部同步网络时间
回复

使用道具 举报

sstutu 发表于 2018-1-11 15:53:44
CR_Y 发表于 2018-1-11 14:57
我们使用的是原生的hadoop集群,认证方式应该是ssh(这个具体我也不太确定),集群时间全部同步网络时间

有多少zookeeper,好像只有 [myid:1] 为1的出现了问题。单独看下它的日志。还有进程是否在等。
回复

使用道具 举报

CR_Y 发表于 2018-1-11 16:35:22
sstutu 发表于 2018-1-11 15:53
有多少zookeeper,好像只有 [myid:1] 为1的出现了问题。单独看下它的日志。还有进程是否在等。

zk一共七个节点,其他节点也有类似的日志,这是myid为1的节点日志
2018-01-07 16:05:11,145 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x25dd0e6d1f300ec, likely client has closed socket
        at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
        at java.lang.Thread.run(Thread.java:744)

回复

使用道具 举报

sstutu 发表于 2018-1-11 17:08:35
CR_Y 发表于 2018-1-11 16:35
zk一共七个节点,其他节点也有类似的日志,这是myid为1的节点日志
2018-01-07 16:05:11,145 [myid:1] -  ...

用下面命令看看什么情况:
[mw_shl_code=bash,true]netstat -antp | grep 2181
[/mw_shl_code]

回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条