在master的日志如下问题:
2013-11-19 06:24:35,134 DEBUG [h1,60000,1384804461419-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2013-11-19 06:29:35,451 DEBUG [h1,60000,1384804461419-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2013-11-19 06:34:35,578 DEBUG [h1,60000,1384804461419-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
2013-11-19 06:39:35,697 DEBUG [h1,60000,1384804461419-BalancerChore] balancer.BaseLoadBalancer: Not running balancer because only 1 active regionserver(s)
通过master日志发现只有一个regionserver工作,我的测试环境是配置了两个regionserver
于是分别查看两台regionsever的log
果然,其中的一台:
2013-11-19 08:09:34,900 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2311)
2013-11-19 08:09:34,901 INFO [Shutdownhook:regionserver60020] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-4,5,main]
2013-11-19 08:09:34,901 INFO [Shutdownhook:regionserver60020] regionserver.HRegionServer: STOPPED: Shutdown hook
2013-11-19 08:09:34,901 INFO [Shutdownhook:regionserver60020] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2013-11-19 08:09:34,906 INFO [Shutdownhook:regionserver60020] regionserver.ShutdownHook: Shutdown hook finished.
尝试过重启hbase集群,跟踪日志发现这台regionserver无法启动
详细查看regionserver日志发现如下内容:
2013-11-19 08:09:34,765 FATAL [regionserver60020] regionserver.HRegionServer: Master rejected startup because clock is out of sync
org.apache.hadoop.hbase.ClockOutOfSyncException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server h2,60020,1384819773571 has been rejected; Reported time is too far out of sync with master. Time difference of 32504ms > max allowed of 30000ms
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:235)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1926)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:790)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException: org.apache.hadoop.hbase.ClockOutOfSyncException: Server h2,60020,1384819773571 has been rejected; Reported time is too far out of sync with master. Time difference of 32504ms > max allowed of 30000ms
at org.apache.hadoop.hbase.master.ServerManager.checkClockSkew(ServerManager.java:314)
at org.apache.hadoop.hbase.master.ServerManager.regionServerStartup(ServerManager.java:215)
at org.apache.hadoop.hbase.master.HMaster.regionServerStartup(HMaster.java:1292)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:5085)
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2146)
at org.apache.hadoop.hbase.ipc.RpcServer$Handler.run(RpcServer.java:1851)
at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1446)
at org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.java:1650)
at org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementation.callBlockingMethod(RpcClient.java:1708)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:5402)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:1924)
... 2 more
Time difference of 32504ms > max allowed of 30000ms
让我想到了集群服务器的操作系统时间,重新同步了集群中服务器的时间后regionserver可以正常启动
由此可见:hbase集群的时间差距不能超过30m,而我的集群环境没用配置时间服务器造成时间差引起的这次故障。
|