Many problems with Cluster / Hazelcast plugin

Hello,

we have an Openfire 3.10.3 cluster (with HazelCast 2.1.2 plugin) with 4 nodes running on Linux behind a load balancer (HAProxy). Unfortunately, it happens quite often, that our servers crash unexpectedly about once a week :-(. Users can’t login anymore, Openfire does not respond, messages get lost, etc…

Our current solution is to restart all Openfire nodes, if we or a customer detect the errorneous behavior.

Crawling through the log files, I’ve found many HazelCast errors. Usually it starts with an OperationTimeoutException:

(here on server 172.25.3.64, which seems to get no response from 172.25.3.63)

2016.01.26 08:12:43 org.jivesoftware.openfire.handler.IQHandler - Interner Serverfehler
com.hazelcast.core.OperationTimeoutException: No response for 60000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=GetOperation{Routing Users Cache}, partitionId=95, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=30000, target=Address[172.25.3.63]:5701, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:435)
    at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:254)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.get(ClusteredCache.java:92)
    at org.jivesoftware.util.cache.CacheWrapper.get(CacheWrapper.java:121)
    at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java:805)
    at org.jivesoftware.openfire.SessionManager.getSessions(SessionManager.java:893)
    at org.jivesoftware.openfire.handler.IQPrivacyHandler.updateOrCreateList(IQPrivacyHandler.java:356)
    at org.jivesoftware.openfire.handler.IQPrivacyHandler.handleIQ(IQPrivacyHandler.java:124)
    at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:65)
    at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:380)
    at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:123)
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:76)
    at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:108)
    at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:69)
    at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:655)
    at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    at ------ End remote and begin local stack-trace ------.(Unknown Source)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:435)
    at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:254)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:82)
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.get(ClusteredCache.java:92)
    at org.jivesoftware.util.cache.CacheWrapper.get(CacheWrapper.java:121)
    at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java:805)
    at org.jivesoftware.openfire.SessionManager.getSessions(SessionManager.java:893)
    at org.jivesoftware.openfire.handler.IQPrivacyHandler.updateOrCreateList(IQPrivacyHandler.java:356)
    at org.jivesoftware.openfire.handler.IQPrivacyHandler.handleIQ(IQPrivacyHandler.java:124)
    at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:65)
    at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:380)
    at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:123)
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:76)
    at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:108)
    at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:69)
    at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:655)
    at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

The followed by:

2016.01.26 08:13:48 com.hazelcast.concurrent.lock.operations.UnlockOperation - [172.25.3.64]:5701 [openfire] [3.5.1] Current thread is not owner of the lock! -> Owner: 10bfd686-5e66-4d0f-bb77-d1c28f4f143b, thread-id: 2251
java.lang.IllegalMonitorStateException: Current thread is not owner of the lock! -> Owner: 10bfd686-5e66-4d0f-bb77-d1c28f4f143b, thread-id: 2251
    at com.hazelcast.concurrent.lock.operations.UnlockOperation.ensureUnlocked(UnlockOperation.java:71)
    at com.hazelcast.concurrent.lock.operations.UnlockOperation.unlock(UnlockOperation.java:65)
    at com.hazelcast.concurrent.lock.operations.UnlockOperation.run(UnlockOperation.java:57)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:137)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:309)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processPacket(OperationThread.java:142)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:115)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:101)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:76)

Later:

2016.01.26 08:26:17 org.jivesoftware.openfire.handler.PresenceUpdateHandler - Interner Serverfehler. Triggered by packet: <presence from="..." to="..." type="unavailable"/>
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=GetOperation{Roster}, partitionId=59, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=6, callTimeout=30000, target=Address[172.25.3.63]:5701, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0

Followed by:

2016.01.26 08:26:26 org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
java.lang.IllegalArgumentException: Requested node 10bfd686-5e66-4d0f-bb77-d1c28f4f143b not found in cluster
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory.doClusterTask(ClusteredCacheFactory.java:307)
    at org.jivesoftware.util.cache.CacheFactory.doClusterTask(CacheFactory.java:574)
    at org.jivesoftware.openfire.plugin.util.cluster.ClusterPacketRouter.routePacket(ClusterPacketRouter.java:46)
    at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableImpl.java:363)
    at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.java:239)
    at org.jivesoftware.openfire.roster.Roster.broadcastPresence(Roster.java:609)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.broadcastUpdate(PresenceUpdateHandler.java:308)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:162)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:137)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:201)
    at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:148)
    at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:84)
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84)
    at org.jivesoftware.openfire.SessionManager$ClientSessionListener.onConnectionClose(SessionManager.java:1242)
    at org.jivesoftware.openfire.nio.NIOConnection.notifyCloseListeners(NIOConnection.java:261)
    at org.jivesoftware.openfire.nio.NIOConnection.close(NIOConnection.java:243)
    at org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:182)

And finally resulting in:

2016.01.26 08:27:53 org.jivesoftware.openfire.handler.PresenceUpdateHandler - Interner Serverfehler. Triggered by packet: <presence from="..." to="..." type="unavailable"/>
com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
    at com.hazelcast.spi.AbstractDistributedObject.throwNotActiveException(AbstractDistributedObject.java:81)
    at com.hazelcast.spi.AbstractDistributedObject.lifecycleCheck(AbstractDistributedObject.java:76)
    at com.hazelcast.spi.AbstractDistributedObject.getNodeEngine(AbstractDistributedObject.java:70)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.unlock(MapProxyImpl.java:239)
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.unlock(ClusteredCache.java:189)
    at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$ClusterLock.unlock(ClusteredCacheFactory.java:465)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.broadcastUnavailableForDirectedPresences(PresenceUpdateHandler.java:480)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:163)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:137)
    at org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:201)
    at org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:148)
    at org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:84)
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84)
    at org.jivesoftware.openfire.SessionManager.removeSession(SessionManager.java:1135)
    at org.jivesoftware.openfire.plugin.util.cache.ClusterListener.cleanupNode(ClusterListener.java:277)
    at org.jivesoftware.openfire.plugin.util.cache.ClusterListener.memberRemoved(ClusterListener.java:684)
    at com.hazelcast.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:1422)
    at com.hazelcast.cluster.impl.ClusterServiceImpl.dispatchEvent(ClusterServiceImpl.java:116)
    at com.hazelcast.spi.impl.eventservice.impl.LocalEventDispatcher.run(LocalEventDispatcher.java:63)
    at com.hazelcast.util.executor.StripedExecutor$Worker.process(StripedExecutor.java:190)
    at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:174)

At the same time server 172.25.3.63 gets similar errors, when targeting 172.25.3.64:

2016.01.26 08:12:42 org.jivesoftware.openfire.spi.RoutingTableImpl - Primary packet routing failed
com.hazelcast.core.OperationTimeoutException: No response for 60000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:mapService', op=GetOperation{Routing Users Cache}, partitionId=168, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=30000, target=Address[172.25.3.64]:5701, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)

Even a weird one like (how can the domain be null suddenly?):

2016.01.26 08:26:43 org.jivesoftware.openfire.handler.IQHandler - Interner Serverfehler
java.lang.NullPointerException: Domain cannot be null
    at org.xmpp.packet.JID.<init>(JID.java:512)
    at org.xmpp.packet.JID.<init>(JID.java:496)
    at org.jivesoftware.openfire.SessionManager.userBroadcast(SessionManager.java:1066)
    at org.jivesoftware.openfire.handler.IQPrivacyHandler.updateOrCreateList(IQPrivacyHandler.java:369)
    at org.jivesoftware.openfire.handler.IQPrivacyHandler.handleIQ(IQPrivacyHandler.java:124)
    at org.jivesoftware.openfire.handler.IQHandler.process(IQHandler.java:65)
    at org.jivesoftware.openfire.IQRouter.handle(IQRouter.java:380)
    at org.jivesoftware.openfire.IQRouter.route(IQRouter.java:123)
    at org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:76)
    at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:108)
    at org.jivesoftware.openfire.SessionPacketRouter.route(SessionPacketRouter.java:69)
    at org.jivesoftware.openfire.http.HttpSession.sendPendingPackets(HttpSession.java:655)
    at org.jivesoftware.openfire.http.HttpSession$HttpPacketSender.run(HttpSession.java:1275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

After these errors, the whole cluster is unusable and we have to restart erverything.

I’ve also found these from another crash:

2015.12.18 15:48:25 com.hazelcast.spi.impl.operationservice.impl.Invocation - [172.25.3.63]:5701 [openfire] [3.5.1] While asking 'is-executing': Invocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.impl.operation.QueryPartitionOperation{serviceName='null', partitionId=238, callId=0, invocationTime=1450445461399, waitTimeout=-1, callTimeout=30000}, partitionId=238, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=30000, target=Address[172.25.3.62]:5701, backupsExpected=0, backupsCompleted=0}
com.hazelcast.core.MemberLeftException: Member [172.25.3.62]:5701 has left cluster!
    at com.hazelcast.spi.impl.operationservice.impl.InvocationRegistry$OnMemberLeftTask.run(InvocationRegistry.java:398)
    at com.hazelcast.util.executor.CachedExecutorServiceDelegate$Worker.run(CachedExecutorServiceDelegate.java:209)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
    at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
    at ------ End remote and begin local stack-trace ------.(Unknown Source)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.access$200(InvocationFuture.java:51)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture$1.run(InvocationFuture.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
    at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
2016.01.21 15:52:12 com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation - [172.25.3.63]:5701 [openfire] [3.5.1] Op: com.hazelcast.concurrent.lock.operations.LockOperation{serviceName='null', partitionId=186, callId=3887632, invocationTime=1453387902620, waitTimeout=-2310001, callTimeout=30000}, Error: Cannot send response: CallTimeoutResponse{callId=3887632, urgent=false} to Address[172.25.3.65]:5701
com.hazelcast.core.HazelcastException: Cannot send response: CallTimeoutResponse{callId=3887632, urgent=false} to Address[172.25.3.65]:5701
    at com.hazelcast.spi.impl.ResponseHandlerFactory$RemoteInvocationResponseHandler.sendResponse(ResponseHandlerFactory.java:131)
    at com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation.run(WaitingOperation.java:139)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:137)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.processOperation(OperationThread.java:154)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.process(OperationThread.java:110)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.doRun(OperationThread.java:101)
    at com.hazelcast.spi.impl.operationexecutor.classic.OperationThread.run(OperationThread.java:76)

as well as this one in the info.log

2016.01.21 15:52:11 com.hazelcast.core.LifecycleService - [172.25.3.65]:5701 [openfire] [3.5.1] Address[172.25.3.65]:5701 is SHUTTING_DOWN
2016.01.21 15:52:11 com.hazelcast.instance.Node - [172.25.3.65]:5701 [openfire] [3.5.1] Shutting down connection manager...

Before such a crash, the CPU also goes high as far as I can tell (couldn’t verify, that it’s hight for every crash).

The above scenario happens too often, about every week…

Please help! Do you have any ideas, what might be the cause or what we can try to prevent the crashes? Thanks!

1 Like

Any help for this issue?