Messages not reaching clients when using hazelcast cluster configuration

I have a two node hazelcast cluster configured and it seems to be working fine most of the time except when I get clients on both nodes. In this case messages send by a user in one node stop reaching users that are connected to the other node. This does not happen all the time but when it happens I can see the logs get filled with messages like:

java.util.concurrent.TimeoutException

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(Invocati onImpl.java:450)

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java: 298)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:66)

at com.jivesoftware.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(Clus teredCacheFactory.java:333)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:586)

at org.jivesoftware.openfire.SessionManager.getConnectionsCount(SessionManager.jav a:894)

at org.jivesoftware.openfire.plugin.StatCollector.run(StatCollector.java:94)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

2014.09.18 11:39:00 com.jivesoftware.util.cache.ClusteredCacheFactory - Failed to execute cluster task within 30 seconds

java.util.concurrent.TimeoutException

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(Invocati onImpl.java:450)

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java: 298)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:66)

at com.jivesoftware.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(Clus teredCacheFactory.java:333)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:586)

at org.jivesoftware.openfire.SessionManager.getConnectionsCount(SessionManager.jav a:894)

at org.jivesoftware.openfire.plugin.StatCollector.run(StatCollector.java:94)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

2014.09.18 11:39:30 com.jivesoftware.util.cache.ClusteredCacheFactory - Failed to execute cluster task within 30 seconds

java.util.concurrent.TimeoutException

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(Invocati onImpl.java:450)

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java: 298)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:66)

at com.jivesoftware.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(Clus teredCacheFactory.java:333)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:586)

at org.jivesoftware.openfire.SessionManager.getConnectionsCount(SessionManager.jav a:894)

at org.jivesoftware.openfire.plugin.StatCollector.run(StatCollector.java:94)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

2014.09.18 11:39:35 com.jivesoftware.util.cache.ClusteredCacheFactory - Failed to execute cluster task within 30 seconds

java.util.concurrent.TimeoutException

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(Invocati onImpl.java:450)

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java: 298)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:66)

at com.jivesoftware.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(Clus teredCacheFactory.java:333)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:586)

at org.jivesoftware.openfire.SessionManager.getConnectionsCount(SessionManager.jav a:894)

at org.jivesoftware.openfire.plugin.StatCollector.run(StatCollector.java:94)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

2014.09.18 11:39:45 com.jivesoftware.util.cache.ClusteredCacheFactory - Failed to execute cluster task within 30 seconds

java.util.concurrent.TimeoutException

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.resolveResponse(Invocati onImpl.java:450)

at com.hazelcast.spi.impl.InvocationImpl$InvocationFuture.get(InvocationImpl.java: 298)

at com.hazelcast.util.executor.DelegatingFuture.get(DelegatingFuture.java:66)

at com.jivesoftware.util.cache.ClusteredCacheFactory.doSynchronousClusterTask(Clus teredCacheFactory.java:333)

at org.jivesoftware.util.cache.CacheFactory.doSynchronousClusterTask(CacheFactory. java:586)

at org.jivesoftware.openfire.SessionManager.getConnectionsCount(SessionManager.jav a:894)

at org.jivesoftware.openfire.plugin.StatCollector.run(StatCollector.java:94)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

And in the web interface on both servers I can see that they fail to list remote sessions (sessions on other cluster node). This makes me thing that the cluster if timing out when requesting the remote sessions, which causes the node to ignore them and fails to send the message to them.

Is someone else having this issue? Is there a fix or workaround? Any help or tips would be greatly appreciated.

I am using Openfire 3.9.2 with Hazelcast 3.1.7.

Wondering is there is any actual cluster installation based on Hazecast that is in production? Given that messages fail to reach users when using this configuration I am not sure how it is possible to use this in production.

I can only give you the anecdotal evidence that Hazelcast has been working just fine in my environments. We’re currently running v1.2.2 of the plugin (based on Hazelcast 3.17) on Openfire 3.9.3, but have been using this and earlier versions for over a year now (starting on Openfire 3.8.2). I’ve never had a problem with messages reaching clients due to Hazelcast, as far as I know. We have had some hiccups when recycling one of the cluster nodes, where clients drop their connections but then almost immediately re-connect.

It would be helpful if we could get some more expert advice on how to troubleshoot Hazelcast on Openfire, as it’s not very well documented.