Hazelcast cluster state bug

Vivek_S · March 19, 2015, 10:02am

The state of the Openfire Hazelcast cluster has an issue.

The org.jivesoftware.openfire.cluster.ClusterManager class has 2 methods which give the state of the cluster - isClusteringStarting() and isClusteringStarted(). These methods get the values from org.jivesoftware.util.cache.CacheFactory class. This class has 2 boolean variables “clusteringStarting” and “clusteringStarted”. In the startClustering() method, clusteredCacheFactoryStrategy.startCluster() is called and the return value is assigned to “clusteringStarting”. In the startCluster() method of com.jivesoftware.util.cache.ClusteredCacheFactory class, Hazelcast is initialized and then ClusterListener constructor is called. In that flow, the CacheFactory.joinedCluster() method is called and there “clusteringStarting” is set to “false” and “clusteringStarted” is set to “true”. But after this, the startCluster() method evaluates “cluster != null” and returns true which is assigned to the variable “clusteringStarting” in the CacheFactory.startClustering() method. Finally, when the cluster has started successfully, we end up getting both “clusteringStarting” and “clusteringStarted” set to true. This is a problem and it creates a memory leak during group chat scenarios.

The abstract class org.jivesoftware.openfire.muc.cluster.MUCRoomTask executes tasks related to rooms in a cluster. In the “execute” method, ClusterManager.isClusteringStarting() is called. If a room is not found, an IllegalArgumentException is caught, and in the exception handling block, if the clusterStarting value is true, the task is added to a queue in QueuedTasksManager. The QueuedTasksManager will remove the task only when clusterStarting is false. Because clusterStarting is always set to true, the tasks get slowly added in the queue (whenever the room is not found due to some issues) without getting removed resulting in a memory leak.

Request you to please log a bug for this.

akrherz · March 19, 2015, 1:14pm

OF-891 has been filed with your report, thanks!

xy1 · March 20, 2015, 2:23am

hello，@Vivek S, you says" But after this, the startCluster() method evaluates “cluster != null” and returns true which is assigned to the variable “clusteringStarting” in the CacheFactory.startClustering() method" .why this method is called twice

Vivek_S · March 30, 2015, 9:10am

hello, @xy, the method is not called twice. As per the flow, first the value for “clusteringStarting” is set to false in CacheFactory.joinedCluster() which is correct; then when the flow ends “cluster != null” is checked and it will be true; this value is wrongly assigned back again to “clusteringStarting”.