Openfire hazelcast cluster threads TIMED_WAITING

Hari_Nambiar · January 25, 2016, 6:25am

Hi,

Openfire version 3.10.2

Hazelcast version: 3.5.1

3 node cluster.

The number of openfire threads suddenly spikes upto a large value like 30,000 and the process crashes. A jstack output when threads run very high gives the following output. There are about 10,000 threads in this TIMED_WAITING state.

“TaskEngine-pool-64721” #69065 daemon prio=5 os_prio=0 tid=0x00007f12552ad000 nid=0x55d7 in Object.wait() [0x00007f131146c000]

java.lang.Thread.State: TIMED_WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.pollResponse(Invo cationFuture.java:299)

locked <0x00000006edad3278> (a com.hazelcast.spi.impl.operationservice.impl.InvocationFuture)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(I nvocationFuture.java:247)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFut ure.java:224)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFut ure.java:204)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.getSafely(Invocat ionFuture.java:216)

at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:79)

at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:73)

at com.hazelcast.map.impl.proxy.MapProxyImpl.lock(MapProxyImpl.java:223)

at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.lock(ClusteredCache. java:173)

at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$ClusterLock.l ock(ClusteredCacheFactory.java:449)

at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java: 800)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl. java:531)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableI mpl.java:303)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:239)

at org.jivesoftware.openfire.net.SocketPacketWriteHandler.process(SocketPacketWrit eHandler.java:68)

at org.jivesoftware.openfire.spi.PacketDelivererImpl.deliver(PacketDelivererImpl.j ava:56)

at org.jivesoftware.openfire.http.HttpSession$4.run(HttpSession.java:1083)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Along with the stack trace above, the other traces that I get is the following. There are about 15000 of these threads too in the same jstack output.

The snippet of code from openfire 3.10.2 that shows the below stack trace comes from adding the stanza to “pendingElements” data structure.

if (!delivered) {

if (pendingConnections > 0) {

Log.warn(“Unable to deliver a stanza (it is being queued instead), although there are available connections! RID / Connection processing is out of sync!”);

}

pendingElements.add(stanza);

}

“TaskEngine-pool-64701” #69020 daemon prio=5 os_prio=0 tid=0x00007f1255087800 nid=0x5516 waiting for monitor entry [0x00007f1311b53000]

java.lang.Thread.State: BLOCKED (on object monitor)

at java.util.Collections$SynchronizedCollection.add(Collections.java:2035)

waiting to lock <0x00000004e58dfd00> (a java.util.Collections$SynchronizedRandomAccessList)

at org.jivesoftware.openfire.http.HttpSession.deliver(HttpSession.java:1004)

at org.jivesoftware.openfire.http.HttpSession.deliver(HttpSession.java:970)

at org.jivesoftware.openfire.session.LocalSession.process(LocalSession.java:289)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl. java:633)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableI mpl.java:303)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:239)

at org.jivesoftware.openfire.net.SocketPacketWriteHandler.process(SocketPacketWrit eHandler.java:68)

at org.jivesoftware.openfire.spi.PacketDelivererImpl.deliver(PacketDelivererImpl.j ava:56)

at org.jivesoftware.openfire.http.HttpSession$4.run(HttpSession.java:1083)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Any clues on how to proceeed. Is the hazelcast stuck and therefore the other threads are BLOCKED ?

Do I have to move to a later version of hazelcast? Found the following link

Hazelcast stuck in TIMED_WAITING when using 2nd-level cache · Issue #4406 · hazelcast/hazelcast · GitHub

Hari_Nambiar · February 9, 2016, 6:17am

I did the experiment again with a single node without any cluster configuration and still ran into the below trace.

Openfire version: 3.10.2

Single Node

Connection types: A mix of BOSH (strophe) and TCP(android & iOS)

“TaskEngine-pool-64701” #69020 daemon prio=5 os_prio=0 tid=0x00007f1255087800 nid=0x5516 waiting for monitor entry [0x00007f1311b53000]

java.lang.Thread.State: BLOCKED (on object monitor)

at java.util.Collections$SynchronizedCollection.add(Collections.java:2035)

waiting to lock <0x00000004e58dfd00> (a java.util.Collections$SynchronizedRandomAccessList)