Openfire hazelcast cluster threads TIMED_WAITING

Hi,

Openfire version 3.10.2

Hazelcast version: 3.5.1

3 node cluster.

The number of openfire threads suddenly spikes upto a large value like 30,000 and the process crashes. A jstack output when threads run very high gives the following output. There are about 10,000 threads in this TIMED_WAITING state.

“TaskEngine-pool-64721” #69065 daemon prio=5 os_prio=0 tid=0x00007f12552ad000 nid=0x55d7 in Object.wait() [0x00007f131146c000]

java.lang.Thread.State: TIMED_WAITING (on object monitor)

at java.lang.Object.wait(Native Method)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.pollResponse(Invo cationFuture.java:299)

  • locked <0x00000006edad3278> (a com.hazelcast.spi.impl.operationservice.impl.InvocationFuture)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(I nvocationFuture.java:247)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFut ure.java:224)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFut ure.java:204)

at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.getSafely(Invocat ionFuture.java:216)

at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:79)

at com.hazelcast.concurrent.lock.LockProxySupport.lock(LockProxySupport.java:73)

at com.hazelcast.map.impl.proxy.MapProxyImpl.lock(MapProxyImpl.java:223)

at org.jivesoftware.openfire.plugin.util.cache.ClusteredCache.lock(ClusteredCache. java:173)

at org.jivesoftware.openfire.plugin.util.cache.ClusteredCacheFactory$ClusterLock.l ock(ClusteredCacheFactory.java:449)

at org.jivesoftware.openfire.spi.RoutingTableImpl.getRoutes(RoutingTableImpl.java: 800)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl. java:531)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableI mpl.java:303)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:239)

at org.jivesoftware.openfire.net.SocketPacketWriteHandler.process(SocketPacketWrit eHandler.java:68)

at org.jivesoftware.openfire.spi.PacketDelivererImpl.deliver(PacketDelivererImpl.j ava:56)

at org.jivesoftware.openfire.http.HttpSession$4.run(HttpSession.java:1083)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Along with the stack trace above, the other traces that I get is the following. There are about 15000 of these threads too in the same jstack output.

The snippet of code from openfire 3.10.2 that shows the below stack trace comes from adding the stanza to “pendingElements” data structure.

if (!delivered) {

if (pendingConnections > 0) {

Log.warn(“Unable to deliver a stanza (it is being queued instead), although there are available connections! RID / Connection processing is out of sync!”);

}

pendingElements.add(stanza);

}

“TaskEngine-pool-64701” #69020 daemon prio=5 os_prio=0 tid=0x00007f1255087800 nid=0x5516 waiting for monitor entry [0x00007f1311b53000]

java.lang.Thread.State: BLOCKED (on object monitor)

at java.util.Collections$SynchronizedCollection.add(Collections.java:2035)

  • waiting to lock <0x00000004e58dfd00> (a java.util.Collections$SynchronizedRandomAccessList)

at org.jivesoftware.openfire.http.HttpSession.deliver(HttpSession.java:1004)

at org.jivesoftware.openfire.http.HttpSession.deliver(HttpSession.java:970)

at org.jivesoftware.openfire.session.LocalSession.process(LocalSession.java:289)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl. java:633)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableI mpl.java:303)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:239)

at org.jivesoftware.openfire.net.SocketPacketWriteHandler.process(SocketPacketWrit eHandler.java:68)

at org.jivesoftware.openfire.spi.PacketDelivererImpl.deliver(PacketDelivererImpl.j ava:56)

at org.jivesoftware.openfire.http.HttpSession$4.run(HttpSession.java:1083)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Any clues on how to proceeed. Is the hazelcast stuck and therefore the other threads are BLOCKED ?

Do I have to move to a later version of hazelcast? Found the following link

Hazelcast stuck in TIMED_WAITING when using 2nd-level cache · Issue #4406 · hazelcast/hazelcast · GitHub

I did the experiment again with a single node without any cluster configuration and still ran into the below trace.

Openfire version: 3.10.2

Single Node

Connection types: A mix of BOSH (strophe) and TCP(android & iOS)

“TaskEngine-pool-64701” #69020 daemon prio=5 os_prio=0 tid=0x00007f1255087800 nid=0x5516 waiting for monitor entry [0x00007f1311b53000]

java.lang.Thread.State: BLOCKED (on object monitor)

at java.util.Collections$SynchronizedCollection.add(Collections.java:2035)

  • waiting to lock <0x00000004e58dfd00> (a java.util.Collections$SynchronizedRandomAccessList)

at org.jivesoftware.openfire.http.HttpSession.deliver(HttpSession.java:1004)

at org.jivesoftware.openfire.http.HttpSession.deliver(HttpSession.java:970)

at org.jivesoftware.openfire.session.LocalSession.process(LocalSession.java:289)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToBareJID(RoutingTableImpl. java:633)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routeToLocalDomain(RoutingTableI mpl.java:303)

at org.jivesoftware.openfire.spi.RoutingTableImpl.routePacket(RoutingTableImpl.jav a:239)

at org.jivesoftware.openfire.net.SocketPacketWriteHandler.process(SocketPacketWrit eHandler.java:68)

at org.jivesoftware.openfire.spi.PacketDelivererImpl.deliver(PacketDelivererImpl.j ava:56)

at org.jivesoftware.openfire.http.HttpSession$4.run(HttpSession.java:1083)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

at java.util.concurrent.FutureTask.run(FutureTask.java:266)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

A little bit of digging of the source code told me that this was happening in the failDelivery() function in HttpSession.java. All packets for a closed Http Session were getting queued up for delivery in a data structure. A thread would be spawned for each such connection and no one get access to the monitor. After a while no more threads can be spawned and the system crashes. The fundamental problem is to find out why some thread has taken a lock and disappeared without releasing it causing every other thread to get BLOCKED.

Note that this happens only during an http connection close. Any pointers as to why this would be happening?

What OS you using? We have problems, I think it is same as yours. We are using just BOSH, and number of opened files is increasing .Maybe its comes from your described problem.