Using smack with apache servlets for many concurrent chats

Padraig_Meaney · January 13, 2014, 12:21pm

Hi All,

I’m using smack 3.3.1 to create connections from a servlet in apache to an ejabberd server on the same box. Essentially, someone makes a webcall to my servlet which, in the background, creates a user and starts a room-based chat for them on an ejabberd server. They can then make various webcalls to send and receive messages to the other ‘business’ users, eventually terminating the conversation, which closes the chat-room and deletes the associated users. I don’t want to give users direct access to the XMPP server via smack as there is some business logic around creating accounts and sending/receiving messages that I need to enforce. Therefore, and unfortunately, because all of the users and all of the XMPP connections exist on the same machine, my JVM memory and thread count grows very large very quickly. Since we have a high throughput of short-time users (2,000) with their associated chat-rooms (and therefore, 6,000 or so smack threads are spawned), the upshot of it is that the apache process falls over after a few hours of use. Are there any config changes I can make to smack to improve memory or thread usage for my scenario, or is my only option to tweak the JVM hosting the process, and or throw more RAM into the machine hosting my setup?

Any help/suggestions here would be very helpful!

Robin_Collier · January 13, 2014, 6:20pm

Unfortunately, you cannot configure the thread usage of Smack at this time. It currently creates 3 threads per connection, which obviously isn`t ideal in scenarios like your own.

Padraig_Meaney · January 14, 2014, 9:54am

Thanks rcollier, unfortunately this is what I suspected. I’ve attached a screenshot of the heapdump from jvisualvm just before the tomcat server falls over, it seems that there are a substantial amount of reentrant locks and concurrent hash map segments/entries, is there anything unusual about this, or is there anything I can do to minimise it?

Padraig_Meaney · January 15, 2014, 9:51am

As well as that, Exception messages appearing in log, even though packets do not appear to be getting lost:

stream:error (text)

   at org.jivesoftware.smack.PacketReader.parsePackets(PacketReader.java:251)

   at org.jivesoftware.smack.PacketReader.access$000(PacketReader.java:46)

   at org.jivesoftware.smack.PacketReader$1.run(PacketReader.java:72)

And sometimes, though much less frequently, the following exceptions:

java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(Unkn own Source)

at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(Unknown Source)

at java.util.concurrent.ArrayBlockingQueue.put(Unknown Source)

at org.jivesoftware.smack.PacketWriter.sendPacket(PacketWriter.java:88)

at org.jivesoftware.smack.XMPPConnection.sendPacket(XMPPConnection.java:496)

at org.jivesoftware.smack.keepalive.KeepAliveManager$5.run(KeepAliveManager.java:280)

at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)

at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)

at java.util.concurrent.FutureTask.run(Unknown Source)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301 (Unknown Source)

at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknow n Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

at java.lang.Thread.run(Unknown Source)

Johan_Maasing · March 17, 2014, 8:05am

I’m having a related problem in my scenarion: A web based application where one XMPPConnection is used to create many (thousands) of short lived chats.

I run into heavy lock contention in PacketWriter.sendPacket because it uses an ArrayBlockingQueue. I strongly suspect that is not the best choice here since any access (read/write) will block access to the entire array. So all my writers starve each other and even the reader thread.

It would probably be better to use a ConcurrentLinkedDeque although that does not have a size limit. However a size limit could be implemented in XMPPConnection using an AtomicInteger. I haven’t done any profiling yet to see the impact of this but I think it would remove the lock contention in these kinds of scanrios. For a more traditional “one connection-a few chats” I think performance would be the same as today.

Flow · March 17, 2014, 10:00am

Using a unbounded datastructure will introduce the requirement for an AtomicInteger if we want to keep the limit for PacketWriters queue. But this AtomicInteger will have the to be synchronized, even worse, the comparison of such AtomicInteger against the maximum queue size conditionally resulting in a block, would need to be synchronized against a lock. So you again have synchronization.

So all my writers starve each other and even the reader thread.

I guess you are talking about the “chat” writers and not Smack’s PacketWriter. It is true that if you have many producers threads you will see heavy performance loss, because the lock jumps betweens the different threads. But I don’t think there is a solution for this if we want to keep the maximum queue size blocking semantic. And we want to keep this semantic.

I suggest that you may overthink your design, using multiple XMPPConnections, which would mean more different lock objects and less lock contention.

I am thinking about replacing the ArrayBlockingQueue with a LinkedBlockingQueue. This may comes with a slight performance improvement.

Johan_Maasing · March 17, 2014, 10:27am

Flow wrote:

Using a unbounded datastructure will introduce the requirement for an AtomicInteger if we want to keep the limit for PacketWriters queue. But this AtomicInteger will have the to be synchronized, even worse, the comparison of such AtomicInteger against the maximum queue size conditionally resulting in a block, would need to be synchronized against a lock. So you again have synchronization.

AtomicIntegers do not need external synchronization, in fact they do not even use object mutexes, that’s why they are called Atomic and in the java.util.concurrent.atomic package

Assume you have :

final AtomicInteger currentMessagesInQueue = new AtomicInteger() ;

If one thread runs this:

if (currentMessagesInQueue.get() > 500) {

// Do whatever when you are not allowed to add more things

} else {

currentMessagesInQueue.getAndIncrement()

}

And another thread does this:

currentMessagesInQueue.getAndDecrement()

it is perfectly thread safe, no need for the synchronized keyword anywhere. The only problem is of course that the counter is separate from the queue so all code paths that adds/removes on the queue must pass increment/decrement.

So all my writers starve each other and even the reader thread.

I guess you are talking about the “chat” writers and not Smack’s PacketWriter. It is true that if you have many producers threads you will see heavy performance loss, because the lock jumps betweens the different threads. But I don’t think there is a solution for this if we want to keep the maximum queue size blocking semantic. And we want to keep this semantic.

Yes, this is exactly what I’m seeing, good description.

I suggest that you may overthink your design, using multiple XMPPConnections, which would mean more different lock objects and less lock contention.

Which is what I’m planning on doing, good to hear my plan wasn’t insane but that we have the same analysis

Flow wrote:

I am thinking about replacing the ArrayBlockingQueue with a LinkedBlockingQueue. This may comes with a slight performance improvement.

That is a good idea, it would probably have the same (or better) characteristics than a separate counter + a dequeue. The really good thing with Queue/Dequeue is that the reader thread will pick from the tail and writers will add to the head. If there are > 1 message the writers will not block the reader thread which is what will happen with the ArrayBlockingQueue with fairness.

The advantage with a separate counter is that is very cheap to read the counter for statistics reporting. You definately do not want to ask a dequeue for size() - since that can, depending on implementation, cost some performance.

Flow · March 17, 2014, 11:51am

AtomicIntegers do not need external synchronization, in fact they do not even use object mutexes, that’s why they are called Atomic and in the java.util.concurrent.atomic package
I am well aware of that

if (currentMessagesInQueue.get() > 500) {

// Do whatever when you are not allowed to add more things

} else {

currentMessagesInQueue.getAndIncrement()

}

You just build yourself the parade example for a deadlock caused by the lost update problem.

it is perfectly thread safe, no need for the synchronized keyword anywhere.
No, it’s not.

Johan_Maasing · March 17, 2014, 12:11pm

Flow wrote:

AtomicIntegers do not need external synchronization, in fact they do not even use object mutexes, that’s why they are called Atomic and in the java.util.concurrent.atomic package

I am well aware of that

if (currentMessagesInQueue.get() > 500) {

// Do whatever when you are not allowed to add more things

} else {

currentMessagesInQueue.getAndIncrement()

}

You just build yourself the parade example for a deadlock caused by the lost update problem.

it is perfectly thread safe, no need for the synchronized keyword anywhere.

No, it’s not.

Oh yes, I see what you mean, there is window between the check and the update that can allow more threads than the limit to sneak in. I don’t really see how it can deadlock or loose updates though. Even if you were to use compareAndSet in a loop. Interesting problem but we’re getting a bit off topic. The best solution seems to be as you said, use more connections. Thanks.

Flow · March 17, 2014, 12:44pm

Oh yes, I see what you mean, there is window between the check and the update that can allow more threads than the limit to sneak in. I don’t really see how it can deadlock or loose updates though. Even if you were to use compareAndSet in a loop.

Consider Thread A that calls sendPacket() and the PacketWriter’s Thread we’ll call B. A evaluates ‘currentMessagesInQueue.get() > 500’ to true, this implies that B is currently writing packets as the queue size is greater then 0. The code block within for the ‘if (currentMessagesInQueue.get() > 500)’ will call sleep() to block the sendPacket() call, because that’s the only thing we can do.

Now if right after the evaluation, A gets unscheduled by the OS scheduler, and B finishes writing the packets and calls notify() to unblock all threads sleeping in sendPacket(). But A is not amongst them, because it could call sleep yet and was simply unscheduled. The notify() update is lost.

Johan_Maasing · March 17, 2014, 1:24pm

Flow wrote:

Oh yes, I see what you mean, there is window between the check and the update that can allow more threads than the limit to sneak in. I don’t really see how it can deadlock or loose updates though. Even if you were to use compareAndSet in a loop.

Consider Thread A that calls sendPacket() and the PacketWriter’s Thread we’ll call B. A evaluates ‘currentMessagesInQueue.get() > 500’ to true, this implies that B is currently writing packets as the queue size is greater then 0. The code block within for the ‘if (currentMessagesInQueue.get() > 500)’ will call sleep() to block the sendPacket() call, because that’s the only thing we can do.

Now if right after the evaluation, A gets unscheduled by the OS scheduler, and B finishes writing the packets and calls notify() to unblock all threads sleeping in sendPacket(). But A is not amongst them, because it could call sleep yet and was simply unscheduled. The notify() update is lost.

What can I say other than: yes, I didn’t think about putting the caller to sleep I hope OP don’t mind the thread hijacking but I’m learning a lot. When I said “lost updates” I meant that the incrementAndGet wouldn’t overwrite some other threads value.

Maybe a spin-lock would sort of get around that problem, I usually never call sleep without a timeout, but now I feel like I’m just slapping patches on a sinking ship

Flow · March 17, 2014, 1:45pm

Just FYI: I decided against switching from ArrayBlockingQueue to LinkedBlockingQueue because of the additional overhead of the Node class creation, which introduces the followding drawbacks:

Slower insert operations because the Node must be created first, vs. simply setting the reference in the ArrayBlockingQueue
The additional object creation which involves also garbage collection afterwards

Johan_Maasing · May 22, 2014, 8:20am

A follow up on this. Let me lay out the use case:

We have a web application that sends XMPP messages to users using Smack 3.4.1. The XMPP messages are not the regular chat messages but a machine-to-machine protocol to talk to javascript clients (using strophe.js).

On the server side we have the usual request threads, i.e. a client does an Ajax-request that results in a ‘response’ message sent over XMPP.

We also have long running threads on the server side that sends XMPP-messages based on events in other systems. So many machine-based sources of events that result in XMPP-messages being sent.

What happened during load is that we saturated the Smack-queue causing lock contention in the ArrayBlockingQueue which blocked many request threads. The blocked request threads really hurt the runtime behaviour of the server, it basically ran into the ground and did not recover gracefully.

In our particular use case blocked threads is quite possibly the worst thing that can happen, we would much rather have got a straight Exception or refusal to send a message if the outgoing queue was full. But then again this is an unusual use case for a chat-client

This is what we did: We pool several XMPPConnection objects and use a round-robin scheme when the server wants to send a message. Basically we let our server sign in several “sender” users in Openfire. This works very well.

We are collecting metrics on all sorts of things, for example the latency in “giving a message to Smack”, so we can monitor and add connections if we need to. This is actually not automated since load testing showed that in our case using 9 “users” per server was more than enough for our typical load.

We also implemented a circuit breaker. That mechanism monitors the error rate and latency of “giving a message to Smack”. If it starts taking too much time (which might indicate a contention) the server simply stops sending messages for a brief time.

Using these mechanisms we have no problem getting an average throughput of several hundred messages per second from each server, mostly due to having several connections, so we get more Smack sender threads and more Smack queues handling bursts. When the circuit breaker “opens”, i.e. briefly throws messages away, Smack rather quickly drains its queues and are ready for business. In this way we can survive huge message bursts. This is at the cost of loosing messages of course but the alternative is crashing the server.

So I would say lessons learned and things to think about in this type of situation are:

Gather metrics for a lot of things, it will help you track down the hot spots. We’ve used Coda Hale metrics (http://metrics.codahale.com/) and OpenTSDB (http://opentsdb.net/) for this. Tools to analyze thread-dumps (we developed internal scripts for this) will help you track down lock contention.
Load testing is key. We used jMeter (http://jmeter.apache.org/)
A good Circuit breaker framework will save you when the storm hits We used Hystrix and it’s dashboard (https://github.com/Netflix/Hystrix)

A final word about Openfire, it is a beast, it can take huge amounts of messages, we have been able to handle bursts of 800-1000 messages a second on a modest machine at the same time as we’ve had between 1000-2000 Bosch connections. The only time Openfire has been in a less than graceful state is when the OS runs out of memory causing the machine to run extremely slow due to swapping. So avoid that at all cost