same situation around here. Roughly 2000 concurrent users, Kraken as IM gateway in use, Spark 2.5.8 deployed, PEP disabled.
The server dies within 2-3 weeks, Nagios reports Java memory errors prior the server failure. Usually the first sign of a imminent server failure is the loss of the admin interface. Furtunately, our SLA for Spark is low and we don’t have to provide 24x7.
Hi Guus, I really appreciate your support on this.
I am actually using a later version of the MINA library (1.1.7) that helped resolved some issues (with Unicode) but unfortunately it doesn’t support the probe because it doesn’t support org/apache/mina/management/MINAStatCollector.
Anyhow I will revert this back to the trunk version of MINA and get the probe working again. This will also eliminate the MINA library being the cause of the problem.
I’ll let you know when I have the probe dump.
On a side note, I have also updated the JETTY library to 7 RC 6, which totally fixed a lot of issues with BOSH. About 10% of our connections are binding and the rest are socket.
I have gotten the java-monitor probe working. Is there some way I can send you a dump of the whole lot?
Anyway we have given Openfire 3GB of ram so that it doesn’t crash too quickly… still crashes every few hours during peak time though.
I notice that we have heaps of TIMED_WAITING threads (over 200) and this increases rapidly before the server crashes. Could it be an issue with a syncronized lock somewhere?
We do have some custom plugins but restarting them doesn’t seem to free up the memory or threads.
I am getting a thread dump and will post it shortly.
Interestingly the blocking looks related to Jetty but maybe this is normal - I don’t know.
UPDATE - I tried turning BOSH off on our live server for almost an hour and as expected the threads reduced but the memory loss was still the same. So the issue doesn’t seem to be related to Jetty in anyway. I will work on trying to get a heap dump when it crashes next time. I don’t have any experience with that.
Disclaimer: I hardly had any sleep in the last few days (but my baby boy is doing great ). I’ve noticed that this does affect my concentration considerably, so what I’m about to type might not make any sense at all, but:
Why all the Jetty-related threads? I’ll be the first one to admit that I haven’t looked at the Jetty Continuation thingy yet, but isn’t the idea behind async servlet handling not having to keep all of those threads open? There appear to be two pretty big threadpools - are those required. Can the sizes of those pools be monitored and/or capped, perhaps?
Great news! We have resolved this issue which was caused by a number of factors…
I found the problem by analysing a heap dump.
By default Openfire caches the Roster (username2roster) for 6 hours since the last activity.
On average each one of our roster items is about 90k in size in the cache. (This is the real size)
Because we could have about 50,000 people logging in within 6 hours during peak time that would use many GB’s of RAM, even though
we would have less than 9000 people logged in at anytime.
Unfortunately the cache wasn’t cleaned up correctly by Openfire when it reached the limit because Openfire is incorrectly calculating the size of the cache in org.jivesoftware.openfire.roster.RosterItem getCachedSize (and possible in org.jivesoftware.openfire.roster.Roster at an initial glance). This is a part of the Cachable interface.
A lot of information has been added to the roster items since this code was written so it is calculating about a third or less of the actual size of the roster items. A new issue needs to be created for this - and I would be happy to prepare the patch which will accurately calculate the size of the roster objects.
If this calculation was correct then the cache would have cleaned up properly and the memory issues wouldn’t have occurred.
We corrected the issue in the meantime by adding the property cache.username2roster.maxLifetime and setting it to 419430400 which 20 minutes (rather than its default 6 hours). For anyone interested, who may be having a similar issue now, this property can be added through the admin interface - no programming required.
I guess the other option would be to reduce the cache.username2roster.size to about a quarter of what you want to account for the calculation issue explained above.
There are a couple of internal plug ins on the OF, but they are used for admin purposes only (password reset, user creation)… I suspect Kraken to be the driver of our issue. We have reduced the number of Kraken users and the failures have reduced noticeably.
Excellent find, and thanks for documenting a workaround too. I’ve created OF-333 to track this issue. If you could provide a patch - please! We’re quite short-handed at the moment - any help is welcome.
I made some modifications to Roster, RosterItem and CacheSizes. These should improve your milage quite a bit. Could you verify if this works for you?
The long term solution would be to move out the “Cacheable” interface completely. As far as I see, we need it only for the DefaultCache implementation that Jive wrote - we should switch that for an open source alternative, which will give us the opportunity to move Coherence out too. That should give us a clustering solution that’s completely open source.
I think that MAIN memory leak results from bad tcp sessions (except low memory leak in enother modules).
Java as .Net has a GC that clean unused datas. But GC doesn`t clean threads if they was not closed in code. Also one thread occupy memory in java virtual memory and OS native memory simultaneously.
Reasons of unclosing of threads may be cause of hang up tcp session (some threads for one tcp session): when tcp (acs and other error) is appear for session, OF can`t see it (if you use switch for network) and OF client (or transports servers like icq, msn and others) establish new connection with him. But the old session is not closed and hang ups reducing the java and NATIVE mamory.
So that is main memory leak that appers in transport and OF connection.
I wrote plugin that disconnect idle sessions after 60 min. but the problem that it`s clean logical sessions of OF but not network.
So, how to select idle network sessions ,I do not know yet.
Also I use 128k of stack for threads in JVM settings - that`s increase time for out of memory exception come
When we did the memory dumps, these two collections used a significant amount of the cache, even though we don’t have any invisible or shared groups in our system. They are still very large and in every single roster item. Just from memory I think about a third of the roster cache was simply these two empty collections.
Do you know why this is the case? I assume they should be set to null? Is there some way of storing roster cache but not storing these empty collections if they are empty?
It would make a significant difference to memory usage.
(We will be putting the latest trunk build into our live system shortly - so I’ll be able to test a lot of the latest changes before the next beta release)