Memory Leak Issues - Anyone?

StaticVortex · January 14, 2010, 2:15am

I am starting another thread because we are still experiencing a serious memory leak and can’t get to the bottom of it.

We have PEP disabled and don’t actually have any Empathy clients - all our clients are XIFF.

By increading the memory significantly we can keep Openfire running for about 24 hours before it crashes from out of memory exceptions.

We are running on 64bit Ubuntu 9.04 server with 4GB RAM - 32bit JAVA (Had issues with 64bit version)

We have an average of between 5000 and 8000 people online at any time. CPU and DB load is all fine.

The memory slowly leaks away until it is in the red zone and then eventually it dies.

Is anyone else experiencing a similar issue? I appreciate any help on this issue.

Thanks

Daniel

guus · January 14, 2010, 6:59am

Hey Daniel,

Well, as in other threads that describe similar issues, my first suggestions are:

monitor the JVM, for example with a tool like the New Openfire monitoring plugin;
create and analyze thread dumps;
create and analyze a heap dump.

I’d be happy to help, if you can provide me with the data.

Regards,

Guus

Walter_Ebeling · January 14, 2010, 7:36am

Hi Daniel,

same situation around here. Roughly 2000 concurrent users, Kraken as IM gateway in use, Spark 2.5.8 deployed, PEP disabled.

The server dies within 2-3 weeks, Nagios reports Java memory errors prior the server failure. Usually the first sign of a imminent server failure is the loss of the admin interface. Furtunately, our SLA for Spark is low and we don’t have to provide 24x7.

Walter

StaticVortex · January 14, 2010, 12:26pm

Hi Guus, I really appreciate your support on this.

I am actually using a later version of the MINA library (1.1.7) that helped resolved some issues (with Unicode) but unfortunately it doesn’t support the probe because it doesn’t support org/apache/mina/management/MINAStatCollector.

Anyhow I will revert this back to the trunk version of MINA and get the probe working again. This will also eliminate the MINA library being the cause of the problem.

I’ll let you know when I have the probe dump.

On a side note, I have also updated the JETTY library to 7 RC 6, which totally fixed a lot of issues with BOSH. About 10% of our connections are binding and the rest are socket.

Thanks

Daniel Haigh

StaticVortex · January 20, 2010, 12:30am

I have gotten the java-monitor probe working. Is there some way I can send you a dump of the whole lot?

Anyway we have given Openfire 3GB of ram so that it doesn’t crash too quickly… still crashes every few hours during peak time though.

I notice that we have heaps of TIMED_WAITING threads (over 200) and this increases rapidly before the server crashes. Could it be an issue with a syncronized lock somewhere?

We do have some custom plugins but restarting them doesn’t seem to free up the memory or threads.

I am getting a thread dump and will post it shortly.

Thanks

StaticVortex · January 20, 2010, 12:37am

Here is the latest thread dump - the server is relatively stable at the moment but has:

2294.73 MB of 2958.25 MB (77.6%) used and going up…

The CPU usage is always under 10%.

We are using Jetty 7 as in this thread: http://www.igniterealtime.org/community/message/199896

Interestingly the blocking looks related to Jetty but maybe this is normal - I don’t know.

UPDATE - I tried turning BOSH off on our live server for almost an hour and as expected the threads reduced but the memory loss was still the same. So the issue doesn’t seem to be related to Jetty in anyway. I will work on trying to get a heap dump when it crashes next time. I don’t have any experience with that.

Message was edited by: StaticVortex
threaddump.txt.zip (7869 Bytes)

StaticVortex · January 20, 2010, 4:41am

Just on a side note - although we have up to 9000 user sessions at any time, we actually have around 200,000 logins every day.

Could there be a memory leak associated with having so many people connect and disconnect (ie not cleaning up very well) - has this ever been tested?

guus · January 20, 2010, 6:37pm

Disclaimer: I hardly had any sleep in the last few days (but my baby boy is doing great ). I’ve noticed that this does affect my concentration considerably, so what I’m about to type might not make any sense at all, but:

Why all the Jetty-related threads? I’ll be the first one to admit that I haven’t looked at the Jetty Continuation thingy yet, but isn’t the idea behind async servlet handling not having to keep all of those threads open? There appear to be two pretty big threadpools - are those required. Can the sizes of those pools be monitored and/or capped, perhaps?

StaticVortex · January 21, 2010, 4:03am

Great news! We have resolved this issue which was caused by a number of factors…

I found the problem by analysing a heap dump.

By default Openfire caches the Roster (username2roster) for 6 hours since the last activity.

On average each one of our roster items is about 90k in size in the cache. (This is the real size)

Because we could have about 50,000 people logging in within 6 hours during peak time that would use many GB’s of RAM, even though

we would have less than 9000 people logged in at anytime.

Unfortunately the cache wasn’t cleaned up correctly by Openfire when it reached the limit because Openfire is incorrectly calculating the size of the cache in org.jivesoftware.openfire.roster.RosterItem getCachedSize (and possible in org.jivesoftware.openfire.roster.Roster at an initial glance). This is a part of the Cachable interface.

A lot of information has been added to the roster items since this code was written so it is calculating about a third or less of the actual size of the roster items. A new issue needs to be created for this - and I would be happy to prepare the patch which will accurately calculate the size of the roster objects.

If this calculation was correct then the cache would have cleaned up properly and the memory issues wouldn’t have occurred.

We corrected the issue in the meantime by adding the property cache.username2roster.maxLifetime and setting it to 419430400 which 20 minutes (rather than its default 6 hours). For anyone interested, who may be having a similar issue now, this property can be added through the admin interface - no programming required.

I guess the other option would be to reduce the cache.username2roster.size to about a quarter of what you want to account for the calculation issue explained above.

Let me know if you have any questions,

Thanks

Daniel

StaticVortex · January 21, 2010, 4:18am

Because it takes a few weeks to fail in your case it is probably a different cause.

Are you using any plugins or have anyhing unusual in your configuration?

Walter_Ebeling · January 21, 2010, 6:51am

There are a couple of internal plug ins on the OF, but they are used for admin purposes only (password reset, user creation)… I suspect Kraken to be the driver of our issue. We have reduced the number of Kraken users and the failures have reduced noticeably.

Walter

guus · January 21, 2010, 10:36am

Excellent find, and thanks for documenting a workaround too. I’ve created OF-333 to track this issue. If you could provide a patch - please! We’re quite short-handed at the moment - any help is welcome.

guus · January 27, 2010, 6:17pm

I made some modifications to Roster, RosterItem and CacheSizes. These should improve your milage quite a bit. Could you verify if this works for you?

The long term solution would be to move out the “Cacheable” interface completely. As far as I see, we need it only for the DefaultCache implementation that Jive wrote - we should switch that for an open source alternative, which will give us the opportunity to move Coherence out too. That should give us a clustering solution that’s completely open source.

StaticVortex · January 27, 2010, 6:57pm

I will test your changes.

Have you considered using memcached for the clustering solution.

Open source

Totally scalable

Easy to implement

Would work on clustered env.

Should be no contention issues

Will store the serialized objects

You would lose the ability to set the sizes of the individual caches

but the timeouts would all work.

I would make this configurable through the clustering plugin so as to

not complicate stand alone installations, but I guess you could use it

by default.

We use memcached extensively on our web servers and it works great.

I assume this would remove the need for Coherence libs?

guus · January 27, 2010, 7:02pm

You assume correctly

I’m very open for suggestions for cache implementations - thanks for your insights.

StaticVortex · January 28, 2010, 1:05am

We are having another issue here which may relate to the latest versions of Jetty/Openfire or could be just us?

Could you have a look when you have some time: http://www.igniterealtime.org/community/thread/40945

Thanks

Konstantine · January 29, 2010, 1:35am

Hello everyone.

I have memory leak to.

What I think for…

I think that MAIN memory leak results from bad tcp sessions (except low memory leak in enother modules).

Java as .Net has a GC that clean unused datas. But GC doesn`t clean threads if they was not closed in code. Also one thread occupy memory in java virtual memory and OS native memory simultaneously.

Reasons of unclosing of threads may be cause of hang up tcp session (some threads for one tcp session): when tcp (acs and other error) is appear for session, OF can`t see it (if you use switch for network) and OF client (or transports servers like icq, msn and others) establish new connection with him. But the old session is not closed and hang ups reducing the java and NATIVE mamory.

So that is main memory leak that appers in transport and OF connection.

I wrote plugin that disconnect idle sessions after 60 min. but the problem that it`s clean logical sessions of OF but not network.

So, how to select idle network sessions ,I do not know yet.

Also I use 128k of stack for threads in JVM settings - that`s increase time for out of memory exception come

StaticVortex · January 30, 2010, 2:51pm

I haven’t experienced the issue you are describing.

Are you using the www.java-monitor.com probe to monitor the threads or something else.

Anyhow, because this thread has been set to answered I would suggest you begin a new thread with your particular issue.

Konstantine · January 31, 2010, 9:15am

Hi

I just wrote what I think about this problem - maybe I`m incorrect

I use jconsole - it`s in java/bin directory

and procexp for windows (freeware)

StaticVortex · February 1, 2010, 12:17am

Hi Guus,

I notice you have updated the calculation to include these collections:

size += CacheSizes.sizeOfCollection(invisibleSharedGroups);

size += CacheSizes.sizeOfCollection(sharedGroups);

When we did the memory dumps, these two collections used a significant amount of the cache, even though we don’t have any invisible or shared groups in our system. They are still very large and in every single roster item. Just from memory I think about a third of the roster cache was simply these two empty collections.

Do you know why this is the case? I assume they should be set to null? Is there some way of storing roster cache but not storing these empty collections if they are empty?

It would make a significant difference to memory usage.

(We will be putting the latest trunk build into our live system shortly - so I’ll be able to test a lot of the latest changes before the next beta release)

Thanks

Daniel