20 Replies Latest reply on Sep 30, 2011 9:41 AM by Michael Adams

    Memory Leak Issues - Anyone?

    StaticVortex Silver

      I am starting another thread because we are still experiencing a serious memory leak and can't get to the bottom of it.

       

      We have PEP disabled and don't actually have any Empathy clients - all our clients are XIFF.

       

      By increading the memory significantly we can keep Openfire running for about 24 hours before it crashes from out of memory exceptions.

       

      We are running on 64bit Ubuntu 9.04 server with 4GB RAM - 32bit JAVA (Had issues with 64bit version)

       

      We have an average of between 5000 and 8000 people online at any time. CPU and DB load is all fine.

       

      The memory slowly leaks away until it is in the red zone and then eventually it dies.

       

      Is anyone else experiencing a similar issue? I appreciate any help on this issue.

       

      Thanks

      Daniel

        • Re: Memory Leak Issues - Anyone?
          Guus der Kinderen KeyContributor

          Hey Daniel,

           

          Well, as in other threads that describe similar issues, my first suggestions are:

           

          I'd be happy to help, if you can provide me with the data.

           

          Regards,

           

            Guus

            • Re: Memory Leak Issues - Anyone?
              StaticVortex Silver

              Hi Guus, I really appreciate your support on this.

               

              I am actually using a later version of the MINA library (1.1.7) that helped resolved some issues (with Unicode) but unfortunately it doesn't support the probe because it doesn't support org/apache/mina/management/MINAStatCollector.

               

              Anyhow I will revert this back to the trunk version of MINA and get the probe working again. This will also eliminate the MINA library being the cause of the problem.

               

              I'll let you know when I have the probe dump.

               

              On a side note, I have also updated the JETTY library to 7 RC 6, which totally fixed a lot of issues with BOSH. About 10% of our connections are binding and the rest are socket.

               

              Thanks

              Daniel Haigh

              • Re: Memory Leak Issues - Anyone?
                StaticVortex Silver

                I have gotten the java-monitor probe working. Is there some way I can send you a dump of the whole lot?

                 

                Anyway we have given Openfire 3GB of ram so that it doesn't crash too quickly.. still crashes every few hours during peak time though.

                 

                I notice that we have heaps of TIMED_WAITING threads (over 200) and this increases rapidly before the server crashes. Could it be an issue with a syncronized lock somewhere?

                 

                We do have some custom plugins but restarting them doesn't seem to free up the memory or threads.

                 

                I am getting a thread dump and will post it shortly.

                 

                Thanks

                  • Re: Memory Leak Issues - Anyone?
                    StaticVortex Silver

                    Here is the latest thread dump - the server is relatively stable at the moment but has:

                     

                    2294.73 MB of 2958.25 MB (77.6%) used and going up...

                     

                    The CPU usage is always under 10%.

                     

                    We are using Jetty 7 as in this thread: http://www.igniterealtime.org/community/message/199896

                    Interestingly the blocking looks related to Jetty but maybe this is normal - I don't know.

                     

                    UPDATE - I tried turning BOSH off on our live server for almost an hour and as expected the threads reduced but the memory loss was still the same. So the issue doesn't seem to be related to Jetty in anyway. I will work on trying to get a heap dump when it crashes next time. I don't have any experience with that.

                     

                    Message was edited by: StaticVortex

                      • Re: Memory Leak Issues - Anyone?
                        StaticVortex Silver

                        Just on a side note - although we have up to 9000 user sessions at any time, we actually have around 200,000 logins every day.

                         

                        Could there be a memory leak associated with having so many people connect and disconnect (ie not cleaning up very well) - has this ever been tested?

                        • Re: Memory Leak Issues - Anyone?
                          Guus der Kinderen KeyContributor

                          Disclaimer: I hardly had any sleep in the last few days (but my baby boy is doing great ). I've noticed that this does affect my concentration considerably, so what I'm about to type might not make any sense at all, but:

                           

                          Why all the Jetty-related threads? I'll be the first one to admit that I haven't looked at the Jetty Continuation thingy yet, but isn't the idea behind async servlet handling not having to keep all of those threads open? There appear to be two pretty big threadpools - are those required. Can the sizes of those pools be monitored and/or capped, perhaps?

                    • Re: Memory Leak Issues - Anyone?
                      Walter Ebeling KeyContributor

                      Hi Daniel,

                       

                      same situation around here. Roughly 2000 concurrent users, Kraken as IM gateway in use, Spark 2.5.8 deployed, PEP disabled.

                       

                      The server dies within 2-3 weeks, Nagios reports Java memory errors prior the server failure. Usually the first sign of a imminent server failure is the loss of the admin interface. Furtunately, our SLA for Spark is low and we don't have to provide 24x7.

                       

                      Walter

                      • Re: Memory Leak Issues - Anyone?
                        StaticVortex Silver

                        Great news! We have resolved this issue which was caused by a number of factors...

                         

                        I found the problem by analysing a heap dump.

                         

                         

                        By default Openfire caches the Roster (username2roster) for 6 hours since the last activity.

                         

                        On average each one of our roster items is about 90k in size in the cache. (This is the real size)

                         

                        Because we could have about 50,000 people logging in within 6 hours during peak time that would use many GB's of RAM, even though

                        we would have less than 9000 people logged in at anytime. 

                         

                        Unfortunately the cache wasn't cleaned up correctly by Openfire when it reached the limit because Openfire is incorrectly calculating the size of the cache in org.jivesoftware.openfire.roster.RosterItem  getCachedSize (and possible in org.jivesoftware.openfire.roster.Roster at an initial glance). This is a part of the Cachable interface. 

                         

                        A lot of information has been added to the roster items since this code was written so it is calculating about a third or less of the actual size of the roster items. A new issue needs to be created for this - and I would be happy to prepare the patch which will accurately calculate the size of the roster objects.

                         

                        If this calculation was correct then the cache would have cleaned up properly and the memory issues wouldn't have occurred.

                         

                        We corrected the issue in the meantime by adding the property cache.username2roster.maxLifetime and setting it to 419430400 which 20 minutes (rather than its default 6 hours). For anyone interested, who may be having a similar issue now, this property can be added through the admin interface - no programming required.

                         

                        I guess the other option would be to reduce the cache.username2roster.size to about a quarter of what you want to account for the calculation issue explained above.

                         

                        Let me know if you have any questions,

                         

                        Thanks

                        Daniel

                         

                         

                         

                         

                         

                         

                         

                          • Re: Memory Leak Issues - Anyone?
                            Guus der Kinderen KeyContributor

                            Excellent find, and thanks for documenting a workaround too. I've created OF-333 to track this issue. If you could provide a patch - please! We're quite short-handed at the moment - any help is welcome.

                              • Re: Memory Leak Issues - Anyone?
                                Guus der Kinderen KeyContributor

                                I made some modifications to Roster, RosterItem and CacheSizes. These should improve your milage quite a bit. Could you verify if this works for you?

                                 

                                The long term solution would be to move out the "Cacheable" interface completely. As far as I see, we need it only for the DefaultCache implementation that Jive wrote - we should switch that for an open source alternative, which will give us the opportunity to move Coherence out too. That should give us a clustering solution that's completely open source.

                                  • Re: Memory Leak Issues - Anyone?
                                    StaticVortex Silver

                                    I will test your changes.

                                     

                                    Have you considered using memcached for the clustering solution.

                                     

                                    Open source

                                    Totally scalable

                                    Easy to implement

                                    Would work on clustered env.

                                    Should be no contention issues

                                    Will store the serialized objects

                                     

                                    You would lose the ability to set the sizes of the individual caches 

                                    but the timeouts would all work.

                                     

                                    I would make this configurable through the clustering plugin so as to 

                                    not complicate stand alone installations, but I guess you could use it 

                                    by default.

                                     

                                    We use memcached extensively on our web servers and it works great.

                                     

                                    I assume this would remove the need for Coherence libs?

                                    • Re: Memory Leak Issues - Anyone?
                                      StaticVortex Silver

                                      Hi Guus,

                                       

                                      I notice you have updated the calculation to include these collections:

                                       

                                      size += CacheSizes.sizeOfCollection(invisibleSharedGroups);
                                      size += CacheSizes.sizeOfCollection(sharedGroups);
                                      When we did the memory dumps, these two collections used a significant amount of the cache, even though we don't have any invisible or shared groups in our system. They are still very large and in every single roster item. Just from memory I think about a third of the roster cache was simply these two empty collections.

                                       

                                      Do you know why this is the case? I assume they should be set to null? Is there some way of storing roster cache but not storing these empty collections if they are empty?

                                       

                                      It would make a significant difference to memory usage.

                                       

                                      (We will be putting the latest trunk build into our live system shortly - so I'll be able to test a lot of the latest changes before the next beta release)

                                       

                                      Thanks

                                      Daniel

                                • Re: Memory Leak Issues - Anyone?
                                  Bronze

                                  Hello everyone.

                                   

                                  I have memory leak to.

                                   

                                  What I think for...

                                   

                                  I think that MAIN memory leak results from bad tcp sessions (except low memory leak in enother modules).

                                   

                                  Java as .Net has a GC that clean unused datas. But GC doesn`t clean threads if they was not closed in code. Also one thread occupy memory in java virtual memory and OS native memory simultaneously.

                                   

                                  Reasons of unclosing of threads may be cause of hang up tcp session (some threads for one tcp session): when tcp (acs and other error) is appear for session, OF can`t see it (if you use switch for network) and OF client (or transports servers like icq, msn and others) establish new connection with him. But the old session is not closed and hang ups reducing the java and NATIVE mamory.

                                   

                                  So that is main memory leak that appers in transport and OF connection.

                                   

                                  I wrote plugin that disconnect idle sessions after 60 min. but the problem that it`s clean logical sessions of OF but not network.

                                   

                                  So, how to select idle network sessions ,I do not know yet.

                                  Also I use 128k of stack for threads in JVM settings - that`s increase time for out of memory exception come 

                                  • Memory Leak Issues - Anyone?
                                    Michael Adams Bronze

                                    I figured out how to capture a memory leak report on the latest build (September 30, 2011). Memory usage was heavily concentrated in PEPService, TaskQueue, and the language modules.

                                     

                                    http://community.igniterealtime.org/servlet/JiveServlet/previewBody/2207-102-1-2 490/leak_report_09302011.pdf