Openfires Achilles' heel

Version 1

    "You're going to bash Openfire on the Ignite blog?"

     

    Perhaps I shouldn't have asked my former coworker to review a draft of this article while the only thing that was written was the problem description. On the other hand, this guy knows me, and knows I love to work with Openfire. Maybe a bit of disclaimer is not completely uncalled for.

     

    I'm about to describe a serious problem in Openfire, one that potentially cripples an entire XMPP domain, if overlooked. I've learned about this problem the hard way. Most of what I'm about to describe was learned in the trenches, while battling unexpected outages.

     

    I'm not ranting here, though. Instead, I'd like to warn administrators and developers of Openfire. The problem that I'm describing here is one that can pop up on environments that get to process high volumes of data. Luckily, there are a few good ways of avoiding, or at the very least greatly reducing the problem. You'll find them in the second half of the article.

     

    Right, enought with the plesantries, lets get started.

    The Problem

    As I said, the Achilles' heel typically pops up at high-traffic Openfire installations. The more data your XMPP domain processes, the more likely it is that you'll run into it. Low volume setups (such as most private and small office setups) hardly ever run into it though. As a rough rule of thumb, this article can be applicable to your setup if you're measuring the amount of concurrently connected users in groups of thousand.

     

    To describe the problem, you'll need to understand a bit of how Openfire works internally.

    What happens when data is sent to Openfire?

    In the next few paragraphs, I'll paint a pretty technical picture of how Openfire processes data that it receives. Feel free to skip to the next section if you're not interested in the technical background.

     

    Openfire uses socket acceptors to receive raw data that is sent. For each type of connection, a specific socket acceptor is instantiated within Openfire. All data that is sent from client connections, for example, is handled by one socket acceptor (although a separate acceptor is used for clients that connect using SSL). All data that is sent from connections from multiplexers (the connectionmanagers) are handled by acceptor. The data sent from external components is handled by yet another acceptor.

     

    A socket acceptor will first put raw bytes into XML. Next, these XML fragments are handed over to a connection handler (specific to the socket acceptor) that will process the data. This is where most of the magic happens: Sessions are opened and closed, XMPP stanzas are parsed from the XML data and are routed through all the appropriate subsystems of Openfire. Eventually, the stanza is delivered to its destination.

     

    A connection handler performs a lot of operations. To allow for optimal performance, multiple events are processed concurrently. To achieve this, a thread pool is used. These thread pools (which I'll refer to as the core thread pools) are fixed-size pools, meaning that they use a fixed number of threads to do all of the processing. By default, Openfire uses 17 threads for each pool.

    Achilles reveals his heel

    Up to here, things are pretty technical. The important bit that you need to realize to understand the Achilles' heel that I'm about to describe, is this:  Openfire uses a fixed number of Java threads (from the core thread pool) to do most of its work.

     

    Each task that needs to be processed is pretty small. Typically, one XMPP stanza is processed. Usually (note the emphasis) it takes very little time to process such a task. But what happens if "usually" doesn't hold true? Thread pools are fronted by a queue, where tasks are stored if all threads from the pool are busy, but this only goes so far. What happens if a particular type of stanza takes quite some time to process and all threads in the pool are busy processing a stanza of that type?

     

    Easy: Openfire stops processing data. If this happens in the core thread pool that handles your client connections, service is very visibly interrupted. Quite often, Openfire doesn't recover from this situation. The server effectively dies.

    What are the odds?

    Surprisingly high, actually. There are two things needed for the problem to occur:

    1. at least one (type of) stanza that takes considerable time to be processed.
    2. all of the threads of a core thread pool busy processing this type of stanza.

     

    It is surprisingly easy to create stanzas that take some time to process. If, for example, your clients send you an IQ request that depends on a costly database query to be answered, you easily add up to seconds to the processing of one stanza.

     

    The second part of the equation is all about odds: The longer it takes a single thread to process a particular task, the bigger the chance that somewhere during that time the other threads also start processing such tasks. Also, if your community grows, your pool will start processing more data. This increases the chance that your troublesome stanza is being processed in parallel.

    Will the clustering plugin solve this problem?

    One can argue that moving to a clustered environment would reduce the risk of having troublesome stanzas processed at the same time. The workload is divided over more processors, after all. Sadly, the clustering implementation itself suffers from a similar problem.

     

    The clustering plugin adds considerable overhead. Things need to be kept in sync with other cluster nodes, after all. Some of this overhead will be added to the core thread pools of other clustering members. For some specific tasks, the requesting cluster node will wait synchronously (meaning: keeping the thread from its core pool waiting) for an answer. Not only does this add to the processing time of that request (cluster operations introduce network IO latency), this also spreads the problem to the other cluster members. Something to ponder about: what could happen if the processing cluster member uses yet another synchronous cluster operation to process the data?

     

    In practice, I've found that the clustering plugin adds to the problem, rather than takes away from it. This is, however, highly dependent of the type of traffic that your XMPP domain gets to process. If your community uses MultiUserChat a lot, for example, things go south quite fast. If you're running a vanilla Openfire cluster and you use but a small amount of the features that are offered by Openfire, you might not run into any problems at all.

    Is this all?

    Not exactly. Up until this point, I have discussed a problem that relates to threads. The problem doesn't stop there, though. Similar problems can pop up in any single resource that's shared between different parts of Openfire. An obvious example would be the database connection pool. Openfire defines just one pool, that can be used by any part of Openfire. If one part of Openfire has a bug, or simply makes excessive use of the resource, it can affect the performance of another part of Openfire.

    Detecting and monitoring the problem

    If the Achilles' Heel problem occurs, at least one of the connection handlers will start and continue to use all threads from the thread pools. Typically, this is a client connection handler. Openfire will be very slow to respond, or suffers from catastrophic failure.

     

    Java-monitor.com's plugin for Openfire allows you to monitor the state of the thread pool in the client connection handler. Similar monitoring tools are easily developed, by tapping into the JMX-based functionality that Openfire (and MINA) provide.

    So, how do we solve this?

    I can identify at least two ways to battle the problem.

     

    Extensions to Openfire (plugins, mainly) should be modified in such a way that they use dedicated resources while executing code. A bit of defensive programming will go a long way in preventing the problem from occurring. In practice, most problems that I've seen are caused by third party development.

     

    The root of the problem can and should be removed too. For this, Openfire needs to be modified in such a way that it will use a dedicated set of resources for every part of the system. For example, the routing routines, arguably the most important part of Openfire, should get a dedicated thread pool (and database connection pool). The execution of various (all) listeners should no longer be synchronous. Instead, their execution should be offloaded to another dedicated thread pool.

     

    We, Openfire developers, should stick our heads together and come up with a battle plan to eliminate the problem from the Openfire code. In the remainder of the article, I'll elaborate on tactics to be employed by developers of extensions to Openfire.

    Practical guidelines for plugin/component developers

    There are a number of things that you, as a plugin or component developer can do to avoid the problem from appearing. In order of importance:

    1. Externalize your code

    If the functionality that you're implementing does not require you to access the internal Openfire API, consider developing an application that runs outside of Openfire. An easy way to do this is using Whack to create an external component.

    2. Prefer Component over PacketInterceptor

    Base your implementation on the Component interface, rather than creating a PacketInterceptor (or any other listener type that Openfire provides). Listeners hook into Openfires stanza processing at a very low level. Additionally, all stanzas need to be checked in order to determine if they are to be processed by your code, which adds overhead.

     

    Components on the other hand are addressable by nature. Sending stanzas to Components does not use any hooks in the Openfire core. Instead, the routing mechanisms that are used for every stanza are used.

    3. Use AbstractComponent

    In version 1.2 of the Tinder library, the AbstractComponent class is introduced. This implementation of the Component interface has been designed specifically to prevent the Achilles' heel problem describe here. Apart from that, it tackles another couple of common problems of Component implementations.

     

    I'd suggest using AbstractComponent as the base of every Component that is under development. Even for existing implementations, checking if the implementation can be retrofitted to use AbstractComponent would be sensible.

    4. Apply the Producer/Consumer design pattern manually

    Whenever your code accepts workload, make sure that the processing of this work is done on another thread than the thread that feeds you the work. Release the thread that delivers you the workload as soon as possible. An easy way of doing this is by using Java's Executor service to implement a producer/consumer pattern.

    5. Avoid synchronous calls

    Avoid code where you wait for something to happen, before you continue work, while keeping the current thread waiting. Instead, queue the task somewhere, and allow the thread to continue work. Listen for the event that you're waiting for, find your queued task, and continue. Beware of tasks that are never continued though!

    Conclusion

    An XMPP server primarily is a router. Based on the addressing information that's available in each stanza, this router delivers stanzas to recipients. Openfire's code should minimize the influence that other, secondary functionality has on the performance of the routing functionality. This can be achieved by defining modules for each set of functionality. Each module uses a dedicated set of resources (such as a thread pool and a database connection pool) that cannot be used outside of the module.