[Bug] Deadlock in PrivacyList logic

I am experiencing a deadlock in the OpenFire 3.9.3 (not sure if present in the last stable version), when using PrivacyLists.

High level description

I implemented a custom OpenFire plugin, which manages PrivacyLists. At some point the plugin’s worker thread calls PrivacyListManager.getDefaultPrivacyList() to get default privacy list for some user. But my plugin happens to deadlock sometimes on this call.

The problem is that two different threads use synchronized() call on two different objects in a different order.

It is caused when one client thread processing presence update calls PresenceManager.probePresence() and a different thread tries to load a default privacy list for the same user, calling PrivacyListManager.getDefaultPrivacyList().

Detailed explanation

I am using a thread dump (bottom of this message) as a reference.

Thread:TaskExecutor.0:BLOCKED

Call flow (from top to bottom, first line causes call of the second line and so on):

  1. [Irrelevant calls prior this call]
  2. PrivacyListManager.getDefaultPrivacyList("test@demo.com"), called from my application code - plugin.
  1. [Irrelevant calls, without nested locks]
  2. RosterManager.getRoster("test@demo.com"); called from PrivacyList. -> updateList -> updateList -> getRoster.
  • synchronized (("test@demo.com" + " ro").intern()) , in RosterManager.java:112. Deadlock. This lock is held by the thread Thread:client-9:BLOCKED.
  1. [Deadlock]

Thread:client-9:BLOCKED

Call flow (from top to bottom, first line causes call of the second line and so on):

  1. [Irrelevant calls prior this call]
  2. RosterManager.getRoster("test@demo.com"); called from PresenceManagerImpl.canProbePresence(PresenceManagerImpl.java:331)
  • synchronized (("test@demo.com" + " ro").intern()) , in RosterManager.java:112
  1. [Irrelevant calls, without nested locks]
  2. GroupManager.getSharedGroups()
  • synchronized("test@demo.com".intern()), in GroupManager:454. Deadlock. This lock is held by the thread Thread:TaskExecutor.0
  1. [Deadlock]

Workaround

As the following locking order: synchronized ((username + " ro").intern()) , synchronized(username.intern()) is common in the OpenFire (code path by RosterManager.getRoster()) I would suggest to preserve this locking order also in the getDefaultPrivacyList() call. The workaround (which worked for me) is to call getDefaultPrivacyList() in the synchronized ((username + " ro").intern()) block:

synchronized ((username + " ro").intern()) {          // Following call synchronizes on username.intern(), then on  (username + "ro").          final PrivacyList list = privListManager.getDefaultPrivacyList(username);
          // Do your job here
     // ...
}

As the synchronized() block uses reentrant locking, when execution flow reaches getRoster() call the thread already holds username + " ro" lock.

ThreadDump

Below is the thread dump from the deadlocked OpenFire:

Thread:TaskExecutor.0:BLOCKED
--> org.jivesoftware.openfire.roster.RosterManager.getRoster(RosterManager.java:113)
--> org.jivesoftware.openfire.privacy.PrivacyList.getRoster(PrivacyList.java:247)
--> org.jivesoftware.openfire.privacy.PrivacyList.updateList(PrivacyList.java:230)
--> org.jivesoftware.openfire.privacy.PrivacyList.updateList(PrivacyList.java:209)
--> org.jivesoftware.openfire.privacy.PrivacyList.<init>(PrivacyList.java:108)
--> org.jivesoftware.openfire.privacy.PrivacyListProvider.loadDefaultPrivacyList(PrivacyListProvider.java:243)
--> org.jivesoftware.openfire.privacy.PrivacyListManager.getDefaultPrivacyList(PrivacyListManager.java:157)
--> TRUNCATED (our application code, holding no locks) Thread:client-9:BLOCKED
--> org.jivesoftware.openfire.group.GroupManager.getSharedGroups(GroupManager.java:455)
--> org.jivesoftware.openfire.roster.RosterManager.getSharedGroups(RosterManager.java:190)
--> org.jivesoftware.openfire.roster.Roster.<init>(Roster.java:128)
--> org.jivesoftware.openfire.roster.RosterManager.getRoster(RosterManager.java:116)
--> org.jivesoftware.openfire.spi.PresenceManagerImpl.canProbePresence(PresenceManagerImpl.java:331)
--> org.jivesoftware.openfire.spi.PresenceManagerImpl.probePresence(PresenceManagerImpl.java:341)
--> org.jivesoftware.openfire.handler.PresenceUpdateHandler.initSession(PresenceUpdateHandler.java:255)
--> org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:153)
--> org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:137)
--> org.jivesoftware.openfire.handler.PresenceUpdateHandler.process(PresenceUpdateHandler.java:201)
--> org.jivesoftware.openfire.PresenceRouter.handle(PresenceRouter.java:148)
--> org.jivesoftware.openfire.PresenceRouter.route(PresenceRouter.java:84)
--> org.jivesoftware.openfire.spi.PacketRouterImpl.route(PacketRouterImpl.java:84)
--> org.jivesoftware.openfire.net.StanzaHandler.processPresence(StanzaHandler.java:348)
--> org.jivesoftware.openfire.net.ClientStanzaHandler.processPresence(ClientStanzaHandler.java:100)
--> org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:265)
--> org.jivesoftware.openfire.net.StanzaHandler.process(StanzaHandler.java:187)
--> org.jivesoftware.openfire.nio.ConnectionHandler.messageReceived(ConnectionHandler.java:181)
--> org.apache.mina.common.support.AbstractIoFilterChain$TailFilter.messageReceived(AbstractIoFilterChain.java:570)
--> org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(AbstractIoFilterChain.java:299)
--> org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilterChain.java:53)
--> org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceived(AbstractIoFilterChain.java:648)
--> org.apache.mina.common.IoFilterAdapter.messageReceived(IoFilterAdapter.java:80)
--> org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(AbstractIoFilterChain.java:299)
--> org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilterChain.java:53)
--> org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceived(AbstractIoFilterChain.java:648)
--> org.apache.mina.filter.codec.support.SimpleProtocolDecoderOutput.flush(SimpleProtocolDecoderOutput.java:58)
--> org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:185)
--> org.apache.mina.common.support.AbstractIoFilterChain.callNextMessageReceived(AbstractIoFilterChain.java:299)
--> org.apache.mina.common.support.AbstractIoFilterChain.access$1100(AbstractIoFilterChain.java:53)
--> org.apache.mina.common.support.AbstractIoFilterChain$EntryImpl$1.messageReceived(AbstractIoFilterChain.java:648)
--> org.apache.mina.filter.executor.ExecutorFilter.processEvent(ExecutorFilter.java:239)
--> org.apache.mina.filter.executor.ExecutorFilter$ProcessEventsRunnable.run(ExecutorFilter.java:283)
--> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
--> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
--> org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
--> java.lang.Thread.run(Thread.java:745)

Awesome post! I have filed it in Jira: https://igniterealtime.org/issues/browse/OF-944

Do you think an acceptable solution would be if we remove the " ro" altogether, so that every code uses the same lock monitor?

For easy reference, a copy/paste of org.jivesoftware.openfire.privacy.PrivacyListManager#getDefaultPrivacyList:

public PrivacyList getDefaultPrivacyList(String username) {
    // Check if we have the default list in the cache
    String cacheKey = getDefaultCacheKey(username);
    PrivacyList list = listsCache.get(cacheKey);
    if (list == null) {
        synchronized (username.intern()) {
            list = listsCache.get(cacheKey);
            if (list == null) {
                // Load default list from the database
                list = provider.loadDefaultPrivacyList(username);
                if (list != null) {
                    listsCache.put(cacheKey, list);
                }
            }
        }
    }
    return list;
}

Instead of wrapping the entire call in another lock, can we loosen up things? I think the mutex is there only to ensure thread safety of the cache. If so, we could:

  • use a different mutex in line 6 (one that’s not uses by other code), as well as

  • move the database call in line 10 outside of the synchronized code block (it’s not wise to lock threads on network IO anyway).

We’d probably run the risk of doing some duplicate database calls, but I think that’s preferable to potential deadlocking.

If we were on JDK 8, we could use Map.computeIfAbsent() if Cache is based on ConcurrentHashMap and completely get rid of the synchronization.

This method is designed for this kind of stuff (get, compute, put), while it’s still thread-safe, see it’s JavaDoc.

Is there a desire to push this potential patch in before Openfire 4.0 release?

We can’t, because afaik we want to support JDK 7 as well.

Anybody wish to comment on this issue with regards to if it still exists with Openfire 4.0.3 ?