Cluster member node doesn't join the cluster senior node

One particular nasty problem has been keeping us busy all day. I’m hoping that someone can help us out, because we’re running out of ideas on how to proceed.

Our current Openfire setup (two cluster nodes running Openfire 3.4.1 with a couple of patches) runs fine in my test environment. Moving the exact same code to two other machines (which form another XMPP domain) gives us a strange problem: The fist node starts up fine. The second node however fails to join the cluster, ging us this exception:

2007-11-28 12:30:08.651 Oracle Coherence 3.3/387 <Info> (thread=pool-12-thread-1, member=n/a): Loaded operational configuration from resource jar:file:/srv/
openfire/plugins/enterprise/lib/coherence.jar!/tangosol-coherence.xml"
2007-11-28 12:30:08.662 Oracle Coherence 3.3/387 <Info> (thread=pool-12-thread-1, member=n/a): Loaded operational overrides from resource "jar:file:/srv/open
fire/plugins/enterprise/lib/coherence.jar!/tangosol-coherence-override-dev.xml"
2007-11-28 12:30:08.665 Oracle Coherence 3.3/387 <Info> (thread=pool-12-thread-1, member=n/a): Loaded operational overrides from resource "file:/srv/openfire
/enterprise/tangosol-coherence-override.xml" Oracle Coherence Version 3.3/387 Grid Edition: Development mode
Copyright (c) 2000-2007 Oracle. All rights reserved. 0.1627200 secs]
2007-11-28 12:30:09.537 Oracle Coherence GE 3.3/387 <Warning> (thread=pool-12-thread-1, member=n/a): UnicastUdpSocket failed to set receive buffer size to 14
28 packets (2096304 bytes); actual size is 89 packets (131071 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proc
eeding with the actual value may cause sub-optimal performance.
2007-11-28 12:30:09.876 Oracle Coherence GE 3.3/387 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
2007-11-28 12:30:09.892 Oracle Coherence GE 3.3/387 <Error> (thread=Cluster, member=n/a): Assertion failed:
        at com.tangosol.coherence.component.net.Member.configure(Member.CDB:6)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService$NewMemberAnnounceReply.onReceived(ClusterService.CDB:66)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onMessage(Service.CDB:9)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onNotify(Service.CDB:123)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService.onNotify(ClusterService.CDB:3)
        at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:35)
        at java.lang.Thread.run(Thread.java:619) 2007-11-28 12:30:09.892 Oracle Coherence GE 3.3/387 <Error> (thread=Cluster, member=n/a): Terminating ClusterService due to unhandled exception: com.tangosol
.util.AssertionException
2007-11-28 12:30:09.892 Oracle Coherence GE 3.3/387 <Error> (thread=Cluster, member=n/a):
com.tangosol.util.AssertionException:
        at com.tangosol.coherence.Component._assertFailed(Component.CDB:12)
        at com.tangosol.coherence.Component._assert(Component.CDB:3)
        at com.tangosol.coherence.component.net.Member.configure(Member.CDB:6)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService$NewMemberAnnounceReply.onReceived(ClusterService.CDB:66)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onMessage(Service.CDB:9)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.Service.onNotify(Service.CDB:123)
        at com.tangosol.coherence.component.util.daemon.queueProcessor.service.ClusterService.onNotify(ClusterService.CDB:3)
        at com.tangosol.coherence.component.util.Daemon.run(Daemon.CDB:35)
        at java.lang.Thread.run(Thread.java:619)
2007-11-28 12:30:09.894 Oracle Coherence GE 3.3/387 <D5> (thread=Cluster, member=n/a): Service Cluster left the cluster

We’ve checked about all that we could think of, which includes:

  • Firewalls are disabled;

  • We used udpcast to make sure that multicasting between the hosts works;

  • We’ve verified that (one) udp packet arrives at the ‘master’ node every time we try to join the second node to the cluster;

  • JVMs are identical.

Does anyone have suggestions?

Hi Guus,

  1. does it work if you shutdown your test system? Maybe very lame suggestion, especially if you have a separated network for your test servers.

  2. Do the server have more than one network interface, maybe it helps to set “tangosol.coherence.localhost”. Reference: http://wiki.tangosol.com/display/COH32UG/Command+Line+Setting+Override+Feature and http://wiki.tangosol.com/display/COH32UG/unicast-listener#unicast-listener-addre ss: “The localhost setting may not work on systems that define localhost as the loopback …”

  3. Clustering Openfire - Unicast may be another way to test whether it works without multicast … not sure if such a configuration helps you and how it helps to get multicast working.

LG

We’ve tried most of that. Thing is that multicast communication seems to work fine. We’ve tried UDPCast and the Coherence MulticastTest tool: both succeed flawlessly.