Issues with 2 node cluster + S2S to other domain (git HEAD)

Dear Devs,

We are attempting to set up a test cluster of two nodes with a third host talking to the cluster via S2S.

When running the two nodes as a standlone cluster, XMPP clients talking to the cluster do the expected thing when the node they are currently connecting to goes down - i.e. Receive a disconnect and on rejoin everything works as expected.

We’ve encoutered issues when using a separate XMPP host via S2S - we get dangling / loss of communications when the endpoint of the S2S connection goes down within the cluster -> when we attempt to send further “groupchat” messages (causing the creation of new S2S connections) we are in a bad state.

Example scenario:

  • Client 1 Spark connects as dan@xmpp.domain to “testroom@conference.xmpp.domain” -> directed to cluster node lh01.xmpp.domain
  • Client 2 Spark connects as test@dh01.standalone.domain to “testroom@conference.xmpp.domain” -> direct connection to host, S2S connection created to lh01.xmpp.domain via the load balancer.

At this point, both clients see each other in the room and can exchange group chat messages.

  • Halt of lh01.xmpp.domain node

The server shuts down, the cluster promotes the junior to senior (lh02) and Client 1 Spark is forced to reconnect - and reconnects successfully to the room. No other participants are visible in the room.

Client 2 Spark does not receive any notice or visible indication that an error has occured. The logs of “dh01.standalone.domain” show the disconnection of the S2S connection.

When typing further messages in Client Spark 2, the following is received:

qut

7Df031

<not-acceptable xmlns="urn:ietf:params:xml:ns:xmpp-stanzas"/>
<offline/>

<delivered/>

<displayed/>

<composing/>

Version / Setup information:

Openfire version: Git checkout of https://github.com/igniterealtime/Openfire/commit/34971f9562fbe07cb7befebb120f88 3f66493850

Platform: Linux Centos 6.8

Database: Oracle 12.1

Load balancer: HA proxy for 5222, 5269

Cluster Node1 host: lh01.xmpp.domain

Cluster Node1 XMPP domain: xmpp.domain

Cluster Node2 host: lh02.xmpp.domain

Cluster Node2 XMPP domain: xmpp.domain

Node3 host: dh01.standalone.domain

Node3 XMPP domain: dh01.standalone.domain

Relevant DNS entries (others like the oracle host are not shown):

lh01.xmpp.domain. IN A 10.0.0.11

lh02.xmpp.domain. IN A 10.0.0.21

xmpp.domain. IN A 10.0.0.50

conference.xmpp.domain. IN CNAME xmpp.domain

dh01.standalone.domain. IN A 10.0.0.60

conference.dh01.standalone.domain. IN CNAME dh01.standalone.domain.

_xmpp-client._tcp.xmpp.domain. IN SRV 0 0 5222 xmpp.domain.

_xmpp-server._tcp.xmpp.domain. IN SRV 0 0 5222 xmpp.domain.

_xmpp-server._tcp.conference.xmpp.domain. IN SRV 0 0 5222 conference.xmpp.domain.

NOTE: The dh01 IP address as listed above is the HA proxy IP address - so that incoming connections to dh01 look like they are coming from the “xmpp.domain” IP address rather than individual cluster nodes.

I have generated trusted certs that have all appropriate alternate names and imported them into the necessary nodes.

I realise I’m potentially asking for a world of pain using HEAD from git - if there’s a specific version I should be trying this with, please let me know.

I have the lab cluster still up and running for further investigations / testing.

Thanks for any pointers that can be given - even if it’s “add more debugging to the server connections here and show us the logs”.

Kind regards,

Dan