Losing Messages with 3.9.x

Hi,

since the upgrade from 3.8.x to 3.9.x, i get reproducable Problems, with Messages not getting transmitted correctly.

Problem is as follows:

you send someone a message and after some time you ask back, why he hasnt responded yet. Turns out, he never got the first message.

As we are in the same office, i checked their clients and their messagelog, there was indeed no message received.

As soon as a discussion is started, everything is fine, no messages are lost anymore.

I could reproduce this now with 4 different ppl. At first i thought it is a client problem because it mostly happend between a miranda and pidgin, but the last case was between 2 pidgin clients.

As the Clients are all on Desktop PCs with Ethernet connections, i doubt its a stream management problem.

The openfire logs doesnt show any problems around the time.

any idea how this could be fixed? This is kinda critical, as we cant be sure, anything we write is transmitted.

greetings

Christian

Hm, you are the first one, who reports such a behavior.

Unfortunately I have no idea, especially if there’s nothing in the logs.

Do your users use Privacy Lists? Maybe it’s related, but then all messages would be blocked.

I understand you, that only the first message is lost? But how can the discussion be started, when the message never arrives?

Bug fix version maybe helps, too. e.g. 3.9.3, 3.9.2…

Thank you for your reply!

im currently using 3.9.3.

to clarify the behavior, here is an example from yesterday, i experienced myself:

(12:13:30) christian.michallek: blabla?
(12:30:27) christian.michallek: blablabla
(12:49:49) christian.michallek: why dont you respond?

only the last line reached the user. i’ve checked that by myself, because i couldnt believe it.

the first 2 lines were lost and couldnt be found anywhere.

there is one similarity to all users with this problem. everyone of them has multiple closed sessions/ressources, which i asked about here:

perhaps this has something to do with this?:

https://igniterealtime.org/issues/browse/OF-818

greetings

christian

there is one similarity to all users with this problem. everyone of them has multiple closed sessions/ressources, which i asked about here:

Yes, it seems like in that case, Openfire routes messages to one of the old (dead) sessions.

It could be related to OF-818, but I doubt it, because the sessions likely don’t have a negative priority (which is set by sending a negative prio presence from the client, which they probably don’t do).

I have no concrete idea right now, maybe it’s also related to LDAP.

If you want, you could try with the latest nightly build:

i guess i could solve my Problem, by removing these dead sessions, but i cant close them via the openfire webconsole.

Any idea how i can get rid of them?

Im not sure if im desperate enough to try out the nightly builds, but i keep this option open

This is simular to what I reported. I call the dead sessions - ghost sessions. The dead/ghost do not show up on the admin interface. Only way I have gotten rid of them is to restart openfire. No ldap is in use.

thanks alot!

i never thought i could get rid of the sessions by restarting, thought they were somehow persistent.

But it worked, they are all gone now.

I’ll keep an eye out if they come back again.

seems like this ghost are haunting me, some are already back

this leads to an additional problem:

users are shown online, but they are not. There is no client connected, still my roster shows them as away.

They are not visible at all in the openfire management console.

I guess regular restarts of openfire is the solution right now?

problem persists in the recent nightly build.

what i could find out so far:

  • Problem exists so far only with pidgin and only for users, who havent set their ressource by themself

  • System hibernation seems the be the trigger for this problem to occur.

so i guess the following happens:

  • client got disconnected by hibernate or some other network interruption.

  • openfire does not notice this correctly and keeps the zombie alive, even if it doenst get a ping reply

  • pidgin reconnects and gets a “ressource in use” reply

  • pidgin generates a new ressource and connects with this one

  • this repeats, until you got dozens of zombie sessions

Clients with manually set Ressource IDs never got this problem.

so i got 2 Problems here:

  • Ressourceconflicts dont get resolved, even with the option to close the old ressources when a conflict occurs

  • XMPP Ping seems not working correctly, as it shows clients online which arent connected (PC isnt even powered on)

perhaps its a good idea to create a ticket?

greetings

Christian

Thanks for your help, it’s interesting input!

I can somewhat follow the problem and I think it’s related to the memory leak issues, which were reported.

There are a lot of unclosed connections/sessions, which result in leak and in your problem.

It is said, that the memory leaks don’t occur in 3.9.1 (at least they are not that severe).

Maybe you can try 3.9.1 and report back, if the problem is gone. That way, we could narrow it down to be a 3.9.2 problem and that both problems have the same cause.

thanks, good to know.

can i just downgrade to 3.9.1 or is this not recommend?

Yes I think so. See https://community.igniterealtime.org/message/240240

As CSH said, downgrade should work (worked for me). So, i see that such ghost sessions are not even shown on the Sessions page? That can explain why i didn’t see anything unusual in Admin Console and yet it was running out of memory after a few days (comparing to 20-30 days on 3.9.1).

Interesting info about hibernating. Though strange to hear that hibernate is so widely used We don’t have such issue with ghost sessions (not using Pidgin). Our users are all on a nightly build of Spark (2.7.0 632 build or something) and as i have set a low value for xmpp.client.idle (30000 which is 30 seconds) i often see such behavior - PC goes into standby after a long idle, after some time Spark loses connection for a second, user becomes offline for a moment, then instantly online and back to away. I notice this as i’m used to use “Notify when user is available” option in Spark and it notifies that user is online, but when i check, he’s away. It looks like XMPP Ping can’t connect to a client, which is on PC in standby/sleep mode and closes the connection, but Spark is still active and restores the connection back. So it is constantly switching every 30 seconds between online and offline. Very annoying

i’ve downgraded yesterday, only the encryption of the mysql usernamer/password was a problem, but easily resolved.

So far it looks GOOD, but i want to wait atleast 48 hours to be absolutely sure.

About the ghost sessions:

they are only visible, if the corresponding user is online, else they are invisible, but the session overview still counts them

Thats was my first clue, that something was wrong, the first 4 tabs on the session overview were empty

just to confirm, 3.9.1 works fine, no more ghost sessions and therefor, no more lost messages.

perhaps 3.9.2+ should be pulled from the website or there should be a warning to wait for the next version?

thanks to everyone who helped me!

Christian helpfully posted on another thread about this problem.

For the sake of tracking it, the thread is here https://community.igniterealtime.org/message/239911

There are two sides of the coin. I completely understand (as an admin being hit by the memory leak in 3.9.2) that this annoys, that you have to apologize your users for having downtimes in a service, doing downgrades, etc. On the other side if we just pull it down, it will be harder to pin point where the problem is. Without your reports we wouldn’t know about ghost sessions. Also it looks like only (or mostly) Pidgin users are hit by this. It’s not like we get dozens of reports on the forums about this, so for some it probably works ok. We need more reports to find out what is happening exactly.

I think i will post an announcement (which only 1% of users read ) about this issue and maybe a poll to find out how many are running 3.9.3 (which again won’t have many answers)…

Thanks wroot.

I’ll post a few more details about my set up on the thread I linked to above.

I have filed a ticket for this issue and linked it with a memory leak ticket - OF-829

Where can i get the 3.9.1 deb package? We are also affected by the “ghosts”, but with Linux Server :frowning:

Thanks

Andreas