Event loss issue

sriggi · March 27, 2017, 12:07pm

Dear all,

I’m experiencing problems with the event system in my application and I would like to ask you suggestions on how to debug and understand the error.
I have 3 C++ device servers with some attributes (created dynamically at device startup with Yat) and a C++ device client subscribing to CHANGE/USER/PERIODIC events from those attributes (about 20 attributes in total) with a unique callback. The maximum polling period for those attributes is 1000 ms. The client has a user thread where events are processed (e.g. event data received in the callback are passed to the worker thread).

The problem I’m facing is that after the client has run for a while some of the events are not received anymore. I suspect that I have some deadlock (+memory issue?).

One of the problem in the client, if I’m not wrong, it should be addressed to code that pushes events manually from a user thread.
Infact I occasionally saw these error messages in the client:


Tango exception   
Severity = ERROR   
Error reason = API_CommandTimedOut   
Desc : Not able to acquire serialization (dev, class or process) monitor   
Origin : TangoMonitor::get_monitor

and also:


Tango::ZmqEventConsumer::push_heartbeat_event() timeout on channel monitor of (dserver address)

and these (apparently) disappeared if I shut down event pushing in the user threads (e.g. leave only events generation from polling thread).
Could you give me more details on this Tango core messages?

However the issue with the event stop did not disappear.
I understand that it is very difficult being helped without code sketches or additional details, so I would like to ask first if you ever experienced the same (Tango version 9.2.5a, zmq v4.2.0, omni v4.2.1) and how to debug the problem. Is there any flag (e.g. related to zmq) I can turn on to explore the issue in detail both in the servers and in the client? Could it be a zmq issue (e.g. some queue full)?
Right now I’m preparing a very simple device client that just subscribes to the same events and monitor the received event rate to understand more but if you have hints that would help me a lot.

Thanks for your support,

Simone

sriggi · March 28, 2017, 5:26pm

Dear all,

I did some further tests and produced a simple example to reproduce the issue (see attachment).
Basically there are 3 device servers with few attributes and a client that just subscribe to server attributes.
To run the example, configure the client (EventListener device) properties (rx_proxy_name, spf_proxy_name, ds_proxy_name) giving the device server names (as configured in your host) and run the four servers.

In the Tango configuration of my laptop (Ubuntu 16.04, Tango v9.2.5a, omniORB v4.2.1, zmq v4.2.0), after a while (few minutes) I do not receive anymore the events with the smallest polling period (1000 ms).
I run the same test in another Tango host (Scientific Linux 6, Tango v9.2.2, zmq v4.2.1) and I did not see the issue so there should be something wrong in my laptop configuration or a very trivial error that I’m doing. Any hints or suggestion to see what’s wrong? Could it be related to zmq (e.g. PUB events dropped or a bug?).

If you have time could you run the same test on your Tango host (latest Tango release) to confirm if you see the same problem?

Many thanks for your help,

Simone

agotz · March 29, 2017, 6:48am

Dear Simone,

I will try your code today and let you know if I get the same results. It will be interesting to know if someone else observes the same.

Cheers

Andy

lpivetta · March 29, 2017, 8:54am

Hi guys,
the exception reported we sometimes face as well. As you know is caused by a contention on the tango monitor. This may be caused by too many clients (e.g the polling thread(s) and additional client(s) reading any non polled attributes) accessing the device. Or by a underestimated polling period with respect to method execution time and number of attributes.
Moreover, my feeling is that event the push-event-by-code approach is acquiring some locks in the TANGO core… but this can be better confirmed by someone actively working on it.
Cheers,
Lorenzo

agotz · March 29, 2017, 9:59am

Simone,
I have your servers running on my laptop. I will let you know if I see the same problem or not.
Cheers
Andy

sriggi · March 29, 2017, 10:04am

Dear all,

many thanks for your support!
Yesterday I re-installed Tango+DB on my laptop with a previous zmq version (v4.0.8) and I did not notice the same error. I will repeat tests today.

Thanks again,

Simone

agotz · March 29, 2017, 10:20am

I am running on Ubuntu 16.04 with Tango 9.2.5a and zmq 5 (from Ubuntu repository) and so far I have not seen the issue either.
How soon does it happen when it does happen? I am counting the events too to see if there is any correlation.
Cheers
Andy

sriggi · March 29, 2017, 11:06am

Dear Andy,

many thanks for the prompt report.
I saw the issue after few minutes (<15 minutes). If I’m not wrong you should be using zmq 4.1.4-7. In case I do the check with this lib also.

Cheers,

Simone

agotz · March 30, 2017, 4:56am

Hi Simone,
I ran your servers on my laptop yesterday for 6 hours without any event loss. More than 340000 events were received by EventListener. This morning I have increased the event rate to 10 Hz to see if I can reproduce your problem. So far not. I will let you you know the result.
I checked on Ubuntu 16.04 and indeed I am using version 4.1.4-7 of zmq from the Ubuntu repository. The standard one installed by Ubuntu.
I haven’t noticed anything strange i your code - it looks very well written to me!
Cheers
Andy

rbourtembourg · March 30, 2017, 7:36am

[quote=“lorenzo”]
Moreover, my feeling is that event the push-event-by-code approach is acquiring some locks in the TANGO core… but this can be better confirmed by someone actively working on it.
Cheers,
Lorenzo[/quote]

Hi,
I confirm what Lorenzo was writing. The push-event-by-code approach is acquiring some locks in the TANGO core.
Cheers,
Reynald

sriggi · March 30, 2017, 9:15am

Dear all,

many thanks for trying the sample code. I also run the tests with zmq 4.0.8 and found no issues. Cannot say at present if it was a problem with the specific zmq version (v4.2.0) or if I messed up the TangoDB in my laptop. I will use the current (apparently working) configuration and in case post if I have problems again.

Instead the deadlock with pushing events from the code seem confirmed. Is this particularly affecting the AlarmSystem?

Cheers,

Simone

tgmrt · January 16, 2018, 10:26am

Team,

Any update on this issue, we are facing the issue.

rbourtembourg · January 16, 2018, 10:40am

Dear TCS_GMRT team,

Is the issue you are facing the same as the one you reported on http://www.tango-controls.org/community/forum/c/development/java/attribute-events-gets-missed/ ?
If yes, please comment directly on the forum topic you created.
If not, please give us the maximum of details about your configuration and a good description of the issue you are facing and some ways to reproduce your issue.
Otherwise it will be very difficult for us to help you.

Kind regards,
Reynald

agotz · January 16, 2018, 11:05am

I can only agree with Reynald. Have you mixed up the issues? This issue was resolved as not being reproducible. Can you reproduce it? Under what circumstances?

If you are referring the open issue you created please try to work consistently through the different proposals given by Reynald, Lorenzo and myself. We need to be able to reproduce your issue or have more information. Can you reproduce it in a test server and send us the code?

Cheers

Andy

tgmrt · January 18, 2018, 3:55pm

[quote]I can only agree with Reynald. Have you mixed up the issues? This issue was resolved as not being reproducible. Can you reproduce it? Under what circumstances?
[/quote]
No both are two different issues for us. In this post, we are referring to the push_heartbeat event related issue.

Our scenario:
TANGO version : 9.2.2
ZMQ version : 4.0.7
OS : UBUNTU 16.04 LTS

We have Taurus UI based on PyTango which subscribes to attributes of Java DS on change event. Along with this, we have displayed Taurus labels using a model to display the attribute values on GUI. We have observed the following error when both DS and GUI are running “push_heartbeat” event issue. Attached is the error screenshot for easy reference.

Some explanation for the attached image:
An arrow in Greenish Shade: The issue in discussion.
An arrow in Red: We want to know what it is trying to describe.
An arrow in Blue: Spectrum tango attribute initialized by “##” by Java DS during its init.
An arrow in Orange: Change event response/error caused by the fluctuations (start-stop of DS) in the dserver/lmcds/c03

At this time GUI gets irresponsive and shows above error. We are unable to debug this issue as it is not following a pattern.
When we check the status of dserver DS then it shows exported - false for 1 or 2 sec and again shows true. This gives a client event subscriber a change event (Arrow in Orange) and the issue occurs. We would like to know why dserver DS gets down, if someone has faced the similar issue?

During this issue the GUI client freezes, we suspect that this might be one factor in the freezing of client. Other factors might be CPU & Memory usage.

We would also like to understand more about the push_heartbeat concept.

Note: LST, mncStatus, systemState, UTC … in the image are the Java DS Tango attributes.

Additionally steps that we have performed:
We tried installing ZMQ v4.2.3 but we hit another roadblock, meanwhile, we will be trying to fix the roadblock and update ZMQ to see if we still get this issue.

rbourtembourg · January 18, 2018, 6:20pm

Hi,

In Tango there is an heartbeat mechanism between the device server (DS) and the clients which subscribed to events on this DS.
The DS will push regularly some special events (heartbeat events) to the clients to inform them it is still alive.
The server should send these heartbeat event at least every 10 seconds.
If a client did not receive any heartbeat event during more than 10 seconds, then it considers a reconnection must occur with the server, so the client will try to re-subscribe to the events every 10 seconds, until the server responds again.

The reasons why the heartbeat was not received could be:

The server is dead
The network was down
The server has been restarted
…

We know there are some issues with the Tango event system related to heartbeats sometimes when using different TANGO_HOST values on client and server sides, or when using network alias names in the TANGO_HOST values. In this case, you may experience reconnections every 10 seconds.
I don’t remember well but there might be some issues when using some uppercase letters in the hostnames as well.

What you see on your logs (which are the logs from the client side if I understand well(??)) are push_heartbeat_event timeouts, which means no heartbeat event was received from the admin device tango://c03:10000/dserver/lmcds/c03 during at least 10 seconds.

In red, I think you see a message from Taurus which switches to do some polling from the client side since it got notified the events were no longer working as expected. This means taurus no longer relies on events for the attributes from lmcds/c03 device server and tries to read the attributes directly at a regular period.

So what you can check is:

is c03 a network alias name?
what is the value of TANGO_HOST environment variable on the server side. Are you using the same TANGO_HOST value on client and server side?
do you have any network issues between your client and server?
do you have a firewall in betweeen filtering some network ports?
is the machine where lmcds/c03 DS is running overloaded and cannot send heartbeat events because of lack of CPU resources?

I won’t be available until next Wednesday so don’t worry if you don’t get any answer from me before that day…
In the mean time, other experts from the community could help you too…

Hoping this helps,
Reynald

agotz · January 19, 2018, 8:02am

It sounds like you are not getting events at all. Probably due to the TANGO_HOST issues mentioned by Reynald. Can you try all he suggests and start atkpanel on your device to see if it receives any events. You will see this in the Debug panel. You should see 0 events received and you should see the same errors in the Error tab as the ones you have seen in the C++ client window. I am not sure changing the zmq version will help but it is worth a try.

This will confirm if events are working at all or not.

Andy

tgmrt · January 22, 2018, 9:13am

Yes , co3 is the hostname of system

Both server and client are running on same system using same TANGO_HOST. (TANGO_HOST = c03)

No, as they are running on the same system sharing same network.

Firewall service is currently inactive.

Yes, sometimes we face CPU overloading issue at our setup.

Along with this we are working on the suggestion provided by andy i.e. testing with atkpanel. We will provide you the updates about that.

lpivetta · January 22, 2018, 9:23am

[quote=“TCS_GMRT”][quote]

is c03 a network alias name?
[/quote]
Yes , co3 is the hostname of system

hostname and network alias are not exactly the same… so is that the real hostname, as defined in the DNS, or an alias (e.g. an additional name you can use to refer to the same host, the DNS is configured with)?

Both server and client are running on same system using same TANGO_HOST. (TANGO_HOST = c03)

You’re missing the port number here…

No, as they are running on the same system sharing same network.

Firewall service is currently inactive.

Yes, sometimes we face CPU overloading issue at our setup.

Have you identified the cause of CPU overloading? Can you mitigate this?

Along with this we are working on the suggestion provided by andy i.e. testing with atkpanel. We will provide you the updates about that.

[/quote]

Cheers,
Lorenzo

zreszela · January 23, 2018, 3:33pm

Hi all,

I just wanted to add one more idea to check. I think that the API_EventTimeout can happen as well in case you have a long callback (more than 10 s) subscribed the an attribute. Or a burst of events that their callbacks will make the event consumer thread not able to handle the heartbeat event on time. Please experts correct me if I’m wrong.

Just to discard this one, do you add your own listeners to the taurus models? Or maybe directly subscribe to the events using PyTango?

As Reynald explains, in case of the API_EventTimeout, Taurus enables its own polling for these attributes, which eventually gets disabled whenever a correct event arrives. What makes me think that your client finally receives the events correctly is that the API_EventTimeout appears somehow periodically, more or less every 1m 20s. I would say that, in between, Taurus disables the polling and relies again on events. But it is hard to say without debugging. Maybe you could try to enable debug logs in Taurus. Either use the --taurus-log-level=debug when starting your GUI or set it programmatically with taurus.setLoggingLevel(10).

Cheers,
Zibi