No heartbeat error on Event Subscription

Hi,

I am exploring the Tango Events. I have created a device server “JDeviceforEvent” and it only has one attribute named “Speed” with following configuration.

Attribute: Speed
Attribute Type: DevDouble
Read/Write Type: READ_WRITE
isPolled: true
pollingPeriod: 3000 ms
pushChangeEvent: false
checkChangeEvent: true
changeEventAbsolute: “1”

I’ve written a simple client which subscribe to the change event of the Speed attribute.

I run the device server and then run the client. When the event is subscribed for the first time I get an event as expected. But at every 10s, I get an error stating “No heartbeat from dserver”.

I have tested it on TANGO Virtual box and it works as expected, i.e whenever the value of the Speed Attribute is change by an absolute value of 1 or more an event is raised. I don’t get any heartbeat exception.

I have already posted this issue in the mailing list. They found that it was because of some bug. They fixed the bug and provided new client API. But I am still not able to resolve the issue. It is because of my network configuration. I’m not able to understand what changes should be made in the network configuration in order to resolve this.

Please help me in resolving the issue.

I’ve attached the code for my device server, client and the modified client API. I’m using JTango-9.0.3 jar.

Note: This is what the suggestion I received while using the modified client API in the mailing community — Put TangORB-9.0.3-a.jar at the beginning of the CLASSPATH.

Hi,

It seems that I’m not able to attach the jar file for modified client API. You can use the following link to download the modified TangORB-9.0.3-a.jar file:

Regards,
Vatsal Trivedi

Hi Vatsal,

Have you managed to workaround this problem?

From mail list I understand that the issue is reproducible only on boxes with 2> network interfaces. Quite unhelpful as our production boxes normally have 2.

In addition after No_Heartbeat error client still gets a value that is read synchronously (ZmqEventConsumer.java:576) which is confusing.

Regards,

Igor.

Hi Igor,

In the mail trail, it was pointed that the issue can be reproduced if the client runs on the machine with more than one network interface. The new client API which was provided didn’t help in resolving the issue. It was then concluded that the issue is somehow related to my network configuration.

I’m still not able to resolve it. Your help would be greatly beneficial.

I’m running the Tango Server and the client on the same machine. System configuration is as below:

IP Address :-> 192.168.118.210.
Hostname :-> PC5-HP
TANGO_HOST :-> 192.168.118.210:20000

I’ve attached the snapshot of “ipconfig /all” and “hosts” file of my windows system. I hope it might help you in understanding network configuration.

I’ve set the logging level to “TRACE” for the Tango Device Server and attached the log generated by the device server. Also I’m attaching the output of the Tango client. In the server log it is stated that “Heartbeat sent for tango://PC5-HP.ncra.tifr.res.in:20000/dserver/jdeviceforevent/jdevt1.heartbeat”, but somehow it is not reaching to the client.

I also get a strange error stating “device tango/admin/pc5-hp not defined in the database” in the “Command Prompt” from which I start the Tango Database server. I’m not sure whether the issue is because of it. I’ve also attached the snapshot of the same.

Please help in resolving the issue. As I’m not able to use events I’m not able to use half of the major functionality provided by the Tango Control System Framework.

Regards,
Vatsal Trivedi

Hi Vatsal,

Thank you very much for all the information.

Here is description how I worked around this particular problem.

As events system seems to be working even though you get this API_NoHeartbeat exception. So I decided to just ignore it:


//event listener defined as field
private TangoEventListener<Long> tikTakListener = new TangoEventListener<Long>() {
        @Override
        public void onEvent(EventData<Long> data) {
            //do stuff
        }

        @Override
        public void onError(Exception cause) {
            //ignore Heartbeat
            if (cause.getMessage().contains("API_NoHeartbeat")) return;
            //otherwise set state to FAULT
            logger.error(cause.getMessage(), cause);
            setState(DevState.FAULT);
        }
    };

I also started a new branch TangORB-9.1.1.hzg. Where I removed synchronous reads from the remote Tango when NoHeartbeat is happening and also when client is subscribing.

So now I have proper behavior in my test cases (and hopefully in the production this week :slight_smile: ):

  • client subscribes for an attribute change
  • once server starts pushing events client gets them
  • when server stops pushing client does not get anything

This works fine. Currently client is on Windows machine with two network interfaces and the server is on debian 7 also with two network interfaces.

Though it is not a fix nor a real understanding why this NoHeartbeat is happening seems to be a workaround for us.

Hope this helps.

Igor.

Igor,

this sounds like a bug. Have you filed a bug report?

Andy

Andy,

No I did not. Can do it in a moment.

I would also raise an issue concerning the client API implementation. Specifically these synchronous calls when client subscribes and when NoHeartbeat happens. This is very misleading as client can not recognize whether it gets value because of the event or it just happens that API has read value and passes it to the client. So basically in my case client got values even though server did not produce anything (server deliberately pushes events). And as client uses this event as a trigger for some routine (data acquisition in this case) you can image what was happening.

So basically API should not attempt to decide for the client to read value synchronously, client may do so in error handler.

Same story with the first read when client subscribes. I can image why this was done (Hello GUI!), i.e. client subscribes, conveniently gets a value, displays it and then waits for a change. But, this must be done by the client explicitly - client reads value, displays it, subscribes for changes and waits.

What do you think? Is C++ implementation has the same contract?

Igor.

Igor,

thanks.

Yes C++ has the same behaviour concerning events. I agree with you it is confusing and hides the fact that events are not coming through sometimes. Your proposal sounds reasonable but it might be difficult to change now because a number of GUIs depend on this behaviour. It should at least be discussed with the community to see if changing the behaviour is possible in a future release.

Andy

Hi Igor,

Thanks for developing a workaround to the issue.

As per my understanding the client uses the Heartbeat event to make sure the Device Server which is publishing the event is still alive. But currently there is no way to know at the client end whether the missing heartbeat is because the device server is dead or because the heartbeat has got lost. Considering this your solution is acceptable.

I hope that the issue gets fixed by the time of Tango 9 release.

Also I want to confirm my understanding regarding the way in which the heartbeat event mechanism works (is implemented) in Tango.

The heartbeat event raised by the device server is first sent to device of the DServer class residing in the Device Server Process and then DServer device forwards it to the all the clients who have subscribed for the events. I inferred it from the following line of the DeviceServer log:

DEBUG 2015-08-10 12:41:14,259 [Event HeartBeat - dserver/JDeviceForEvent/jdEvt1] org.tango.server.events.EventManager.run:603 - Heartbeat sent for tango://PC5-HP.ncra.tifr.res.in:20000/dserver/jdeviceforevent/jdevt1.heartbeat

Similarly I believe that the subscription request sent by the client would come to the DServer device and then the DServer device will make some changes like adding the name of the client in a list.

Is my understanding correct ?

Also is the behavior same for all events raised by the Device Server or it is specific to heartbeat event ?

Once again I appreciate the efforts which you put for resolving the issue.

Regards,
Vatsal

Dear Vatsal,

I discussed the event issue with our Java expert here and he confirms that this feature works and is used extensively here. This means the problem you are encountering is either a bug or specific to your setup. The workaround from Igor will not solve the problem. Your problem is you are not getting any events. The heartbeat is simply a symptom of this. You are right the heartbeat is to check the device server is alive. I don’t know the details of the implementation exactly but your assumption that the DServer common admin device sends the heartbeat sounds logical.

To find out why events are not working could you fire up atkpanel on your device and check what the errors are in the error log and what the View → Diagnostics windows says about support for events for your device attributes.

I see you are on Windows - have you switched the firewall off? If I think of any other reasons why events could not be working and how you can check I will let you know.

Kind regards

Andy

Dear Andy,

Yes I have switched the firewall OFF on my windows system.

Also I opened the error log from the ATK Panel for my device. There are no errors in the error log.

I checked in the Diagnostic Window. It says “tango://192.168.118.210:20000/dserver/JDeviceForEvent/jdEvt1 has no event channel defined in the database 192.168.118.210:20000 May be the server is not running.”

I have attached the snapshot of the Diagnostic Window and Error Log.

Could the issue be because the event channel is not defined ? If that is the case then please suggest the way of defining the event channel in the database.

In the Appendix D (Section D.3 and D.4) of the Tango Control System manual (v8.1) it is mentioned that the event channel is required for tango release prior to version 8. I’ve installed tango v 8.1.2 and I believe it uses ZMQ for events. Also the Tango jar (Tango-9.0.3 jar) file which I’m using has all the API for ZMQ.

So, I’m not able to understand why the Diagnostic window is showing this error.

Also is there any way of specifying the logging level on the client side ? The detailed log on the client side might help you in understanding the issue better.

Also I want to know if I am the only one who is facing this issue ? Are you not able to reproduce the issue on any system ?

Regards,
Vatsal

Dear Vatsal,

I have downloaded your server and test client and run them on my Ubuntu system. The events work on my system. One minor problem was the while loop in your client does not sleep and uses 100% of the cpu!

So the events problem is not with your server or client but rather with your setup. We still have to understand why.

I used JTangoServer-1.1.7-all.jar which I got from the sourceforge download site. I had errors compiling with the version of the server you pointed to in your initial post (log4j etc were missing).

Here is the screenshot of the server and client running in my eclipse workbench. I have included the relevant windows for jive and atkpanel showing how they should look when events work.

Vatsal,

some more answers inline:

Good

I don’t know where this error is coming from. It sounds suspiciously like an error message from the old corba events system. Are you sure you are not including an old TangORB in your classpath?

Yes there is via environment variables. I have forgotten how …

I cannot reproduce it on Linux for now. But by persevering we will get to the bottom of this.

Andy

I have managed to reproduce the problem! When using the TangORB-9.0.3a.jar I get the same error from your client as you:

home/goetz/workspaces/tango/JDeviceForEvent/bin:/home/goetz/tango/jevents/TangORB-9.0.3-a.jar:/home/goetz/tango/jevents/JTangoServer-1.1.7-all.jar:/usr/share/java/jayatanaag.jar
Device name: JD/Evt/1
====================== ZMQ (3.22) event system is available ============================
tcp://127.0.0.1:51103 Connected !!!!
Tue Aug 11 10:05:01 CEST 2015
Event Name: JD/Evt/1/speed
Event: change
Event Type: 0
Event Source: 0
Event Error: false
Attribute Value: 0.0


--------------------------------------------------------------------


tango://pc35.home:10000/dserver/jdeviceforevent/test Not found
tango://pc35.home:10000/dserver/jdeviceforevent/test Not found
Tue Aug 11 10:05:21 CEST 2015
Event Name: JD/Evt/1
Event: change
Event Type: 0
Event Source: 0
Event Error: true
Error list size: 1
Error 1 Description: No heartbeat from dserver/jdeviceforevent/test
Error 1 Severity: ERR
Error 1 Reason: API_NoHeartbeat
Error 1 Origin: ZmqEventConsumer.checkIfHeartbeatSkipped()

I propose you replace this jar with the latest stable one from sourcforge. You can download it from here:

http://sourceforge.net/projects/tango-cs/files/JTango/JTangoServer-1.1.7-all.jar/download

Let me know if this fixes your problem.

Andy

Dear Andy,

I downloaded the new stable JTangoServer-1.1.7-all.Jar from the link provided by you. I have also modified the client code as per your suggestion. Instead of infinite loop I have added the following code:

while(true)
{
    Thread.sleep(20000);
}

I have attached the new client code and the same device server code. It produced some strange result.

The event channel error in the ATKPanel Diagnostic Window is now resolved. I’ve attached the snapshot of the ATKPanel and now it looks exactly same as yours.

On the client side I don’t get the Heartbeat error at every 10 seconds but I get the value of the Speed variable even though it has not been changed. So behavior seems like that the event is periodic in nature (although it is not).

I have attached the client log and the device server log. The device server log indicates that the “ZmqEventSubscriptionChange” command of the admin device is called at fixed interval by the client and it results in re-subscription of the event and a synchronous read of the attribute value. I inferred it from the regular repetition of the following line in the device server log

REQUEST 2015-08-11 15:59:09,816 [dserver/JDeviceForEvent/jdEvt1] - Operation command_inout_4 (cmd = ZmqEventSubscriptionChange) from cache_device requested from PC5-HP.ncra.tifr.res.in (Java client with main class org.tango.console.TestClient.TestClient_Console - PID=7456)

I appreciate the efforts the members of Tango community are putting in to resolve the issue.

Regards,
Vatsal Trivedi

Hi

You wrote:
Hostname:-> PC5-HP
TANGO_HOST :-> 192.168.118.210:20000

Did you try with:
TANGO_HOST=PC5-HP:20000 ?

Hi,

I was not able to attach the client log and the Device Server log due to the size issue along with my previous post so I have attached the logs with this post.

I also want to tell that currently the system I’m using is in a workgroup and not in a domain. Also there is no DNS mapping corresponding to the hostname and IPAddress of my system in the DNS server. I have manually added the entry in the hosts file of my system. All the systems in my office are on the same LAN and they internally uses Link-Local Multicast Name Resolution(LLMNR) protocol to resolve the hostname and IPAddress.

I’m not sure whether the information I provided in the above paragraph will be useful to you. I just thought that it might help you in understanding my network configuration better.

Regards,
Vatsal Trivedi

Why do you specify the TANGO_HOST with the ip address instead of the ip name? Is the name resolution working? If this is not working due to your setup then indeed it will be difficult for the server to contact the client using the ip name. This would explain why events aren’t working …

Try to make ip name resolution work for your PC or try on another PC.

Andy

Hi,

@Pascal
I tried by using Hostname (PC5-HP) instead of IP Address in the TANGO_HOST environment variable. It does not resolve the issue.

@Andy
I also tried your suggestion of changing the hostname to PC5HP. It does not help either.

If some new ideas come to your mind let me know I’ll keep on trying. Thanks for your continuous support.

Regards,
Vatsal Trivedi

I am not sure I understood. Can you resolve the HP-PC5 hostname from a client e.g. does ping HP-PC5 work?

When you changed the TANGO_HOST to the ip name what changed? Does jive and atkpanel still work? In your last screenshot you showed the TANGO_HOST=192.168.118.210:20000. Can you still use jive and your client with TANGO_HOST=PC5-HP.ncra.tifr.res.in:20000 i.e. the Fully Qualified Domain Name (FQDN)?

The way the network connection works for TANGO events is that the device server will try to build a connection to the client using the FQDN hostname of the client. If it cannot resolve this name then the server cannot send events to the client. I am more and more convinced this is your problem.

Possible solutions are:

(1) use /etc/hosts and add an alias for PC5-HP.ncra.tifr.res.in for the ip address of the pc

(2) use /etc/hosts and change the hostname to be PC-HP5 and have an entry in /etc/hosts for this host

(3) make DNS work correctly so that you can resolve the FQDN to the ip address

In ALL cases the host name displayed in the log output must be resolvable for events to work.

Andy