I am exploring the Tango Events. I have created a device server “JDeviceforEvent” and it only has one attribute named “Speed” with following configuration.
I’ve written a simple client which subscribe to the change event of the Speed attribute.
I run the device server and then run the client. When the event is subscribed for the first time I get an event as expected. But at every 10s, I get an error stating “No heartbeat from dserver”.
I have tested it on TANGO Virtual box and it works as expected, i.e whenever the value of the Speed Attribute is change by an absolute value of 1 or more an event is raised. I don’t get any heartbeat exception.
I have already posted this issue in the mailing list. They found that it was because of some bug. They fixed the bug and provided new client API. But I am still not able to resolve the issue. It is because of my network configuration. I’m not able to understand what changes should be made in the network configuration in order to resolve this.
Please help me in resolving the issue.
I’ve attached the code for my device server, client and the modified client API. I’m using JTango-9.0.3 jar.
Note: This is what the suggestion I received while using the modified client API in the mailing community — Put TangORB-9.0.3-a.jar at the beginning of the CLASSPATH.
It seems that I’m not able to attach the jar file for modified client API. You can use the following link to download the modified TangORB-9.0.3-a.jar file:
From mail list I understand that the issue is reproducible only on boxes with 2> network interfaces. Quite unhelpful as our production boxes normally have 2.
In addition after No_Heartbeat error client still gets a value that is read synchronously (ZmqEventConsumer.java:576) which is confusing.
In the mail trail, it was pointed that the issue can be reproduced if the client runs on the machine with more than one network interface. The new client API which was provided didn’t help in resolving the issue. It was then concluded that the issue is somehow related to my network configuration.
I’m still not able to resolve it. Your help would be greatly beneficial.
I’m running the Tango Server and the client on the same machine. System configuration is as below:
IP Address :-> 192.168.118.210.
Hostname :-> PC5-HP
TANGO_HOST :-> 192.168.118.210:20000
I’ve attached the snapshot of “ipconfig /all” and “hosts” file of my windows system. I hope it might help you in understanding network configuration.
I’ve set the logging level to “TRACE” for the Tango Device Server and attached the log generated by the device server. Also I’m attaching the output of the Tango client. In the server log it is stated that “Heartbeat sent for tango://PC5-HP.ncra.tifr.res.in:20000/dserver/jdeviceforevent/jdevt1.heartbeat”, but somehow it is not reaching to the client.
I also get a strange error stating “device tango/admin/pc5-hp not defined in the database” in the “Command Prompt” from which I start the Tango Database server. I’m not sure whether the issue is because of it. I’ve also attached the snapshot of the same.
Please help in resolving the issue. As I’m not able to use events I’m not able to use half of the major functionality provided by the Tango Control System Framework.
Here is description how I worked around this particular problem.
As events system seems to be working even though you get this API_NoHeartbeat exception. So I decided to just ignore it:
//event listener defined as field
private TangoEventListener<Long> tikTakListener = new TangoEventListener<Long>() {
@Override
public void onEvent(EventData<Long> data) {
//do stuff
}
@Override
public void onError(Exception cause) {
//ignore Heartbeat
if (cause.getMessage().contains("API_NoHeartbeat")) return;
//otherwise set state to FAULT
logger.error(cause.getMessage(), cause);
setState(DevState.FAULT);
}
};
I also started a new branch TangORB-9.1.1.hzg. Where I removed synchronous reads from the remote Tango when NoHeartbeat is happening and also when client is subscribing.
So now I have proper behavior in my test cases (and hopefully in the production this week ):
client subscribes for an attribute change
once server starts pushing events client gets them
when server stops pushing client does not get anything
This works fine. Currently client is on Windows machine with two network interfaces and the server is on debian 7 also with two network interfaces.
Though it is not a fix nor a real understanding why this NoHeartbeat is happening seems to be a workaround for us.
I would also raise an issue concerning the client API implementation. Specifically these synchronous calls when client subscribes and when NoHeartbeat happens. This is very misleading as client can not recognize whether it gets value because of the event or it just happens that API has read value and passes it to the client. So basically in my case client got values even though server did not produce anything (server deliberately pushes events). And as client uses this event as a trigger for some routine (data acquisition in this case) you can image what was happening.
So basically API should not attempt to decide for the client to read value synchronously, client may do so in error handler.
Same story with the first read when client subscribes. I can image why this was done (Hello GUI!), i.e. client subscribes, conveniently gets a value, displays it and then waits for a change. But, this must be done by the client explicitly - client reads value, displays it, subscribes for changes and waits.
What do you think? Is C++ implementation has the same contract?
Yes C++ has the same behaviour concerning events. I agree with you it is confusing and hides the fact that events are not coming through sometimes. Your proposal sounds reasonable but it might be difficult to change now because a number of GUIs depend on this behaviour. It should at least be discussed with the community to see if changing the behaviour is possible in a future release.
As per my understanding the client uses the Heartbeat event to make sure the Device Server which is publishing the event is still alive. But currently there is no way to know at the client end whether the missing heartbeat is because the device server is dead or because the heartbeat has got lost. Considering this your solution is acceptable.
I hope that the issue gets fixed by the time of Tango 9 release.
Also I want to confirm my understanding regarding the way in which the heartbeat event mechanism works (is implemented) in Tango.
The heartbeat event raised by the device server is first sent to device of the DServer class residing in the Device Server Process and then DServer device forwards it to the all the clients who have subscribed for the events. I inferred it from the following line of the DeviceServer log:
DEBUG 2015-08-10 12:41:14,259 [Event HeartBeat - dserver/JDeviceForEvent/jdEvt1] org.tango.server.events.EventManager.run:603 - Heartbeat sent for tango://PC5-HP.ncra.tifr.res.in:20000/dserver/jdeviceforevent/jdevt1.heartbeat
Similarly I believe that the subscription request sent by the client would come to the DServer device and then the DServer device will make some changes like adding the name of the client in a list.
Is my understanding correct ?
Also is the behavior same for all events raised by the Device Server or it is specific to heartbeat event ?
Once again I appreciate the efforts which you put for resolving the issue.
I discussed the event issue with our Java expert here and he confirms that this feature works and is used extensively here. This means the problem you are encountering is either a bug or specific to your setup. The workaround from Igor will not solve the problem. Your problem is you are not getting any events. The heartbeat is simply a symptom of this. You are right the heartbeat is to check the device server is alive. I don’t know the details of the implementation exactly but your assumption that the DServer common admin device sends the heartbeat sounds logical.
To find out why events are not working could you fire up atkpanel on your device and check what the errors are in the error log and what the View → Diagnostics windows says about support for events for your device attributes.
I see you are on Windows - have you switched the firewall off? If I think of any other reasons why events could not be working and how you can check I will let you know.
Yes I have switched the firewall OFF on my windows system.
Also I opened the error log from the ATK Panel for my device. There are no errors in the error log.
I checked in the Diagnostic Window. It says “tango://192.168.118.210:20000/dserver/JDeviceForEvent/jdEvt1 has no event channel defined in the database 192.168.118.210:20000 May be the server is not running.”
I have attached the snapshot of the Diagnostic Window and Error Log.
Could the issue be because the event channel is not defined ? If that is the case then please suggest the way of defining the event channel in the database.
In the Appendix D (Section D.3 and D.4) of the Tango Control System manual (v8.1) it is mentioned that the event channel is required for tango release prior to version 8. I’ve installed tango v 8.1.2 and I believe it uses ZMQ for events. Also the Tango jar (Tango-9.0.3 jar) file which I’m using has all the API for ZMQ.
So, I’m not able to understand why the Diagnostic window is showing this error.
Also is there any way of specifying the logging level on the client side ? The detailed log on the client side might help you in understanding the issue better.
Also I want to know if I am the only one who is facing this issue ? Are you not able to reproduce the issue on any system ?
I have downloaded your server and test client and run them on my Ubuntu system. The events work on my system. One minor problem was the while loop in your client does not sleep and uses 100% of the cpu!
So the events problem is not with your server or client but rather with your setup. We still have to understand why.
I used JTangoServer-1.1.7-all.jar which I got from the sourceforge download site. I had errors compiling with the version of the server you pointed to in your initial post (log4j etc were missing).
Here is the screenshot of the server and client running in my eclipse workbench. I have included the relevant windows for jive and atkpanel showing how they should look when events work.
I don’t know where this error is coming from. It sounds suspiciously like an error message from the old corba events system. Are you sure you are not including an old TangORB in your classpath?
Yes there is via environment variables. I have forgotten how …
I cannot reproduce it on Linux for now. But by persevering we will get to the bottom of this.
I downloaded the new stable JTangoServer-1.1.7-all.Jar from the link provided by you. I have also modified the client code as per your suggestion. Instead of infinite loop I have added the following code:
while(true)
{
Thread.sleep(20000);
}
I have attached the new client code and the same device server code. It produced some strange result.
The event channel error in the ATKPanel Diagnostic Window is now resolved. I’ve attached the snapshot of the ATKPanel and now it looks exactly same as yours.
On the client side I don’t get the Heartbeat error at every 10 seconds but I get the value of the Speed variable even though it has not been changed. So behavior seems like that the event is periodic in nature (although it is not).
I have attached the client log and the device server log. The device server log indicates that the “ZmqEventSubscriptionChange” command of the admin device is called at fixed interval by the client and it results in re-subscription of the event and a synchronous read of the attribute value. I inferred it from the regular repetition of the following line in the device server log
REQUEST 2015-08-11 15:59:09,816 [dserver/JDeviceForEvent/jdEvt1] - Operation command_inout_4 (cmd = ZmqEventSubscriptionChange) from cache_device requested from PC5-HP.ncra.tifr.res.in (Java client with main class org.tango.console.TestClient.TestClient_Console - PID=7456)
I appreciate the efforts the members of Tango community are putting in to resolve the issue.
I was not able to attach the client log and the Device Server log due to the size issue along with my previous post so I have attached the logs with this post.
I also want to tell that currently the system I’m using is in a workgroup and not in a domain. Also there is no DNS mapping corresponding to the hostname and IPAddress of my system in the DNS server. I have manually added the entry in the hosts file of my system. All the systems in my office are on the same LAN and they internally uses Link-Local Multicast Name Resolution(LLMNR) protocol to resolve the hostname and IPAddress.
I’m not sure whether the information I provided in the above paragraph will be useful to you. I just thought that it might help you in understanding my network configuration better.
Why do you specify the TANGO_HOST with the ip address instead of the ip name? Is the name resolution working? If this is not working due to your setup then indeed it will be difficult for the server to contact the client using the ip name. This would explain why events aren’t working …
Try to make ip name resolution work for your PC or try on another PC.
I am not sure I understood. Can you resolve the HP-PC5 hostname from a client e.g. does ping HP-PC5 work?
When you changed the TANGO_HOST to the ip name what changed? Does jive and atkpanel still work? In your last screenshot you showed the TANGO_HOST=192.168.118.210:20000. Can you still use jive and your client with TANGO_HOST=PC5-HP.ncra.tifr.res.in:20000 i.e. the Fully Qualified Domain Name (FQDN)?
The way the network connection works for TANGO events is that the device server will try to build a connection to the client using the FQDN hostname of the client. If it cannot resolve this name then the server cannot send events to the client. I am more and more convinced this is your problem.
Possible solutions are:
(1) use /etc/hosts and add an alias for PC5-HP.ncra.tifr.res.in for the ip address of the pc
(2) use /etc/hosts and change the hostname to be PC-HP5 and have an entry in /etc/hosts for this host
(3) make DNS work correctly so that you can resolve the FQDN to the ip address
In ALL cases the host name displayed in the log output must be resolvable for events to work.