we have implemented for the first time a device server aimed at controlling a dome (protection for telescope).
I describe the environment to explain the situation:
we have 2 machines on which Tango 8.something is installed as recommended (normally)
A machine 1, called workstation (installed with ubuntu 14.04 LTS)which host the database (TANGO_HOST)
A machine 2, called LCU (stands for Local Control Unit) which is the machine connected to the hardware (installed with ubuntu 14.04 LTS and which hosts a windows virtual machine; the driver for the dome is provided for windows)
we wrapped the dome driver with python, in order to be able to test the driver itself independently of Tango (among other things)
we implement a device server containing a Tango class to control the dome which uses the wrapper
this dome device server is running on the LCU machine (machine 2)
we did a PyTango script to test the dome device server (access to attribute, command execution …)
this script is executed from the workstation (machine 1)
When executing this script, we have the following symptoms,
sometimes once and sometimes for a long period (2 - 3 hours), the script is executed correctly
but sometimes we encountered some errors/failures: the script stops because of the 3s Tango timeout
in most cases, it stops on a state or status command (we don’t have overloaded them, we use the inherited ones), but it could be on an attribute read (implemented by us, not turned into asynchronous mechanism, because shoul reply quickier than 3s)
the script does not fail always on the same point
Then the problem is erratic, not reproductible … and seems to be linked to the environment
The machine used are performing and dedicated to our control experiment
The network is Gigabyte
Endurance tests (python scripts) have been performed sevral times during several hours, using dome wrapper only (without Tango) and are ok
When testing the state and status command with our device server, from the test device utility of Jive on the LCU, the commands take between 150ms and 1800ms
We did some extra tests, with other DS, not done by us
we tested the Tango test DS, TangoTest, provided by the Tango distribution, written in C++
if the DS is running on the linux OS of the LCU => it goes very fast, the state command takes 0ms from the jive test device utility
if the DS is running on the windows virtual machine of the LCU => it goes very fast, the state command takes 0ms from the jive test device utility
if the DS is running on the linux OS of the LCU => it goes very slowly, and it fails because of the 3s Tango timeout error, when retrieving the attributes
if the DS is running on the windows virtual machine of the LCU => it goes very slowly, and it fails because of the 3s Tango timeout error, when retrieving the attributes
It seems that
the virtual machine does not affect the behaviour
the language C++ or python affects the behaviour
but …???
Do the users of PyTango or other users already had this problem or this kind of problem?
Does someone have an idea to help us on this problem?
very strange. This kind of configuration (and ones with many more hosts) works well at the ESRF and at many sites. I have not tested the DevTest.py device server but it should not slow down in the way you describe. It seems a mystery. Maybe there is a problem with the version of PyTango. How did you install it? Have you checked to see if the python device server is using lots of memory / cpu? I would switch on logging in the python device server and send it to a file to try to see what is happening by correlating it with logging messages from the client.
At what frequency does the client interrogate the server?
thanks for your fast answer.
When you say “This kind of configuration (and ones with many more hosts) works well at the ESRF and at many sites.”; I suppose and indeed something should be wrong in our configuration
I will only be at IPAG (in front of machines) on thursday.
There is a great chance that you have some kind of “deadlock” in your device server process. To check this, increase the timeout
value in your client to 3.5 sec (DeviceProxy::set_timeout_millis() method) and check the error message you received.
There is a chance that it won’t be “Timeout” anymore but rather something like “Not able to acquire serialization monitor”.
Could you confirm that the error message you now get is this “not able to …”?
Could you also tell us if you have your own thread(s) within your Tango class?
I join in attachment three files (described in the first post) we used for our device server, if it can help
the dome driver wrapper: functions have been turned to synchronous mode to handle end of “action”
the Tango class included in the device server: functions have been turned to asynchronous mode to avoid Tango 3s timeout error, with thread python module
the PyTango script we used to test the device server; since we call asynchronous functions, we wait for the end of operation (command, …) with the function waitForState(state, delay) that waits untill the state is not the required one
PyTango version is 8.1.1
PyTango has been installed from the provided Ubuntu package as recommended
On our LCU, in the Virtual machine windows
Windows Powershell powershel.exe 38,148 K
Python.exe *32 35,616K
Performance
UC utilisée between 0% and 5% generally could reach ~20% in a peak
Memory 773Mo
I’ve increased the timeout to 3.5s testing the DevTest DS which is a python test DS
I don’t have the “Not able to acquire serialization monitor” error but still the timeout error
I don’t have the error message “not able to …”?
Whatever the DS run in the LCU VM or linux or even on the workstation (I did this extra test to run the DS on the same machine than the TANGO_HOST), the error is
DevError[
desc = Timeout (3000 mS) exceeded on device toto/python/test if the timeout is set to 3S
…
or
DevError[
desc = Timeout (3500 mS) exceeded on device toto/python/test if the timeout is 3.5s
…
I join in attachment the log file as well as some error screenshots
The error is produced when reading the attr_async_to attribute (output on the console where the DS is started)
Start DevTest DS and run the client test script
xtramgr@xtworkstation:~$ ./DevTest.py test
Failed to import EventChannelFactory notifd/factory/xtworkstation from the Tango database
In DevTestClass init
In DevTest init
In Python init_device method
In read_attr_hardware
[DevTest::read_attr] attribute name Long_attr
[DevTest::read_attr] attribute name DefUserAttr
[DevTest::read_attr] attribute name DefClassUserAttr
In read_attr_hardware
[DevTest::read_attr] attribute name attr_wrong_type
In read_attr_hardware
[DevTest::read_attr] attribute name Float_attr
In read_attr_hardware
[DevTest::read_attr] attribute name UShort_ima_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name Short_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name UChar_spec_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name Short_spec_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name ULong64_attr_rw
In read_attr_hardware
[DevTest::read_attr] attribute name attr_asyn_to
Leaving reading attr_asyn_to attribute
and I can reproduce the timeout error when reading this attr_asyn_to attribute in the jive test device utility
So it will take more than 4 seconds to read this attribute.
Therefore it is normal to get a timeout exception if you have set your client timeout to 3s or 3.5s.
Increasing the timeout or changing the code of read_attr_asyn_to attribute so that it returns in less than 3s or 3.5s should solve this problem.
thanks; I’ve modified the code of DevTest DS by reducing the sleep time under 3S, and of course it’s ok.
I have still to find out of what goes wrong in OUR DS (see first post), but I have first to reproduce the problem and re build a system (because the “hardware has left”
I will investigate as soon as possible and ask if needed.