I am subscribing a spectrum attribute of a DS(a/b/c). Sometimes I get an error as
org.omg.CORBA.TRANSIENT: Server-side Exception: resource limit reached vmcid: OMG minor code: 1 completed: No Severity: PANIC Reason: TangoApi_CANNOT_IMPORT_DEVICE Origin: Connection.dev_import(a/b/c)
Can anybody tell what this error exactly notifies and the probable reasons for this?
Good question.
I don’t think I have ever encountered this error but the error message is suggesting the Tango DatabaseDS device server didn’t have enough resources (RAM, opened file descriptors limit, …?) to execute your request properly when trying to import a Tango device.
Is the computer where your TANGO database server server is running under heavy load?
Do you have a huge number of clients doing a huge number of queries in parallel at high frequency? (A big number of client applications trying to connect to many TANGO devices which are currently not running?)
Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects?
You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server.
You can start this tool from Astor by right-clicking on the TANGO_HOST database and select “Database Monitoring”.
In a similar way, you can also use the Database BlackBox right-click menu item to see the last 50 queries sent to the Database server. This could help to eventually identify a client which would send a huge/abnormal number of commands to the Database server.
We are still facing this issue. These are some of the artefacts that we have. I will post the other artefacts shortly.
Yes, it has load, but the system resources are available as per the output of htop command
We have approx. 20 Tango DS running in parallel on one Host. (The same host has TANGO Facility)
Yes, there is a possibility of having more than 30+ TANGO DS running on 30+ TANGO Facility. The TANGO DS on the host of interest (where we get resource limit reached error) tries to reach to these 30+ DS(s) which may or may not be alive.
No. We re-use the device Proxy once created. If we do not get it we may re-try getting the proxy but once we get the proxy we do not re-create it.
I will try to do this and get back with the updates.
Yes, there still are resources available, anyway looks like bit of a loaded system…
20 is not a large number of devices running on a host. But the specific question is whether you have a large number of clients querying at high frequency…
This it typically something you want to avoid. It is really preferable to have TANGO devices always running, maybe just sitting idle, rather than starting and stopping services, since clients hitting non-running devices create a unwanted load on the database server, which depending on how much unfair the client is, can turn out in a heavy load.
Thanks, Lorenzo for the prompt reply. We are actively looking at this issue so that we will be able to fix it. I will keep posted with the artefacts as and when my testing have some.
My quick answer is check the number of open file descriptors for the database server and the limit of your operating system for open file descriptors per process. If you reach the limit of the operating system per process this could explain your problem.
The solution is to increase the number of file descriptors per process.
You should also check how many import calls are being made per second. You can do that using the database timing attributes. If there are a lot of calls per second you should check which client is doing this and why. It might be a badly configured client.
We have many more devices for one database so there shouldn’t be a problem.
So my guess would be that there are far too many connections attempts to this device server and JacORB is not able any more to handle more requests. A queue must be full somewhere or it has reached the maximum number of allowed open file descriptors?
You can check the number of file descriptors opened by your device server by executing the following shell command on cmsserver (Thanks Emmanuel Taurel for the tip):
ls /proc/<DEVICE_SERVER_PID>/fd | wc -l
So, in the case of your jive screenshot (if you didn’t restart Node/AGN0 device server since that screenshot), it would be:
ls /proc/6828/fd | wc -l
Could it be that you have some clients which are creating new DeviceProxy objects without deleting old DeviceProxy objects?
Do you receive this exception all the time?
If you start from a clean state (no client and device server restarted), how long does it take to reach this state where you get this error?
I would advise you to stop all the clients of this device server (you can see the clients using the blackbox feature) and to try to restart from a clean state and to add clients one by one slowly until you hopefully see which client is triggering the problem (if the problem comes from a specific client, this still needs to be proven).