Event Error as Server-side Exception: resource limit reached

tgmrt · May 10, 2018, 10:51am

Hello,

I am subscribing a spectrum attribute of a DS(a/b/c). Sometimes I get an error as
org.omg.CORBA.TRANSIENT: Server-side Exception: resource limit reached vmcid: OMG minor code: 1 completed: No Severity: PANIC Reason: TangoApi_CANNOT_IMPORT_DEVICE Origin: Connection.dev_import(a/b/c)
Can anybody tell what this error exactly notifies and the probable reasons for this?

rbourtembourg · May 11, 2018, 12:25pm

Hi,

Good question.
I don’t think I have ever encountered this error but the error message is suggesting the Tango DatabaseDS device server didn’t have enough resources (RAM, opened file descriptors limit, …?) to execute your request properly when trying to import a Tango device.

Is the computer where your TANGO database server server is running under heavy load?
Do you have a huge number of clients doing a huge number of queries in parallel at high frequency? (A big number of client applications trying to connect to many TANGO devices which are currently not running?)

Are you creating DeviceProxies all the time at high frequency, without reusing the created DeviceProxy objects?

You can use the Database Monitoring tool (DBBench) to monitor the number and type of commands sent to the Database Server.
You can start this tool from Astor by right-clicking on the TANGO_HOST database and select “Database Monitoring”.

In a similar way, you can also use the Database BlackBox right-click menu item to see the last 50 queries sent to the Database server. This could help to eventually identify a client which would send a huge/abnormal number of commands to the Database server.

Kind regards,
Reynald

tgmrt · July 13, 2018, 7:52am

Hi Team,

We are still facing this issue. These are some of the artefacts that we have. I will post the other artefacts shortly.

Yes, it has load, but the system resources are available as per the output of htop command

We have approx. 20 Tango DS running in parallel on one Host. (The same host has TANGO Facility)

Yes, there is a possibility of having more than 30+ TANGO DS running on 30+ TANGO Facility. The TANGO DS on the host of interest (where we get resource limit reached error) tries to reach to these 30+ DS(s) which may or may not be alive.

No. We re-use the device Proxy once created. If we do not get it we may re-try getting the proxy but once we get the proxy we do not re-create it.

I will try to do this and get back with the updates.

lpivetta · July 13, 2018, 9:54am

Yes, there still are resources available, anyway looks like bit of a loaded system…

20 is not a large number of devices running on a host. But the specific question is whether you have a large number of clients querying at high frequency…

This it typically something you want to avoid. It is really preferable to have TANGO devices always running, maybe just sitting idle, rather than starting and stopping services, since clients hitting non-running devices create a unwanted load on the database server, which depending on how much unfair the client is, can turn out in a heavy load.

This can provide some more detail to look at.

Cheers,
Lorenzo

tgmrt · July 13, 2018, 10:01am

Thanks, Lorenzo for the prompt reply. We are actively looking at this issue so that we will be able to fix it. I will keep posted with the artefacts as and when my testing have some.

agotz · July 13, 2018, 10:11am

My quick answer is check the number of open file descriptors for the database server and the limit of your operating system for open file descriptors per process. If you reach the limit of the operating system per process this could explain your problem.

The solution is to increase the number of file descriptors per process.

Andy

tgmrt · July 13, 2018, 10:21am

Andy, I will check this.

Please find the images attached from the Database Monitoring feature from Astor as suggested by Reynald

agotz · July 13, 2018, 10:34am

How many file descriptors are open?

You should also check how many import calls are being made per second. You can do that using the database timing attributes. If there are a lot of calls per second you should check which client is doing this and why. It might be a badly configured client.

We have many more devices for one database so there shouldn’t be a problem.

Andy

rbourtembourg · July 13, 2018, 12:19pm

Hi,

The exception you get:

means the following according to the CORBA specifications (About the Common Object Request Broker Architecture Specification Version 3.0):

TRANSIENT OMG minor code 1 means:

So my guess would be that there are far too many connections attempts to this device server and JacORB is not able any more to handle more requests. A queue must be full somewhere or it has reached the maximum number of allowed open file descriptors?

You can check the number of file descriptors opened by your device server by executing the following shell command on cmsserver (Thanks Emmanuel Taurel for the tip):

ls /proc/<DEVICE_SERVER_PID>/fd | wc -l

So, in the case of your jive screenshot (if you didn’t restart Node/AGN0 device server since that screenshot), it would be:

ls /proc/6828/fd | wc -l

Could it be that you have some clients which are creating new DeviceProxy objects without deleting old DeviceProxy objects?

Do you receive this exception all the time?
If you start from a clean state (no client and device server restarted), how long does it take to reach this state where you get this error?

I would advise you to stop all the clients of this device server (you can see the clients using the blackbox feature) and to try to restart from a clean state and to add clients one by one slowly until you hopefully see which client is triggering the problem (if the problem comes from a specific client, this still needs to be proven).

Hoping this helps,

Reynald

tgmrt · July 18, 2018, 8:02am

I do not think so as we are re-using the proxy

No, we do not get it all the time, but most of the time.

No pattern observed, sometimes it is immediately sometimes it takes some hours.

I tried this but does not seem to help because, when I repeated we got altogether different client count.

I have put some checks to see if the file descriptor increases over a period of time. Will post the results once available.

tgmrt · August 6, 2018, 10:55am

To update: I had kept it for testing a couple of time and all times the number of file descriptor were more or less constant.

Note: Throughout the duration of testing the maximum file descriptor count that I received was 525. However, the DS did not get killed at that moment.