API_FwdAttrInconsistency error in PollThread

Hi everyone,

I’m hitting a pretty strange issue involving forwarded attributes. I have two Tango devices, A and B, running on the same server (Tango v9.3.4, ubuntu 20.04 LTS). Device A is an aggregator, with several forwarded attributes mapped to the corresponding attributes in B. Both devices run fine for about 10-15 minutes, but then device A suddenly crashes.

Here’s the GDB stack trace:

(gdb) bt full
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
        set = {__val = {18446744066192564231, 77309411330, 140737241121104, 0, 0, 0, 4916306193379385684, 7234307576299943525, 140737241124608, 140737241124552, 
            140737241122944, 140737241122952, 140737241124792, 0, 140737241124808, 0}}
        pid = <optimized out>
        tid = <optimized out>
        ret = <optimized out>
#1  0x00007ffff634d859 in __GI_abort () at abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x7ffff65185c0 <_IO_2_1_stderr_>, sa_sigaction = 0x7ffff65185c0 <_IO_2_1_stderr_>}, sa_mask = {__val = {
              140737327667140, 3432, 140737324492712, 2, 140737324492712, 16, 140737325925824, 2, 2, 1, 140736951513728, 2, 140737325926272, 140736951513824, 
              140737343928876, 140737488344288}}, sa_flags = -161598044, sa_restorer = 0x7ffff6518780 <stderr>}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007ffff65d88d1 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#3  0x00007ffff65e437c in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#4  0x00007ffff65e43e7 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#5  0x00007ffff65e4699 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
No symbol table info available.
#6  0x000055555574e5fd in Tango::Except::throw_exception (reason=0x7ffff7701001 "API_FwdAttrInconsistency", desc=" not registered in map of root attribute!", 
    origin=0x7ffff7701be8 "RootAttRegistry::get_local_att_name", sever=Tango::ERR) at /opt/Software/TANGO/v9.3.4/include/tango/except.h:595
--Type <RET> for more, q to quit, c to continue without paging--
        errors = {<_CORBA_Unbounded_Sequence<Tango::DevError>> = {<_CORBA_Sequence<Tango::DevError>> = {pd_max = 1, pd_len = 1, pd_rel = true, pd_bounded = false, 
              pd_buf = 0x7fffbc00b118}, <No data fields>}, <No data fields>}
#7  0x00007ffff7520e8a in Tango::RootAttRegistry::RootAttConfCallBack::get_local_att_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/Software/TANGO/v9.3.4/lib/libtango.so.9
No symbol table info available.
#8  0x00007ffff7439673 in Tango::RootAttRegistry::get_local_att_name(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
   from /opt/Software/TANGO/v9.3.4/lib/libtango.so.9
No symbol table info available.
#9  0x00007ffff7521de8 in Tango::RootAttRegistry::auto_unsub() () from /opt/Software/TANGO/v9.3.4/lib/libtango.so.9
No symbol table info available.
#10 0x00007ffff751682c in Tango::PollThread::auto_unsub() () from /opt/Software/TANGO/v9.3.4/lib/libtango.so.9
No symbol table info available.
#11 0x00007ffff7510f6c in Tango::PollThread::one_more_poll() () from /opt/Software/TANGO/v9.3.4/lib/libtango.so.9
No symbol table info available.
#12 0x00007ffff750eb6b in Tango::PollThread::run_undetached(void*) () from /opt/Software/TANGO/v9.3.4/lib/libtango.so.9
No symbol table info available.
#13 0x00007ffff78d3620 in omni_thread_wrapper () from /opt/Software/omniORB/v4.2.5/lib/libomnithread.so.4
No symbol table info available.
#14 0x00007ffff6256609 in start_thread (arg=<optimized out>) at pthread_create.c:477
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737241126656, -925966338890497742, 140737488344158, 140737488344159, 140737488344288, 140737241124800, 
--Type <RET> for more, q to quit, c to continue without paging--
                925960324307648818, 925950210945089842}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = 0
#15 0x00007ffff644a353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

As you can see, frame #6 shows an API_FwdAttrInconsistency exception from roorattrreg.cpp. My interpretation is that, after 10 minutes running fine, the Polling thread is suddenly unable to find the local name of one of the fwd attributes. So, it’s like device A loses track at some point of an attribute it’s monitoring, and the crash happens when that attribute is accessed. But… why?

The puzzling part is that we have three other servers running this same setup (same omniORB, zmq, tango versions, same version of devices A and B, and identical attribute configuration), yet they all run perfectly fine, no crashes.

This looks like some sort of concurrency issue? Any insight would be highly appreciated.

Thanks in advance,
Cristobal