tgmrt
(TCS_GMRT)
September 25, 2018, 7:31am
1
I am working on tango device server part, as I start tango DS it is automatically being killed after sometime. I have attached the screen-shot of jive & gdb logs. From the GDB logs, I can see tango DS process is getting “SIGABRT ” signal and because of that it is being killed.
I am not getting why it receives SIGABRT signal.
Anyone has any idea why it is getting SIGABRT signal with message : Assertion failed: erased == 1 (mtrie.cpp:292) ???
Hi,
It would be interesting to get more details.
What version of the C++ tango library are you using?
What version of ZMQ are you using?
Can you reproduce this easily with a simple C++ device server?
If yes, can you please share the source code of the simple device server and give instructions on how to reproduce the problem?
I found these issues reported on zmq github repo which look similar to your problem:
opened 12:43PM - 12 Feb 18 UTC
closed 03:59AM - 05 Mar 18 UTC
Platform (linux/generic)
Symptom (Assertion failure)
*Please use this template for reporting suspected bugs or requests for help.*
…
# Issue description
When all of these conditions are satisfied, the assertion failure from `mtrie.cpp` occurs:
- A connection between a `PUB` socket and many `SUB` sockets.
- A `SUB` socket subscribe/unsubscribe many prefixes.
- Call `zmq_getsockopt()` with `ZMQ_EVENTS` for `SUB` sockets.
```
Assertion failed: erased == 1 (src/mtrie.cpp:297)
[1] 30266 abort (core dumped) ./a.out
```
# Environment
* libzmq version (commit hash if unreleased): 4.2.0 and 4.2.3
* OS: Ubuntu 16.04 LTS
# Minimal test code / Steps to reproduce the issue
To reproduce this crash, we should prepare a `PUB` socket and many `SUB` sockets.
We will call this sequence (pseudo-code): `pub.connect(sub) or sub.connect(pub); pub.getsockopt(ZMQ_EVENTS); sub.subscribe(prefix); sub.getsockopt(ZMQ_EVENTS); sub.unsubscribe(prefix); sub.getsockopt(ZMQ_EVENTS)`. There will be many prefixes to subscribe/unsubscribe.
Calling `getsockopt(ZMQ_EVENTS)` after `SUB`'s `SUBSCRIBE`/`UNSUBSCRIBE`, or `PUB`'s `zmq_connect()` will produce a crash due to the assertion failure in `mtrie_t::rm_helper`.
You can switch `PUB<->SUB` connection topology by the `pub_to_sub` variable.
```c
#include "zmq.h"
#include <stdio.h>
// Set 1 or 0 to switch the PUB<->SUB connection topology.
static int pub_to_sub = 1;
void gen_topic(int n, char* topic)
{
// Simple hash function to generate a subscription prefix from a number.
n = (n * 2654435761);
sprintf(topic, "%08x", n);
}
void getsockopt_events_within_many_subscriptions(void* sub)
{
char topic[8];
char opt[256];
size_t opt_len = 256;
for (int j = 0; j < 10000; ++j)
{
gen_topic(j, topic);
zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
// CRASH: Get ZMQ_EVENTS from a SUB socket.
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
for (int j = 0; j < 10000; ++j)
{
gen_topic(j, topic);
zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
// CRASH: Get ZMQ_EVENTS from a SUB socket.
zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
}
int main()
{
printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);
void *context = zmq_ctx_new();
void *pub = zmq_socket(context, ZMQ_PUB);
void *sub;
char addr[256]; size_t addr_len = 256;
char opt[256]; size_t opt_len = 256;
if (pub_to_sub)
{
// PUB->SUB
for (int i = 0; i < 100; ++i)
{
sub = zmq_socket(context, ZMQ_SUB);
zmq_bind(sub, "tcp://127.0.0.1:*");
zmq_getsockopt(sub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
zmq_connect(pub, addr);
getsockopt_events_within_many_subscriptions(sub);
}
}
else
{
// SUB->PUB
zmq_bind(pub, "tcp://127.0.0.1:*");
zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
for (int i = 0; i < 100; ++i)
{
sub = zmq_socket(context, ZMQ_SUB);
zmq_connect(sub, addr);
getsockopt_events_within_many_subscriptions(sub);
// CRASH: Get ZMQ_EVENTS from the PUB socket.
zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);
}
}
}
```
# What's the actual result? (include assertion message & call stack if applicable)
```console
$ gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out
4.2.3
Assertion failed: erased == 1 (src/mtrie.cpp:297)
[1] 30266 abort (core dumped) ./a.out
```
# What's the expected result?
```console
$ gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out
4.2.3
$ echo $?
0
```
When `SUB` sockets connect to the `PUB` socket, this crash doesn't happen.
opened 10:10AM - 13 Dec 16 UTC
closed 08:14AM - 09 Jan 17 UTC
I've been making a game server with ZeroMQ's Pub/Sub messaging pattern. I got a… critical problem by using `PUB`/`SUB` sockets. Sometimes my processes are aborted with assertion failure from ZeroMQ:
```
Assertion failed: erased == 1 (src/mtrie.cpp:297)
```
I tried with pyzmq-16.0.2 over libzmq-4.2.0.
In my case, a `SUB` socket binds to an address then a `PUB` socket connects to the address. All of `PUB` sockets and `SUB` sockets in a cluster connect with each others. They makes a fully connected network among 500+ server processes.
A `SUB` socket frequently subscribes or unsubscribes their topics. The topics in a cluster grow up since the cluster started. At a moment when I checked, one of `SUB` sockets is subscribing 3000+ topics.
I saw 3 aborting scenarios:
1. When a `SUB` socket closes, some `PUB` sockets abort. Perhaps it is a concurrency bug from [pyzmq](https://github.com/zeromq/pyzmq) what I'm using. I reproduced it by a test case. And I think [I fixed it](https://github.com/what-studio/pyzmq/commit/94ab0a88dbef7d0f33b34cdf18e55487735dde01).
2. When a `PUB` socket joins to a mature cluster it aborts almost immediately. A mature cluster means there are already so many subscribing topics and subscribe/unsubscribe synchronization messages.
3. A `PUB` socket on a weak host machine (e.g. AWS EC2 t2.medium), sometimes aborts. I'm not sure what is the point.
Unfortunately, I couldn't reproduce the last 2 scenarios by a small code. But my server still has been aborted.
The assertion failure occurs when a `PUB` socket tries to remove a pipe to a `SUB` socket but there's no matched pipe. I'm wondering if ZeroMQ guarantees the consistency of subscribe/unsubscribe synchronizations between busy `PUB` and `SUB` sockets.
Cheers,
Reynald
When looking at PUB crash when SUB exceeded SNDHWM · Issue #2942 · zeromq/libzmq · GitHub , here is the description added in libzmq README when it has been fixed (in ZMQ 4.2.4):
[quote]
Fixed #2942 - ZMQ_PUB crash when due to high volume of subscribe and
unsubscribe messages, an unmatched unsubscribe message is
received in certain conditions[/quote]
Do you have a high volume of subscribe and unsubscribe messages coming to your device server?
tgmrt
(TCS_GMRT)
September 25, 2018, 10:37am
4
Hi Reynald , thanks for the prompt reply.
We are using 9.2.2 version of C++ tango library and 4.0.7 for ZMQ
I will try to reproduce this with a simple device server program.
Approximately 5 - 7 clients subscribe to a particular attribute, however, the attribute value change is approximately 2 messages/sec.
tgmrt
(TCS_GMRT)
September 27, 2018, 7:23am
5
Further findings after updating ZMQ_v4.2.5:
Changes in Taurus labels values are not updated however the attributes with static values are visible on Taurus label.
Taurus version: 3.7.0
Please find attached POC screens with different ZMQ version