[zeromq-dev] What is the canonical handling of zeromq sockets when fork+exec?

Discussion:

zmqdev

2016-11-25 09:37:24 UTC

* Background

I have a service that starts workers on demand with fork+exec.
The requests arrive over zeromq sockets.

After the fork, before the exec, I close all file descriptors > 2,
keeping only stdin/out/err. I then exec the requested program.

* Problem

It works. Except that I get some rare core dumps (of the service) with
the following assertion failure:

Bad file descriptor (src/epoll.cpp:90)

and the backtrace:

#0 0xf77f5430 in __kernel_vsyscall ()
#1 0xf743f1f7 in raise () from /lib/libc.so.6
#2 0xf7440a33 in abort () from /lib/libc.so.6
#3 0xf7067134 in zmq::zmq_abort(char const*) () from $LIBS/libzmq.so.5
#4 0xf7065e6c in zmq::epoll_t::rm_fd(void*) () from $LIBS/libzmq.so.5
#5 0xf7068823 in zmq::io_object_t::rm_fd(void*) () from
$LIBS/libzmq.so.5
#6 0xf70958af in zmq::stream_engine_t::unplug() () from
$LIBS/libzmq.so.5
#7 0xf7098711 in
zmq::stream_engine_t::error(zmq::stream_engine_t::error_reason_t) ()
from $LIBS/libzmq.so.5
#8 0xf7098867 in zmq::stream_engine_t::timer_event(int) () from
$LIBS/libzmq.so.5
#9 0xf707f972 in zmq::poller_base_t::execute_timers() () from
$LIBS/libzmq.so.5
#10 0xf7066209 in zmq::epoll_t::loop() () from $LIBS/libzmq.so.5
#11 0xf7066467 in zmq::epoll_t::worker_routine(void*) () from
$LIBS/libzmq.so.5
#12 0xf709d67e in thread_routine () from $LIBS/libzmq.so.5
#13 0xf7619b2c in start_thread () from /lib/libpthread.so.0
#14 0xf750808e in clone () from /lib/libc.so.6

This is with zeromq-4.1.4 on RHEL 7.3 x86_64.

So I wonder: is there some interaction between parent and child?

* Documentation

The Guide and the FAQ do not address explicitly the fork+exec point.

The question has been asked several times on the mailing list in various
forms, without a definitive answer (for dummies like me at least).

* Questions:

Do I need to zmq_close the sockets in the child?
Or is zmq_term in the child enough?
Does closing the file descriptors in the child cause problems in the parent?

What is the correct way to handle this?

Luca Boccassi

2016-11-25 10:50:29 UTC

Permalink

Post by zmqdev
* Background
I have a service that starts workers on demand with fork+exec.
The requests arrive over zeromq sockets.
After the fork, before the exec, I close all file descriptors > 2,
keeping only stdin/out/err. I then exec the requested program.
* Problem
It works. Except that I get some rare core dumps (of the service) with
Bad file descriptor (src/epoll.cpp:90)
#0 0xf77f5430 in __kernel_vsyscall ()
#1 0xf743f1f7 in raise () from /lib/libc.so.6
#2 0xf7440a33 in abort () from /lib/libc.so.6
#3 0xf7067134 in zmq::zmq_abort(char const*) () from $LIBS/libzmq.so.5
#4 0xf7065e6c in zmq::epoll_t::rm_fd(void*) () from $LIBS/libzmq.so.5
#5 0xf7068823 in zmq::io_object_t::rm_fd(void*) () from
$LIBS/libzmq.so.5
#6 0xf70958af in zmq::stream_engine_t::unplug() () from
$LIBS/libzmq.so.5
#7 0xf7098711 in
zmq::stream_engine_t::error(zmq::stream_engine_t::error_reason_t) ()
from $LIBS/libzmq.so.5
#8 0xf7098867 in zmq::stream_engine_t::timer_event(int) () from
$LIBS/libzmq.so.5
#9 0xf707f972 in zmq::poller_base_t::execute_timers() () from
$LIBS/libzmq.so.5
#10 0xf7066209 in zmq::epoll_t::loop() () from $LIBS/libzmq.so.5
#11 0xf7066467 in zmq::epoll_t::worker_routine(void*) () from
$LIBS/libzmq.so.5
#12 0xf709d67e in thread_routine () from $LIBS/libzmq.so.5
#13 0xf7619b2c in start_thread () from /lib/libpthread.so.0
#14 0xf750808e in clone () from /lib/libc.so.6
This is with zeromq-4.1.4 on RHEL 7.3 x86_64.
So I wonder: is there some interaction between parent and child?
* Documentation
The Guide and the FAQ do not address explicitly the fork+exec point.
The question has been asked several times on the mailing list in various
forms, without a definitive answer (for dummies like me at least).
Do I need to zmq_close the sockets in the child?
Or is zmq_term in the child enough?
Does closing the file descriptors in the child cause problems in the parent?
What is the correct way to handle this?

Hi,

I have not dealt with this case personally, so perhaps other folks who
have can chip in.

What I can say is that we have a unit test for this situation:

https://github.com/zeromq/libzmq/blob/master/tests/test_fork.cpp

And the child closes the (TCP) socket explicitly before the context.
Which is in fact what should happen in all cases.

The parent then can receive messages on the sockets just fine.

Maybe it's a linger issue? By default a socket has 30s of linger grace
period.

Try setting ZMQ_LINGER to 0 in the socket in the child, close the socket
and then terminate the context perhaps.

Kind regards,
Luca Boccassi

zmqdev

2016-11-25 11:22:17 UTC

Permalink

Post by Luca Boccassi
https://github.com/zeromq/libzmq/blob/master/tests/test_fork.cpp
And the child closes the (TCP) socket explicitly before the context.
Which is in fact what should happen in all cases.
The parent then can receive messages on the sockets just fine.
Maybe it's a linger issue? By default a socket has 30s of linger grace
period.
Try setting ZMQ_LINGER to 0 in the socket in the child, close the socket
and then terminate the context perhaps.

thanks. Formatted differently:

1. zmq_close sockets in child (perhaps setting ZMQ_LINGER to 0 beforehand)
2. zmq_term context in child

and only then

3. close rest of file descriptors in child

The reason I went directly to point 3 is this line from the man page of
fork(2):

The child process is created with a single thread—the one
that called fork().

(see http://man7.org/linux/man-pages/man2/fork.2.html)

Michael Kerrisk in "The Linux Programming Interface" insists:

When a multithreaded process calls fork(), only the calling thread is
replicated in the child process. (The ID of the thread in the child
is the
same as the ID of the thread that called fork() in the parent.) All
of the
other threads vanish in the child; no thread-specific data
destructors or
cleanup handlers are executed for those threads.
(...)

Of course, that's where I run into the problem?!

Luca Boccassi

2016-11-26 15:52:03 UTC

Permalink

Post by zmqdev

1. zmq_close sockets in child (perhaps setting ZMQ_LINGER to 0 beforehand)
2. zmq_term context in child
and only then
3. close rest of file descriptors in child
The reason I went directly to point 3 is this line from the man page of
The child process is created with a single threadâthe one
that called fork().
(see http://man7.org/linux/man-pages/man2/fork.2.html)
When a multithreaded process calls fork(), only the calling thread is
replicated in the child process. (The ID of the thread in the child
is the
same as the ID of the thread that called fork() in the parent.) All
of the
other threads vanish in the child; no thread-specific data
destructors or
cleanup handlers are executed for those threads.
(...)
Of course, that's where I run into the problem?!

Yes I suspect the background I/O thread suddenly going missing might
have caused issues.

Did setting the linger and closing the socket help?

Kind regards,
Luca Boccassi