[zeromq-dev] ZMQ_SUB socket stuck in recv() call

Hi Marc,

If you could tell us:

* the version of 0MQ you're using
* your operating system and harware
* the programming language
* the 0MQ transports you're using
* and provide a minimal test case that reproduces the problem

It'll be easier to see what's going on.

-Pieter

Hi all. Have a pub/sub setup with a single publisher and several
subscribers with default subscriptions (""). Fairly high volume of messages
are published and several times throughout the day I'll have to restart a
client due to it no longer receiving updates and it's memory footprint
growing rapidly.
Have been attaching to the hung clients with gdb and have found the same
basic thing everytime: 4 threads in the zmq::epoll_t::loop() func
(epoll.cpp line 161) and one thread in zmq::signaler_t::recv()
(signaler.cpp line 263) that never returns from the ::recv() call.
Usually when one of the clients enter this state the others keep on going
just fine, I'm assuming the memory footprint growth is due to messages from
the publisher being queued up.
I read in the docs about different sockets and their "exception" states that
would cause them to block until the issue is resolved, but I didn't see
anything in there about SUB sockets entering an exception state.
Any help / ideas to look into would be greatly appreciated.
Marc
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com

Wolfgang Richter

2010-09-01 17:10:35 UTC

Just as a quick question, does 0MQ have a ticket tracking system?
--
Wolf

Post by Pieter Hintjens
Hi Marc,
* the version of 0MQ you're using
* your operating system and harware
* the programming language
* the 0MQ transports you're using
* and provide a minimal test case that reproduces the problem
It'll be easier to see what's going on.
-Pieter

Post by Marc Rossi
Hi all. Have a pub/sub setup with a single publisher and several
subscribers with default subscriptions (""). Fairly high volume of messages
are published and several times throughout the day I'll have to restart a
client due to it no longer receiving updates and it's memory footprint
growing rapidly.
Have been attaching to the hung clients with gdb and have found the same
basic thing everytime: 4 threads in the zmq::epoll_t::loop() func
(epoll.cpp line 161) and one thread in zmq::signaler_t::recv()
(signaler.cpp line 263) that never returns from the ::recv() call.
Usually when one of the clients enter this state the others keep on going
just fine, I'm assuming the memory footprint growth is due to messages from
the publisher being queued up.
I read in the docs about different sockets and their "exception" states that
would cause them to block until the issue is resolved, but I didn't see
anything in there about SUB sockets entering an exception state.
Any help / ideas to look into would be greatly appreciated.
Marc
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Pieter Hintjens

2010-09-01 17:15:04 UTC

Yes, there is a public issue tracker on the github.com repository:
http://github.com/zeromq/zeromq2/issues

-Pieter

Post by Wolfgang Richter
Just as a quick question, does 0MQ have a ticket tracking system?
--
Wolf

Hi all. Have a pub/sub setup with a single publisher and several
subscribers with default subscriptions (""). Fairly high volume of messages
are published and several times throughout the day I'll have to restart a
client due to it no longer receiving updates and it's memory footprint
growing rapidly.
Have been attaching to the hung clients with gdb and have found the same
basic thing everytime: 4 threads in the zmq::epoll_t::loop() func
(epoll.cpp line 161) and one thread in zmq::signaler_t::recv()
(signaler.cpp line 263) that never returns from the ::recv() call.
Usually when one of the clients enter this state the others keep on going
just fine, I'm assuming the memory footprint growth is due to messages from
the publisher being queued up.
I read in the docs about different sockets and their "exception" states that
would cause them to block until the issue is resolved, but I didn't see
anything in there about SUB sockets entering an exception state.
Any help / ideas to look into would be greatly appreciated.
Marc
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com

Wolfgang Richter

2010-09-01 17:16:06 UTC

Thanks :)
--
Wolf

Post by Pieter Hintjens
http://github.com/zeromq/zeromq2/issues
-Pieter

Post by Wolfgang Richter
Just as a quick question, does 0MQ have a ticket tracking system?
--
Wolf

_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Marc Rossi

2010-09-01 17:27:09 UTC

Hi Pieter,

0MQ version 2.0.7
Linux (Fedora Core 12 -- kernel 2.6.32.14-127.fc123.x86_64)
2 Quad core Xeon processors.
C++
TCP transport

Quick description of process. Publisher receives data from a 3rd party
stock market feed, and makes it available on address 'tcp://*:5555'. I'll
see what I can do about a simplified test case without any of the
dependencies in my environment.

Thanks,
Marc

Post by Marc Rossi
Hi all. Have a pub/sub setup with a single publisher and several
subscribers with default subscriptions (""). Fairly high volume of

messages

Post by Marc Rossi
are published and several times throughout the day I'll have to restart a
client due to it no longer receiving updates and it's memory footprint
growing rapidly.
Have been attaching to the hung clients with gdb and have found the same
basic thing everytime: 4 threads in the zmq::epoll_t::loop() func
(epoll.cpp line 161) and one thread in zmq::signaler_t::recv()
(signaler.cpp line 263) that never returns from the ::recv() call.
Usually when one of the clients enter this state the others keep on going
just fine, I'm assuming the memory footprint growth is due to messages

from

Post by Marc Rossi
the publisher being queued up.
I read in the docs about different sockets and their "exception" states

that

Post by Marc Rossi
would cause them to block until the issue is resolved, but I didn't see
anything in there about SUB sockets entering an exception state.
Any help / ideas to look into would be greatly appreciated.
Marc
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Pieter Hintjens

2010-09-01 17:33:02 UTC

Hi Marc,

I'd advise you upgrade to 2.0.8 (stable), there have been a number of
bug fixes since 2.0.7.

I don't immediately see this issue in the changelog but it's worth
using the latest stable release in any case.

-Pieter

Post by Marc Rossi
Hi Pieter,
0MQ version 2.0.7
Linux (Fedora Core 12 -- kernel 2.6.32.14-127.fc123.x86_64)
2 Quad core Xeon processors.
C++
TCP transport
Quick description of process. Publisher receives data from a 3rd party
stock market feed, and makes it available on address 'tcp://*:5555'. I'll
see what I can do about a simplified test case without any of the
dependencies in my environment.
Thanks,
Marc

Hi all. Have a pub/sub setup with a single publisher and several
subscribers with default subscriptions (""). Fairly high volume of messages
are published and several times throughout the day I'll have to restart a
client due to it no longer receiving updates and it's memory footprint
growing rapidly.
Have been attaching to the hung clients with gdb and have found the same
basic thing everytime: 4 threads in the zmq::epoll_t::loop() func
(epoll.cpp line 161) and one thread in zmq::signaler_t::recv()
(signaler.cpp line 263) that never returns from the ::recv() call.
Usually when one of the clients enter this state the others keep on going
just fine, I'm assuming the memory footprint growth is due to messages from
the publisher being queued up.
I read in the docs about different sockets and their "exception" states that
would cause them to block until the issue is resolved, but I didn't see
anything in there about SUB sockets entering an exception state.
Any help / ideas to look into would be greatly appreciated.
Marc
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com

Marc Rossi

2010-09-08 18:27:53 UTC

Just an update. Have upgraded to 2.0.9, and problem still exists. Trying
to get my hands around the zmq architecture so I can figure out what I'm
doing wrong.

I'm assuming the io_threads are responsible for pulling the data off the tcp
socket and are working properly as a netstat shows no data in the receive
queue. When a process enters this state (not sure what triggers it but it
will happen once or twice through trading day) memory continues to grow so
my assumption is the handoff from the io_threads to my thread calling
socket->recv() is breaking down. I'm going to keep digging, hopefully
someone can give me some pointers before I become a zeromq internals expert.

threads & stack trace shown below.

Thanks
Marc

(gdb) info threads
12 Thread 0x7f0f69118710 (LWP 27483) 0x0000003290e0e53c in recv () from
/lib64/libpthread.so.0
11 Thread 0x7f0f63fff710 (LWP 27484) 0x00000032902ded73 in epoll_wait ()
from /lib64/libc.so.6
10 Thread 0x7f0f635fe710 (LWP 27485) 0x00000032902ded73 in epoll_wait ()
from /lib64/libc.so.6
9 Thread 0x7f0f62bfd710 (LWP 27486) 0x00000032902ded73 in epoll_wait ()
from /lib64/libc.so.6
8 Thread 0x7f0f621fc710 (LWP 27487) 0x00000032902ded73 in epoll_wait ()
from /lib64/libc.so.6
7 Thread 0x7f0f617fb710 (LWP 27488) 0x0000003290e0b04c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
6 Thread 0x7f0f60dfa710 (LWP 27489) 0x0000003290e0b04c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
5 Thread 0x7f0f5bfff710 (LWP 27490) 0x0000003290e0b04c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
4 Thread 0x7f0f5b5fe710 (LWP 27491) 0x0000003290e0b04c in
pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
3 Thread 0x7f0f5abfd710 (LWP 27495) 0x00000032902d7553 in select () from
/lib64/libc.so.6
2 Thread 0x7f0f597fb710 (LWP 27497) 0x0000003290e0e43d in accept () from
/lib64/libpthread.so.0
* 1 Thread 0x7f0f6993a820 (LWP 27482) 0x00000032902a4d5d in nanosleep ()
from /lib64/libc.so.6
(gdb) thread 12
[Switching to thread 12 (Thread 0x7f0f69118710 (LWP 27483))]#0
0x0000003290e0e53c in recv () from /lib64/libpthread.so.0
(gdb) where
#0 0x0000003290e0e53c in recv () from /lib64/libpthread.so.0
#1 0x00007f0f6a3f8af6 in zmq::signaler_t::recv (this=0x7f0f640023b0,
cmd_=0x7f0f69117a50, block_=true) at signaler.cpp:274
#2 0x00007f0f6a3ea724 in zmq::app_thread_t::process_commands
(this=0x7f0f64002380, block_=<value optimized out>, throttle_=<value
optimized out>) at app_thread.cpp:88
#3 0x00007f0f6a3f903c in zmq::socket_base_t::recv (this=0x7f0f640023f0,
msg_=0x7f0f69117bc0, flags_=0) at socket_base.cpp:443
#4 0x000000000042ff0f in zmq::socket_t::recv (this=0x7f0f69117c70,
msg_=0x7f0f69117bc0, flags_=0) at /usr/local/include/zmq.hpp:256
#5 0x000000000042dbcc in Altair::MktDataSub::Consumer () at
MktDataSub_gpb.cpp:110
#6 0x00000000004353b5 in boost::detail::thread_data<void (*)()>::run
(this=0xaeb810) at /usr/include/boost/thread/detail/thread.hpp:56
#7 0x00007f0f6b5a4670 in thread_proxy () from
/usr/lib64/libboost_thread-mt.so.5
#8 0x0000003290e06a3a in start_thread () from /lib64/libpthread.so.0
#9 0x00000032902de77d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Post by Pieter Hintjens
Hi Marc,
I'd advise you upgrade to 2.0.8 (stable), there have been a number of
bug fixes since 2.0.7.
I don't immediately see this issue in the changelog but it's worth
using the latest stable release in any case.
-Pieter

I'll

Post by Marc Rossi
see what I can do about a simplified test case without any of the
dependencies in my environment.
Thanks,
Marc

restart

Post by Marc Rossi
a
client due to it no longer receiving updates and it's memory footprint
growing rapidly.
Have been attaching to the hung clients with gdb and have found the

same

Post by Marc Rossi
basic thing everytime: 4 threads in the zmq::epoll_t::loop() func
(epoll.cpp line 161) and one thread in zmq::signaler_t::recv()
(signaler.cpp line 263) that never returns from the ::recv() call.
Usually when one of the clients enter this state the others keep on going
just fine, I'm assuming the memory footprint growth is due to messages from
the publisher being queued up.
I read in the docs about different sockets and their "exception"

states

Post by Marc Rossi
that
would cause them to block until the issue is resolved, but I didn't

see

Post by Marc Rossi
anything in there about SUB sockets entering an exception state.
Any help / ideas to look into would be greatly appreciated.
Marc
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

--
-
Pieter Hintjens
iMatix - www.imatix.com
_______________________________________________
zeromq-dev mailing list
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Martin Sustrik

2010-09-13 13:18:38 UTC

Hi Marc,

Post by Marc Rossi
Just an update. Have upgraded to 2.0.9, and problem still exists.
Trying to get my hands around the zmq architecture so I can figure out
what I'm doing wrong.

It looks like you are doing nothing wrong.

Post by Marc Rossi
I'm assuming the io_threads are responsible for pulling the data off the
tcp socket and are working properly as a netstat shows no data in the
receive queue.

Yes. That's why you memory usage is growing -- I/O thread reads the
messages from the network and stores them in a queue. The application
thread is asleep and not reading the messages etc.

Post by Marc Rossi
When a process enters this state (not sure what triggers
it but it will happen once or twice through trading day) memory
continues to grow so my assumption is the handoff from the io_threads to
my thread calling socket->recv() is breaking down.

Yes. It looks like that.

Post by Marc Rossi
I'm going to keep
digging, hopefully someone can give me some pointers before I become a
zeromq internals expert.
threads & stack trace shown below.
Thanks
Marc
(gdb) info threads
12 Thread 0x7f0f69118710 (LWP 27483) 0x0000003290e0e53c in recv ()
from /lib64/libpthread.so.0
11 Thread 0x7f0f63fff710 (LWP 27484) 0x00000032902ded73 in
epoll_wait () from /lib64/libc.so.6
10 Thread 0x7f0f635fe710 (LWP 27485) 0x00000032902ded73 in
epoll_wait () from /lib64/libc.so.6
9 Thread 0x7f0f62bfd710 (LWP 27486) 0x00000032902ded73 in epoll_wait
() from /lib64/libc.so.6
8 Thread 0x7f0f621fc710 (LWP 27487) 0x00000032902ded73 in epoll_wait
() from /lib64/libc.so.6
7 Thread 0x7f0f617fb710 (LWP 27488) 0x0000003290e0b04c in
6 Thread 0x7f0f60dfa710 (LWP 27489) 0x0000003290e0b04c in
5 Thread 0x7f0f5bfff710 (LWP 27490) 0x0000003290e0b04c in
4 Thread 0x7f0f5b5fe710 (LWP 27491) 0x0000003290e0b04c in
3 Thread 0x7f0f5abfd710 (LWP 27495) 0x00000032902d7553 in select ()
from /lib64/libc.so.6
2 Thread 0x7f0f597fb710 (LWP 27497) 0x0000003290e0e43d in accept ()
from /lib64/libpthread.so.0
* 1 Thread 0x7f0f6993a820 (LWP 27482) 0x00000032902a4d5d in nanosleep
() from /lib64/libc.so.6
(gdb) thread 12
[Switching to thread 12 (Thread 0x7f0f69118710 (LWP 27483))]#0
0x0000003290e0e53c in recv () from /lib64/libpthread.so.0
(gdb) where
#0 0x0000003290e0e53c in recv () from /lib64/libpthread.so.0
#1 0x00007f0f6a3f8af6 in zmq::signaler_t::recv (this=0x7f0f640023b0,
cmd_=0x7f0f69117a50, block_=true) at signaler.cpp:274
#2 0x00007f0f6a3ea724 in zmq::app_thread_t::process_commands
(this=0x7f0f64002380, block_=<value optimized out>, throttle_=<value
optimized out>) at app_thread.cpp:88
#3 0x00007f0f6a3f903c in zmq::socket_base_t::recv (this=0x7f0f640023f0,
msg_=0x7f0f69117bc0, flags_=0) at socket_base.cpp:443
#4 0x000000000042ff0f in zmq::socket_t::recv (this=0x7f0f69117c70,
msg_=0x7f0f69117bc0, flags_=0) at /usr/local/include/zmq.hpp:256
#5 0x000000000042dbcc in Altair::MktDataSub::Consumer () at
MktDataSub_gpb.cpp:110
#6 0x00000000004353b5 in boost::detail::thread_data<void (*)()>::run
(this=0xaeb810) at /usr/include/boost/thread/detail/thread.hpp:56
#7 0x00007f0f6b5a4670 in thread_proxy () from
/usr/lib64/libboost_thread-mt.so.5
#8 0x0000003290e06a3a in start_thread () from /lib64/libpthread.so.0
#9 0x00000032902de77d in clone () from /lib64/libc.so.6
#10 0x0000000000000000 in ?? ()

Thanks for the stack trace. It confirms the analysis above (I/O thread
receiving messages, application thread stuck not being informed about
the fact there are messages waiting for it.

Three questions:

1. Are you using HWM?
2. Are you using SWAP?
3. Have the stuck subscriber got any messages before it got stuck or is
it a fresh consumer?

Martin

Marc Rossi

2010-09-15 17:42:33 UTC

Martin, thanks for the response. Apologies for late reply, haven't been
able to get back to this one until today.

Post by Martin Sustrik
Thanks for the stack trace. It confirms the analysis above (I/O thread
receiving messages, application thread stuck not being informed about the
fact there are messages waiting for it.
1. Are you using HWM?
2. Are you using SWAP?
3. Have the stuck subscriber got any messages before it got stuck or is it
a fresh consumer?

No HWM or SWAP set. Subscriber does receive messages prior to getting
stuck. Right now I have 3 clients (SUB) listening to market data
throughout the trading day. Clients create a thread in which all the zmq
code is maintained (context creation, socket creation, recv, etc).

Yesterday there were 2 occasions where one of the clients entered this
state, different clients, different times. So far today everything running
smoothly, but I would assume there will be at least 1 hiccup before the day
is up.

I am going to start doing some digging deeper in the zmq lib to see if I can
get a better handle on what is going on. But as always, any pointers /
ideas from the zmq experts would be much appreciated.

Thanks again,
Marc

Martin Sustrik

2010-09-17 07:43:53 UTC

Marc,

Post by Marc Rossi
I am going to start doing some digging deeper in the zmq lib to see if I
can get a better handle on what is going on. But as always, any
pointers / ideas from the zmq experts would be much appreciated.

I cannot really help as I don't see the problem, however, I'll be happy
to answer any questions you may have about how the code works.

What you should have a look at is that zmq_recv is stuck in waiting for
command from an I/O thread, notifying it that a message have arrived.
(the commands are passed via signaler_t object). The specific command
the writer should send to wake up the reader is "revive" (see
send_revive call in pipe_t::writer_t and process_revive in
pipe_t::reader_t).

Btw, whats the microarchitecture? x86?

Martin

Marc Rossi

2010-09-17 13:51:30 UTC

I cannot really help as I don't see the problem, however, I'll be happy to
answer any questions you may have about how the code works.

Unfortunately this is where I'm at as well. All looks good but problem
still exists. I do appreciate the time you have spent helping me though.

What you should have a look at is that zmq_recv is stuck in waiting for
command from an I/O thread, notifying it that a message have arrived. (the
commands are passed via signaler_t object). The specific command the writer
should send to wake up the reader is "revive" (see send_revive call in
pipe_t::writer_t and process_revive in pipe_t::reader_t).

That's about where I'm at in understanding the zmq lib. I'm going to try &
recreate the problem while hacking up the zmq lib to get a better
understanding of what triggers it.

Btw, whats the microarchitecture? x86?

x86_64

I'll be sure to post anything I find to the list.

Pieter Hintjens

2010-09-01 17:07:12 UTC